Replace column values of one file with values present in another file

146 Views Asked by At

Consider two files

file 1:

name|gender|phone|email|city|country 
abc|F|11111|ldiskmsjdh|pune|india
xyz|M|22222|wassrrrrtf|delhi|india

file2:

sno|name|email|country 
1|abc|zzzzzzzz|USA
2|mnq|tttttttt|UK

i need the below output needed using unix:

name|gender|phone|email|city|country
abc|F|11111|zzzzzzzz|pune|USA
xyz|M|22222|wassrrrrtf|delhi|india

NOTE: Match based on primary key i.e “name”

I was able to replace the entire line. But I want to replace only columns for rows when there is match and update only those columns that are present in second file. original format should not change.

I am looking for a general code that will work for N number of columns. Key position shall remain the same.

5

There are 5 best solutions below

0
Ed Morton On BEST ANSWER

I wouldn't normally post an answer when the OP hasn't provided any attempt of their own but since there are multiple answers posted already...

This may be what you want, using any awk:

$ cat tst.awk
BEGIN { FS=OFS="|" }
NR == FNR {
    if ( FNR == 1 ) {
        for ( i=1; i<=NF; i++ ) {
            tags2aFldNrs[$i] = i
        }
    }
    else {
        name = $(tags2aFldNrs["name"])
        names2aVals[name] = $0
    }
    next
}
{
    if ( FNR == 1 ) {
        for ( i=1; i<=NF; i++ ) {
            tags2bFldNrs[$i] = i
            if ( $i in tags2aFldNrs ) {
                bFldNrs2aFldNrs[i] = tags2aFldNrs[$i]
            }
        }
    }
    else {
        name = $(tags2bFldNrs["name"])
        if ( name in names2aVals ) {
            split(names2aVals[name],aVals)
            for ( bFldNr in bFldNrs2aFldNrs ) {
                aFldNr = bFldNrs2aFldNrs[bFldNr]
                $bFldNr = aVals[aFldNr]
            }
        }
    }
    print
}

$ awk -f tst.awk file2 file1
name|gender|phone|email|city|country
abc|F|11111|zzzzzzzz|pune|USA
xyz|M|22222|wassrrrrtf|delhi|india

That'll work efficiently no matter how many fields you have on each line as it only loops through the subset of fields that have column header tags that are common between the 2 files and only when the name matches. It'll also work no matter where the name column exists in each file.

4
Arnaud Valmary On

Something like that:

Program: merge.awk

BEGIN {
    FS  = "|"
    OFS = "|"
    FILE1_KEY_COL    = 1
    FILE1_COL_NAME   = 1
    FILE1_COL_GENDER = 2
    FILE1_COL_PHONE  = 3
    FILE1_COL_CITY   = 5
    FILE2_KEY_COL     = 2
    FILE2_COL_NAME    = 2
    FILE2_COL_EMAIL   = 3
    FILE2_COL_COUNTRY = 4
}
NR == FNR && FNR == 1 {
    # Ignore first line
    next
}
NR == FNR {
    FILE_2_NAME[$FILE2_KEY_COL]    = $FILE2_COL_NAME
    FILE_2_EMAIL[$FILE2_KEY_COL]   = $FILE2_COL_EMAIL
    FILE_2_COUNTRY[$FILE2_KEY_COL] = $FILE2_COL_COUNTRY
}
NR != FNR && FNR == 1 {
    print $0
    next
}
NR != FNR && FILE_2_NAME[$FILE1_KEY_COL] != $FILE1_KEY_COL {
    print $FILE1_KEY_COL > "/dev/stderr"
    next
}
NR != FNR {
    print $FILE1_COL_NAME,$FILE1_COL_GENDER,$FILE1_COL_PHONE,FILE_2_EMAIL[$FILE1_KEY_COL],$FILE1_COL_CITY,FILE_2_COUNTRY[$FILE1_KEY_COL]
}

Executed like that:

awk -f merge.awk file2 file1 >new_records 2>unknown_records

Warning: file2 (the reference with new values) must be before the file1

UPDATE

In order to use more and more fields. You could generate parts on awk script lines, like this :

Declarations:

$ cat file1 | head -1 | tr '|' "\n" | tr '[a-z]' '[A-Z]' | awk '{printf("FILE_1_COL_%-20s = %d\n", $0, i+1); i++}'
FILE_1_COL_NAME                 = 1
FILE_1_COL_GENDER               = 2
FILE_1_COL_PHONE                = 3
FILE_1_COL_EMAIL                = 4
FILE_1_COL_CITY                 = 5
FILE_1_COL_COUNTRY              = 6

$ cat file2 | head -1 | tr '|' "\n" | tr '[a-z]' '[A-Z]' | awk '{printf("FILE_2_COL_%-20s = %d\n", $0, i+1); i++}'
FILE_2_COL_SNO                  = 1
FILE_2_COL_NAME                 = 2
FILE_2_COL_EMAIL                = 3
FILE_2_COL_COUNTRY              = 4

File 2 memorization:

cat file2 | head -1 | tr '|' "\n" | tr '[a-z]' '[A-Z]' | awk '{printf("FILE_2_%s[$FILE_2_KEY_COL] = $FILE_2_COL_%s\n", $0, $0); i++}'
FILE_2_SNO[$FILE_2_KEY_COL] = $FILE_2_COL_SNO
FILE_2_NAME[$FILE_2_KEY_COL] = $FILE_2_COL_NAME
FILE_2_EMAIL[$FILE_2_KEY_COL] = $FILE_2_COL_EMAIL
FILE_2_COUNTRY[$FILE_2_KEY_COL] = $FILE_2_COL_COUNTRY

Print:

$ cat file1 | head -1 | tr '|' "\n" | tr '[a-z]' '[A-Z]' | awk 'BEGIN {printf("print ")} i>0 {printf(",")} {printf("$FILE_1_COL_%s", $0); i++} END {printf("\n")}'
print $FILE_1_COL_NAME,$FILE_1_COL_GENDER,$FILE_1_COL_PHONE,$FILE_1_COL_EMAIL,$FILE_1_COL_CITY,$FILE_1_COL_COUNTRY
6
Amadan On

Here is an awkward but general solution.

# patch_csv.awk

BEGIN {
  FS="|"
}

FNR==NR {
  if (FNR == 1) {
    nf1 = NF
    print
    for (i = 1; i <= NF; i++) {
      header1[i] = $i
      if ($i == "name") {
        key = i
      }
    }
  } else {
    data[$key] = $0
  }
}

FNR!=NR {
  if (FNR == 1) {
    for (j = 1; j <= NF; j++) {
      header2[j] = $j
      if ($j == "name") {
        key = j
      }
      for (i = 1; i <= nf1; i++) {
        if (header1[i] == header2[j]) {
          colmap[i] = j
        }
      }
    }
  } else if (data[$key]) {
    split(data[$key], record, FS)
    for (i = 1; i <= nf1; i++) {
      printf "%s%s", (colmap[i] ? $colmap[i] : record[i]), (i == nf1 ? "\n" : FS)
    }
  }
}

Run like this:

awk -f patch_csv.awk file1 file2

Output:

name|gender|phone|email|city|country
abc|F|11111|zzzzzzzz|pune|USA

Note that this is a very primitive handling of CSV format, as it does not support quoting or multiline values.

Of course, this would be much simpler in, say, Python or Ruby. Not to mention they also handle the full CSV specification correctly.

But if you are willing to hardcode column numbers, then this might be simpler:

head -1 file1.csv && \
join -t '|' -1 1 -2 2 \
  <(tail -n +2 file1.csv | sort -t '|' -k 1,1) \
  <(tail -n +2 file2.csv | sort -t '|' -k 2,2) | \
awk -v FS='|' \
  '{ print($1 FS $2 FS $3 FS ($8 ? $8 : $4) FS $5 FS ($9 ? $9 : $6)) }'

Print the header using head -1. Then use join to display the rows of the first file joined to the second file, using the first column of the first file (-1 1) and the second column of the second file (-2 2). For join, the files must be sorted; so first tear off the header (tail -n +2), then sort each file on the relevant column (sort -k). Both join and sort should use | as the separator (-t '|'). The join output contains columns from both files, so we need to merge them ourselves, and we can use awk to do so, e.g. by printing the 4th field if the 8th one is empty. Note that the row order is not preserved.

EDIT: If you have 300 fields... just use a proper programming language lol

EDIT2: Ruby version, because I have been challenged. Also, does outer join, since OP clarified:

require 'csv'

abort "Usage: #{__FILE__} <data_csv> <patch_csv> <key_field>" unless ARGV.length == 3
key = ARGV[2].to_sym
data_table = CSV.table(ARGV[0], col_sep: '|')
patch_table = CSV.table(ARGV[1], col_sep: '|')
common_fields = data_table.headers & patch_table.headers
patch_hash = patch_table.map { |row| [row[:name], row.to_h.slice(*common_fields) ] }.to_h
result_csv = CSV.generate(col_sep: "|", headers: true ) { |csv|
  csv << data_table.headers
  data_table.each { |row|
    csv << row.to_h.merge(patch_hash[row[key]] || {}) }
}
puts result_csv
0
jhnc On

This seems like a database operation on CSV data:

sqlite3 '' <<'EOD'
.mode csv
.separator |
.import "file 1" t1
.import file2 t2
update t1
    set
        email = t2.email,
        "country " = t2."country "
    from ( select * from t2 ) as t2
    where t1.name = t2.name;
.headers on
.output file1.new
select * from t1 order by rowid;
EOD

which writes file1.new:

name|gender|phone|email|city|"country "
abc|F|11111|zzzzzzzz|pune|USA
xyz|M|22222|wassrrrrtf|delhi|india

I assume values in the name column in file2 are distinct, so there is a 1:1 name→country mapping. If this is not the case, one country will be picked randomly. (Likewise for email.)

Adjust the column name if you didn't intend the trailing space.

1
glenn jackman On

Using join with a little awk to tidy up

join --header -t '|' -1 1 -2 2 -o 1.1,1.2,1.3,1.4,1.5,2.4,1.6 -a 1 file1 file2 |
   awk -F '|' -v OFS='|' '$(NF-1) == "" {$(NF-1) = $NF} {NF--} 1'
name|gender|phone|email|city|country
abc|F|11111|ldiskmsjdh|pune|USA
xyz|M|22222|wassrrrrtf|delhi|india

Unfortunately with join you have to be explicit about the position of the join fields and the select fields