how to find out common columns and its records from two files using awk

439 Views Asked by At

I have two files:

File 1:

id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio

File 2(only headers):

id|name|country

Now, I want an output like:

OUTPUT:

id|name|country
1|abc|xyz
2|asd|uio

Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.

I want to do this using awk or bash.

I tried using:

awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt

and have no idea what to do next.

Thank You

5

There are 5 best solutions below

5
RavinderSingh13 On BEST ANSWER

Following awk may help you on same.

awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2  File1

Adding a non-one liner form of solution too now.

awk -F"|" '
FNR==NR{
   for(i=1;i<=NF;i++){
     a[$i]};
   next}
FNR==1 && FNR!=NR{
   for(j=1;j<=NF;j++){
     if($j in a){ b[++p]=j }}
}
{
   for(o=1;o<=p;o++){
     printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2  File1

Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:

BEGIN { FS=OFS="|" }
NR==FNR {
    for (i=1; i<=NF; i++) {
        names[$i]
    }
    next
}
FNR==1 {
    for (i=1; i<=NF; i++) {
        if ($i in names) {
            f[++numFlds] = i
        }
    }
}
{
    for (i=1; i<=numFlds; i++) {
        printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
    }
}
1
KamilCuk On

Solution using bash>4:

IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'


# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[@]}-1))); do
        for j in $(seq 0 $((${#headers1[@]}-1))); do
                if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
                        idx+=($j)
                        break
                fi
        done
done
# idx=(0 1 3) for example

# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }

# first line - output headers
join_by '|' "${headers2[@]}"

isfirst=true
while IFS='|' read -a vals; do
        # ignore first (header line)
        if $isfirst; then
                isfirst=false
                continue;
        fi;
        # filter from line only columns with idx indices
        tmp=()
        for i in ${idx[@]}; do 
             tmp+=("${vals[$i]}")
        done
        # join ouptut with '|'
        join_by '|' "${tmp[@]}"
done < $file1
11
Kusalananda On

This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.

It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.

BEGIN       { FS = OFS = "|" }

# read headers from first file
NR == FNR   { for (i = 1; i <= NF; ++i) header[$i]; next }

# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1    {
    for (i = 1; i <= NF; ++i)
        select[i] = ($i in header)
}

{
    skip = 0
    pos  = 1
    for (i = 1; i <= NF; ++i)
        if (!select[i]) {          # we don't want this field
            ++skip
            $pos = $(pos + skip)   # shift fields left
        } else
            ++pos

    NF -= skip  # adjust number of fields
    print
}

Running this:

$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio
0
James Brown On

This one respects the order of columns in file1, changed the order:

$ cat file1
id|country|name

The awk:

$ awk '
BEGIN { FS=OFS="|" }
NR==1 {                                             # file1
    n=split($0,a)
    next
}
NR==2 {                                             # file2 header
    for(i=1;i<=NF;i++)
        b[$i]=i
} 
{                                                   # output part
    for(i=1;i<=n;i++)
        printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd

(Another version using cut for outputing in revisions)

0
karakfa On

with (lot's of) unix pipes as Doug McIlroy intended...

$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file

id|name|country
1|abc|xyz
2|asd|uio