how to find out common columns and its records from two files using awk

Question

how to find out common columns and its records from two files using awk

439 Views Asked by Saman At 16 May 2018 at 12:47

I have two files:

File 1:

id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio

File 2(only headers):

id|name|country

Now, I want an output like:

OUTPUT:

id|name|country
1|abc|xyz
2|asd|uio

Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.

I want to do this using awk or bash.

I tried using:

awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt

and have no idea what to do next.

Thank You

Original Q&A

There are 5 best solutions below

KamilCuk On 16 May 2018 at 13:12

Solution using bash>4:

IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'


# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[@]}-1))); do
        for j in $(seq 0 $((${#headers1[@]}-1))); do
                if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
                        idx+=($j)
                        break
                fi
        done
done
# idx=(0 1 3) for example

# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }

# first line - output headers
join_by '|' "${headers2[@]}"

isfirst=true
while IFS='|' read -a vals; do
        # ignore first (header line)
        if $isfirst; then
                isfirst=false
                continue;
        fi;
        # filter from line only columns with idx indices
        tmp=()
        for i in ${idx[@]}; do 
             tmp+=("${vals[$i]}")
        done
        # join ouptut with '|'
        join_by '|' "${tmp[@]}"
done < $file1

Kusalananda On 16 May 2018 at 13:43

This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.

It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.

BEGIN       { FS = OFS = "|" }

# read headers from first file
NR == FNR   { for (i = 1; i <= NF; ++i) header[$i]; next }

# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1    {
    for (i = 1; i <= NF; ++i)
        select[i] = ($i in header)
}

{
    skip = 0
    pos  = 1
    for (i = 1; i <= NF; ++i)
        if (!select[i]) {          # we don't want this field
            ++skip
            $pos = $(pos + skip)   # shift fields left
        } else
            ++pos

    NF -= skip  # adjust number of fields
    print
}

Running this:

$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio

James Brown On 16 May 2018 at 14:13

This one respects the order of columns in file1, changed the order:

$ cat file1
id|country|name

The awk:

$ awk '
BEGIN { FS=OFS="|" }
NR==1 {                                             # file1
    n=split($0,a)
    next
}
NR==2 {                                             # file2 header
    for(i=1;i<=NF;i++)
        b[$i]=i
} 
{                                                   # output part
    for(i=1;i<=n;i++)
        printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd

(Another version using cut for outputing in revisions)

karakfa On 16 May 2018 at 15:15

with (lot's of) unix pipes as Doug McIlroy intended...

$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file

id|name|country
1|abc|xyz
2|asd|uio

**RavinderSingh13** · Accepted Answer · 2018-05-16T12:52:45.210000

Following awk may help you on same.

awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2  File1

Adding a non-one liner form of solution too now.

awk -F"|" '
FNR==NR{
   for(i=1;i<=NF;i++){
     a[$i]};
   next}
FNR==1 && FNR!=NR{
   for(j=1;j<=NF;j++){
     if($j in a){ b[++p]=j }}
}
{
   for(o=1;o<=p;o++){
     printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2  File1

Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:

BEGIN { FS=OFS="|" }
NR==FNR {
    for (i=1; i<=NF; i++) {
        names[$i]
    }
    next
}
FNR==1 {
    for (i=1; i<=NF; i++) {
        if ($i in names) {
            f[++numFlds] = i
        }
    }
}
{
    for (i=1; i<=numFlds; i++) {
        printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
    }
}

how to find out common columns and its records from two files using awk

There are 5 best solutions below

Related Questions in BASH

Related Questions in UNIX

Related Questions in AWK

Related Questions in GAWK

Trending Questions

Popular # Hahtags

Popular Questions