Compute symmetric matrix of a large file size using awk

64 Views Asked by At

I have an OD matrix (origin-destination matrix) written in list form, like this inputfile.csv:

"origin_id","destination_id","trips"
"0","0","20"
"0","1","12"
"0","2","8"
"1","0","23"
"1","1","50"
"1","2","6"
"2","1","9"
"2","2","33"

Which reads as:

  • There was 20 trips from origin_id=0 to destination_id=0,
  • There was 12 trips from origin_id=0 to destination_id=1,
  • And so on.

Please note that all the origin-destination pairs that have 0 trips, are not present in the input file (the elements of the matrix with zeros).

I need to compute the symmetric matrix as S=(OD+DO)/2, but the main problem is that the inputfile.csv is 31GB in size. I thought that a tool like awk could be a good solution, but I don't know how to proceed, because I have 32GB of RAM which is very similar to the input file size, and I always get out of memory.

The desired output should contain ideally only the diagonal and sub-diagonal elements, to avoid repeating values as the matrix is symmetrical. The desired final output of the provided input file is:

"origin_id","destination_id","trips"
"0","0","20"  
"0","1","17.5"
"0","2","4"
"1","1","50"
"1","2","7.5"
"2","2","33"

As complementary notes, I would like to add here that I can compute the symmetrical matrix with smaller input files using this script (symm.awk):

BEGIN {
    FS = OFS = ","
}

NR==1 { 
print; next 
}

{
    gsub("\"", "", $3)
    a[$1 FS $2] = $3
    b[$2 FS $1] = $3
}

END {
    for (i in a) {
        if (i in b) {
            print i, "\"" (a[i] + b[i]) / 2 "\""
        } 
        else {
            print i, "\"" (a[i]) / 2 "\""
        }
    }
}

and then get the diagonal and subdiagonal elements piping the result again to awk, like this:

awk -f symm.awk inputfile.csv |awk -F"\"" 'NR==1{print;next}$2<=$4{print $0}' > output.csv

But the ouptut it's not sorted, and also it's very ugly to obtain a result using an awk script and then piping the result again to awk.

I would appreciate some help in avoiding this "piping awk result into awk", and also it would be great to have some clues on how to deal with such a large input file size.

2

There are 2 best solutions below

2
markp-fuso On

One awk | sort idea:

head -1 inputfile.csv

awk '
BEGIN { FS=OFS="," }
NR==1 { next }
      { gsub(/"/,"")
        if   ($1 <= $2) a[$1,$2]+=$3
        else            a[$2,$1]+=$3
      }
END   { for (i in a) {
            split(i,ndx,SUBSEP)
            printf "\"%s\",\"%s\",\"%s\"\n", ndx[1], ndx[2],
                    a[ndx[1],ndx[2]] / (ndx[1]==ndx[2] ? 1 : 2 )
        }
      }
' inputfile.csv | sort -t, -k1,1V -k2,2V

This generates:

"origin_id","destination_id","trips"
"0","0","20"
"0","1","17.5"
"0","2","4"
"1","1","50"
"1","2","7.5"
"2","2","33"
1
dawg On

This pipe should work with a file of any size since only 2 lines are in memory at any time:

awk -F, 'FNR>1{
      gsub(/"/,"")
      printf("%s,%s\n", (($1<$2) ? $1 $2 : $2 $1),$0)}' file.csv | 
sort -t "," -k1 | 
awk 'BEGIN{FS=OFS=","}

function pp(){
    if (cnt==2) {
        split(lines[1],a,",")
        split(lines[2],b,",")
        lines[1]=a[1] OFS a[2] OFS (a[3]+b[3])/2
    }
    gsub(/,/,"\",\"", lines[1])
    print "\"" lines[1] "\""
    split("",lines)
    cnt=0
}
NR==1{lines[++cnt]=$2 OFS $3 OFS $4; prev=$1; next}
$1!=prev{ pp() }
{ 
    lines[++cnt]=$2 OFS $3 OFS $4
    prev=$1
}
END{pp()}'

With the example prints:

"0","0","20"
"0","1","17.5"
"0","2","8"
"1","1","50"
"1","2","7.5"
"2","2","33"

You can add the header with head -n 1 file.csv as a separate step.

It works by doing a Decorate / Sort / Undecorate approach to group a and b lines together.

Here is the Decorate / Sort step:

awk -F, 'FNR>1{
            gsub(/"/,"")
            printf("%s,%s\n", (($1<$2) ? $1 $2 : $2 $1),$0)}' file.csv | 
sort -t "," -k1 

Prints:

00,0,0,20
01,0,1,12
01,1,0,23
02,0,2,8
11,1,1,50
12,1,2,6
12,2,1,9
22,2,2,33

Then detecting a group with a run of 1 or 2 lines with the same value in $1.