Faster way of Awking a file from End to Beginning?

153 Views Asked by At

I want to get results starting at the bottom of a file and working my way up to the beginning. I tried using tac and pipe that into my awk command, but its very slow (15 seconds for an 2GB file). Compared to searching normally (3 seconds for the same file). I'm also piping the awk command into tail -n +1 | head -n 50 to stop after 50 results.

Is there a faster way to tac a file? or at least start searching from the bottom up?

The big picture is to create a python script that takes arguments (start date, end date, search terms) and use those to search through a date organized log file. Returning 50 results at a time.

I need to read from end to beginning in case a user wants to search in Descending order (newest date to oldest date).

An example command for ascending results ('oldest to newest): (im using find, because is a user specified argument, it can potentially reference all files (*.txt))

  • Start Date: 2018-03-04T03:45
  • End Date: 2018-03-05T16:24
  • Search Term: Potato

find '/home/logs/' -type f -name 'log_file.txt' -exec cat {} \+ 2>&1| LANC=C fgrep 'Potato' | LC_ALL=C IGNORECASE=1 awk -v start="2018-03-04T03:45:00" -v stop="2018-03-05T16:24:59" 'BEGIN{IGNORECASE=1;} {line=$0; xz=" "; for(i=4;i<=NF;i++){xz=xz" "$i};} ($1>=start&&$1<=stop) && (tolower(xz) ~ /Potato/) {print line}' | tail -n +1 | head -n 50

The tail -n +1 | head -n 50 is to return the first 50 matches.

This command takes about 3-4 seconds to find results, however if I sub in tac, it takes closer to 20 seconds.

3

There are 3 best solutions below

0
On BEST ANSWER

Much faster to open the file, and seek to some amount before the end of the file. Perl is handy here:

perl -Mautodie -se '
    $size = -s $file;
    $blocksize = 64000;
    open $fh, "<", $file;
    seek $fh, $size - $blocksize, 0;
    read $fh, $data, $blocksize;
    @lines = split "\n", $data;
    # last 50 lines
    print join "\n", reverse @lines[-51..-1];
' -- -file="filename"

We can throw a loop in there so after it reads the last block, it can seek to the end minus 2 blocks and read a block, etc.

But if you want to process the entire gigantic file from bottom to top, you'll have to expect it to take time.

0
On

Well, if you got the memory, hash the records and process backwards in END section:

$ for i in {a..e} ; do echo $i ; done |   
  awk '{ a[NR]=$0 }       # hash to a, NR as key
  END {                   # in the end
      for(i=NR;i>=1;i--)  # process a in descending order
          c++             # process
      print c
}'
5

Update: I tested above with a 1 GB file (36 M records). It got hashed and counted in 1 minute and sucked about 4.5 GB memory.

0
On

Everything depends a bit on the awk code you are having, but some solutions that come to mind are :

  • if you print every line:

    tac <file> | awk '(NR > 50){exit}{do-your-stuff}'
    
  • if you print only few lines with awk

    tac <file> | awk '(c > 50){exit} 
                      { do-part-of stuff;
                        print foobar; c++;
                        do-remaining part}'
    

Both solutions terminate awk after the first 50 printed lines. This way you do not have to process the full 2GB file. The termination after 50 printed lines mimicks the tail -n +1 | head -n 50