compare rows in two files in unix shell script and merge without redundant data

1k Views Asked by At

There is one old report file residing on a drive. Everytime a new report is generated, it should be compared to the contents of this old file. If any new account row is reported in this new report file, it should be added to the old file, else just skip. Both files will have same title and headers. Eg: old report

RUN DATE:xyz                FEE ASSESSMENT REPORT

fee calculator

ACCOUNT NUMBER      DELVRY DT     TOTAL FEES     
=======================================================

123456      2014-06-27      110.0   

The new report might be

RUN DATE:xyz                FEE ASSESSMENT REPORT

fee calculator

ACCOUNT NUMBER      DELVRY DT     TOTAL FEES     
=======================================================

898989      2014-06-26      11.0 

So now the old report should be merged to have both rows under it - 123456 and 898989 acc no rows.

I am new to shell scripting. I don't know if I should use diff cmd or while read LINE or awk?

Thanks!

1

There are 1 best solutions below

3
On

This appears to be several commands in combination to create an actual script, rather than an adept commandlinefu in only one line.

Assuming the number of lines in the header section of the report is consistent, then you can use tail -n +7 to return the lines after the first 7 as you show in your example.
If they are not the same, but all end with the line you've shown above "==========" then you can use grep -n to find that line number and start parsing the account numbers after it.

#!/usr/bin/env bash
OLD_FILE="ancient_report.log"
NEW_FILE="latest_and_greatest.log"
tmp_ext=".tmp"
tail -n +7 ${OLD_FILE} > ${OLD_FILE}${tmp_ext}
tail -n +7 ${NEW_FILE} >> ${OLD_FILE}${tmp_ext}
sort -u ${OLD_FILE}${tmp_ext} > ${OLD_FILE}${tmp_ext}.unique
mv -f ${OLD_FILE}${tmp_ext}.unique ${OLD_FILE}

To illustrate this script:

#!/usr/bin/env bash

The shebang line above tells *nix how to run it.

OLD_FILE="ancient_report.log"
NEW_FILE="latest_and_greatest.log"
tmp_ext=".tmp"

Declare starting variables. You can also do this by using arguments of the file names. OLD_FILE=${1} to get the first argument on the command line.

tail -n +7 ${OLD_FILE} > ${OLD_FILE}${tmp_ext}
tail -n +7 ${NEW_FILE} >> ${OLD_FILE}${tmp_ext}

Put the endings of the two files into a single 'tmp' file

sort -u ${OLD_FILE}${tmp_ext} > ${OLD_FILE}${tmp_ext}.unique

sort and retain only the 'unique' entries with -u If your OS version of sort does not have the -u then you can get the same results by using: sort <filename> | uniq

mv -f ${OLD_FILE}${tmp_ext}.unique ${OLD_FILE}

Replace old file with new uniq'd file.

There are of course many simpler ways to do this, but this one gets the job done with several commands in a sequence.

Edit:
To preserve the header portion of the file with the latest report date, then instead of mving the new tmp file over the old, do:

rm ${OLD_FILE};
head -n 7 ${NEW_FILE}) > ${OLD_FILE}
cat ${OLD_FILE}${tmp_ext}.unique >> ${OLD_FILE}

This removes the OLD_FILE (can't overwrite without deleting first) and cats together the header of the new file (for date) and the entire contents of the unique tmp file. After this you can do general file cleanup such as removing any new files you've created. To preserve/debug any changes, you can add a datestamp to each 'uniqued' file name and keep them as an audit trail of all report additions.