I have 2 csv files of multiple fields(approx 30 fields), and huge size ( approx 4GB ).
File1:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
Karthick,10,10.245,"140,North Street,India"
File2:
EmployeeName,Age,Salary,Address
Vinoth,12,2548.245,"140,North Street,USA"
Karthick,10,10.245,"140,North Street,India"
Vivek,40,2548.245,"140,North Street,India"
I want to compare these 2 files and report the differences into another csv file. In the above example, Employee Vivek and Karthick details are present in different row numbers but still the record data is same, so it should be considered as match. Employee Vinoth record should be considered as a mismatch since there is a mismatch in the address.
Output diff.csv file can contain the mismatched record from the File1 and File 2 as below.
Diff.csv
EmployeeName,Age,Salary,Address
F1, Vinoth,12,2548.245,"140,North Street,India"
F2, Vinoth,12,2548.245,"140,North Street,USA"
I've written the code so far as below. After this I'm confused which option to choose whether a Binary Search or any other efficient way to do this. Could you please help me?
My approach
1. Load the File2 in memory as hashes of hashes.
2.Read line by line from File1 and match it with the hash of hashes in memory.
use strict;
use warnings;
use Text::CSV_XS;
use Getopt::Long;
use Data::Dumper;
use Text::CSV::Hashify;
use List::BinarySearch qw( :all );
# Get Command Line Parameters
my %opts = ();
GetOptions( \%opts, "file1=s", "file2=s", )
or die("Error in command line arguments\n");
if ( !defined $opts{'file1'} ) {
die "CSV file --file1 not specified.\n";
}
if ( !defined $opts{'file2'} ) {
die "CSV file --file2 not specified.\n";
}
my $file1 = $opts{'file1'};
my $file2 = $opts{'file2'};
my $file3 = 'diff.csv';
print $file2 . "\n";
my $csv1 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csv2 =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
my $csvout =
Text::CSV_XS->new(
{ binary => 1, auto_diag => 1, sep_char => ',', eol => $/ } );
open( my $fh1, '<:encoding(utf8)', $file1 )
or die "Cannot not open '$file1' $!.\n";
open( my $fh2, '<:encoding(utf8)', $file2 )
or die "Cannot not open '$file2' $!.\n";
open( my $fh3, '>:encoding(utf8)', $file3 )
or die "Cannot not open '$file3' $!.\n";
binmode( STDOUT, ":utf8" );
my $f1line = undef;
my $f2line = undef;
my $header1 = undef;
my $f1empty = 'false';
my $f2empty = 'false';
my $reccount = 0;
my $hash_ref = hashify( "$file2", 'EmployeeName' );
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
while (1) {
if ( $f1empty eq 'false' ) {
$f1line = $csv1->getline($fh1);
}
if ( !defined $f1line ) {
$f1empty = 'true';
}
if ( $f1empty eq 'true' ) {
last;
}
else {
## Read each line from File1 and match it with the File 2 which is loaded as hashes of hashes in perl. Need help here.
}
}
print "End of Program" . "\n";
Storing data of such magnitude in database is most correct approach to tasks of this kind. At minimum SQLlite is recommended but other databases MariaDB, MySQL, PostgreSQL will work quite well.
Following code demonstrates how desired output can be achieved without special modules, but it does not take in account possibly messed up input data. This script will report data records as different even if difference can be just one extra space.
Default output is into console window unless you specify option output.
NOTE: Whole file #1 is read into memory, please be patient processing big files can take a while.
Output for test files