Find mismatch on 2nd column between 2 text files

209 Views Asked by At

I have these 2 text files and I would like to find any mismatch on 2nd column between files. The mismatch to be identified is based on type of F ,P and N regardless which lines they occur. I have 1F, 3P in first file while 2P,1N and 1F in second file. When do comparison, both files should have equal occurrence of type 1F, 3P and 1N.

Text1:

f0x11 F
f0x34 P
drx99 
dex67 P
edx43 P
sdx33 

Text2:

1 P
2 N
4 
5 F
6 
7 P

Expected Output:

Text 1 has missing type of N
Text 2 has missing type of P 

What I have tried so far does not produce desired output.

code:

use strict;
my %ref_data;
my %ref_data2;
open my $fh, '<', 'Text1' or die "Could not open file to read:$!";
while (<$fh>) {
  chomp;
    my ($res, $type) = split;
    if (defined $type){
             $ref_data{$type} = "$type"; 
            }           
 }
our ($data,$data2);
open $fh, '<', 'Text2' or die "Could not open file to read:$!";
while (<$fh>) {
  chomp;
 my ($res, $type) = split;
    if (defined $type){
                $ref_data2{$type}= "$type";
                $data2= $ref_data2{$type};
                $data = $ref_data{$type};
                print "File 2 has missing type of $type\n" unless $data;
         }
  }
foreach ($data){
print "File 1 has missing type of $_\n" if $data ne $data2;
}
2

There are 2 best solutions below

2
On BEST ANSWER

I've refactored your code where you seem to be duplicating the same behavior.

The output isn't to spec, but should be clear enough for you to understand and finish up yourself.

I added a close $fh; and use warnings; as well

#!/usr/bin/perl

use strict;
use warnings;

#run
my %max; # hash of combined data
my $file_data_1 = parse_file_into_hash("text1", \%max);
my $file_data_2 = parse_file_into_hash("text2", \%max);
diff_hashes(\%max, $file_data_1, $file_data_2);

# diff_hashes($max, $h1, $h2)
#
# diffs 2 hash refs against a combined $max hash and prints results
sub diff_hashes {
    my ($max, $h1, $h2) = @_;

    # TODO - do all the comparisios and some error checking (if keys exist etc...) here
    for my $key (keys %$max) {
        print "max/combined: $key = $max->{$key}\n";

        my $h1_print = exists $h1->{$key} ? $h1->{$key} : "0";
        my $h2_print = exists $h2->{$key} ? $h2->{$key} : "0";

        print "h1: $key = $h1_print\n";
        print "h2: $key = $h2_print\n";
    }
}

# parse_file_into_hash($file, $max)
#
# $max is a hash reference (passed by reference) so you can count occurences over 
# multiple files...
# returns reference of hash ( $line_number => $data_value )
sub parse_file_into_hash {
    my ($file, $max) = @_;
    my %ref_data;

    open my $fh, '<', $file or die "Could not open file to read:$!";
    while (my $line = <$fh>) {
        chomp $line;
        my ($res, $type) = split /\s+/, $line;

        if ($type) {
            $ref_data{$type}++;

            if (!exists $max->{$type} || $ref_data{$type} > $max->{$type}) {
                $max->{$type} = $ref_data{$type};
            }
        }
    }
    close $fh;

    return \%ref_data;
}

Output ran against your example files:

$ ./example.pl
max/combined: F = 1
h1: F = 1
h2: F = 1
max/combined: N = 1
h1: N = 0
h2: N = 1
max/combined: P = 3
h1: P = 3
h2: P = 2
0
On

You appear to want to keep track of how many times the values in Column 2 occur within each file -- for example, in a comment you write, "I have 1F, 3P in first file while 2P,1N and 1Fin second file". If that's the case, you need a better data structure.

Specifically, one that counts occurrences of the values in Column 2, and you need those counts to be tracked separately for each file. That suggests a hash-of-hashes.

use strict;
use warnings;

# Example usage:
# perl YOUR_SCRIPT.pl a.txt b.txt
my @files = @ARGV;   

# Count the values in Column 2, organizing the tallies like this:
# $tallies{COL_2}{FILE_NAME} = N
my %tallies;
while (<>) {
    my @cols = split;
    $tallies{$cols[1]}{$ARGV} ++ if @cols > 1;
}

# Print discrepancies.
for my $c2 (keys %tallies) {
    my @t = map { $tallies{$c2}{$_} || 0 } @files;
    next if $t[0] == $t[1];
    print "$c2: $files[0] has $t[0]; $files[1] has $t[1]\n";
}

Example output:

N: a.txt has 0; b.txt has 1
P: a.txt has 3; b.txt has 2

Also worth noting: this code does not have to open any files explicitly, and file names are not hard-coded in the program. Instead, we pass input file names as command-line arguments, get those arguments via @ARGV, process lines in those files via <>, and know which file we're currently processing via $ARGV.