Using grep with a pattern file: print single and duplicate entries

8.3k Views Asked by At

Let me start off by saying I don't want to print only the duplicate lines nor do I want to remove them.

I am trying to use grep with a pattern file to parse a large data file.

The Pattern file for example may look like this:

1243
1234
1234
1234
1354
1356
1356
1677

etc. with more single and duplicate entries.

The Input data file might look like this:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
ttttt   1555    bbbbbb
ppppp   1354    pppppp
yyyyy   3333    zzzzzz
qqqqq   1677    eeeeee
iiiii   4444    iiiiii

etc. for 27000 lines.

when i use

grep -f 'Patternfile.txt' 'Inputfile.txt' > 'Outputfile.txt'

I get an output file that resembles this:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
ppppp   1354    pppppp

how would can i get it to also report the duplicates so i end up with something like this?:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
yyyyy   1234    vvvvvv
yyyyy   1234    vvvvvv
ppppp   1354    pppppp


qqqqq   1677    zzzzzz

Additionally I would also like to print a blank line should a query in the pattern file not match a substring in the input file.

Thank you!

2

There are 2 best solutions below

7
On BEST ANSWER

One solution, not with grep, but with perl:

With patternfile.txt and inputfile.txt with data of your original post. Next content of script.pl should do the job (I assume that the string to match is the second column, otherwise it should be modified to use a regexp instead. This way is faster):

use warnings;
use strict;

## Check arguments.
die qq[Usage: perl $0 <pattern-file> <input-file>\n] unless @ARGV == 2;

## Open input files.
open my $pattern_fh, qq[<], shift @ARGV or die qq[Cannot open pattern file\n];
open my $input_fh, qq[<], shift @ARGV or die qq[Cannot open input file\n];

## Hash to save patterns.
my (%pattern, %input);

## Read each pattern and save how many times appear in the file.
while ( <$pattern_fh> ) { 
    chomp;
    if ( exists $pattern{ $_ } ) { 
        $pattern{ $_ }->[1]++;
    }   
    else {
        $pattern{ $_ } = [ $., 1 ];
    }   
}

## Read file with data and save them in another hash.
while ( <$input_fh> ) { 
    chomp;
    my @f = split;
    $input{ $f[1] } = $_; 
}

## For each pattern, search it in the data file. If it appears, print line those
## many times saved previously, otherwise print a blank line.
for my $p ( sort { $pattern{ $a }->[0] <=> $pattern{ $b }->[0] } keys %pattern ) { 
    if ( $input{ $p } ) { 
        printf qq[%s\n], $input{ $p } for ( 1 .. $pattern{ $p }->[1] );
    }   
    else {
         # Old behaviour.
         # printf qq[\n];

         # New requirement.
         printf qq[\n] for ( 1 .. $pattern{ $p }->[1] );
    }   
}

Run it like:

perl script.pl patternfile.txt inputfile.txt

And gives next output:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
yyyyy   1234    vvvvvv
yyyyy   1234    vvvvvv
ppppp   1354    pppppp


qqqqq   1677    eeeeee
0
On

You aren't so much greping for the patterns as you are left-joining the data in input to the data in pattern.

You can (mostly) accomplish this with join, a handy Unix utility I've come to know pretty well since I've been trying to solve a problem similar to yours.

There are a couple small differences, though.

First the command:

join -a 1 -2 2 <(sort Patternfile.txt) <(sort -k2,3 Inputfile.txt)

And explanation:

  • -a 1 means to also include unjoinable lines from file 1 (Patternfile.txt). I added this because you wanted to include "blank" lines for unmatchable rows, and this was the closest I could get.
  • -2 2 means to join on field 2 for file 2 (You can set the field for both -1 FIELD and -2 FIELD, the default is field 1). This is because the key you are joining on in Inputfile.txt is in the second column
  • <(sort Patternfile.txt) — the files must be sorted on the join field for join to work correctly.
  • <(sort -k2,2 Inputfile.txt) — sort input file from key 2 to key 2, inclusive

Output:

1234 yyyyy vvvvvv
1234 yyyyy vvvvvv
1234 yyyyy vvvvvv
1243 aatta qqqqqq
1354 ppppp pppppp
1356
1356
1677 qqqqq eeeeee

Differences

Slight differences between your specified output and this result:

  • It's sorted by the key order.
  • Unjoinable rows still contain their original key. If that's a problem, you can clear the unmatched rows by piping through a simple awk:

    ... | awk '{ if ($2 != "") print; else print ""  }'