Regex/Algorithm to find 'n' repeated lines in a file

563 Views Asked by At

I am looking for an advanced version of this.

Basically, if I have a file with text:

abc
ghi
fed
jkl
abc
ghi
fed

I want the output to be:(for n=3)

Duplicated Lines
abc
ghi
fed
Times = 2
2

There are 2 best solutions below

4
On BEST ANSWER

One way is splitting your text based on your n then count the number of your elements that all is depending this counting you can use some data structures that use hash-table like dictionary in python that is much efficient for such tasks.

The task is that you create a dictionary that keeps the keys unique and then loop over the list of splitted text and increase the count of each item every time you see a duplicate.

At last you'll have a dictionary contain the unique items with those count as the values of dictionary.

Some langs like python provides good tools like Counter for count the elements within an iterable and islice for slicing and iterable that returns a generator and is very efficient for long iterables :

>>> from collections import Counter
>>> from itertools import islice

>>> s="""abc
... ghi
... fed
... jkl
... abc
... ghi
... fed"""
>>> sp=s.split()
>>> Counter('\n'.join(islice(sp,i,i+3)) for i in range(len(sp)))
Counter({'abc\nghi\nfed': 2, 'fed': 1, 'jkl\nabc\nghi': 1, 'ghi\nfed': 1, 'fed\njkl\nabc': 1, 'ghi\nfed\njkl': 1})

Or you can do it custom :

>>> a=['\n'.join(sp[i:i+3] for i in range(len(sp))]
>>> a
['abc\nghi\nfed', 'ghi\nfed\njkl', 'fed\njkl\nabc', 'jkl\nabc\nghi', 'abc\nghi\nfed', 'ghi\nfed', 'fed']
>>> d={}
>>> for i in a:
...    if i in d:
...       d[i]+=1
...    else :
...       d[i]=1
... 
>>> d
{'fed': 1, 'abc\nghi\nfed': 2, 'jkl\nabc\nghi': 1, 'ghi\nfed': 1, 'fed\njkl\nabc': 1, 'ghi\nfed\njkl': 1}
>>> 
2
On

So, something like this (in perl):

#!/usr/bin/perl
use strict;
use warnings;

my %seen; 
my @order; 

while ( my $line = <DATA> ) {
   chomp ( $line ); 
   push ( @order, $line ) unless $seen{$line}++; 

}

foreach my $element ( @order ) { 
    print "$element, $seen{$element}\n" if $seen{$element} > 1;
}

__DATA__
abc
ghi
fed
jkl
abc
ghi
fed

This can turn into a shorter snippet by:

perl -e 'while ( <> ) { push ( @order, $_ ) unless $seen{$_}++; } for (@order) {print if $seen{$_} > 1}' myfile