php: find duplicate content in files/nested looping

153 Views Asked by At

I have scraped 5000 files, stored them in individual files (0-4999.txt), now i need to find duplicate content in them. so i am comparing each file with one another in nested loop (ETA 82 hours). This approach will definitely take hours to complete. My main concern here is the no. of iterations. Can anyone suggest a better approach to cut down iterations and reduce time taken?

current code: NCD algorithm

function ncd_new($sx, $sy, $prec=0, $MAXLEN=9000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare. 
# Use $prec=-1 for result range [0-1], $pres=0 for percentual, 
# For NCD definition see http://arxiv.org/abs/0809.2553
  $x = $min = strlen(gzcompress($sx));
  $y = $max = strlen(gzcompress($sy));
  $xy= strlen(gzcompress($sx.$sy));
  $a = $sx;
  if ($x>$y) { # swap min/max
    $min = $y;
    $max = $x;
    $a = $sy;
  }
  $res = ($xy-$min)/$max; # NCD definition.
    if ($MAXLEN<0 || $xy<$MAXLEN) {
    $aa= strlen(gzcompress($a.$a));
    $ref = ($aa-$min)/$min;
    $res = $res - $ref; # correction
  }
  return ($prec<0)? $res: 100*round($res,2+$prec);
}

looping over each file:

$totalScraped = 5000;
for($fileC=0;$fileC<$totalScraped;$fileC++)
{
    $f1 = file_get_contents($fileC.".txt");
    $stripstr = array('/\bis\b/i', '/\bwas\b/i', '/\bthe\b/i', '/\ba\b/i');
    $file1 = preg_replace($stripstr, '', $f1);

    // 0+fileC => exclude already compared files
    // eg. if fileC=10 , start loop 11 to 4999
    for($fileD=(0+$fileC);$fileD<$totalScraped;$fileD++)
    {
            $f2 = file_get_contents($fileD.".txt", FILE_USE_INCLUDE_PATH);
            $stripstr = array('/\bis\b/i', '/\bwas\b/i', '/\bthe\b/i', '/\ba\b/i');
            $file2 = preg_replace($stripstr, '', $f2);

            $total=ncd_new($file1,$file2);

            echo "$fileName1 vs $fileName2 is: $total%\n";
    }
}
2

There are 2 best solutions below

0
On BEST ANSWER

another process that i tried was:

  1. strip html tags from page
  2. replace \s{2,} with \s, \n{2,} with \n, so that text b/w each tag is presented in a single line(almost)
  3. compare two such generated files by taking a line, preg_matching, if found -> duplicate, else break line into array of words, calculate array_intersect, if count is 70% or more of line length -> duplicate.

which was very efficient and i could compare 5000 files in ~10 minutes

but still slow for my requirements.

So i implemented the first logic "ncd algo" method in C language, and it completes the task with 5-10 seconds (depending on the average page size)

0
On

You may want to find a way to distinguish likely candidates from unlikely ones. So, maybe there is a way that you can compute a value for each file (say: a word count, a count of sentences / paragraphs... maybe even a count of individual letters), to identify the unlikely candidates beforehand. If you could achieve this, you could reduce the amount of comparisons by ordering your arrays by this computed number.