php: find duplicate content in files/nested looping

Question

php: find duplicate content in files/nested looping

147 Views Asked by Sachin Upmanyu At 31 July 2025 at 14:57

I have scraped 5000 files, stored them in individual files (0-4999.txt), now i need to find duplicate content in them. so i am comparing each file with one another in nested loop (ETA 82 hours). This approach will definitely take hours to complete. My main concern here is the no. of iterations. Can anyone suggest a better approach to cut down iterations and reduce time taken?

current code: NCD algorithm

function ncd_new($sx, $sy, $prec=0, $MAXLEN=9000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare. 
# Use $prec=-1 for result range [0-1], $pres=0 for percentual, 
# For NCD definition see http://arxiv.org/abs/0809.2553
  $x = $min = strlen(gzcompress($sx));
  $y = $max = strlen(gzcompress($sy));
  $xy= strlen(gzcompress($sx.$sy));
  $a = $sx;
  if ($x>$y) { # swap min/max
    $min = $y;
    $max = $x;
    $a = $sy;
  }
  $res = ($xy-$min)/$max; # NCD definition.
    if ($MAXLEN<0 || $xy<$MAXLEN) {
    $aa= strlen(gzcompress($a.$a));
    $ref = ($aa-$min)/$min;
    $res = $res - $ref; # correction
  }
  return ($prec<0)? $res: 100*round($res,2+$prec);
}

looping over each file:

$totalScraped = 5000;
for($fileC=0;$fileC<$totalScraped;$fileC++)
{
    $f1 = file_get_contents($fileC.".txt");
    $stripstr = array('/\bis\b/i', '/\bwas\b/i', '/\bthe\b/i', '/\ba\b/i');
    $file1 = preg_replace($stripstr, '', $f1);

    // 0+fileC => exclude already compared files
    // eg. if fileC=10 , start loop 11 to 4999
    for($fileD=(0+$fileC);$fileD<$totalScraped;$fileD++)
    {
            $f2 = file_get_contents($fileD.".txt", FILE_USE_INCLUDE_PATH);
            $stripstr = array('/\bis\b/i', '/\bwas\b/i', '/\bthe\b/i', '/\ba\b/i');
            $file2 = preg_replace($stripstr, '', $f2);

            $total=ncd_new($file1,$file2);

            echo "$fileName1 vs $fileName2 is: $total%\n";
    }
}

Original Q&A

There are 2 best solutions below

Burki On 20 January 2015 at 13:48

You may want to find a way to distinguish likely candidates from unlikely ones. So, maybe there is a way that you can compute a value for each file (say: a word count, a count of sentences / paragraphs... maybe even a count of individual letters), to identify the unlikely candidates beforehand. If you could achieve this, you could reduce the amount of comparisons by ordering your arrays by this computed number.

**Sachin Upmanyu** · Accepted Answer

another process that i tried was:

strip html tags from page
replace \s{2,} with \s, \n{2,} with \n, so that text b/w each tag is presented in a single line(almost)
compare two such generated files by taking a line, preg_matching, if found -> duplicate, else break line into array of words, calculate array_intersect, if count is 70% or more of line length -> duplicate.

which was very efficient and i could compare 5000 files in ~10 minutes

but still slow for my requirements.

So i implemented the first logic "ncd algo" method in C language, and it completes the task with 5-10 seconds (depending on the average page size)

php: find duplicate content in files/nested looping

There are 2 best solutions below

Related Questions in PHP

Related Questions in NESTED-LOOPS

Related Questions in DUPLICATE-CONTENT

Trending Questions

Popular # Hahtags

Popular Questions