How to count char tuples efficiently in PHP

89 Views Asked by At

I need to fast count char tuples (or N-grams) in huge files/strings (from 10MB+ up to 1GB+) within a PHP project (a file classifier).

The current implementation is made for single characters count (N=1), and runs very fast in 0.x sec. on very large strings (up to 1 billion chars/bytes)

// returns an associative array [char => count]
function frequencies($txt) {
  $index = count_chars($txt, 1);
  foreach ($index as $code => $nb) {
    $count[chr($code)] = $nb;
  }
  return $count;
}

I'd like to modify it to run with bigrams (N=2 or more), so I wrote this

// returns an associative array [charTuple => count]
function frequencies($txt, $n) {
    $length = strlen($txt) - $n+1;
    for ($i = 0; $i < $length; $i++) {
        @$count [ substr($txt, $i, $n) ] ++;
    }
    return $count;
}
// NB: '@' is ugly but seems faster than if isset()

However this code is really slow: 45 sec. on the same file for N = 1 (about 100x slower) and more than a minute for N=2.

I have tried a direct access + concatenation alternative:

function twograms($txt) {
    $length = strlen($txt) - $n;
    for ($i = 0; $i < $length; $i++) {
        @$count [ $txt[$i] . $txt[$i+1] ] ++;
    }
    return $count;
}

And it runs slightly faster in 42 sec. (+/- error margin)

It is still way too slow and does not match the built-in function count_char() efficiency.

Is there an alternative ? Is there a way to count char tuples more efficiently ?

EDIT: I updated the online PHP link to test/benchmark to take into account the proposal with str_split and array_count_values and the associated memory usage.

1

There are 1 best solutions below

1
Rob Eyre On

I'm not sure about the performance, but you could include this approach when you're benchmarking alternatives. It uses a combination of str_split and array_count_values to count sequences of chars in the source string:

$str = 'The quick brown fox jumps over the lazy dog';

$len = 2;
$counts = [];
for ($n = 0; $n < $len; $n++) {
    $ngramCounts = array_count_values(str_split(substr($str, $n), $len));
    foreach ($ngramCounts as $ngram => $count) {
        $counts[$ngram] = ($counts[$ngram] ?? 0) + $count;
    }
}
print_r($counts);