I need to fast count char tuples (or N-grams) in huge files/strings (from 10MB+ up to 1GB+) within a PHP project (a file classifier).
The current implementation is made for single characters count (N=1), and runs very fast in 0.x sec. on very large strings (up to 1 billion chars/bytes)
// returns an associative array [char => count]
function frequencies($txt) {
$index = count_chars($txt, 1);
foreach ($index as $code => $nb) {
$count[chr($code)] = $nb;
}
return $count;
}
I'd like to modify it to run with bigrams (N=2 or more), so I wrote this
// returns an associative array [charTuple => count]
function frequencies($txt, $n) {
$length = strlen($txt) - $n+1;
for ($i = 0; $i < $length; $i++) {
@$count [ substr($txt, $i, $n) ] ++;
}
return $count;
}
// NB: '@' is ugly but seems faster than if isset()
However this code is really slow: 45 sec. on the same file for N = 1 (about 100x slower) and more than a minute for N=2.
I have tried a direct access + concatenation alternative:
function twograms($txt) {
$length = strlen($txt) - $n;
for ($i = 0; $i < $length; $i++) {
@$count [ $txt[$i] . $txt[$i+1] ] ++;
}
return $count;
}
And it runs slightly faster in 42 sec. (+/- error margin)
It is still way too slow and does not match the built-in function count_char() efficiency.
Is there an alternative ? Is there a way to count char tuples more efficiently ?
EDIT: I updated the online PHP link to test/benchmark to take into account the proposal with str_split and array_count_values and the associated memory usage.
I'm not sure about the performance, but you could include this approach when you're benchmarking alternatives. It uses a combination of
str_splitandarray_count_valuesto count sequences of chars in the source string: