I've implemented the Adler32 rolling hash in PHP, but because ord
is so slow (about 1MB per second on my dev machine) to get the integer values of chanters in a string, this solution is unworkable for 100MB+ files.
PHP's mhash function can get a very quick calculation for the adler32 (120MB per second on my dev machine). However mhash doesn’t seem to support the rolling nature of adler32, so you have to calculate a whole new adler32 as the rolling window moves rather than just recalculate the hash for the two bytes which have actually changed.
I'm not tied to the adler32 algorithm, I just need a very fast rolling hash in PHP.
Call the low two bytes of the Adler-32 A and the high two bytes B, where that is the Adler-32 of the sequence {x1, x2, ..., xn}.
To get the Adler-32 of {x2, ..., xn}, subtract x1 from A, modulo 65521, and subtract n * x1 + 1 from B, again modulo 65521.
Note that if your window size n happens to be a multiple of 65521, then you can just subtract one from B (modulo 65521). So that might be a good window size to pick, if you can. Also note that if n is larger than 65521, then you can multiply x1 by (n modulo 65521) instead. If n is a constant, you can do that modulo operation ahead of time.
(Note that
%
operator in C and PHP is not the modulo operation, but rather the remainder operation. So you need to take care with negative numbers.)