I am currently reading about the Rabin Karp algorithm and as part of that I need to understand string polynomial hashing. From what I understand, the hash of a string is given by the following formula:
hash = ( char_0_val * p^0 + char_1_val * p^1 + ... + char_n_val ^ p^n ) mod m
Where:
- char_i_val: is the integer value of the character plus 1 given by
string[i]-'a' + 1 - p is a prime number larger than the character set
- m is a large prime number
The website cp-algorithms has the following entry on the subject. They say that the code to write the above is as follows:
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}
I understand what the program is trying to do but I do not understand why it is correct.
My question
I am having trouble understanding why the above code is correct. It has been a long time since I have done any modular math. After searching online I see that we have the following formulas for modular addition and modular multiplication:
a+b (mod m) = (a%m + b%m)%m
a*b (mod m) = (a%m * b%m)%m
Based on the above shouldn't the code be as follows?
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
int char_value = (c - 'a' + 1);
hash_value = (hash_value%m + ((char_value%m * p_pow%m)%m)%m ) % m;
p_pow = (p_pow%m * p%m) % m;
}
return hash_value;
}
What am I missing? Ideally I am seeking a breakdown of the code and an explanation of why the first version is correct.
Mathematically, there is no reason to reduce intermediate results modulo
m.Operationally, there are a couple of very closely related reasons to do it:
So let's look at some quantities and see if they need to be reduced.
pwas defined as some value less thanm, sop % m == p.p_powandhash_valuehave already been reduced modulomwhen they were computed, reducing them modulomagain would do nothing.char_valueis at most 26, which is already less thanm.char_value * p_powis at most26 * (m - 1). That can be, and often will be, more thanm. So reducing it modulomwould do something. But it can still be delayed, because the next step is still "safe" (no overflow)char_value * p_pow + hash_valueis still at most27 * (m - 1)which is still much less than 263-1 (the maximum value for along long, see below why I assume that along longis 64-bit), so there is no problem yet. It's fine to reduce modulomafter the addition.As a bonus, the loop could actually do (263-1) / (27 * (m - 1)) iterations before it needs to reduce
hash_valuemodulom. That's over 341 million iterations! For most practical purposes you could therefore remove the first% mandreturn hash_value % m;instead.I used 263-1 in this calculation because
p_pow = (p_pow * p) % mrequireslong longto be a 64-bit type (or, hypothetically, an exotic size of 36 bits or higher). If it was a 32-bit type (which is technically allowed, but rare nowadays) then the multiplication could overflow, becausep_powcan be approximately a billion and a 32-bit type cannot hold 31 billion.BTW note that this hash function is specifically for strings that only contain lower-case letters and nothing else. Other characters could result in a negative value for
char_valuewhich is bad news because the remainder operator%in C++ works in a way such that for negative numbers it is not the "modulo operator" (misnomer, and the C++ specification does not call it that). A very similar function can be written that can take any string as input, and that would change the analysis above a little bit, but not qualitatively.