I am trying to find the CRC32 collision probability among all possible ASCII strings of variable length ranging from 1 to 7. Here an ASCII character can range from ASCII 32 to ASCII 126. I tried to run my program in my PC, but due to high requirement of CPU, the program would crash and never run to completion. Is there any other way to find this out.
my code is as below:
import binascii
from itertools import product
def crc32_all_ascii_strings(length):
strings = []
for chars in product(range(32, 127), repeat=length):
strings.append(''.join(chr(c) for c in chars))
crc32_dict = {}
for string in strings:
crc32 = binascii.crc32(string.encode()) & 0xFFFFFFFF
if crc32 not in crc32_dict:
crc32_dict[crc32] = []
crc32_dict[crc32].append(string)
return crc32_dict
if __name__ == "__main__":
for length in range(1, 8):
crc32_dict = crc32_all_ascii_strings(length)
for crc32, string_list in crc32_dict.items():
if len(string_list) > 1:
print(f"CRC32: {crc32:08X}")
for string in string_list:
print(f" {string}")
The probability of collision for strings of length 1-4 is exactly zero. A CRC-32 is a one-to-one and onto mapping of 32 bits to 32 bits.
For length 5, the probability of a collision is very close to 2-32. 955 is about double 232, so there is plenty of coverage of the range space. For length 6 or more, it is almost exactly 2-32, whether it's ASCII or all byte values. The domain constraint has negligible effect.
Update:
I ran the experiments, and for length 5, restricting the five bytes to be in 32..126, the probability of collision is about 2-32.2. That's 13% less than what the probability of collision would be if the bytes were not constrained (in 0..255), which is 2-32.
For length 6, bytes in 32..126, we indeed get a collision probability of 2-32. It will be the same for any greater length.
Update 2:
I was able to calculate the probability for length 5. It is: 4111856 divided by 20546909374090625 ≈ 2-32.2184. Here is the code in C++ that does the calculation: