What’s the best way to compute multiple hashes of small integer numbers for Bloom filter?

192 Views Asked by At

I’ve looked at Murmur3 and Meow but they both seem to be optimized for bandwidth when hashing long arrays. I don’t have any arrays, I only have uint32_t integers on input. My inputs are small non-negative numbers usually under a few millions, all divisible by 3. The probability density is uniform over the range [ 0 .. N*3 ] for some integer number N.

Is the following code good enough in terms of performance, and quality of distribution?

// Count of bits in the complete Bloom filter
// Using 32kb or memory to hopefully stay in L1D cache
constexpr size_t countBits = 1024 * 32 * 8;

// These bytes are from random.org
// Cryptographic strength is irrelevant, we only need performance here.
static const __m128i seed = _mm_setr_epi8( (char)0x8a, (char)0xf4, 0x30, (char)0x91, (char)0x07, 0x05, 0x45, (char)0x99, 0x2f, (char)0x95, 0x4a, (char)0xa2, (char)0x84, (char)0x88, (char)0xe6, (char)0x09 );

// Mask to extract lower bits from the hash
static const __m128i andMask = _mm_set1_epi32( countBits - 1 );

// Compute 16 bytes of hash function, mask upper bits in every 32-bit lane, computing 4 bit positions for the Bloom filter.
inline __m128i hash( uint32_t key )
{
    // Broadcast integer to 32-bit lanes
    __m128i a = _mm_set1_epi32( (int)key );
    // Abuse AES-NI decrypting instruction to make a 16-byte hash.
    // That thing only takes 4 cycles of latency to complete.
    a = _mm_aesdec_si128( a, seed );
    // Mask upper bits in the lanes so the _mm_extract_epi32 returns position of bits in the Bloom filter to set or test
    a = _mm_and_si128( a, andMask );
    return a;
}

Update: the question is not opinion based because count of hash collisions is trivially measurable. It’s for this very reason I’ve specified the distribution of my input values.

1

There are 1 best solutions below

8
On

It sounds like you might not need to use a hash function at all:

My inputs are small non-negative numbers usually under a few millions, all divisible by 3. The probability density is uniform over the range [ 0 .. N*3 ] for some integer number N.

The whole point of using a hash function for a bloom filter is that given an input whose probability density is not uniform, to make it uniform. Of course, even if statistically every value is equally likely to occur, it can still be that a given input is correlated with a previous input (say, values close together are likely to appear in groups), in which case you still might need to apply a hash function.

Is the following code good enough in terms of performance, and quality of distribution?

That is hard to answer, only you can tell if it meets the requirements of your application. A single round of AES should not be considered cryptographically secure, and even if it were, using a known fixed seed would allow an attacker that can control the inputs to your Bloom filter to distribute the inputs in such a way that your Bloom filter is very inefficient. But you said it yourself:

[...] count of hash collisions is trivially measurable.

You can measure the distribution of hash values before and after applying the AES decryption round on realistic input, and thus check if it is good enough for you.