Is the hash function of unordered_map deterministic?

414 Views Asked by At

I'm trying to debug a caching issue in our codebase, where we use std::unordered_map to hash a struct custom_key which is defined below, but I believe it should only be POD datatypes so I think whatever the C++ library implemented for the hash function will be used. I am wondering if this function is deterministic?

struct custom_key {
  // bunch of uint8_t variables
  // some int8_t variables
  // some boolean variables
  another_struct params;
}

struct another_struct {
  // an enum variable
  // some int variables
  int dimensions[5];
}

Our caching algorithm is

// some_input_key is coming in from a stream
std::unordered_map<custom_key, some_output_type, custom_hasher<custom_key>, custom_equal<custom_key>> ht;
if (ht.find(some_input_key) != ht.end()) {
  // do something
} else {
  ht[some_input_key] = ....
}

In our input stream, we have a set of custom_keys where there may be identical values, but we're unable to find some subsequent identical keys in the hash table.

e.g., our stream would be key1, key2, key3, key4, .... and suppose key1 and key4 are identical, sometimes it will not be able to find key4 in the hash table. I'm pretty sure the hashing function is deterministic, so I'm perplexed what could be causing the issue. I've printed the contents of both keys and they look identical.

Edit: it appears that we do have a custom hash function

template <typename T>
// based on Fowler–Noll–Vo hash function
struct custom_hasher {
  static_assert(std::is_pod<T>::value, "T is not POD");

  size_t operator()(const T& input) const {
    auto ptr = reinterpret_cast<const uint8_t*>(&input);
    uint32_t value = 0x811C9DC5;
    // we have an implementation for irange that does something similar to the one in python
    for (const auto i : irange((int)sizeof(T))) {
      value ^= ptr[i];
      value *= 0x01000193;
    }
    return (size_t)value;
  }
};

template <typename T>
struct custom_equal {
  static_assert(std::is_pod<T>::value, "T is not POD");

  bool operator()(const T& lhs, const T& rhs) const {
    auto ptr1 = reinterpret_cast<const uint8_t*>(&lhs);
    auto ptr2 = reinterpret_cast<const uint8_t*>(&rhs);
    return memcmp(ptr1, ptr2, sizeof(T)) == 0;
  }
};
1

There are 1 best solutions below

0
On

cppreference.com:

For two parameters k1 and k2 that are equal, std::hash()(k1) == std::hash()(k2). For two different parameters k1 and k2 that are not equal, the probability that std::hash()(k1) == std::hash()(k2) should be very small, approaching 1.0/std::numeric_limitsstd::size_t::max().

std::unordered_map will use hash on its keys. Objects with the same HASH go into the same buckets. However, if two things have the same VALUE, then the second pair is ignored. If key1 and key4 are identical, then key4 and its value will be discarded.

What you want to use is std::unordered_multimap (see here): cppreference.com:

Unordered multimap is an unordered associative container that supports equivalent keys (an unordered_multimap may contain multiple copies of each key value) (emphasis added).

EDIT: As the others are saying, padding might be messing with your hash result. An alternative is to individually hash each member of the struct and then combine the hashes.