Bucket count in unordered_map

6.8k Views Asked by At

In the sample program given below (source: http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/)

// unordered_map::rehash
#include <iostream>
#include <string>
#include <unordered_map>

int main ()
{
  std::unordered_map<std::string,std::string> mymap;

  mymap.rehash(20);

  mymap["house"] = "maison";
  mymap["apple"] = "pomme";
  mymap["tree"] = "arbre";
  mymap["book"] = "livre";
  mymap["door"] = "porte";
  mymap["grapefruit"] = "pamplemousse";

  std::cout << "current bucket_count: " << mymap.bucket_count() << std::endl;

  return 0;
}

output becomes:

current bucket_count: 23

why the bucket count becomes 23? What's the impact on heapsize? When does the heap allocation done? On bucket rehash or on actual insert? When the dynamic deallocation gets done? when clear() is used or erase() or both?

2

There are 2 best solutions below

2
On

The default rehash policy used by libstdc++ is to go up to the smallest prime number of buckets greater than or equal to the number requested. 23 is the smallest prime above 20.

0
On

Hash tables are usually sized to be "comfortably" larger than the number of items to be stored in the table. This is because the probability of two different items mapping to the same bucket increases as a hash table fills.

As the following image from wikipedia (image source) shows, for some methods of collision resolution, the behavior of the hash table suffers dramatic if its "load factor" --- the percentage of buckets which are used --- surpasses a certain fraction.

Cache misses per look-up in a hash table

Therefore, the number of buckets should always be larger than the number of elements in your hash table.

Having the number of buckets be a prime number helps ensure that entries in the hash table are distributed evenly across it. In general, any key which shares a common factor with the number of buckets will be hashed to a bucket that is a multiple of this factor. Therefore, if you set the number of buckets to 20 and your hash values happen to be even, you'll waste about 50% of the table's space. The situation's worse if your keys have factors like 4, 5, or 10.

Knowing the above, you can see why the hash table might be larger than you specify: the extra space (usually) contributes to performance. You can also see why the number of bins will be a prime: because that gives better usage of the space you do have. Combining these makes 23 seem like a reasonable choice.