Collisions in unordered_multimap when iterating through bucket via local_it

1.3k Views Asked by At

In the code below, I have a number of strings (DNA sequences) that I am storing in a vector. I have a struct, read_tag that I use to identify each string; read_tag.read_id is the string identifier. I take 30 character substrings of each string and use it as a key in an unordered_multimap with the read_tag as a value; the aim is to group strings that share a 30 character sequence. Naturally, identical string will hash to the same value, and end up in the same bucket in the multi map. Offset is used to give the "shift" from index zero of the 30 character tag.

However, when I run this code, iterating through each bucket; I find that there are multiple different sequences in the same bucket. I thought collisions were resolved in unordered_mutlimap, and therefore in a bucket, their should just be one key (string). I understand that collision can happen, but I thought that either chaining, probing etc was implemented in unordered_mutlimap. You should be able to run and check output to see where I am getting confused.

I also std::hash the each key, one in a bucket, and I find that the keys in the "collisions" have a different hash value.

So, it's as if collisions are happening, resulting in values with diff. keys in the same bucket, but in contradiction, the keys hashed to different vals. Is their a way to avoid this, and differentiate values based on keys within buckets? Or do I need to do implement this?

#include <iostream>                                                                                   
#include <string>                                                                                     
#include <unordered_map>                                                                              
#include <vector>                                                                                     
#include <functional>                                                                                 

using namespace std;                                                                                  


int main() {                                                                                          


  vector<string>  reads;                                                                              

  reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");
  reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");
  reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
  reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");
  reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
  reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
  reads.push_back("GGCAGGGTCATACCCGATTAACTTGTTATAGAGTATGGGGCATCAACTTGGGCAGCAATGGGGAACGGTGTCTCTGGAAG");
  reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
  reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
  reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
  reads.push_back("CCGGGCGTGGTGGCGTGCACCTGTAATCCCAGCTACTTGGGATGTTCAGGCAGGAGACTCGCTTGATCCCCGGGGACGGA");
  reads.push_back("CCAGCTGCTCTCACCCTGGGCAGGGTCCCTGCACACACTGTATCTTTTGAGGTCCCTTCAGGACCCCGGTTTGCTGCCTC");

  struct read_tag{                                                                                    
    unsigned int read_id;    // unique string identifier                                                                          
    int offset;              // shift of 30 character substring represented by tag                                                                                                                                            
  };                                                                                                  

  unordered_multimap<string, read_tag> mutation_grouper;                                              

  for(int read_id=0; read_id < reads.size(); read_id++) {                                             
    string read = reads[read_id];                                                                                              
    for(int i=0; i < read.size()-30; i++) {                                                                                                                            
      string sub_read = read.substr(i, 30);                                                           
      read_tag next_tag;                                                                              
      pair<string, read_tag> key_val;                                                                 

      next_tag.read_id = read_id;                                                                     
      next_tag.offset = i;                                                                                                                                             

      key_val.first = sub_read;                                                                       
      key_val.second = next_tag;                                                                      

      mutation_grouper.insert(key_val);                                                               
    }                                                                                                 
  }                                                                                                   

  cout << "mutation_grouper buckets" << endl;                                                         
  std::hash<std::string> hash_er;                                                                     

  for(unsigned int bucket = 0;  bucket < mutation_grouper.bucket_count(); bucket++) {

    cout << "Bucket: " << bucket << endl;                                                    
    for( auto local_it = mutation_grouper.begin(bucket);                                     
     local_it != mutation_grouper.end(bucket); ++local_it) {                             

      cout << local_it->first << " : " << local_it->second.read_id                           
      << ", " << local_it->second.offset << ", " << endl;                                               

      cout << "hash value: " << local_it->first <<"::: " << hash_er(local_it->first) << endl;

     }                                                                                        
     cout << endl << endl;                                                                    
   }                                                                                          
 }     
2

There are 2 best solutions below

0
On BEST ANSWER

Yes, your are correct. It is not guaranteed, that two different items land in two different buckets. You know only, that two same items land in the same bucket.

The solution to your problem is simply to avoid buckets. The class unordered_multimap (as well as multimap) has the method equal_range, which gives you the range of elements with a specific key. So you only have to iterate over all keys, and use equal_range to iterate over all values. Sadly there is no method, that allows you to iterate over the keys, so you have to be a little tricky. The following code should give you the desired output:

// iterate through all elements in the multimap
// don't worry, we'll skip a bunch
for (auto it = mutation_grouper.begin(); it != mutation_grouper.end(); )
{
    // Get the range of the current key
    auto range = mutation_grouper.equal_range(it->first);

    // Print all elements of the range
    cout << it->first << endl;
    for (auto local_it = range.first; local_it != range.second; ++local_it)
        std::cout << "   " << local_it->second.read_id << " " << local_it->second.offset << '\n';

    // Step to the end of the range
    it = range.second;
}
0
On

So, for anyone whom was interested. I found this in the standard

[C++11: 23.2.5/5]: Two values k1 and k2 of type Key are considered equivalent if the container’s key_equal function object returns true when passed those values. If k1 and k2 are equivalent, the hash function shall return the same value for both. [..]

[C++11: 23.2.5/8]: The elements of an unordered associative container are organized into buckets. Keys with the same hash code appear in the same bucket. [..]

So, two values with the same key will always end up in same bucket, but keys with different values could also end up in these buckets. So, I suppose the implementation may be more intelligent, and actually promote these situations; one reason I could think of it to keep the number of buckets down. You can see from the output that the filled buckets are sparse; and the closer we get to direct address tables (array of vectors, indexed by the hash) we'll end up with a giant universe of potential keys, with a giant number of empty slots, which hash tables protect against. So, it seems like a reasonable space optimisation.

Because of this, I have chosen to use multimap instead. The reasons are that, values in multimap are ordered based on key, so I can make a single pass through grouping values based on keys. In unordered_multimap once I reach a bucket (in O(1) because its a hash table) there is no key based ordering, so I cant take a linear pass through the bucket to group sequences.