making LSH implementation faster in C++11

287 Views Asked by At

I am implementing minhash and LSH for similarity search for some string elements in C++11. The minhash sketch for my implementation is a vector of 200 64-bit integers i.e. vector<uint64_t> MinHashSketch. I have more than 2 million entries and the sketch generation portion does not take much time. But, the bucketing stage takes a long time. I am wondering if I can get some suggestions to make it a bit faster. Following is my bucketing stage using LSH.

I am taking consecutive elements in the sketch to create a hash which becomes bucket id. If bsize = 5, then 1-5, 6-10, 11-15, ... 196-200 elements in MinHashSketch[i] (for ith element) forms the bucket ids. Following the piece of code that does that.

for (int p = 0; p < 200; p += bsize) {  //bsize = 5
  string s = ""; 
  for(int x = p; x < (p+bsize); x++){
    s = s + to_string(MinHashSketch[i].at(x)); // ith element 
  }       
  uint64_t hash1 = 0;  // bucket id
  Hash_function ((uint8_t*)s.c_str(), s.length(), (uint8_t *)&hash1, 0);
  ........
  ........
}
0

There are 0 best solutions below