I am implementing minhash and LSH for similarity search for some string elements in C++11. The minhash sketch for my implementation is a vector of 200 64-bit integers i.e. vector<uint64_t> MinHashSketch
. I have more than 2 million entries and the sketch generation portion does not take much time. But, the bucketing stage takes a long time. I am wondering if I can get some suggestions to make it a bit faster. Following is my bucketing stage using LSH.
I am taking consecutive elements in the sketch to create a hash which becomes bucket id. If bsize = 5
, then 1-5, 6-10, 11-15, ... 196-200
elements in MinHashSketch[i]
(for ith element) forms the bucket ids. Following the piece of code that does that.
for (int p = 0; p < 200; p += bsize) { //bsize = 5
string s = "";
for(int x = p; x < (p+bsize); x++){
s = s + to_string(MinHashSketch[i].at(x)); // ith element
}
uint64_t hash1 = 0; // bucket id
Hash_function ((uint8_t*)s.c_str(), s.length(), (uint8_t *)&hash1, 0);
........
........
}