Questions about LSH (Locality-sensitive hashing) and minihashing implementation

345 Views Asked by At

I'm trying to implement this paper

Browser Fingerprint Coding Methods Increasing the Effectiveness of User Identification in the Web Traffic

I got a couple of questions about the LHS algorithm in general and the proposed implementation:

  • The LSH algorithm it's used only when you have a lot of documents to compare with each other (because it is supposed to put the similar ones in the same bucket from what I got). If for example I have a new document and I want to calculate the similarity with the others, I have to relaunch the LHS algorithm from scratch, including the new document, correct?

  • In 'Mining of Massive Datasets, Ch3', it is said that for the LHS we should use one hash function per band. Each hash function creates n buckets. So, for the first band, we are going to have n buckets. For the second band onward, Am I supposed to keep using the same hash function (so this way I keep using the same buckets as before) or another one (ending so with m>>n buckets)?

  • This question is related t the previous one. If I use the same hash function for all the bands, then I'll have n buckets. No problem here. But If I have to use more hash functions (one different function per row), I'm going to end up with a lot of different buckets. Am I supposed to measure the similarity for each pair in each bucket? (If I have to use only one hash function then here it's not a problem).

  • In the paper, I understood most of the algorithm except for its end. Basically, two Signatures matrices are created (one for stable features and one for unstable features) via minhashing. Then, they use LSH on the first matrix to obtain a list of candidates pairs. So far so good. What happens at the end? do they perform the LHS on the second matrix? How the result of the first LHS is used? I cannot see the relationship between the first and the second LHS.

  • The output of the final step is supposed to be a list of pairing candidates, right? and all that I have to do is performing Jaccard similarity on them and setting a threshold, right?

Thanks for your answers!

1

There are 1 best solutions below

0
On

I got a partial answer to my question (still missing question 4)

  1. No. You would keep the bucket structure and hash the new doc into it. Then compare with only those docs in one of the buckets it fell into.

  2. No. You HAVE to use different hash functions and a different set of buckets for each hash function.

  3. This is irrelevant because of the answer to (2).