I am trying to detect near-duplicates using Elasticknn plugin.

I have created minhashes of text documents, with Minhash set size = 100

I want to apply LSH with Jaccard similarity using Elasticknn plugin (because it has this type of index available,)

In my knowledge of LSH, Minhash duplicate detection algorithm, as per the required level of jaccard similarity (say 0.8) we have to choose the

  1. number of buckets b and
  2. bucket size r

Elastiknn provides some different parameters https://elastiknn.com/api/#jaccard-lsh-mapping

  1. L - Number of hash tables. Generally, increasing this value increases recall.
  2. k - Number of hash functions combined to form a single hash value

I am not sure if L and k are actually b and r.

Can anybody explain how to tune L and k from Elastiknn to get maximum accuracy for required level of jaccard similar documents?

1

There are 1 best solutions below

0
On

I am not sure if L and k are actually b and r.

Can you provide a more precise definition of b and r? For example "size" is ambiguous, and "number of buckets" might mean the same thing as "number of hash tables", but maybe not? I looked briefly and don't see any references to b and r in the context of minhash.

Can anybody explain how to tune L and k from Elastiknn to get maximum accuracy for required level of jaccard similar documents?

Parameter tuning is generally a process of trial-and-error. The general guidelines are as described in the docs:

  • Increasing L will generally increase recall. L represents the number of hash tables. A vector can only have one hash value per hash table. If you create more hash tables, you increase the probability that two vectors will share a hash value in one of those tables. This is otherwise known as "OR amplification".
  • Increasing k will generally increase precision. k represents the number of hashes concatenated together to create a single hash value for a single hash table. The more hashes you concatenate, the less likely it is that two vectors will have the same concatenated value. This is otherwise known as "AND amplification"

This pattern of OR and AND amplification applies to all of the LSH algos used in Elastiknn. LSH and Amplification are covered more thoroughly here: https://elastiknn.com/posts/tour-de-elastiknn-august-2021/