What more advantageous minhash over simhash?

1.6k Views Asked by At

I am working with simhash but also see minhash is more effective.
But I don't understand.
Please explain for me: What more advantageous minhash over simhash ?

2

There are 2 best solutions below

0
On

In simhash, we do not need to store hyperplanes. It has slightly worse error bounds.Simhash lecture

3
On

Simhash is faster and typically has smaller memory requirements than minhash, but it is limited by the fact that it can only detect very close similarities. If two items differ more than a small amount, their similarity will not be detected. Minhash, on the other hand, can be used to detect even quite distant similarities, such as items that have only 5% similarity to each other. Simhash is also a little more complex to understand.

Minhash relies on generating multiple hashes per item, e.g. commonly somewhere between 20 and 400 64-bit hashes. These hashes all need to be stored, along with the ID of the item they belong to, indexed by hash. To find all items that have e.g. 50% estimated similarity to a given item, you must find all other items that share at least 50% of the given item's hashes. This may involve enumerating a fairly large number of hash-itemID pairs.

Simhash, on the other hand, only uses a single hash per item, e.g. a 64-bit hash; and this hash is generated such that very similar items will have hashes with very similar bit-patterns. This hash must be stored (along with the item's ID) in multiple tables (e.g. 8 different tables), each table permuting the hash's bits in different ways, and each table sorting the permuted hashes in numeric order. Using multiple tables enables a clever trick whereby you can rapidly find all hashes that differ by at most k bits from a given hash; the trouble is that k cannot be large: depending on how many items you expect to store, how many bits are in the entire hash, and how many tables you can keep in memory, k might be as low as 3 or possibly as high as 6 or 7. See this explanation of simhash.

Minhash and simhash both depend for their speed on having their tables held in main memory, though both are capable of being split across multiple machines if you need to overcome memory restrictions. The method of creating a simhash is covered by a patent held by Google, though they seem to permit at least non-commercial use of the algorithm.