simhash like algorithm to compare two text documents

2.2k Views Asked by At

The problem is: I have a collection of text documents, i want to pick up the most similar one to the input one. The input text document could be exactly match or modified partly. The algorithm must be very fast.

Currently, I found simhash to take a fingerprint from collection documents. Is there any other algorithm to do the same thing?

2

There are 2 best solutions below

0
On BEST ANSWER

have you tried LSH(locality sensitive Hashing) techniques

0
On

LSH (Locality Sensitive Hashing) techniques are general indexing methods. They are very efficient at finding approximate nearest neighbors.

SimHash is one hashing algorithm for LSH. It uses cosine similarity over real-valued data.

MinHash is another hashing algorithm for LSH. It calculates resemblance similarity over binary vectors.

Mining of Massive Dataset, Chapter 3 by Anand Rajaraman and Jeff Ullman. is good introduction to the problem space and MinHash in particular.