How to compare the similarity of documents with Simhash algorithm?

9.6k Views Asked by At

I'm currently creating a program that can compute near-dupliate score within a corpus of text documents (+5000 docs). I'm using Simhash to generate a uniq footprint of a document (thanks to this github repo)

my datas are :

data = {
    1: u'Im testing simhash algorithm.',
    2: u'test of simhash algorithm',
    3: u'This is simhash test.',
}

and this gives me 3 hash like this :

00100110101110100011111000100010010101011001000001110000111001011100110101001101111010100010001011001011000110000100110101100110

00001001110010000000011000001000110010001010000101010000001100000100100011100100110010100000010000000110001001010110000010000100

10001110101100000100101010000010010001011010001000000000101000101100001100100000110011000000011001000000000110000000100110000000

And now, how to compare those 3 hashes ? I know that I have to split them into blocks but dont have the exact method ?

What i want to do is the output all the duplicated documents (>70%) with their ID and the IDs of duplicates docs.

Can someone help ?

2

There are 2 best solutions below

1
On BEST ANSWER

Before I answer your question, it is important to keep in mind:

  1. Simhash is useful as it detects near duplicates. This means that near duplicates will end up with the same hash.
  2. For exact duplicates you can simply use any one way, consistent hashing mechanism (ex. md5)
  3. The examples that you pasted here are too small and given their size, their differences are significant. The algorithm is tailored to work with large Web Documents and not small sentences.

Now, I have replied to your question on the Github issue that you raised here.

For reference though, here is some sample code you can use to print the final near duplicate documents after hashing them.

# assuming that you have a dictionary with document id as the key and the document as the value: 
# documents = { doc_id: doc } you can do:

from simhash import simhash

def split_hash(str, num):
    return [ str[start:start+num] for start in range(0, len(str), num) ]

hashes = {}
for doc_id, doc in documents.items():
    hash = simhash(doc)
    
    # you can either use the whole hash for higher precision or split into chunks for higher recall
    hash_chunks = split_hash(hash, 4)
    
    for chunk in hash_chunks:
        if chunk not in hashes:
            hashes[chunk] = []
        hashes[chunk].append(doc_id)

# now you can print the duplicate documents:
for hash, doc_list in hashes:
    if len(doc_list) > 1:  # if the length of doc is greater than 1
        print("Duplicates documents: ", doc_list)

Please let me know if something is not clear.

0
On

Further to Memos' answer, if you want to detect >=70% similarity, you cannot use simhash. Simhash only allows quite small hamming distances to be detected, up to around 6 or 7 bits difference, at a stretch, depending on the size of your corpus. For 70% similarity you'd have to allow 19 bits difference, which is not possible in any normal scenario. You should instead look into minhash.

If you're interested, here's an extensive explanation of simhash.