Detect near duplicate document using simhash

1.3k Views Asked by At

I've found this python project in github but when I am trying to use it from my purpose to detect near-duplicate document e.g json, I'm not getting enough information from the README.md file on how to do that? It shows only to compute

import simhash

a = simhash.compute(...) 
b = simhash.compute(...)
simhash.num_differing_bits(a, b)

AND how to find matches using

import simhash
hashes = []
blocks = 4
distance = 3
matches = simhash.find_all(hashes, blocks, distance)

What I've tried so far: After cloning this repo, I've installed all the requirements but when I try to run the setup.py or bench.py it is showing

ImportError: No module named simhash.simhash

This project is awesome but I'm having this difficulty because the README.md file is not very descriptive on how to create hashes of documents?, how to pass hashes? and how detect near duplicates?. So I need help on that how can I make hashes of my documents? Can anyone help me out on how to implement near duplicate documents detection using this simhash using python or provide any step by step tutorial link to implement this? By the way I've seen that but this doesn't contain full steps to implement it.

1

There are 1 best solutions below

0
On

Try this

pip install git+https://github.com/seomoz/simhash-py.git

Also for more description dlecocq has posted in the issue. below is the link for that

https://github.com/seomoz/simhash-py/issues/47