I've found this python project in github but when I am trying to use it from my purpose to detect near-duplicate document e.g json, I'm not getting enough information from the README.md file on how to do that? It shows only to compute
import simhash
a = simhash.compute(...)
b = simhash.compute(...)
simhash.num_differing_bits(a, b)
AND how to find matches using
import simhash
hashes = []
blocks = 4
distance = 3
matches = simhash.find_all(hashes, blocks, distance)
What I've tried so far: After cloning this repo, I've installed all the requirements but when I try to run the setup.py
or bench.py
it is showing
ImportError: No module named simhash.simhash
This project is awesome but I'm having this difficulty because the README.md file is not very descriptive on how to create hashes of documents?, how to pass hashes? and how detect near duplicates?. So I need help on that how can I make hashes of my documents? Can anyone help me out on how to implement near duplicate documents detection using this simhash using python or provide any step by step tutorial link to implement this? By the way I've seen that but this doesn't contain full steps to implement it.
Try this
Also for more description dlecocq has posted in the issue. below is the link for that
https://github.com/seomoz/simhash-py/issues/47