How to compare hash values in python

814 Views Asked by At

I want to know how to compare two hash values not Hamming distance.

Is there a way?

The final goal is to determine key of python dictionary that similar images can have in common.

for example.

import imagehash

# img1, img2, img3 are same images
img1_hash = imagehash.average_hash(Image.open('data/image1.jpg'))
img2_hash = imagehash.average_hash(Image.open('data/image2.jpg'))
img3_hash = imagehash.average_hash(Image.open('data/image3.jpg'))
img4_hash = imagehash.average_hash(Image.open('data/image4.jpg'))
print(img1_has, img2_hash, img3_hash, img4_hash)
>>> 81c38181bf8781ff, 81838181bf8781ff, 81838181bf8781ff, ff0000ff3f00e7ff

Result that I want to print out.

{common value1 : [81c38181bf8781ff, 81838181bf8781ff, 81838181bf8781ff], common value2: [ff0000ff3f00e7ff]}

I tried to convert the image into a hash value and compare it,

but please let me know if there is any other way without converting to hash value.

1

There are 1 best solutions below

0
On

You could use any distance metrics, like from rapidfuzz, and throw it inside a clustering algorithm.

Make sure to pip install rapidfuzz and;

from rapidfuzz import process, fuzz
import numpy as np
from sklearn.cluster import dbscan

hashes = ["81c38181bf8781ff", "81838181bf8781ff", "81838181bf8781ff", "ff0000ff3f00e7ff"]

X = np.arange(len(hashes)).reshape(-1, 1)

def rapidfuzz_dist(x, y):
    i, j = int(x[0]), int(y[0])
    return 1 - ( fuzz.ratio(hashes[i], hashes[j]) / 100 )

labels, clusters = dbscan(X, metric=rapidfuzz_dist, eps=.5, min_samples=1)

will create clusters, and you can output with somewhat your question with

for cluster in set(clusters):
    print( f"cluster: {cluster}:")
    print( [ h for h,c in zip(hashes,clusters) if c == cluster] )

to get

cluster: 0:
['81c38181bf8781ff', '81838181bf8781ff', '81838181bf8781ff']
cluster: 1:
['ff0000ff3f00e7ff']