Trying to group together similar documents with matching keyword field values and phashes of their related images. At the moment I have the following which works well for exact matching phashes
'duplicate_docs':
A('terms',
script={
"lang":
"painless",
"inline":
"def term = doc['make'] + '' +doc['model'] + doc['province'] + doc['mileage'];return term+''+doc['image_hash'];"
}),
}, {'dup_docs': A('top_hits', size=20)}):
However some of the images are slightly different and the whole point of phash is that you can use a hamming distance to figure how different
I realise this probably makes the calculation vastly more expensive as essentially need to compare every image against every other image which seems excessive but unsure how else I could go about this. Thanks
You may want to try this out:
Mu, C, Zhao, J., Yang, G., Yang, B. and Yan, Z., 2019, October. Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines. In International Conference on Similarity Search and Applications (pp. 49-56). Springer, Cham.
The FENSHSES method proposed by the above paper could efficiently find all r-neighbors in Hamming space w/o scanning all documents.