Elasticsearch Aggregation with hamming distance of a phash

651 Views Asked by At

Trying to group together similar documents with matching keyword field values and phashes of their related images. At the moment I have the following which works well for exact matching phashes

          'duplicate_docs':
        A('terms',
          script={
              "lang":
              "painless",
              "inline":
              "def term = doc['make'] + '' +doc['model'] + doc['province'] + doc['mileage'];return term+''+doc['image_hash'];"
          }),
    }, {'dup_docs': A('top_hits', size=20)}):

However some of the images are slightly different and the whole point of phash is that you can use a hamming distance to figure how different

I realise this probably makes the calculation vastly more expensive as essentially need to compare every image against every other image which seems excessive but unsure how else I could go about this. Thanks

1

There are 1 best solutions below

0
On

You may want to try this out:

Mu, C, Zhao, J., Yang, G., Yang, B. and Yan, Z., 2019, October. Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines. In International Conference on Similarity Search and Applications (pp. 49-56). Springer, Cham.

The FENSHSES method proposed by the above paper could efficiently find all r-neighbors in Hamming space w/o scanning all documents.