How to always recommend different documents (files) in Elasticsearch

209 Views Asked by At

I have a service that recommends documents (files) relevant to the user current context. It uses ElasticSearch more_like_this in combination with filters (see query bellow). These documents are uploaded by users and if it is public, then it could be recommended to other users. It works fine, but the problem happens when two or more users upload same files. There are two or more instances of the same document in elasticsearch and it is very likely that both (or even more) files will be recommended.

Does anyone have idea how I could enforce ElasticSearch to ignore these duplicates and return only one instance of the same file?

POST _search
{
 "query": {
   "filtered": {
    "query": {
       "mlt": {
       "fields": [
          "file"
         ],
         "like_text": "Some sample text here",
         "min_term_freq": 1,
         "max_query_terms": 1,
         "min_doc_freq": 1
    }
  },
"filter" : {
  "or" : {
    "filters" : [ {
      "term" : {
        "visibility" : "public"
      }
    }, {
      "and" : {
        "filters" : [ {
          "term" : {
            "visibility" : "private"
          }
        }, {
          "term" : {
            "ownerId" : 2
          }
        } ]
      }
    } ]
  }
 }
 }
 },
"fields": [
  "id","title","visibility", "ownerId","contentType", "dateCreated", "url"]
}

Edited:

I solved the first part of this problem. I'm using Tika to extract the content from web page or text document. Then, I'm using it in More Like This query as like text to find most similar documents, and those having values higher then 0.9 are marked as duplicate. For this, I'm using a new field "uniqueness" which has UUID value. If new document to index is duplicate, I'm copying its "uniqueness" value, and if there is no duplicates, I'm creating new value "uniqueness" for that document.

However, the second part of the problem I still didn't solve is how to make a query that will eliminate these duplicates. So basically in above mentioned query, I have to integrate part that will choose only 1 instance of documents with the same value of field "uniqueness".

Does anybody have an idea how to solve this?

1

There are 1 best solutions below

8
On

You can define a "duplicate" field where you can set the value to "true" or the id of a duplicate document during indexing. then you can filter out these documents.