Fetching documents based on priority list in ElasticSearch

22 Views Asked by At

The docs in my index has the following fields

{
  "weight" : int
  "tags" : string[]
}

tags is a list of string. Eg - ["A", "B", "C", "D"] . Lets assume my index has the following data

[
    {
        "weight": 1,
        "tags": [
            "B",
            "C"
        ]
    },
    {
        "weight": 2,
        "tags": [
            "A"
        ]
    },
    {
        "weight": 3,
        "tags": [
            "B"
        ]
    },
    {
        "weight": 4,
        "tags": [
            "A",
            "C"
        ]
    },
    {
        "weight": 5,
        "tags": [
            "C"
        ]
    }
]

I have a param priority = ["A", "C"]. I want to fetch documents based on the priority list. So since "A" appears first in list, the docs with tag "A" should appear first in output. If doc1 and doc2 both have the same tag, then the doc with greater weight should appear first in output. So output should be

[
    {
        "weight": 4,
        "tags": [
            "A",
            "C"
        ]
    },
    {
        "weight": 2,
        "tags": [
            "A"
        ]
    },
    {
        "weight": 5,
        "tags": [
            "C"
        ]
    },
    {
        "weight": 1,
        "tags": [
            "B",
            "C"
        ]
    }
]

Can we achieve this in ElasticSearch ? I have also heard about Painless scripts. How can we use Painless scripts here, if we can ?

1

There are 1 best solutions below

0
On

The first thing you need to know is that the tags indexed in the tags array are not necessarily indexed in the same order as you specify them in the source. Usually, the lexical order prevails, and while it works with simple letters like A, B and C, your real tags might be different and not listed in lexical order. To sum up, you cannot count on the order of the tags list in order to boost certain documents relative to others.

Similarly, if you were to specify a terms clause in your query to give more importance to A over C (as in priority = ["A", "C"]), ES would not necessarily use that order to execute your query.

The solution I'm giving you below respects the conceptual ordering of your priority, by using a bool/should query, where the first element has a bigger boost factor than the second, the second has a bigger boost factor than the third, etc. In this case, we should boost A over C so I'm giving documents having tag A a boost of 2 and the ones with tag C a boost of 1. If you had three tags, you would start at 3, instead. This will properly boost the documents as per your desired priorities.

The next part is to account for documents having equal score, and for this we can simply sort by descending weight:

GET prio/_search
{
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    },
    {
      "weight": {
        "order": "desc"
      }
    }
  ], 
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "tags.keyword": {
              "value": "A",
              "boost": 2
            }
          }
        },
        {
          "term": {
            "tags.keyword": {
              "value": "C",
              "boost": 1
            }
          }
        }
      ]
    }
  }
}

The above query, when executed over your sample set of documents, would yield the results you expect:

"hits": [
  {
    "_index": "prio",
    "_id": "4",
    "_score": 2.5930133,
    "_source": {
      "weight": 4,
      "tags": [
        "A",
        "C"
      ]
    },
    "sort": [
      2.5930133,
      4
    ]
  },
  {
    "_index": "prio",
    "_id": "2",
    "_score": 1.9826791,
    "_source": {
      "weight": 2,
      "tags": [
        "A"
      ]
    },
    "sort": [
      1.9826791,
      2
    ]
  },
  {
    "_index": "prio",
    "_id": "5",
    "_score": 0.6103343,
    "_source": {
      "weight": 5,
      "tags": [
        "C"
      ]
    },
    "sort": [
      0.6103343,
      5
    ]
  },
  {
    "_index": "prio",
    "_id": "1",
    "_score": 0.6103343,
    "_source": {
      "weight": 1,
      "tags": [
        "B",
        "C"
      ]
    },
    "sort": [
      0.6103343,
      1
    ]
  }
]