Elasticsearch ranking shorter/less relevant titles first

1k Views Asked by At

I'm working on a product search with Elasticsearch 7.3. The product titles are not formatted the same but there is nothing I can do about this.

Some titles might look like this:

Ford Hub Bearing

And others like this:

Hub bearing for a Chevrolet Z71 - model number 5528923-01

If someone searches for "Chevrolet Hub Bearing" the "Ford Hub Bearing" product ranks #1 and the Chevrolet part ranks #2. If I remove all the extra text (model number 5528923-01) from the product title, the Chevrolet part ranks #1 as desired.

Unfortunately I am unable to fix the product titles, so I need to be able to rank the Chevrolet part as #1 when someone searches Chevrolet Hub Bearing. I have simply set the type of name to text and applied the standard analyzer in my index. Here is my query code:

{
    query:{

        bool: {
            must: [
                {
                    multi_match:{
                        fields: 
                            [
                               'name'
                             ],
                             query: "Chevrolet Hub Bearing"
                    }
                 }                  
            ]
        }

    }         
}
3

There are 3 best solutions below

0
On BEST ANSWER

Elasticsearch uses the field length in the scoring formula with the BM25 algorithm. That's why the longer document get in the second position even when it matches more terms.

I recommend you to read those wonderful blog posts about the BM25 : how-shards-affect-relevance-scoring-in-elasticsearch And the-bm25-algorithm-and-its-variables

But you can tweak the bm25 algorithm to avoid this behavior. Here is the bm25 documentation for elasticsearch and here a post explaining how to do it

TF/IDF based similarity that has built-in tf normalization and is supposed to work better for short fields (like names). See Okapi_BM25 for more details. This similarity has the following options:

k1 => Controls non-linear term frequency normalization (saturation). The default value is 1.2.

b => Controls to what degree document length normalizes tf values. The default value is 0.75.

discount_overlaps => Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.

So you should configure a new similarity in your index settings like that :

PUT <index>
{
  "settings": {
    "index": {
      "number_of_shards": 1
    },
    "similarity": {
      "my_bm25_without_length_normalization": {
        "type": "BM25",
        "b": 0
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "text",
          "similarity": "my_bm25_without_length_normalization"
        }
      }
    }
  }
}

Then if will stop penalizing longer name for the scoring. The length normalization will be kept for other fields.

2
On

I just have 2 recommandations at first glance:

1.use the english analyzer on that field. i believe the distance between terms in your query impacts the scoring of documents and i am wrong(edit: as pointed by @Pierre Mallet, it is not the case un multi_query) and the standard analyzers keep words like "for" and "a", which probably lowers the score of the document because "for a" are considered tokens by the analyzer.

2.if you have anything like a description or detail document, you could add that field to your multi_match fields list and tweak the scoring of the fields using ^2 to manipulate scoring mathematically (relevancy of the name is more important than relevancy of the description, but the content of description could be a nice tie breaker on some results) see the following example:

"multi_match": {
  "query": "open source",
  "fields": [
    "title^2",
    "content"
  ]
}

You could also explore the type parameter of the multi_match, which affects how the scoring of results behaves. see this documentation for more details.

0
On

I would recommend setting operator parameter of multi_match to and:

{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "fields": ["name"],
            "query": "Chevrolet Hub Bearing",
            "operator": "and"
          }
        }
      ]
    }
  }
}

The and operator ensures that all words from the search phrase must appear in the resulting document. This setting alone should give you the desired results.