I'm working on a product search with Elasticsearch 7.3. The product titles are not formatted the same but there is nothing I can do about this.
Some titles might look like this:
Ford Hub Bearing
And others like this:
Hub bearing for a Chevrolet Z71 - model number 5528923-01
If someone searches for "Chevrolet Hub Bearing" the "Ford Hub Bearing" product ranks #1 and the Chevrolet part ranks #2. If I remove all the extra text (model number 5528923-01) from the product title, the Chevrolet part ranks #1 as desired.
Unfortunately I am unable to fix the product titles, so I need to be able to rank the Chevrolet part as #1 when someone searches Chevrolet Hub Bearing
. I have simply set the type of name
to text
and applied the standard
analyzer in my index. Here is my query code:
{
query:{
bool: {
must: [
{
multi_match:{
fields:
[
'name'
],
query: "Chevrolet Hub Bearing"
}
}
]
}
}
}
Elasticsearch uses the field length in the scoring formula with the BM25 algorithm. That's why the longer document get in the second position even when it matches more terms.
I recommend you to read those wonderful blog posts about the BM25 : how-shards-affect-relevance-scoring-in-elasticsearch And the-bm25-algorithm-and-its-variables
But you can tweak the bm25 algorithm to avoid this behavior. Here is the bm25 documentation for elasticsearch and here a post explaining how to do it
So you should configure a new similarity in your index settings like that :
Then if will stop penalizing longer name for the scoring. The length normalization will be kept for other fields.