Elasticsearch - Research that returns too many bad results

115 Views Asked by At

I have an elasticsearch that works but it is really too large, it gives me too many results on terms that have nothing to do with it. I'm looking for a way to refine these results.

On a sample of fake text when I search for the term music, the terms that come out in highlights are :
must, much, alice, inside, patriotic, noticed

I think that the ngram doesn't help me but I think I really need it to have a better search.

Here is my configuration :

{
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
        "analyzer": {
            "default": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["lowercase", "mySnowball", "myNgram"]
            },
            "default_search": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["standard", "lowercase", "mySnowball", "myNgram"]
            }
        },
        "filter": {
            "mySnowball": {
                "type": "snowball",
                "language": "English"
            },
            "myNgram": {
                "type": "ngram",
                "min_gram": 2,
                "max_gram": 6
            }
        }
    }
}

Here is my request :

    {
    "query": {
        "bool": {
            "should": [{
                "match": {
                    "content": "music"
                }
            }, {
                "match": {
                    "url": "music"
                }
            }, {
                "match": {
                    "h1": "music"
                }
            }, {
                "match": {
                    "h2": "music"
                }
            }
         ],
            "minimum_should_match": 1
        }
    },
    "min_score": 8
}

My document is quite simple :

content => text,
url => text,
h1 => text,
h2 => text,

And the mapping too:

$configMapping  = [
    'content' => ['type' => 'text', 'boost' => 6],
    'url'     => ['type' => 'text', 'boost' => 6],
    'h1'      => ['type' => 'text', 'boost' => 9],
    'h2'      => ['type' => 'text', 'boost' => 7]
]

I welcome any modification that will allow me to obtain only consistent results.

1

There are 1 best solutions below

1
Shira Elitzur On

As you said yourself, analyzing with 'ngram' is the reason you get all these unrelated results.

In all the results you get, you can see the token (2 characters token, as the minimum of your n-gram) that matched the query term 'music': must, much, alice, inside, patriotic, noticed

Start by removing this filter from your analyzer and keep on tuning the results from there.