Elasticsearch Phrase Suggestion problem with ngram indexed data

289 Views Asked by At

I need to implement a phrase suggester for spell checking querys. I have data indexed with an analyzer that use an edge_ngram tokenizer.

"suggestion_tokenizer": {
      "type": "edge_ngram",
      "min_gram": 2,
      "max_gram": 10,
      "token_chars": [
        "letter",
        "digit",
        "symbol"
      ]
    }

I am using a phrase suggester with this configs:

"suggest": {
"text": "helo worl",
"custom_suggester": {
  "phrase": {
    "field": "item.title",
    "max_errors": 3,
    "size": 5,
    "direct_generator" : [{
      "field": "item.title",
      "prefix_length": 0,
      "size": 5,
      "max_edits": 1,
      "min_word_length": 3
    }]
  }
}

When I perform phrase suggestions it works fine for wrong words, i.e.:

"helo world" ---> "hello world"

The problem is that if a query:

"helo worl" ---> "hello worl"

The phrase suggester makes the right correction for "helo" to "hello", but doesn't take care of "worl" missing "d" letter, because there is an inverted index for "worl" (generated by the edge_ngram tokenizer when indexing data), and besides ES founds matching in WORLd.

How can I solve this issue?

1

There are 1 best solutions below

0
On

I solved this problem using the trigram analyzer with a shingle filter explained here. The disadvantage is that I had to reindex all data so ES can create inverted indexes for pairs of words (i.e. "hello world").

Another thing that improve results is to add

"suggest_mode": "always"

With this, the direct generator provides more options for each term to the phrase suggester to evaluate with ngram-language models. In my case, the results were better.