Tokens in Index Time vs Query Time are not the same when using common_gram filter ElasticSearch

63 Views Asked by At

I want to use common_gram token filter based on this link. My elasticsearch version is: 7.17.8

Here is the setting of my index in ElasticSearch. I have defined a filter named "common_grams" that uses "common_grams" as type.

I have defined a custom analyzer named "index_grams" that use "whitespace" as tokenizer and the above filter as a token filter.

I have just one field named as "title_fa" and I have used my custom analyzer for this field.

PUT /my-index-000007
{
  "settings": {
    "analysis": {
      "analyzer": {
        "index_grams": {
          "tokenizer": "whitespace",
          "filter": [ "common_grams" ]
        }
      },
      "filter": {
        "common_grams": {
          "type": "common_grams",
          "common_words": [ "the","is" ]
        }
      }
    }
  }
  ,
    "mappings": {
        "properties": {
            "title_fa": {
                "type": "text",
                "analyzer": "index_grams",
                "boost": 40
                
            }
        }
    }
}

It works fine in Index Time and the tokens are what I expect to be. Here I get the tokens via kibana dev tool.

GET /my-index-000007/_analyze
{
  "analyzer": "index_grams",
  "text" : "brown is the"
}

Here is the result of the tokens for the text.

{
  "tokens" : [
    {
      "token" : "brown",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "brown_is",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "gram",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "is",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "is_the",
      "start_offset" : 6,
      "end_offset" : 12,
      "type" : "gram",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "the",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    }
  ]
}

When I search the query "brown is the", I expect these tokens to be searched:

["brown", "brown_is", "is", "is_the", "the" ]

But these are the tokens that will actually be searched:

["brown is the", "brown is_the", "brown_is the"]

Here you can see the details

Query Time Tokens

UPDATE: I have added a sample document like this:

POST /my-index-000007/_doc/1
{ "title_fa" : "brown" }

When I search "brown coat"

GET /my-index-000007/_search
{
  "query": {
    "query_string": {
      "query": "brown is coat",
      "default_field": "title_fa"
    }
  }
}

it returns the document because it searches: ["brown", "coat"]

When I search "brown is coat", it can't find the document because it is searching for

["brown is coat", "brown_is coat", "brown is_coat"]

Clearly when it gets a query that contains a common word, it acts differently and I guess it's because of the index time tokens and query time tokens.

Do you know where I am getting this wrong? Why is it acting differently?

0

There are 0 best solutions below