Elasticsearch 6.8 match_phrase search N-gram tokenizer works not well

812 Views Asked by At

i use Elasticsearch N-gram tokenizer and use match_phrase to fuzzy match my index and test data as below:

PUT m8
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 3,
    "max_ngram_diff": 10
  "mappings": {
    "table": {
      "properties": {
        "dataSourceId": {
          "type": "long"
        "dataSourceType": {
          "type": "integer"
        "dbName": {
          "type": "text",
          "analyzer": "my_analyzer",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256

PUT /m8/table/1

PUT /m8/table/2
PUT /m8/table/3

check _analyze:

POST m8/_analyze
  "tokenizer": "my_tokenizer",
  "text": "rm.rf"

_analyze result:

  "tokens" : [
      "token" : "r",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
      "token" : "rm",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
      "token" : "rm.",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
      "token" : "m",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 3
      "token" : "m.",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 4
      "token" : "m.r",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "word",
      "position" : 5
      "token" : ".",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 6
      "token" : ".r",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 7
      "token" : ".rf",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 8
      "token" : "r",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 9
      "token" : "rf",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 10
      "token" : "f",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 11

When i search 'rm', nothing found:

GET /m8/table/_search
  "query": {
    "bool": {
      "must": [
          "match_phrase": {
            "dbName": "rm"
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]

But '.rf' can be found:

  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  "hits" : {
    "total" : 1,
    "max_score" : 1.7260926,
    "hits" : [
        "_index" : "m8",
        "_type" : "table",
        "_id" : "1",
        "_score" : 1.7260926,
        "_source" : {
          "dataSourceId" : 1,
          "dataSourceType" : 2,
          "dbName" : "rm.rf"

My question: Why 'rm' couldn't been found even _analyze has splited these phrase?


There are 1 best solutions below

  1. my_analyzer will be used during search time as well.

     "dbName": {
      "type": "text",
      "analyzer": "my_analyzer" 
      "search_analyzer":"my_analyzer"  // <==== If you don't provide a search analyzer then what you defined in analyzer will be used during search time as well.
  2. Match_phrase query is used to match phrases considering the position of analyzed text. e.g Searching for "Kal ho" will match document having "Kal" at position X, & "ho" at position X+1 in the analyzed text.

  3. When you are searching for 'rm' (#1) the text gets analyzed using my_analyzer, which converts it into n-gram and on the top of that phrase_search will be used. Hence the outcome is not expected.


  1. Use standard analyzer with simple match query

    GET /m8/_search
     "query": {
     "bool": {
       "must": [
           "match": {
             "dbName": {
               "query": "rm",
               "analyzer": "standard" // <=========

    OR Define during mapping & use a match query (not match_phrase)

          "dbName": {
           "type": "text",
           "analyzer": "my_analyzer" 
           "search_analyzer":"standard" //<==========

Followup Question: Why do you want to use a match_phrase query with n-gram tokenizer?