I need to include special character in elastic search

51 Views Asked by At

I have created a index with this analiser

{
  "settings": {
    "analysis": {
      "filter": {
        "specialCharFilter": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 30
        }
      },
      "analyzer": {
        "specialChar": {
          "type": "custom",
          "tokenizer": "custom_tokenizer",
          "filter": [
            "lowercase",
            "specialCharFilter"
          ]
        }
      },
      "tokenizer": {
        "custom_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 30,
          "token_chars": [
            "letter",
            "digit",
            "symbol",
            "punctuation"
          ]
        }
      }
    },
    "index.max_ngram_diff": 30
  },
  "mappings": {
    "properties": {
      "partyName": {
        "type": "keyword",
        "analyzer": "specialChar",
        "search_analyzer": "standard"
      }
    }
  }
} 


[
  {
    "partyName": "FLYJAC LOGISTICS PVT LTD-TPTBLR ."
  },
  {
    "partyName": "L&T GEOSTRUCTURE PRIVATE LIMITED"
  }
]

If i do a query with {"query": {"match": {"partyName": "L&T"}}}

I want an output of the below object {"partyName" : "L&T GEOSTRUCTURE PRIVATE LIMITED"}

1

There are 1 best solutions below

2
On BEST ANSWER

First off, it makes no sense to have an ngram tokenizer AND an ngram token filter, that would generate way too many useless and duplicate tokens and increase your index size needlessly. Here is a gist showing what tokens are produced using your analyzer.

Next, the reason why you searching for L&T doesn't yield anything is because the standard search time analyzer will remove the & sign and only search for l and t which won't yield anything since you only index tokens of minimum length of 2.

I suggest the following index-time analyzer, using a whitespace tokenizer to simply split words at whitespaces and then running an edge-ngram on each token, i.e. you can search for any prefix (of min length 2) of any indexed token. At search time, we have the same analyzer, but without the edge-ngram token filter, we just split the query terms on whitespace and lowercase them. Also the partyName field MUST be of type text (not keyword). if you want to analyze its content:

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "specialCharFilter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 30
        }
      },
      "analyzer": {
        "specialChar": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "specialCharFilter"
          ]
        },
        "searchSpecialChar": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        }
      }
    },
    "index.max_ngram_diff": 30
  },
  "mappings": {
    "properties": {
      "partyName": {
        "type": "text",
        "analyzer": "specialChar",
        "search_analyzer": "searchSpecialChar"
      }
    }
  }
} 

Then we can index your sample data:

PUT test/_doc/1
{
  "partyName": "FLYJAC LOGISTICS PVT LTD-TPTBLR ."
}
  
PUT test/_doc/2
{
  "partyName": "L&T GEOSTRUCTURE PRIVATE LIMITED"
}

Then searching for the query you provided would yield the second document:

POST test/_search
{
  "query": {
    "match": {
      "partyName": "L&T"
    }
  }
}
=>

"hits": [
  {
    "_index": "test",
    "_id": "2",
    "_score": 1.0538965,
    "_source": {
      "partyName": "L&T GEOSTRUCTURE PRIVATE LIMITED"
    }
  }
]