I have created a index with this analiser
{
"settings": {
"analysis": {
"filter": {
"specialCharFilter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 30
}
},
"analyzer": {
"specialChar": {
"type": "custom",
"tokenizer": "custom_tokenizer",
"filter": [
"lowercase",
"specialCharFilter"
]
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 30,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
},
"index.max_ngram_diff": 30
},
"mappings": {
"properties": {
"partyName": {
"type": "keyword",
"analyzer": "specialChar",
"search_analyzer": "standard"
}
}
}
}
[
{
"partyName": "FLYJAC LOGISTICS PVT LTD-TPTBLR ."
},
{
"partyName": "L&T GEOSTRUCTURE PRIVATE LIMITED"
}
]
If i do a query with {"query": {"match": {"partyName": "L&T"}}}
I want an output of the below object {"partyName" : "L&T GEOSTRUCTURE PRIVATE LIMITED"}
First off, it makes no sense to have an ngram tokenizer AND an ngram token filter, that would generate way too many useless and duplicate tokens and increase your index size needlessly. Here is a gist showing what tokens are produced using your analyzer.
Next, the reason why you searching for
L&T
doesn't yield anything is because thestandard
search time analyzer will remove the&
sign and only search forl
andt
which won't yield anything since you only index tokens of minimum length of 2.I suggest the following index-time analyzer, using a whitespace tokenizer to simply split words at whitespaces and then running an edge-ngram on each token, i.e. you can search for any prefix (of min length 2) of any indexed token. At search time, we have the same analyzer, but without the edge-ngram token filter, we just split the query terms on whitespace and lowercase them. Also the
partyName
field MUST be of typetext
(notkeyword
). if you want to analyze its content:Then we can index your sample data:
Then searching for the query you provided would yield the second document: