Elasticsearch Merge tokens (terms) after the tokenisation

114 Views Asked by At

I am trying to find a solution to combine all tokens (terms) after tokenisation.

for example - This analyser(my-analyser) produce n tokens after applying "custom_stop" filter. Is there any way to combine all tokens and generate one single token?

I have seen 'fingerprint' filter which combine all tokens but it does sorting as well, which I don't want. Please suggest solution for this.


 "analysis": {
      "analyzer": {
        "my-analyser": {
          "tokenizer": "standard",
          "filter": [ "custom_stop"]
        }
      },
      "filter": {
        "custom_stop": {
          "type": "stop",
          "ignore_case": true,
          "stopwords": [ "elastic", "aws", "java" ]
        }
}

for the input - "The concepts in elastic aws java are discussed here" it would produce these tokens - ["concepts", "discussed", "here"],

I want to combine these three tokens and generate one token like ["concepts discussed here"]

3

There are 3 best solutions below

13
Sander Toonen On
"analysis": {
  "analyzer": {
    "my-analyzer": {
      "tokenizer": "standard",
      "filter": [
        "custom_stop",
        "concatenate_tokens"
      ]
    }
  },
  "filter": {
    "custom_stop": {
      "type": "stop",
      "ignore_case": true,
      "stopwords": ["elastic", "aws", "java"]
    },
    "concatenate_tokens": {
      "type": "script",
      "script": "String.join(' ', tokens)"
    }
  }
}
0
G0l0s On

Answer #1. Removing Stop Words by Regular Expressions

You can use the straightforward method and add all the stop words in a regular expression pattern. The source text isn't splitted into tokens (the keyword tokenizer)

GET /_analyze
{
    "tokenizer": "keyword",
    "filter": [
      "lowercase",
        {
            "type": "pattern_replace",
            "pattern": "\\b(are|aws|elastic|in|java|the)\\b",
            "replacement": ""
        },
        {
            "type": "pattern_replace",
            "pattern": "(\\s){2,}",
            "replacement": "$1"
        },
        "trim"
    ],
    "text": "The concepts in elastic aws java are discussed here"
}

The second pattern_replace filter remove multiple space symbols

The disadvantage of this method is the inability to use the stop word filter

0
G0l0s On

Answer #2. Concatenating Terms in a Runtime Field

You can use a runtime field to concatenate terms. Set fielddata = true on the field with text

Mapping

PUT /concatenated_terms
{
    "settings": {
        "analysis": {
            "analyzer": {
                "standard_custom_stop_list_filter_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "custom_stop_list_filter"
                    ]
                }
            },
            "filter": {
                "custom_stop_list_filter": {
                    "type": "stop",
                    "ignore_case": true,
                    "stopwords": [
                        "elastic",
                        "aws",
                        "java",
                        "_english_"
                    ]
                }
            }
        }
    },
    "mappings": {
        "runtime": {
            "text_concatenated_terms": {
                "type": "keyword",
                "script": {
                    "source": """
                        List terms = doc[params.field_name];
                        String concatenatedTerms = String.join(params.term_delimiter, terms);
                        emit(concatenatedTerms);
                    """,
                    "params": {
                        "term_delimiter": " ",
                        "field_name": "text"
                    }
                }
            }
        },
        "properties": {
            "text": {
                "type": "text",
                "analyzer": "standard_custom_stop_list_filter_analyzer"
            }
        }
    }
}

Document

PUT /concatenated_terms/_doc/1
{
    "text" : "The concepts in elastic aws java are discussed here"
}

Search query

GET /concatenated_terms/_search?filter_path=hits.hits
{
    "query": {
        "match_all": {}
    },
    "fields": [
        "*"
    ],
    "_source": false
}

Response

{
    "hits" : {
        "hits" : [
            {
                "_index" : "concatenated_terms",
                "_type" : "_doc",
                "_id" : "1",
                "_score" : 1.0,
                "fields" : {
                    "text" : [
                        "The concepts in elastic aws java are discussed here"
                    ],
                    "text_concatenated_terms" : [
                        "concepts discussed here"
                    ]
                }
            }
        ]
    }
}