Force Match phrase to discard results with full email searching only its domain

101 Views Asked by At

I'd like to find in my ElasticSearch index the string outlook.com inside a text with a match_phrase query. But I don't want results that are [email protected], that are taken with this query:

GET /my_index/_search
{
  "size": 1,
  "query": {
    "bool": {
      "should": [],
      "must": [
        {
          "match_phrase": {
            "message": {
              "query": "outlook.com",
              "slop": 0
            }
          }
        }
      ]
    }
  }
}

I think that these results are taken because the tokenizer of the standard analyzer separate [email protected] in [something...],[outlook.com] with @ as separator.

I tried to put the analyzer whitespace to tokenize as [[email protected]] and avoid taking the full emails as results. But with this query:

GET /my_index/_search
{
  "size": 1,
  "query": {
    "bool": {
      "should": [],
      "must": [
        {
          "match_phrase": {
            "message": {
              "query": "outlook.com",
              "slop": 0,
              "analyzer": "whitespace",
            }
          }
        }
      ]
    }
  }
}

still finds results like [email protected]. How can I do?

UPDATE:

In my mapping, I set standard analyzer a time ago. So my intuition is that even if I use a whitespace analyzer at search time, the documents are already tokenized with the standard one, so the tokenization is no more changeable after the indexing time.

I tried doing a painless script to match a certain pattern, but my field is type text so the search takes too much time.

Otherwise, a regexp query can do something similar:

GET /my_index/_search
{
  "size": 1,
  "query": {
    "bool": {
      "should": [],
      "must": [
        {
          "regexp": {
            "message": ".*[^A-Za-z0-9\\@]outlook.com[^A-Za-z0-9\\@].*"
          }
        }
      ]
    }
  }
}

But unfortunately reading regexp syntax documentation there is a limited set of operators. For example with this regex [^A-Za-z0-9\\@] I mean any characters, but not a @ before outlook.com and not an alphanumeric character (this is to simulate the word boundary that we could have with the match_phrase query). My problem is that if the field starts or ends with Outlook.com, it's not retrieved because the regex doesn't find a character before or after ([^A-Za-z0-9\\@] doesn't match the empty string).

1

There are 1 best solutions below

2
Mouad Slimane On

you can use the regexp query instead of match_phrase like this:

{  "query":{
    "bool": {
      "must": [
        {
          "regexp": {
            "message": ".*[^@]outlook.com"
          }
        }
      ]
    }
  }
}