add elision filter to snowball

281 Views Asked by At

At first, I was using the analizer "language analyzer" and everything seemed to work very well. Until I realize that "a" is not part of the list of stopwords in french

So I decided to test with snowball. It also seemed working well, but in this case it does remove short word like " l' ", " d' ", ...

Hence my question: How to use snowball, keep filters by default, and add a list of stopwords and elision?

Otherwise, how to change the list of stopwords for analizer "language analyzer"?

And one last question: is there really an interest to use snowball rather than the analizer "language analyzer"? is it faster? more relevant?

thank you

1

There are 1 best solutions below

0
On BEST ANSWER

Since an analyzer is simply the combination of a tokenizer and zero or more filters, you can build your own custom snowball analyzer, which mimics the "defaults" and adds on the top your own filters, such as an elision token filter.

As stated in the snowball analyzer documentation:

An analyzer of type snowball that uses the standard tokenizer, with standard filter, lowercase filter, stop filter, and snowball filter.

So here is an example which contains both implementations, a snowball analyzer with default filters plus custom stopwords and elision, and a language analyzer whith a custom list of stopwords:

{
  "settings": {
    "analysis": {
      "analyzer": {
       "custom_snowball_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "stop",
            "snowball",
            "custom_stop",
            "custom_elision"
          ]
        },
        "custom_language_analyzer": {
          "type": "french",
          "stopwords": ["a", "à", "t"]
        }
      },
      "filter": {
        "custom_stop": {
          "type": "stop",
          "stopwords": ["a", "à", "t"]
        },
        "custom_elision": {
          "type": "elision",
          "articles": ["l", "m", "t", "qu", "n", "s", "j"]
        }
      }
    }
  }
}

Let's see the tokens produced by both analyzers, using the same testing sentence:

curl -sXGET 'http://localhost:9200/testing/_analyze?analyzer=custom_snowball_analyzer&pretty' -d "Il a de la chance, parait-t-il que l'amour est dans le pré, mais au final à quoi bon ?." | grep token
  "tokens" : [ {
    "token" : "il",
    "token" : "de",
    "token" : "la",
    "token" : "chanc",
    "token" : "parait",
    "token" : "il",
    "token" : "que",
    "token" : "amour",
    "token" : "est",
    "token" : "dan",
    "token" : "le",
    "token" : "pré",
    "token" : "mai",
    "token" : "au",
    "token" : "final",
    "token" : "quoi",
    "token" : "bon",

curl -sXGET 'http://localhost:9200/testing/_analyze?analyzer=custom_language_analyzer&pretty' -d "Il a de la chance, parait-t-il que l'amour est dans le pré, mais au final à quoi bon ?." | grep token
  "tokens" : [ {
    "token" : "il",
    "token" : "de",
    "token" : "la",
    "token" : "chanc",
    "token" : "parait",
    "token" : "il",
    "token" : "que",
    "token" : "amou",
    "token" : "est",
    "token" : "dan",
    "token" : "le",
    "token" : "pré",
    "token" : "mai",
    "token" : "au",
    "token" : "final",
    "token" : "quoi",
    "token" : "bon",

As you can see, both analyzers produces almost the exact same tokens, except for "amour", which has not been stemmed, I don't know why to be honest, since the snowball filter uses a stemmer under the hood.

About your second question, those filters only affect indexing time (during tokenization step), I would say that both implementations will perform almost equally (the language analyzer should be slightly faster since it only stem french words in this example) and won't be noticeable unless you plan to index huge docs under heavy load.

Search response times should be similar because the tokens are almost the same (if you index french documents only), so I think Lucene will provide the same performances.

To conclude, I would choose the language analyzer if you are indexing french documents only, since it's far smaller in the mapping definition :-)