recommended max number of tokens? (scalability)

335 Views Asked by At

I'm using the following ngram tokenizer to process 15000 documents (and expect it to grow to up to a million documents), each with up to 6000 characters (avg 100-200) of text. 2-8 ngram is what I use for a catch-all approach because it needs to support all languages. 1 QPS should be sufficient (not a lot of concurrent users) so performance is not a priority as long as each search takes ~200 ms avg.

 "tokenizer": {
    "ngram_tokenizer": {
      "type": "ngram",
      "min_gram": 2,
      "max_gram": 8,
      "token_chars": [
        "letter",
        "digit"
      ]
    }
  }

The tokenizer needs to work with all languages including CJK, hence the need for ngram. Alternative is to use analyzer plugins for CJK languages (and maybe others) which will produce less tokens. But I prefer a one size fits all approach if at all possible.

The largest sample document produces up to 10000 tokens using above ngram, a bit over a megabyte in size. But if this is an issue, I can probably set a maximum for the text of each document the tokens are based on. While I only have around 15000 documents and search is sufficiently fast, I don't know how this scales with # of documents. Is this a reasonable amount? Does Elasticsearch have any documented recommendations/limits for max number of tokens?

Some more info: Memory optimized deployment (ES Cloud), 2 zones, 4GB Storage and 2GB RAM per zone, 162 shards. Memory pressure is around 30%.

1

There are 1 best solutions below

3
On

In the beginning of question you mentioned million documents, but later you mentioned 15k so please clarify this, Your question is very similar to what I answered in my this stackOverflow answer, but I would add few more details.

It would help why you are using the n-gram and explain your use-case, so that we can explain alternatives if possible. Apart from this, definitely using ngram is costly and takes more CPU, memory, Disk and infra both at index and query time, and known to cause performance issues, but there are also various factors like your cluster size, index configuration(no of primary shards and replica shards), and how they are allocated in your Elasticsearch cluster.

It's very difficult to provide the specific recommendation unless you provide more information, also you need to do the benchmark testing for your dataset and your cluster, as every deployment is unique.

Bonus tips: You can use the profile API to know the execution details and finding the bottleneck in your query.

Hope this helps and let me know if you need more info.