We're using Elasticsearch to store much of our data, but we're running into issues where we have to subset the search by a large number of potential values, for instance due to permission controls we want to have take effect.
When you're building an Elasticsearch query, there's a limit around 65k for the number of individual values that can be searched for in a given field. That's causing problems for us.
Here's a more concrete example. This is a little contrived, but I think gets the problem across. Imagine that we have an index called people
with the fields name
, person_id
, and age
. And we don't just have a few people, we have 1 billion people in the index.
Then we have another index called stores
with the fields store_id
, name
, address
, and person_ids
, where person_ids
represents the people who have bought something at a given store.
Now, we want to do a search for, say, people called "Emily" who are over 35 and have shopped at a given store. Most of the time when we want to do a query like this, there's no problem because a typical store (in this hypothetical) has under 1000 customers who've ever been there.
But imagine some of the stores are huge, and have 1 million people who've visited them. Searching against that subset of data is impossible the naïve way because adding 1 million person_id
values to the query against people
way exceeds the ca. 65k item limit for a predicate.
The naïve solution here would be to denormalize the data. We could insert a list of store_ids
in each of the people
documents, and then search for people called "Emily" over 35 with an added search term for the given store_id
. We don't want to do this because you then have the obvious issues of keeping the denormalized data in sync as people shop at new stores.
We could also move the data to a relational database, which solves the implicit join issue, but there we run into issues because we want to execute queries with fuzzy text matching, or other more complicated matching procedures.
Thus the question: is there a clever way to subset an Elasticsearch query against a big predicate like this?
I hope you have already checked and know that you can change the default limit of 65536 by changing the setting
index.max_terms_count
. I know this is not a ideal solution but there is a trade-off of doing it.person_id
to search, and only few queries which will have large number ofperson_id
to search in thestore
index will take more time, which should be fine.person_ids
frompeople
index first, this way you already know that how many of them you need to put in thestore
index, let's suppose if there are 1 Million of such persons, instead of search all 1 Million persons in a single query, you can batch them in multiple queries in your application, this way you will not hit the default limit set by Elasticsearch, also it would not saturate the resources at Elasticsearch.In short, you will have to try few options, and benchmark the performance of various approaches, 1 and 2 is easy to test without any change in the application.