How can I subset or restrict an Elasticsearch query by a large number of potential values?

1k Views Asked by At

We're using Elasticsearch to store much of our data, but we're running into issues where we have to subset the search by a large number of potential values, for instance due to permission controls we want to have take effect.

When you're building an Elasticsearch query, there's a limit around 65k for the number of individual values that can be searched for in a given field. That's causing problems for us.

Here's a more concrete example. This is a little contrived, but I think gets the problem across. Imagine that we have an index called people with the fields name, person_id, and age. And we don't just have a few people, we have 1 billion people in the index.

Then we have another index called stores with the fields store_id, name, address, and person_ids, where person_ids represents the people who have bought something at a given store.

Now, we want to do a search for, say, people called "Emily" who are over 35 and have shopped at a given store. Most of the time when we want to do a query like this, there's no problem because a typical store (in this hypothetical) has under 1000 customers who've ever been there.

But imagine some of the stores are huge, and have 1 million people who've visited them. Searching against that subset of data is impossible the naïve way because adding 1 million person_id values to the query against people way exceeds the ca. 65k item limit for a predicate.

The naïve solution here would be to denormalize the data. We could insert a list of store_ids in each of the people documents, and then search for people called "Emily" over 35 with an added search term for the given store_id. We don't want to do this because you then have the obvious issues of keeping the denormalized data in sync as people shop at new stores.

We could also move the data to a relational database, which solves the implicit join issue, but there we run into issues because we want to execute queries with fuzzy text matching, or other more complicated matching procedures.

Thus the question: is there a clever way to subset an Elasticsearch query against a big predicate like this?

1

There are 1 best solutions below

3
On

I hope you have already checked and know that you can change the default limit of 65536 by changing the setting index.max_terms_count. I know this is not a ideal solution but there is a trade-off of doing it.

  1. As you already mentioned that most of your queries will not have large person_id to search, and only few queries which will have large number of person_id to search in the store index will take more time, which should be fine.
  2. You can also try the optimization suggested by Elasticsearch in this case

To reduce network traffic, a terms lookup will fetch the document’s values from a shard on a local data node if possible. If the your terms data is not large, consider using an index with a single primary shard that’s fully replicated across all applicable data nodes to minimize network traffic.

  1. I think you are getting the person_ids from people index first, this way you already know that how many of them you need to put in the store index, let's suppose if there are 1 Million of such persons, instead of search all 1 Million persons in a single query, you can batch them in multiple queries in your application, this way you will not hit the default limit set by Elasticsearch, also it would not saturate the resources at Elasticsearch.

In short, you will have to try few options, and benchmark the performance of various approaches, 1 and 2 is easy to test without any change in the application.