Any efficient way to get unique terms from Elasticsearch index

675 Views Asked by At

I aim is to store all unique term along with their md5 hashes in a database. I have a 1 million document index which has ~400000 unique terms. I got this figure from using aggregations in elasticsearch.

GET /dt_index/document/_search
{
  "aggregations": {
    "my_agg": {
      "cardinality": {
        "field": "text"
      }
    }
  }
}

I can get the unique terms using the following:

GET /dt_matrix/document/_search
{
  "aggregations": {
    "my_agg": {
      "term": {
        "field": "text",
        "size": 100
      }
    }
  }
}

This gives me 10 search results along with the term aggregation of 100 unique terms. But getting a JSON of ~400000 terms would require memory. Just like for parsing through all the search results we can use scan-scroll. Is there any way I can parse through all unique terms without loading all in memory?

2

There are 2 best solutions below

0
On

You cant scan scroll through aggregation results. Rather , you should index these unique terms in a separate index or type while indexing and then do a normal pagination over it.

0
On

Although you can't scroll through aggregations, you can retrieve smaller, more memory manageable subsets by adding to your query request. For example, you can request all unique terms starting with the letter A, and so on. Adjust your query until you are satisfied with the size of the biggest subset.