I aim is to store all unique term along with their md5 hashes in a database. I have a 1 million document index which has ~400000 unique terms. I got this figure from using aggregations
in elasticsearch.
GET /dt_index/document/_search
{
"aggregations": {
"my_agg": {
"cardinality": {
"field": "text"
}
}
}
}
I can get the unique terms using the following:
GET /dt_matrix/document/_search
{
"aggregations": {
"my_agg": {
"term": {
"field": "text",
"size": 100
}
}
}
}
This gives me 10 search results along with the term aggregation of 100 unique terms. But getting a JSON of ~400000 terms would require memory. Just like for parsing through all the search results we can use scan-scroll
. Is there any way I can parse through all unique terms without loading all in memory?
You cant scan scroll through aggregation results. Rather , you should index these unique terms in a separate index or type while indexing and then do a normal pagination over it.