Elasticsearch significant terms aggregation

1.6k Views Asked by At

I've started using the significant terms aggregation to see which keywords are important in groups of documents as compared to the entire set of documents I've indexed.

It works all great until a lot of documents are indexed. Then for the same query that used to work, elasticsearch only says:

 SearchPhaseExecutionException[Failed to execute phase [query], 
 all shards failed; shardFailures {[OIWBSjVzT1uxfxwizhS5eg][demo_paragraphs][0]:
 CircuitBreakingException[Data too large, data for field [text] 
 would be larger than limit of [633785548/604.4mb]];

My query looks the following:

 POST /demo_paragraphs/_search
 {
     "query": {
         "match": {
            "django_target_id": 1915661
         }
     },
     "aggregations" : {
         "signKeywords" : {
             "significant_terms" : {
                 "field" : "text"
             }
         }
     }
 }

And the document structure:

        "_source": {
           "django_ct": "citations.citation",
           "django_target_id": 1915661,
           "django_id": 3414077,
           "internal_citation_id": "CR7_151",
           "django_source_id": 1915654,
           "text": "Mucin 1 (MUC1) is a protein heterodimer that is overexpressed in lung cancers [6]. MUC1 consists of two subunits, an N-terminal extracellular subunit (MUC1-N) and a C-terminal transmembrane subunit (MUC1-C). Overexpression of MUC1 is sufficient for the induction of anchorage independent growth and tumorigenicity [7]. Other studies have shown that the MUC1-C cytoplasmic domain is responsible for the induction of the malignant phenotype and that MUC1-N is dispensable for transformation [8]. Overexpression of",
           "id": "citations.citation.3414077",
           "num_distinct_citations": 0
        }

The data that I index are paragraphs from scientifical papers. No document is really large.

Any ideas on how to analyze or solve the problem?

3

There are 3 best solutions below

0
On

I think there is simple solution. Please give ES more RAM :D Aggregations require much memory.

1
On

If the data set is to large to compute result on one machine you may need more then one node.

Be thoughtful when planning shard distribution. Make sure that shards are properly distributed so each node is equally stressed when computing heavy queries. A good topology for large data sets is Master-Data-Search configuration where you have one node which acts as master (no data, no queries running on this node). A few nodes are dedicated for holding data (shards) and some nodes are dedicated to execute queries (they do not hold data they use data nodes for partial query execution and combine results). For starter Netflix is using this topology Netflix raigad enter image description here

Paweł Róg is right you will need much more RAM. For a starter increase java heap size available to each node. See this site for details: ElasticSearch configuration You have to reasearch how much RAM is enough. Sometimes too much RAM actually slows down ES (unless it was fixed in one of recent versions).

0
On

Note that coming in elasticsearch 6.0 there is the new significant_text aggregation which doesn't require field data. See https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-significanttext-aggregation.html