ElasticSearch circuit_breaking_exception (Data too large) with significant_terms aggregation

Question

ElasticSearch circuit_breaking_exception (Data too large) with significant_terms aggregation

53.4k Views Asked by esp At 13 May 2016 at 17:45

The query:

{
  "aggregations": {
    "sigTerms": {
      "significant_terms": {
        "field": "translatedTitle"
      },
      "aggs": {
        "assocs": {
          "significant_terms": {
            "field": "translatedTitle"
          }
        }
      }
    }
  },
  "size": 0,
  "from": 0,
  "query": {
    "range": {
      "timestamp": {
        "lt": "now+1d/d",
        "gte": "now/d"
      }
    }
  },
  "track_scores": false
}

Error:

{
  "bytes_limit": 6844055552,
  "bytes_wanted": 6844240272,
  "reason": "[request] Data too large, data for [<reused_arrays>] would be larger than limit of [6844055552/6.3gb]",
  "type": "circuit_breaking_exception"
}

Index size is 5G. How much memory does the cluster need to execute this query?

Original Q&A

There are 3 best solutions below

**Val** · Answer 1 · 2016-05-18T08:49:34.367000

You can try to increase the request circuit breaker limit to 41% (default is 40%) in your elasticsearch.yml config file and restart your cluster:

indices.breaker.request.limit: 41%

Or if you prefer to not restart your cluster you can change the setting dynamically using:

curl -XPUT localhost:9200/_cluster/settings -d '{
  "persistent" : {
    "indices.breaker.request.limit" : "41%" 
  }
}'

Judging by the numbers showing up (i.e. "bytes_limit": 6844055552, "bytes_wanted": 6844240272), you're just missing ~190 KB of heap, so increasing by 1% to 41% you should get 17 MB of additional heap (your total heap = ~17GB) for your request breaker which should be sufficient.

Just make sure to not increase this value too high, as you run the risk of going OOM since the request circuit breaker also shares the heap with the fielddata circuit breaker and other components.

**chetan varma** · Answer 2 · 2016-05-18T11:36:09.383000

Circuit breakers are designed to deal with situations when request processing needs more memory than available. You can set limit by using following query

PUT /_cluster/settings
{
  "persistent" : {
    "indices.breaker.request.limit" : "45%" 
  }
}

You can get more information on

https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index-modules-fielddata.html

**Andrei Stefan** · Answer 3 · 2016-05-19T11:13:41.407000

I am not sure what you are trying to do, but I'm curious to find out. Since you get that exception, I can assume the cardinality of that field is not small. You are basically trying to see, I guess, the relationships between all the terms in that field, based on significance.

The first significant_terms aggregation will consider all the terms from that field and establish how "significant" they are (calculating frequencies of that term in the whole index and then comparing those with the frequencies from the range query set of documents).

After it's doing that (for all the terms), you want a second significant_aggregation that should do the first step, but now considering each term and doing for it another significant_aggregation. That's gonna be painful. Basically, you are computing number_of_term * number_of_terms significant_terms calculations.

The big question is what are you trying to do?

If you want to see a relationship between all the terms in that field, that's gonna be expensive for the reasons explained above. My suggestion is to run a first significant_terms aggregation, take the first 10 terms or so and then run a second query with another significant_terms aggregation but limiting the terms by probably doing a parent terms aggregation and include only those 10 from the first query.

You can, also, take a look at sampler aggregation and use that as a parent for your only one significant terms aggregation.

Also, I don't think increasing the circuit breaker limit is the real solution. Those limits were chosen with a reason. You can increase that and maybe it will work, but it has to make you ask yourself if that's the right query for your use case (as it doesn't sound like it is). That limit value that it's in the exception might not be the final one... reused_arrays refers to an array class in Elasticsearch that is resizeable, so if more elements are needed, the array size is increased and you may hit the circuit breaker again, for another value.

ElasticSearch circuit_breaking_exception (Data too large) with significant_terms aggregation

There are 3 best solutions below

Related Questions in ELASTICSEARCH

Related Questions in SIGNIFICANT-TERMS

Trending Questions

Popular # Hahtags

Popular Questions