Improve performance of a nested term aggregation?

392 Views Asked by At

Is there a way to improve performance of a nested term aggregation without sampling?

Terms query:

GET <INDEX>/_search?pretty&request_cache=false
{
    "_source": false,
    "sort": [
        "_doc"
    ],
    "size": 0,
    "track_total_hits": false,
    "aggregations": {
        "nested_suggestions": {
            "nested": {
                "path": "measurement"
            },
            "aggs": {
                "suggestions": {
                    "terms": {
                        "field": "measurement.description.label",
                        "size": 1
                    }
                }
            }
        }
    }
}
...
{
  "took" : 8239,
  "timed_out" : false,
  ...
  "aggregations" : {
    "nested_suggestions" : {
      "doc_count" : 226139234,
      "suggestions" : {
        "doc_count_error_upper_bound" : 7445607,
        "sum_other_doc_count" : 214543500,
        "buckets" : [
          {
            "key" : "xxx",
            "doc_count" : 11635382
          }
        ]
      }
    }
  }
}

Cardinality query:

GET <INDEX>/_search?pretty&request_cache=false
{
    "_source": false,
    "sort": [
        "_doc"
    ],
    "size": 0,
    "track_total_hits": false,
    "aggregations": {
        "nested_suggestions": {
            "nested": {
                "path": "measurement"
            },
            "aggs": {
                "suggestions": {
                    "cardinality": {
                        "field": "measurement.description.label"
                    }
                }
            }
        }
    }
}
...
{
  "took" : 5688,
  "timed_out" : false,
  ...
  "aggregations" : {
    "nested_suggestions" : {
      "doc_count" : 226139234,
      "suggestions" : {
        "value" : 1379
      }
    }
  }
}

Minimal mapping:

{
    "settings": {
        "number_of_replicas": "0",
        "number_of_shards": "10",
        "analysis": {
            "normalizer": {
                "raw_clean": {
                    "type": "custom",
                    "filter": [
                        "asciifolding"
                    ]
                }
            }
        }
    },
    "mappings": {
        "_doc": {
            "dynamic": "strict",
            "properties": {
                "id": {
                    "type": "keyword"
                },
                "measurement": {
                    "type": "nested",
                    "dynamic": "strict",
                    "properties": {
                        "id": {
                            "type": "keyword"
                        },
                        "description": {
                            "type": "text",
                            "norms": false,
                            "fields": {
                                "label": {
                                    "type": "keyword",
                                    "normalizer": "raw_clean",
                                    "ignore_above": 255,
                                    "eager_global_ordinals": true
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

I've verified that the global ordinals have data via /_cat/fielddata?v.

Is this kind of performance expected with nested terms aggregations?

Environment:

  • elasticsearch 6.8.3
  • index size ~200GB (with the full mapping)
  • documents ~1million
  • nested documents ~225million
  • 4CPU 16GB RAM 500GB SSD
0

There are 0 best solutions below