Different results between date histogram and date range on Elastic Search

3.6k Views Asked by At

I would like to analyse my logs data with Elastic Search/Kibana and count unique customer by month. Results are different when I use a date histogram aggregation and date range aggregation.

Here is the date histogram query :

"query": {
    "query_string": {
      "query": "_type:logs AND created_at:[2015-04-01 TO now]",
      "analyze_wildcard": true
    }
  },
  "size": 0,
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "created_at",
        "interval": "1M",
        "min_doc_count": 1
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "customer.id"
          }
        }
      }
    }
  }

And results :

"aggregations": {
    "2": {
      "buckets": [
        {
          "1": {
            "value": 595805
          },
          "key_as_string": "2015-04-01T00:00:00.000Z",
          "key": 1427839200000,
          "doc_count": 6410438
        },
        {
          "1": {
            "value": 647788
          },
          "key_as_string": "2015-05-01T00:00:00.000Z",
          "key": 1430431200000,
          "doc_count": 6669555
        },...

Here is the date range query :

"query": {
    "query_string": {
      "query": "_type:logs AND created_at:[2015-04-01 TO now]",
      "analyze_wildcard": true
    }
  },
  "size": 0,
  "aggs": {
    "2": {
      "date_range": {
        "field": "created_at",
        "ranges": [
          {
            "from": "2015-04-01",
            "to": "2015-05-01"
          },
          {
            "from": "2015-05-01",
            "to": "2015-06-01"
          }
        ]
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "customer.id"
          }
        }
      }
    }
  }

And the response :

"aggregations": {
    "2": {
      "buckets": [
        {
          "1": {
            "value": 592179
          },
          "key": "2015-04-01T00:00:00.000Z-2015-05-01T00:00:00.000Z",
          "from": 1427846400000,
          "from_as_string": "2015-04-01T00:00:00.000Z",
          "to": 1430438400000,
          "to_as_string": "2015-05-01T00:00:00.000Z",
          "doc_count": 6411884
        },
        {
          "1": {
            "value": 616995
          },
          "key": "2015-05-01T00:00:00.000Z-2015-06-01T00:00:00.000Z",
          "from": 1430438400000,
          "from_as_string": "2015-05-01T00:00:00.000Z",
          "to": 1433116800000,
          "to_as_string": "2015-06-01T00:00:00.000Z",
          "doc_count": 6668060
        }
      ]
    }
  }

In the first case, I have 595,805 for April and 647,788 for May In the second case, I have 592,179 for April and 616,995 for May

Someone could explain me why I have these differences between these use cases ?

Thank you

I update my first post to add another example

I add another example with fewer data (on 1 day) but with the same issue. Here is the first request with date histogram :

{
  "size": 0,
  "query": {
    "query_string": {
      "query": "_type:logs AND logs.created_at:[2015-04-01 TO 2015-04-01]",
      "analyze_wildcard": true
    }
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "created_at",
        "interval": "1h",
        "pre_zone": "00:00",
        "pre_zone_adjust_large_interval": true,
        "min_doc_count": 1
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "customer.id"
          }
        }
      }
    }
  }
}

And we can see 660 unique count with 1717 doc count for the first hour :

{  
   "hits":{  
      "total":203961,
      "max_score":0,
      "hits":[  

      ]
   },
   "aggregations":{  
      "2":{  
         "buckets":[  
            {  
               "1":{  
                  "value":660
               },
               "key_as_string":"2015-04-01T00:00:00.000Z",
               "key":1427846400000,
               "doc_count":1717
            },
            {  
               "1":{  
                  "value":324
               },
               "key_as_string":"2015-04-01T01:00:00.000Z",
               "key":1427850000000,
               "doc_count":776
            },
            {  
               "1":{  
                  "value":190
               },
               "key_as_string":"2015-04-01T02:00:00.000Z",
               "key":1427853600000,
               "doc_count":481
            }
         ]
      }
   }
}

But on the second request with the date range :

{
  "size": 0,
  "query": {
    "query_string": {
      "query": "_type:logs AND logs.created_at:[2015-04-01 TO 2015-04-01]",
      "analyze_wildcard": true
    }
  },
  "aggs": {
    "2": {
      "date_range": {
        "field": "created_at",
        "ranges": [
          {
            "from": "2015-04-01T00:00:00",
            "to": "2015-04-01T01:00:00"
          },
          {
            "from": "2015-04-01T01:00:00",
            "to": "2015-04-01T02:00:00"
          }
        ]
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "customer.id"
          }
        }
      }
    }
  }
}

We can see only 633 unique count with 1717 doc count :

{  
   "hits":{  
      "total":203961,
      "max_score":0,
      "hits":[  

      ]
   },
   "aggregations":{  
      "2":{  
         "buckets":[  
            {  
               "1":{  
                  "value":633
               },
               "key":"2015-04-01T00:00:00.000Z-2015-04-01T01:00:00.000Z",
               "from":1427846400000,
               "from_as_string":"2015-04-01T00:00:00.000Z",
               "to":1427850000000,
               "to_as_string":"2015-04-01T01:00:00.000Z",
               "doc_count":1717
            },
            {  
               "1":{  
                  "value":328
               },
               "key":"2015-04-01T01:00:00.000Z-2015-04-01T02:00:00.000Z",
               "from":1427850000000,
               "from_as_string":"2015-04-01T01:00:00.000Z",
               "to":1427853600000,
               "to_as_string":"2015-04-01T02:00:00.000Z",
               "doc_count":776
            }
         ]
      }
   }
}

Please someone could tell me why ? Thank you

1

There are 1 best solutions below

1
On

When using the date_histogram aggregation you need to take into account the timezone, which date_range doesn't as it's always using the GMT timezone.

If you look at the long millisecond values in your results, you'll see the following:

For your date histogram, from: 1427839200000 is actually equal to 2015-03-31T22:00:00.000Z which differs from the key_as_string value (i.e. 2015-04-01T00:00:00.000Z) that is formatted according to the GMT timezone.

In your first aggregation, try explicitly specifying the time_zone parameter to be your current timezone (apparently GMT+2) and you should get the same results:

  "date_histogram": {
    "field": "created_at",
    "interval": "1M",
    "min_doc_count": 1,
    "time_zone": -2
  },