Compute percentile with collapsing by user

102 Views Asked by At

Let says I have an index where I save a million of tweets (original object). I want to get the 90th percentile users based on the number of followers. I know there is the aggregation "percentile" to do this, but my problem is that ElasticSearch use all documents so I have some users that tweet a lot who noise my calculation. I want to isolate all unique user then compute the 90th. The other constraint is that I want to do this in only one or two requests to keep the response lower than 500ms.

I have tried a lot of things and I was able to do this with "scripted_metric" but when my dataset exceed 100k of tweets the performances go down criticaly.

Any advice ?

Additionnal infos :

  • My index store orginal tweets & retweets based on user search queries
  • The index is mapped with a dynamic template mapping (No problem with this)
  • The index contains approximatly 100M
  • Unfortunately, "top hits" aggregation doesn't accept sub-aggs.

The request I try to achieve is :

{
  "collapse": {
    "field": "user.id"    <--- I want this effect on aggregation
  },
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "metadatas.clientId": {
              "value": projectId
            }
          }
        },
        {
          "match": {
            "metadatas.blacklisted": false
          }
        }
      ],
      "filter": [
        {
          "range": {
            "publishedAt": {
              "gte": "now-90d/d"
            }
          }
        }
      ]
    }
  },
  "aggs":{
    "twitter": {
      "percentiles": {
        "field": "user.followers_count",
        "percents": [95]
      }
    }
  },
  "size": 0
}

1

There are 1 best solutions below

0
On

Finally, I figure out to find a workaround.

In percentile aggregation, I can use a script. I use params variable to hold unique keys then return preceding _score.

Without the complete explanation of the computation, I cannot fine tune the behavior of my script. But the result is good enough for me.

"aggs": {
    "unique":{
      "cardinality": {
        "field": "collapse_profile"
      }
    },
    "thresholds":{
      "percentiles": {
        "field": "user.followers_count",
        "percents": [90], 
        "script": {
          "source": """
            if(params.keys == null){
              params.keys = new HashMap();
            }
            
            def key = doc['user.id'].value;
            def value = doc['user.followers_count'].value;
            
            if(params.keys[key] == null){
              params.keys[key] = _score;
              return value;
            }
            return _score;
          """,
          "lang": "painless"
        }
      }
    }    
  }