Let says I have an index where I save a million of tweets (original object). I want to get the 90th percentile users based on the number of followers. I know there is the aggregation "percentile" to do this, but my problem is that ElasticSearch use all documents so I have some users that tweet a lot who noise my calculation. I want to isolate all unique user then compute the 90th. The other constraint is that I want to do this in only one or two requests to keep the response lower than 500ms.
I have tried a lot of things and I was able to do this with "scripted_metric" but when my dataset exceed 100k of tweets the performances go down criticaly.
Any advice ?
Additionnal infos :
- My index store orginal tweets & retweets based on user search queries
- The index is mapped with a dynamic template mapping (No problem with this)
- The index contains approximatly 100M
- Unfortunately, "top hits" aggregation doesn't accept sub-aggs.
The request I try to achieve is :
{
"collapse": {
"field": "user.id" <--- I want this effect on aggregation
},
"query": {
"bool": {
"must": [
{
"term": {
"metadatas.clientId": {
"value": projectId
}
}
},
{
"match": {
"metadatas.blacklisted": false
}
}
],
"filter": [
{
"range": {
"publishedAt": {
"gte": "now-90d/d"
}
}
}
]
}
},
"aggs":{
"twitter": {
"percentiles": {
"field": "user.followers_count",
"percents": [95]
}
}
},
"size": 0
}
Finally, I figure out to find a workaround.
In percentile aggregation, I can use a script. I use params variable to hold unique keys then return preceding _score.
Without the complete explanation of the computation, I cannot fine tune the behavior of my script. But the result is good enough for me.