Elasticsearch filter the maximum value document

3.7k Views Asked by At

I trying to get the maximum value of document from the same name records. Forexample, I have 3 users, 2 of them have same name but different followers count, I wanted to return only 1 document from the 2 same with same name based on the maximum of followers_count.

{ id: 1, name: "John Greenwood", follower_count: 100 }
{ id: 2, name: "John Greenwood", follower_count: 200 }
{ id: 3, name: "John Underwood", follower_count: 300 }

So the result would be,

{ id: 2, name: "John Greenwood", follower_count: 200 }
{ id: 3, name: "John Underwood", follower_count: 300 }

From 2 same names, the one with the maximum followers wins and other single one will also come.

I have mapping as follow,

"users-development" : {
    "mappings" : {
      "user" : {
        "dynamic" : "false",
        "properties" : {
          "follower_count" : {
            "type" : "integer"
          },
          "name" : {
            "type" : "string",
            "fields" : {
              "exact" : {
                "type" : "string",
                "index" : "not_analyzed"
              }
            }
          },
        }
      }
    }

This is where I have been stucked from long,

         {
            query: {
              filtered: {
                filter: {
                  bool: {
                    must: [
                      { terms: { "name.exact": [ "John Greenwood", "John Underwood" ] } },
                    ]
                  }
                }
              }
            },

            aggs: {
              max_follower_count: { max: { field: 'follower_count' } }
            },

            size: 1000,
          }

Any suggestions please

2

There are 2 best solutions below

2
On

Your question have a special tool in the elastic stack as a hammer for a head kkk. Are Aggregations, See the examples: First of all in your case you will need aggregate by full name including spaces, your name field need to be not_analyzed like this

`PUT /index
{
  "mappings": {
    "users" : {
      "properties" : {
        "name" : {
          "type" :    "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}`

Now your query will be like this one:

`POST /index/users/_search
{
   "aggs": {
      "users": {
         "terms": {
            "field": "name"
         },
         "aggs": {
            "followers": {
               "max": {
                  "field": "follower_count"
               }
            }
         }
      }
   }
}`

I just aggregated by name and used a max metric to get the higgest follower count.

The response will be like this:

`"aggregations": {
      "users": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "John Greenwood",
               "doc_count": 2,
               "followers": {
                  "value": 200
               }
            },
            {
               "key": "John Underwood",
               "doc_count": 1,
               "followers": {
                  "value": 300
               }
            }
         ]
      }
   }`

Hope that will be good for you. Use aggregations for all situations that you need aggregate data and get sum on values.

0
On

Ok, I think you are looking for something along these lines, using the terms aggregation

{
   "query": {
      "terms": { "name.exact": [ "John Greenwood", "John Underwood" ] }
   },
   "aggs": {
      "max_follower_count": {
         "terms": {
            "field":"name.exact"
         },
         "aggs":{
             "max_follow" : { "max" : { "field" : "follower_count" } }
         }
      }
   },
   "size": 1000
}

The terms aggregation will make a bucket for each unique value, from names.exact, which will only be those specified in your terms query. So we now have a bucket for both Johns, now we can use the max aggregation to count who has the most followers. The max aggregation will operate on each bucket in its parent aggregation.

Each of these unique terms will then have its max value of follower_count computed, and displayed in the bucket. Results look as follows:

... //query results of just the terms query up here
"aggregations": {
  "max_follower_count": {
     "doc_count_error_upper_bound": 0,
     "sum_other_doc_count": 0,
     "buckets": [
        {
           "key": "John Greenwood",
           "doc_count": 2,
           "max_follow": {
              "value": 200
           }
        },
        {
           "key": "John Underwood",
           "doc_count": 1,
           "max_follow": {
              "value": 300
           }
        }
     ]
  }
}

The terms aggregation comes with a few caveats with how it does the counting, and the documentation linked should be pretty clear on that.