ElasticSearch query with conditions on multiple documents

1.1k Views Asked by At

I have data of this format in elasticsearch, each one is in seperate document:

{ 'pid': 1, 'nm' : 'tom'}, { 'pid': 1, 'nm' : 'dick''},{ 'pid': 1, 'nm' : 'harry'}, { 'pid': 2, 'nm' : 'tom'}, { 'pid': 2, 'nm' : 'harry'}, { 'pid': 3, 'nm' : 'dick'}, { 'pid': 3, 'nm' : 'harry'}, { 'pid': 4, 'nm' : 'harry'}

    {
       "took": 137,
       "timed_out": false,
       "_shards": {
          "total": 5,
          "successful": 5,
          "failed": 0
       },
       "hits": {
          "total": 8,
          "max_score": null,
          "hits": [
             {
                "_index": "query_test",
                "_type": "user",
                "_id": "AVj9KS86AaDUbQTYUmwY",
                "_score": null,
                "_source": {
                   "pid": 1,
                   "nm": "Harry"
                }
             },
             {
                "_index": "query_test",
                "_type": "user",
                "_id": "AVj9KJ9BAaDUbQTYUmwW",
                "_score": null,
                "_source": {
                   "pid": 1,
                   "nm": "Tom"
                }
             },
             {
                "_index": "query_test",
                "_type": "user",
                "_id": "AVj9KRlbAaDUbQTYUmwX",
                "_score": null,
                "_source": {
                   "pid": 1,
                   "nm": "Dick"
                }
             },
             {
                "_index": "query_test",
                "_type": "user",
                "_id": "AVj9KYnKAaDUbQTYUmwa",
                "_score": null,
                "_source": {
                   "pid": 2,
                   "nm": "Harry"
                }
             },
             {
                "_index": "query_test",
                "_type": "user",
                "_id": "AVj9KXL5AaDUbQTYUmwZ",
                "_score": null,
                "_source": {
                   "pid": 2,
                   "nm": "Tom"
                }
             },
             {
                "_index": "query_test",
                "_type": "user",
                "_id": "AVj9KbcpAaDUbQTYUmwb",
                "_score": null,
                "_source": {
                   "pid": 3,
                   "nm": "Dick"
                }
             },
             {
                "_index": "query_test",
                "_type": "user",
                "_id": "AVj9Kdy5AaDUbQTYUmwc",
                "_score": null,
                "_source": {
                   "pid": 3,
                   "nm": "Harry"
                }
             },
             {
                "_index": "query_test",
                "_type": "user",
                "_id": "AVj9KetLAaDUbQTYUmwd",
                "_score": null,
                "_source": {
                   "pid": 4,
                   "nm": "Harry"
                }
             }
          ]
       }
    }

And I need to find the pid's which have 'harry' and do not have 'tom', which in the above example are 3 and 4. Which essentialy means look for the documents having same pids where none of them has nm with value 'tom' but at least one of them have nm with value 'harry'.

How do I query that?

EDIT: Using Elasticsearch version 5

2

There are 2 best solutions below

0
On

I am relatively very new in Elasticsearch, so I might be wrong. But I have never seen such query. Simple filters can not be used here as those are applied on a doc (and not aggregations) which you do not want. What I see is you want to do a "Group by" query with "Having" clause (in terms of SQL). But Group by queries involve some aggregation (like avg, max, min of any field) which is used in "Having" clause. Basically you use a reducer for Post processing of aggregation results. For queries like this Bucket Selector Aggregation can be used. Read this
But your case is different. You do not want to apply Having clause on any metric aggregation but you want to check if some value is present in field (or column) of your "group by" data. In terms of SQL, you want to do a "where" query in "group by". This is what I have never seen. You can also read this
However, at application level, you can easily do this by breaking your query. First find unique pid where nm= harry using term aggs. Then get docs for those pid with additional condition nm != tom.

P.S. I am very new to ES. And I will be very happy if any one contradicts me show ways to do this in one query. I will also learn that.

14
On

What if you have a POST request body which could look something like below, where you might use bool :

POST _search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "nm" : "harry" }
      },
      "must_not" : {
        "term" : { "nm" : "tom" }
      }
    }
  }
}