Intersect two queries with different filters

1.2k Views Asked by At

I use Druid for monitoring events in my website. The data can be represented as follows:

event_id   |  country  |  user_id  |  event_type  
================================================
1          |  USA      |  id1      |  visit
2          |  USA      |  id2      |  visit
1          |  Canada   |  id3      |  visit
3          |  USA      |  id1      |  click
1          |  Canada   |  id4      |  visit
3          |  Canada   |  id3      |  click
3          |  USA      |  id2      |  click

I also defined an aggregation for counting events. I made queries to Druid to present data for event_id=3 as follows:

Pay attention that the visits are not related to the event_id.

country   |  visits    | clicks  
===============================    
USA       |  4         |  2
Canada    |  3         |  2

Currently I use two queries of topNResults with 2 different filters:

  1. event_type = visit -> to count visits per country regardless of the event id.
  2. event_id = 3

Of course my data is much larger than that and contains many countries. topNResults api must have threshold param that represents the max amount of results that I want to get as a response.

The problem is if my threshold is smaller than actual results, those two queries might not have the same countries results.

Currently, I merge the overlapping results in my server but I loose some countries results and I display less than my threshold although there are more results.

What can I do to optimize that I will always have the same countries for my threshold (without sending list of countries returned from the first query to the second filter - I tried it and it was very slow)?

1

There are 1 best solutions below

0
On

Sounds Filtered Aggregator will save you all queries.
Filtered Aggregator aggregate only the values which match the dimension filter.
The following query will do the trick in your case: After druid will group all events under the countries (because the dimension is country) the aggregator filter will filter all events which it's event ids in (e1,e2) and perform count aggregator on the filtered results.

{
  ...
  "dimension":"country",
  ...,
   "aggregations": [
      {
        "type" : "filtered",
        "filter" : {
           "type" : "selector",
           "dimension" : "event_id",
           "value" : ["1","2"]
           "type": "in"
         }
        "aggregator" : {
          "type" : "count", 
          "name" : "count_countries" }
        }
      }
   ]
 }

Let's take your table.

event_id   |  country  |  user_id  |  event_type  
================================================
1          |  USA      |  id1      |  visit
2          |  USA      |  id2      |  visit
1          |  Canada   |  id3      |  visit
3          |  USA      |  id1      |  click
1          |  Canada   |  id4      |  visit
3          |  Canada   |  id3      |  click
3          |  USA      |  id2      |  click

Druid will group the results by country.

  country  |  user_id  |  event_type | event_id    
================================================
  USA      |  id1      |  visit  |  1
  USA      |  id2      |  visit  |  2
  USA      |  id1      |  click  |  1
  USA      |  id2      |  click  |  3
  Canada   |  id3      |  visit  |  1 
  Canada   |  id4      |  visit  |  3
  Canada   |  id3      |  click  |  3

The aggregator filter will remove all event_id=3 because of our filter ("value" : ["1","2"])

  country  |  user_id  |  event_type | event_id    
================================================
  USA      |  id1      |  visit  |  1
  USA      |  id2      |  visit  |  2
  USA      |  id1      |  click  |  1
  Canada   |  id3      |  visit  |  1 

And return the following result (our aggregator is simple count)

  country  |  count   
===================
  USA      |  3   
  Canada   |  1   

Enjoy!