Apache Druid: count outliers

210 Views Asked by At

I prepared an installation of Apache Druid that takes data from a Kafka topic. It works very smoothly and efficiently.

I'm currently trying to implement some queries and I'm stuck in the count of rows (grouped by some fields) for which a column value is an outlier. In the normal SQL world, I will essentially compute the first and third quartiles (q1 and q3) and then use something like (I'm interested only in "right" outliers):

SUM(IF(column_value > q3 + 1.5*(q3-q1), 1, 0))

This approach makes use of cte and joins: I compute the quartiles in a cte with grouping and then I join it with the original table.

I was able to easily compute the quartiles and the outlier threshold with datasketch extension using a groupBy query, but I'm not realizing how to perform a postAggregation that can perform the count.

In theory, I may implement a second query using the thresholds obtained in the first. Unfortunately, I can get hundreds of thousands of different values. That makes this approach unfeasible.

Do you have any suggestion on how to tackle this problem?

1

There are 1 best solutions below

0
On BEST ANSWER

As of version 0.18.0, Apache Druid supports joins. This solves the problem.