I prepared an installation of Apache Druid that takes data from a Kafka topic. It works very smoothly and efficiently.
I'm currently trying to implement some queries and I'm stuck in the count of rows (grouped by some fields) for which a column value is an outlier. In the normal SQL world, I will essentially compute the first and third quartiles (q1 and q3) and then use something like (I'm interested only in "right" outliers):
SUM(IF(column_value > q3 + 1.5*(q3-q1), 1, 0))
This approach makes use of cte and joins: I compute the quartiles in a cte with grouping and then I join it with the original table.
I was able to easily compute the quartiles and the outlier threshold with datasketch extension using a groupBy
query, but I'm not realizing how to perform a postAggregation that can perform the count.
In theory, I may implement a second query using the thresholds obtained in the first. Unfortunately, I can get hundreds of thousands of different values. That makes this approach unfeasible.
Do you have any suggestion on how to tackle this problem?
As of version 0.18.0, Apache Druid supports joins. This solves the problem.