Is there a way to calculate moving median for an attribute in Spark DataFrame?
I was hoping that it is possible to calculate moving median using a window function (by defining a window using rowsBetween(0,10)), but there no functionality to calculate it (similar to average or mean).
I think you've got few options here.
ntile window function
I think
ntile(2)(over a window of rows) would give you two "segments" that in turn you could use to calculate the median over the window.Quoting the scaladoc:
If the number of rows in one group is bigger than in the other, pick the largest from the bigger group.
If the number of rows in the groups is even, take the maximum and the minimum in each group and calculate the median.
I found it quite nicely described in Calculating median using the NTILE function.
percent_rank window function
I think
percent_rankmight also be an option to calculate the median over a window of rows.Quoting the scaladoc:
User-Defined Aggregate Function (UDAF)
You could write a user-defined aggregate function (UDAF) to calculate median over a window.
A UDAF extends org.apache.spark.sql.expressions.UserDefinedAggregateFunction which is (quoting the scaladoc):
Luckily there is an sample implementation of a custom UDAF in UserDefinedUntypedAggregation example.