Spark Imputer for filling up missing values

646 Views Asked by At

requirement -

In the Picture attached, consider the first 3 columns as my raw data. Some rows have quantity column as NULL value which is exactly what I want to fill up. In an Ideal case, I would fill up any NULL value with the previous KNOWN value.

Spark Imputer seemed to be a very easily implementable library that can help me fill missing values. But here the issue is,Spark Imputer is limited to mean or Median calculation according to all NON-BULL values present in the data frame as a result of which I don't get desired result (4th column in the Pic).

Logic -

val imputer = new Imputer()
          .setInputCols(Array("quantity"))
          .setOutputCols(Array("quantity_imputed"))
          .setStrategy("mean")

val model = imputer.fit(new_combinedDf)
model.transform(new_combinedDf).show()

Result -

Result

Now is it possible to limit the Mean calculation for EACH null value to be the MEAN of last n values ? i.e For 2020-09-26 , where we get the first null value, Is it possible to tweak Spark Imputer to calculate the Mean over last n values only instead of all non-null values in the dataframe ?

0

There are 0 best solutions below