requirement -
In the Picture attached, consider the first 3 columns as my raw data. Some rows have quantity column as NULL value which is exactly what I want to fill up. In an Ideal case, I would fill up any NULL value with the previous KNOWN value.
Spark Imputer seemed to be a very easily implementable library that can help me fill missing values. But here the issue is,Spark Imputer is limited to mean or Median calculation according to all NON-BULL values present in the data frame as a result of which I don't get desired result (4th column in the Pic).
Logic -
val imputer = new Imputer()
.setInputCols(Array("quantity"))
.setOutputCols(Array("quantity_imputed"))
.setStrategy("mean")
val model = imputer.fit(new_combinedDf)
model.transform(new_combinedDf).show()
Result -
Now is it possible to limit the Mean calculation for EACH null value to be the MEAN of last n values ? i.e For 2020-09-26 , where we get the first null value, Is it possible to tweak Spark Imputer to calculate the Mean over last n values only instead of all non-null values in the dataframe ?