I am looking at the window function for a Spark DataFrame in .NET (C#).
I have a DataFrame df with columns Year, Month, Day, Hour, Minute, ID, Type and Value:
| 2021 | 3 | 4 | 8 | 9 | 87 | Type1 | 380.5 |
| 2021 | 3 | 4 | 8 | 10 | null | null | null |
| 2021 | 3 | 4 | 8 | 11 | null | null | null |
| 2021 | 3 | 4 | 8 | 12 | null | null | null |
| 2021 | 3 | 4 | 8 | 13 | 87 | Type1 | 0.0 |
| 2021 | 3 | 4 | 8 | 14 | 87 | Type1 | 0.0 |
I would like to fill empty rows (nulls) with values from previous row based on year, month, day, hour, minute as shown below:
| 2021 | 3 | 4 | 8 | 9 | 87 | Type1 | 380.5 |
| 2021 | 3 | 4 | 8 | 10 | 87 | Type1 | 380.5 |
| 2021 | 3 | 4 | 8 | 11 | 87 | Type1 | 380.5 |
| 2021 | 3 | 4 | 8 | 12 | 87 | Type1 | 380.5 |
| 2021 | 3 | 4 | 8 | 13 | 87 | Type1 | 0.0 |
| 2021 | 3 | 4 | 8 | 14 | 87 | Type1 | 0.0 |
So far, I found solutions using Windows and Lag function in scala, but I am not sure how to do it in C#. In scala the window would be defined as something like:
val window = Window.orderBy("Year", "Month", "Day", "Hour", "Minute")
I would like to add a newValue column using
var filledDataFrame = df.WithColumn("newValue", Functions.When(df["Value"].IsNull(), Functions.Lag(df["Value"], 1).Over(window)).Otherwise(df["Value"])
How to define a window in .NET for Spark and use Lag function for forward-filling null values?
To use Lag and a Window with .NET for Apache Spark you are very close and would need:
This would result in:
but you probably want
Lastinstead ofLagas you can skip nulls:Which results in:
hope it helps!
ed
(the using statements needed to make this work)