I have a data frame like the below:
+----+----+----+
|colA|colB|colC|
+----+----+----+
|1 |1 |23 |
|1 |2 |63 |
|1 |3 |null|
|1 |4 |32 |
|2 |2 |56 |
+----+----+----+
I apply the below instructions such that I create a sequence of values in column C:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD",
collect_list("colC").over(Window.partitionBy("colA").orderBy("colB")))
The result is like this such that column D is created and includes values of column C as a sequence while it has removed null
value:
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |null|[23, 63] |
|1 |4 |32 |[23,63,32] |
|2 |2 |56 |[56] |
+----+----+----+------------+
However, I would like to keep null values in the new column and have the below result:
+----+----+----+-----------------+
|colA|colB|colC|colD |
+----+----+----+-----------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |null|[23, 63, null] |
|1 |4 |32 |[23,63,null, 32] |
|2 |2 |56 |[56] |
+----+----+----+-----------------+
As you see I still have null
values in the result. Do you know how can I do it?
Since
collect_list
automatically removes allnull
s, one approach would be to temporarily replacenull
with a designated number, sayInt.MinValue
, before applying the method, and use a UDF to restore those numbers back tonull
afterward: