Spark 1.6.2 and Scala 2.10 here.
I want to filter the spark dataframe column with an array of strings.
val df1 = sc.parallelize(Seq((1, "L-00417"), (3, "L-00645"), (4, "L-99999"),(5, "L-00623"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|L-00417|
| 3|L-00645|
| 4|L-99999|
| 5|L-00623|
+---+-------+
val df2 = sc.parallelize(Seq((1, "L-1"), (3, "L-2"), (4, "L-3"),(5, "L-00623"))).toDF("c3","c4")
+---+-------+
| c3| c4|
+---+-------+
| 1| L-1|
| 3| L-2|
| 4| L-3|
| 5|L-00623|
+---+-------+
val c2List = df1.select("c2").as[String].collect()
df2.filter(not($"c4").contains(c2List)).show()`
I am getting below error.
Unsupported literal type class [Ljava.lang.String; [Ljava.lang.String;@5ce1739c
Can anyone please help to fix this?
First,
containsisn't suitable because you're looking for the opposite relationship - you want to check ifc2Listcontainsc4's value, and not the other way around.You can use
isinfor that - which uses "repeated argument" (similar to Java's "varargs") of the values to match, so you'd want to "expand"c2Listinto a repeated argument, which can be done using the: _*operator:Alternatively, with Spark 1.6 you can use an "left anti join", to join the two dataframes and get only values in
df2that did NOT match values indf1:Unlike the previous, this option is not limited to the case where
df1is small enough to be collected.Lastly, if you're using earlier Spark version, you can immitate
leftantiusing aleftjoin and a filter: