Spark SQL 1.5.2: left excluding join

4.2k Views Asked by At

Given dataframes df_a and df_b, how can I achieve the same result as left excluding join:

SELECT df_a.*
FROM df_a
  LEFT JOIN df_b
    ON df_a.id = df_b.id
WHERE df_b.id is NULL

I've tried:

df_a.join(df_b, df_a("id")===df_b("id"), "left")
  .select($"df_a.*")
  .where(df_b.col("id").isNull)

I get an exception from the above:

Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
2

There are 2 best solutions below

0
Pushkr On

You can try executing SQL query itself - keeping it simple..

df_a.registerTempTable("TableA")
df_b.registerTempTable("TableB")
result = sqlContext.sql("SELECT * FROM TableA A \
                          LEFT JOIN TableB B \
                          ON A.id = B.id \
                          WHERE B.id is NULL ")
0
Sanchit Grover On

If you wish to do it through dataframes try below example :

  import sqlContext.implicits._
  val df1 = sc.parallelize(List("a", "b", "c")).toDF("key1")
  val df2 = sc.parallelize(List("a", "b")).toDF("key2")

  import org.apache.spark.sql.functions._

  df1.join(df2,
    df1.col("key1") <=> df2.col("key2"),
    "left")
    .filter(col("key2").isNull)
    .show

You would get output :

+----+----+
|key1|key2|
+----+----+
|   c|null|
+----+----+