Spark Dataset when to use Except vs Left Anti Join

10.1k Views Asked by At

I was wondering if there are performance difference between calling except (https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#except(org.apache.spark.sql.Dataset) and using a left anti-join. So far, the only difference I can see is that with the left anti-join, the 2 datasets can have different columns.

1

There are 1 best solutions below

2
On BEST ANSWER

Your title vs. explanation differ, actually.

Feb '24: Your comment is noted and clarifies: Let's say I have 2 datasets with the same schema, Dataset A and Dataset B. My goal is to find all the rows in Dataset A that are not present in Dataset B; should I do that with an EXCEPT or a LEFT ANTI JOIN?

If you have the same structure in Datasets A & B, you would simply use EXCEPT. Using LEFT ANTI JOIN would be convoluted coding (as aluded to in Comments), but is technically possible.

EXCEPT

is a specific implementation that enforces same structure and is a subtract operation, whereas

LEFT ANTI JOIN

allows different structures to be compared and where clause is needed..

Use cases differ: 1) Left Anti Join can apply to many situations pertaining to missing data - customers with no orders (yet), orphans in a database. 2) Except is for subtracting things, e.g. Machine Learning splitting data into test- and training sets, or your use case "...new DataFrame containing only rows present in the DS A but not present in DS B ..."

Performance should not be a real deal breaker as they are different use cases in general and therefore difficult to compare. Except will involve the same data source whereas LAJ will involve different data sources.

So, you need to use EXCEPT.