Which Join Condition Order is More Efficient in Spark SQL?

63 Views Asked by At

I have two PySpark DataFrames a and b that I want to join using multiple conditions. One condition involves equality, and the other involves an inequality. Here are the two join statements:

a = dups.alias('a')
b = dups.alias('b')
# Statement 1
df = a.join(b, (col('a.MTH') == col('b.MTH')) & (col('a.HASH_KEY') == col('b.HASH_KEY')) & (col('a.id') < col('b.id')), 'inner')
# Statement 2
df = a.join(b,  (col('a.id') < col('b.id')) & (col('a.MTH') == col('b.MTH')) & (col('a.HASH_KEY') == col('b.HASH_KEY')), 'inner')

In the first line, I gave the equality condition first, and in the second, I gave the inequality condition first.

There are only 6 distinct values for MTH; it's data of 6 months.

There are millions of distinct values for id.

The overall objective is to find the ids having same hash_key in each month.

I am using a large EMR (16 nodes of config m5.16xlarge) on AWS to run this.

Which of these join condition orders is likely to be more efficient?

I see a lot of shuffling when I executed statement 2. I have not yet executed statement 1.

0

There are 0 best solutions below