I have two PySpark DataFrames a and b that I want to join using multiple conditions. One condition involves equality, and the other involves an inequality. Here are the two join statements:
a = dups.alias('a')
b = dups.alias('b')
# Statement 1
df = a.join(b, (col('a.MTH') == col('b.MTH')) & (col('a.HASH_KEY') == col('b.HASH_KEY')) & (col('a.id') < col('b.id')), 'inner')
# Statement 2
df = a.join(b, (col('a.id') < col('b.id')) & (col('a.MTH') == col('b.MTH')) & (col('a.HASH_KEY') == col('b.HASH_KEY')), 'inner')
In the first line, I gave the equality condition first, and in the second, I gave the inequality condition first.
There are only 6 distinct values for MTH; it's data of 6 months.
There are millions of distinct values for id.
The overall objective is to find the ids having same hash_key in each month.
I am using a large EMR (16 nodes of config m5.16xlarge) on AWS to run this.
Which of these join condition orders is likely to be more efficient?
I see a lot of shuffling when I executed statement 2. I have not yet executed statement 1.