Shuffle Partition vs Join Keys

52 Views Asked by At

What happens when # of unique join keys less than shuffle partitions?

Are we going to end up with lots of empty partitions?

If yes,is there any point to have shuffle partitions bigger than # of unique join keys?

1

There are 1 best solutions below

0
canaytore On

Simple answer is yes. If the number of unique join keys is less than the number of shuffle partitions, some of the partitions may end up empty or with some minimal data, and associated overhead. The empty partitions do not contribute to the final result but still need to be processed, which can impact the overall performance of the join operation.

Shuffle partition has been a problematic parameter to optimize for a long time, but I think this has been overcome with Adaptive Query Execution. With AQE, the mismatch can be mitigated by dynamically adjusting the number of partitions to match the number of unique join keys, improving performance by optimizing data distribution. Because AQE dynamically (hence adaptive) adjusts the execution plan based on data statistics and characteristics of the job.