Why does Spark shuffle the data while joining two partitioned & bucketed tables

84 Views Asked by At

I am trying to create a view on top of two tables.

Table 1: Partitioned by col1 Bucketed by col2 (no of buckets: 3600)

Table 2: Partitioned by col1 Bucketed by col2 ( no of buckets:3600)

View: Table1 Join Table2 On col1=col1 And col2=col2

Here,

When I run query on top of this view, data for both tables are shuffled by hashpartitioning(col1,col2)

I am wondering why this should happen. My understanding is since data is already partitioned and bucketed in equal no of partitions/buckets, spark should know where data exists and it should just do sort-merge-join without shuffling.

Can anyone help me understand why this happens?

I tried to set various spark properties related bucketing, but still it shuffles data.

0

There are 0 best solutions below