I am trying to create a view on top of two tables.
Table 1: Partitioned by col1 Bucketed by col2 (no of buckets: 3600)
Table 2: Partitioned by col1 Bucketed by col2 ( no of buckets:3600)
View: Table1 Join Table2 On col1=col1 And col2=col2
Here,
When I run query on top of this view, data for both tables are shuffled by hashpartitioning(col1,col2)
I am wondering why this should happen. My understanding is since data is already partitioned and bucketed in equal no of partitions/buckets, spark should know where data exists and it should just do sort-merge-join without shuffling.
Can anyone help me understand why this happens?
I tried to set various spark properties related bucketing, but still it shuffles data.