Why does Spark shuffle the data while joining two partitioned & bucketed tables

99 Views Asked by user2417458 At 01 December 2023 at 05:25

I am trying to create a view on top of two tables.

Table 1: Partitioned by col1 Bucketed by col2 (no of buckets: 3600)

Table 2: Partitioned by col1 Bucketed by col2 ( no of buckets:3600)

View: Table1 Join Table2 On col1=col1 And col2=col2

Here,

When I run query on top of this view, data for both tables are shuffled by hashpartitioning(col1,col2)

I am wondering why this should happen. My understanding is since data is already partitioned and bucketed in equal no of partitions/buckets, spark should know where data exists and it should just do sort-merge-join without shuffling.

Can anyone help me understand why this happens?

I tried to set various spark properties related bucketing, but still it shuffles data.

Original Q&A

Why does Spark shuffle the data while joining two partitioned & bucketed tables

There are 0 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in HIVE

Related Questions in TABLE-PARTITIONING

Related Questions in BUCKETING

Trending Questions

Popular # Hahtags

Popular Questions