Spark SQL repartition before insert operation

44 Views Asked by aaa At 28 July 2025 at 04:18

Suppose we are using Spark on top of Hive, specifically the SQL API. Now suppose we have a table A with two partition columns, part1 and part2 and that we are insert overwriting into A with dynamic partitions from a select statement. It would look something like this:

INSERT OVERWRITE A
PARTITION (part1, part2)
-- hint here
SELECT -- /*+ REPARTITION(part1, part2) */
    col1
    ,col2
    ,part1
    ,part2
FROM tmp_tbl
-- or repartion here
-- DISTRIBUTE BY part1, part2;

Now my question is:

Are there any benefits in using distribute by/order by/cluster by/sort by on the select statement, either the explicit clause or the hint? Would the write operation be more efficient somehow? Or does it not matter? Or even, could this make no difference at all and it's just increasing the complexity and causing unnecessary shuffle?

I searched the web and SO but didn't find anything on this specific topic.

Sorry for any mistakes, I wrote this on my phone.

Original Q&A

Spark SQL repartition before insert operation

There are 0 best solutions below

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in HIVE

Related Questions in APACHE-SPARK-SQL-REPARTITION

Trending Questions

Popular # Hahtags

Popular Questions