Spark SQL repartition before insert operation

50 Views Asked by At

Suppose we are using Spark on top of Hive, specifically the SQL API. Now suppose we have a table A with two partition columns, part1 and part2 and that we are insert overwriting into A with dynamic partitions from a select statement. It would look something like this:

INSERT OVERWRITE A
PARTITION (part1, part2)
-- hint here
SELECT -- /*+ REPARTITION(part1, part2) */
    col1
    ,col2
    ,part1
    ,part2
FROM tmp_tbl
-- or repartion here
-- DISTRIBUTE BY part1, part2;

Now my question is:

Are there any benefits in using distribute by/order by/cluster by/sort by on the select statement, either the explicit clause or the hint? Would the write operation be more efficient somehow? Or does it not matter? Or even, could this make no difference at all and it's just increasing the complexity and causing unnecessary shuffle?

I searched the web and SO but didn't find anything on this specific topic.

Sorry for any mistakes, I wrote this on my phone.

0

There are 0 best solutions below