Hive difference between PARTITIONED BY, CLUSTERED BY and SORTED BY with BUCKETS and insert overwrite with PARTITIONED and CLUSTER BY?

2.4k Views Asked by chuckfinley At 30 January 2014 at 01:18

I've seen some good explanations of creating a table with partitions which are CLUSTERED BY and SORTED BY. How does this compare with creating a table with partitions, then populating the table (with INSERT OVERWRITE for instance) using CLUSTER BY? Is the CLUSTER BY a persistent sort within the table?

Original Q&A

There are 1 best solutions below

leftjoin On 06 February 2014 at 15:24 BEST ANSWER

Even if INSERT OVERWRITE + CLUSTER BY would produce table with persistently sorted data there is no way to tell hive that data is already sorted other than create CLUSTERED BY table. you can benefit from sorted data (sort-merge-join for example) only when the Hive knows about it and therefore can optimize the query. The data is not necessarily written to the disk in the same order it was produced or passed to the writer unless you specified that table is clustered(sorted). Usual (heap) tables are not sorted in theory. Writer process does not write data in the same order that the input because it is buffered (deferred write) and parallel.

Hive difference between PARTITIONED BY, CLUSTERED BY and SORTED BY with BUCKETS and insert overwrite with PARTITIONED and CLUSTER BY?

There are 1 best solutions below

Related Questions in SORTING

Related Questions in HIVE

Related Questions in HIVEDDL

Trending Questions

Popular # Hahtags

Popular Questions