I've seen some good explanations of creating a table with partitions which are CLUSTERED BY and SORTED BY. How does this compare with creating a table with partitions, then populating the table (with INSERT OVERWRITE for instance) using CLUSTER BY? Is the CLUSTER BY a persistent sort within the table?
Hive difference between PARTITIONED BY, CLUSTERED BY and SORTED BY with BUCKETS and insert overwrite with PARTITIONED and CLUSTER BY?
2.4k Views Asked by chuckfinley At
1
There are 1 best solutions below
Related Questions in SORTING
- How to sort a multi-dimensional array by the second array in descending order?
- Ignore #VALUE! error in SORT function
- What is the code of the sorted function?
- Pull out first occurrences from array
- how to keep 10 biggest integer while reading a list in java?
- IQueryable<T> OrderBy<T> Extension Fails with Foreign Key Property
- Anagram test using C++ having compile time error
- How to sort a nested dictionary by the a nested value?
- sort through text file numerically by numbers in column
- Python elegant way to sort numerically named directories
- sorting all data on multiple pages by clicking on its header
- Sort oberservableArray by multiple parameters
- 2D array, sort rows by sum
- sorting RDD elements
- Less beautifier - format code
Related Questions in HIVE
- How do I set the Hive user to something different than the Spark user from within a Spark program?
- schedule and automate sqoop import/export tasks
- PIG merge two lines in the log
- Elephant bird with hive to query protobuf file
- How can we decide the total no. of buckets for a hive table
- How to create a table in Hive with a column of data type array<map<string, string>>
- How to find number of unique connection using hive/pig
- sqoop-export is failing when I have \N as data
- How can we test expressions in hive
- Run Hive Query in R with Config
- Rhive: The messages shows: Not Connected to Hiveserver2 (But can connect HDFS)
- HIVE Query Deleting source data blob
- Hive JOIN of query with subquery takes forever
- What is Metadata DB Derby?
- How could I set the number or size of output files in an "insert" script?
Related Questions in HIVEDDL
- Hive External Table - Drop Partition
- Error while trying to create external table in hive
- Hive Update partition vs MSCK Repair
- How to truncate a partitioned external table in hive?
- Is there anyway to change the datatype of the non-partition column of the external hive table?
- Is it possible to add a bloom filter on an existing table with data?
- CREATE TABLE doesn't load data from disk
- drop table command with partitions column in hive
- Hive load multiple partitioned HDFS file to table
- Hive difference between PARTITIONED BY, CLUSTERED BY and SORTED BY with BUCKETS and insert overwrite with PARTITIONED and CLUSTER BY?
- How to generate json object from hive SQL table description?
- What happens if I move Hive table data files before moving the table?
- LOCATION in Hive
- Apache Hive: How to Add Column at Specific Location in Table
- Sorted Table in Hive (ORC file format)
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Even if INSERT OVERWRITE + CLUSTER BY would produce table with persistently sorted data there is no way to tell hive that data is already sorted other than create CLUSTERED BY table. you can benefit from sorted data (sort-merge-join for example) only when the Hive knows about it and therefore can optimize the query. The data is not necessarily written to the disk in the same order it was produced or passed to the writer unless you specified that table is clustered(sorted). Usual (heap) tables are not sorted in theory. Writer process does not write data in the same order that the input because it is buffered (deferred write) and parallel.