INSERT OVERWRITE on just created table

1.8k Views Asked by At

I have to replicate a process for a client. I have never worked with Hive, so I am trying to understand what they were doing in other cases.

The Hive script I am trying to understand is this one:

DROP TABLE IF EXISTS distribution.030601_TI11;

CREATE EXTERNAL TABLE IF NOT EXISTS distribution.030601_TI11(
    mygroup STRING, year STRING, type1 STRING, type2 STRING,
    type3 STRING, myvalue INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION '/warehouse/distribution/030601_TI11';

INSERT OVERWRITE TABLE distribution.030601_TI11
SELECT *
FROM develop.030601_TI11;

What are they doing?

As far as I have read about Hive, a DROP TABLE IF EXISTS statement over a external table will only delete the table metadata and not the table data. But I would like to know if that INSERT OVERWRITE statement is dropping the previous entries stored in the table, and inserting only the new rows contained in the specified location

And also, how is the LOCATION managed? I want to create the table from a single .csv file. Can I write something like LOCATION '/warehouse/develop/myfile.csv' or I can only provide a HDFS directory as a location?

2

There are 2 best solutions below

0
leftjoin On BEST ANSWER

INSERT OVERWRITE TABLE removes all files inside table location and moves new file. This happens at the very end when the query has already successfully executed and result files are created in the temporary location, after that load task removes all files in table location and moves files from temp location to the table location. See also this answer: https://stackoverflow.com/a/63378038/2700344

If you want to create table on top of single file, put it in some folder and make sure there are no other files in the same folder and and specify that folder as a location in create table DDL. Also you can put that file into existing table location using hdfs dfs -put command or using LOAD command or using some other means. Main point here is that table should have it's own location, does not matter how many files are in the location - single file or many files, location is a folder (directory), not a file. Even if it was possible to create table on top of single file instead of folder, it is unsafe, because overwrite can create another files and table will have location pointing to non-existing file. carefully read answers on this question: How to point to a single file with external table

2
Koushik Roy On

You are right, the location for external table will remain as is. So, by drop-create statements they are ensuring that the table doesn't exist before dropping or creating. And the table seems to be dynamic in nature so that can be another reason of drop-create.

Please notice you are using CREATE EXTERNAL TABLE IF NOT EXISTS which means if table exist, it will not recreate. Storage will be cleaned and loaded using INSERT OVERWRITE.
Now, if you want to create a table on top of csv file just use LOCATION '/warehouse/develop/myfile. You dont have to use .csv in location.