HIVE Query Deleting source data blob

163 Views Asked by At

I am using a Azure HdInsight(3.1.3.577).

running the HIVEQL statement: LOAD DATA INPATH '/myData/employee.txt' INTO TABLE employee;

loads the data correctly but also has the side effect of removing the source text file. This behavior is puzzling to me.

In the documentation (https://cwiki.apache.org/confluence/display/Hive/GettingStarted) there is the following "loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous."

My confusion is why this would be efficient, given that the HDFS (Azure blobstore) has to be loaded afresh with the source data for each run.

2

There are 2 best solutions below

0
On

Hive uses HDFS to store it's tables data under default location "/user/hive/warehouse".

When the data is already exists in HDFS, we create a External table and provides the HDFS path using LOCATION keyword. This will not result in moving the file / directory to default location.

By doing this Hive assumes that the data exists in the given path and it doesn't owns the data.

Even if you drop the table, the data specified in the path still exists.

0
On

try this,

create external table myTable (Userid string, name string)
row format delimited
fields terminated by '\t'
LOCATION '/myData/employee.txt' ; 

Location should be in hdfs;