I am using a Azure HdInsight(3.1.3.577).
running the HIVEQL statement: LOAD DATA INPATH '/myData/employee.txt' INTO TABLE employee;
loads the data correctly but also has the side effect of removing the source text file. This behavior is puzzling to me.
In the documentation (https://cwiki.apache.org/confluence/display/Hive/GettingStarted) there is the following "loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous."
My confusion is why this would be efficient, given that the HDFS (Azure blobstore) has to be loaded afresh with the source data for each run.
Hive uses HDFS to store it's tables data under default location "/user/hive/warehouse".
When the data is already exists in HDFS, we create a External table and provides the HDFS path using LOCATION keyword. This will not result in moving the file / directory to default location.
By doing this Hive assumes that the data exists in the given path and it doesn't owns the data.
Even if you drop the table, the data specified in the path still exists.