I am trying to create an ORC table
in Hive by importing from a text file in HDFS. I have tried multiple different ways, searched online for help, and regardless the insert job won't start.
I can get the text file to HDFS, I can read the text file to Hive, but I cannot convert from that to ORC.
I tried many different variations, including this one that can be used as a reference to this question:
I have a single-node HDP cluster (being used for development) - version:
HDP-2.3.2.0
(2.3.2.0-2950)
And here are the relevant service versions:
Service Version Status Description
HDFS 2.7.1.2.3 Installed Apache Hadoop Distributed File System
MapReduce2 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)
YARN 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)
Tez 0.7.0.2.3 Installed Tez is the next generation Hadoop Query Processing framework written on top of YARN.
Hive 1.2.1.2.3 Installed Data warehouse system for ad-hoc queries & analysis of large datasets and table & storage management service
What happens when I run a SQL like this (again, I've tried many variations including directly from online tutorials):
INSERT OVERWRITE TABLE mycars SELECT * FROM cars;
My job stays like this:
Total number of applications (application-types: [] and states:
[SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1455989658079_0002 HIVE-3f41161c-b806-4e7d-974e-c18e028d683f TEZ hive root.hive ACCEPTED UNDEFINED 0% N/A
And it just hangs there. (Literally, I've tried a 20 row sample table and let it run for hours before killing it).
I am by no means an Hadoop expert (yet) and am sure it's probably a config issue, but I have been unable to figure it out.
All other Hive operations I've tried, such as creating dropping tables, loading a file to a text table, selects, all work fine. It's just when I create an ORC table that it does this. And I need an ORC table for my requirement.
Any advice would be helpful.
Most of the time it has to do with increasing your Yarn Scheduling capacity, but if your resources are already capped you can also reduce the amount of memory requested by individual TEZ tasks, through adjusting the following property in TEZ configuration :
In order to increase the Cluster's capacity you can do it in the configuration settings of YARN or directly through Ambari or Cloudera Manager
In order to monitor what is happening behind the hoods you can run Yarn Resource Manager UI and check the diagnostics tab of the specific Application there are useful explicit messages about resource allocation especially when the job is accepted and keeps pending.