Hive create table with inputs from nested sub-directories

5.1k Views Asked by At

I have data in Avro format in HDFS in file paths like: /data/logs/[foldername]/[filename].avro. I want to create a Hive table over all these log files, i.e. all files of the form /data/logs/*/*. (They're all based on the same Avro schema.)

I'm running the below query with flag mapred.input.dir.recursive=true:

CREATE EXTERNAL TABLE default.testtable
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  LOCATION 'hdfs://.../data/*/*'
  TBLPROPERTIES (
    'avro.schema.url'='hdfs://.../schema.avsc') 

The table ends up being empty unless I change LOCATION to be less nested, i.e. to be 'hdfs://.../data/[foldername]/' with a certain foldername. This worked no-problem with a less nested path for LOCATION.

I'd like to be able to source data from all these different [foldername] folders. How do I make the recursive input selection go further in my nested directories?

2

There are 2 best solutions below

0
On

One thing that would solve your problem is adding the folder name as a partition column to the external table. Then you can create the table as you're creating just on the data directory. Or you can take these nested files and flatten them in a single directory.

I don't think you'll be able to ask hive to have input of all these folders considered as 1 table otherwise.

This questions seems to be addressing a similar issue: when creating an external table in hive can I point the location to specific files in a direcotry?

There is an open jira issue on the same context: https://issues.apache.org/jira/browse/HIVE-951

Browsing more I saw this post suggesting you use SimlinkInputTextFormat as an alternative. I am not sure how well this would fly with your Avro format. https://hive.apache.org/javadocs/r0.10.0/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html

4
On

Use this Hive settings to enable recursive directories:

set hive.mapred.supports.subdirectories=TRUE;
set mapred.input.dir.recursive=TRUE;

Create external table and specify root directory as a location:

LOCATION 'hdfs://.../data'

You will be able to query data from table location and all subdirectories