Time based directory structure Apache Drill

253 Views Asked by At

I have CSV files organized by date and time as follows

logs/YYYY/MM/DD/CSV files...

I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,

SELECT * from data where trans>='20170101' AND trans<'20170102';

In this SQL, the directory logs/2017/01/01 should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?

Please note:

  • SQL queries will almost always contain the time frame.
  • Number of CSV files in a given directory is not huge. Combined all years worth of data, it will be huge
  • There is a field called 'trans' in every CSV file, which contains the date and time.
  • The CSV file is put under appropriate directory based on the value of 'trans' field.
  • CSV files do not follow any schema. Columns may or may not be different.
1

There are 1 best solutions below

2
On

Querying using column inside the data file would not help in partition pruning.

You can use dir* variables in Drill to refer to partitions in table.

create view trans_logs_view as 
select
 `dir0` as `tran_year`,
 `dir1` as `trans_month`,
 `dir2` as `tran_date`, * from dfs.`/data/logs`;

You can query using tran_year,tran_month and tran_date columns for partition pruning.

Also see if below query helps for pruning.

select count(1)  from dfs.`/data/logs` 
where concat(`dir0`,`dir1`,`dir2`) between '20170101' AND '20170102';

If so , you can define view by aliasing concat(dir0,dir1,dir2) to trans column name and query.

See below for more details.

https://drill.apache.org/docs/how-to-partition-data/