I have CSV files organized by date and time as follows
logs/YYYY/MM/DD/CSV files...
I have setup Apache Drill to execute SQL queries on top of these CSV files. Since there are many CSV files; the organization of the files can be utilized to optimize the performance. For example,
SELECT * from data where trans>='20170101' AND trans<'20170102';
In this SQL, the directory logs/2017/01/01
should be scanned for data. Is there a way to let Apache Drill do optimization based on this directory structure? Is it possible to do this in Hive, Impala or any other tool?
Please note:
- SQL queries will almost always contain the time frame.
- Number of CSV files in a given directory is not huge. Combined all years worth of data, it will be huge
- There is a field called 'trans' in every CSV file, which contains the date and time.
- The CSV file is put under appropriate directory based on the value of 'trans' field.
- CSV files do not follow any schema. Columns may or may not be different.
Querying using column inside the data file would not help in partition pruning.
You can use dir* variables in Drill to refer to partitions in table.
You can query using tran_year,tran_month and tran_date columns for partition pruning.
Also see if below query helps for pruning.
If so , you can define view by aliasing concat(
dir0
,dir1
,dir2
) totrans
column name and query.See below for more details.
https://drill.apache.org/docs/how-to-partition-data/