I have an airflow
dag which reads some json
data and split into different parquet
files that are uploaded to AWS S3
.
I have the specific path partitioned by 5 values which are:
- type
- year
- month
- day
- hour
Then I do a SELECT statement such as:
SELECT fields
FROM table
WHERE type='A'
AND year='2020'
AND month='11'
AND day='25'
AND hour='20'
But it takes around 2 minutes per each type
S3
path, and I have 30 and still growing.
I read that to speed up you have to use following:
- Parquet - Done
- Fewest column possible - Done (I have a lot, but are the minimum I need)
- Use less parquet files, around 64mb each. And I had 168 files of few
kb
before trying to merge them in a single one of around 1 mb.
This last option didn't speed it up. So I don't know how to improve the process speed. It is an hourly job, and it takes 50 minutes to insert the 30 folders data into Redshift
tables.
The data is already partitioned by
:
If I use:
select *
from SVV_EXTERNAL_PARTITIONS
where tablename='table' and schemaname='spectrum'
and values='["A","2020","11","25","20"]';
It returns a row:
spectrum,table,"[""A"",""2020"",""11"",""25"",""20""]",s3://parquet/account/A/2020/11/25/20/,org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat,org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat,org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe,{},0,{}
I expect you need to use the PARTITION BY option in the external table definition. Just having the S3 key path with keywords in it doesn't tell Redshift Spectrum what data is in each folder. So Redshift has to read all the files looking for the data you want. You may need to change the keypaths to your files so that they conform to the S3 partition syntax.
A good example of doing this is at "Example 2: Partitioning with a multiple partition key" in https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html