Speed up AWS Redshift spectrum select

998 Views Asked by At

I have an airflow dag which reads some json data and split into different parquet files that are uploaded to AWS S3.

I have the specific path partitioned by 5 values which are:

  1. type
  2. year
  3. month
  4. day
  5. hour

Then I do a SELECT statement such as:

SELECT fields 
FROM table
WHERE type='A' 
AND year='2020'
AND month='11'
AND day='25'
AND hour='20'

But it takes around 2 minutes per each type S3 path, and I have 30 and still growing.

I read that to speed up you have to use following:

  • Parquet - Done
  • Fewest column possible - Done (I have a lot, but are the minimum I need)
  • Use less parquet files, around 64mb each. And I had 168 files of few kb before trying to merge them in a single one of around 1 mb.

This last option didn't speed it up. So I don't know how to improve the process speed. It is an hourly job, and it takes 50 minutes to insert the 30 folders data into Redshift tables.

The data is already partitioned by:

If I use:

select *
from SVV_EXTERNAL_PARTITIONS
where tablename='table' and schemaname='spectrum'
and values='["A","2020","11","25","20"]';

It returns a row:

spectrum,table,"[""A"",""2020"",""11"",""25"",""20""]",s3://parquet/account/A/2020/11/25/20/,org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat,org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat,org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe,{},0,{}
2

There are 2 best solutions below

2
On

I expect you need to use the PARTITION BY option in the external table definition. Just having the S3 key path with keywords in it doesn't tell Redshift Spectrum what data is in each folder. So Redshift has to read all the files looking for the data you want. You may need to change the keypaths to your files so that they conform to the S3 partition syntax.

A good example of doing this is at "Example 2: Partitioning with a multiple partition key" in https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html

0
On

Set of things to do. Here the checklist:

  1. Ensure you are using hive partitioning. Not needed the year=2020 format as said in some answer. Hive partitioning works using only 2020. If you don't use the year=2020 naming the partition it's named partition_n where n is a number.
  2. Check data volume for some partitions. It's better to have less files, and bigger ones (parquet) than otherwise.
  3. Split data in significant buckets. In my case I work with 4 providers. And each of them are expected to grom in data volume, then I did 4 bucekts instead of one where first partition was the provider.