Speed up AWS Redshift spectrum select

989 Views Asked by mrc At 28 July 2025 at 01:36

I have an airflow dag which reads some json data and split into different parquet files that are uploaded to AWS S3.

I have the specific path partitioned by 5 values which are:

type
year
month
day
hour

Then I do a SELECT statement such as:

SELECT fields 
FROM table
WHERE type='A' 
AND year='2020'
AND month='11'
AND day='25'
AND hour='20'

But it takes around 2 minutes per each type S3 path, and I have 30 and still growing.

I read that to speed up you have to use following:

Parquet - Done
Fewest column possible - Done (I have a lot, but are the minimum I need)
Use less parquet files, around 64mb each. And I had 168 files of few kb before trying to merge them in a single one of around 1 mb.

This last option didn't speed it up. So I don't know how to improve the process speed. It is an hourly job, and it takes 50 minutes to insert the 30 folders data into Redshift tables.

The data is already partitioned by:

If I use:

select *
from SVV_EXTERNAL_PARTITIONS
where tablename='table' and schemaname='spectrum'
and values='["A","2020","11","25","20"]';

It returns a row:

spectrum,table,"[""A"",""2020"",""11"",""25"",""20""]",s3://parquet/account/A/2020/11/25/20/,org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat,org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat,org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe,{},0,{}

Original Q&A

There are 2 best solutions below

Bill Weiner On 10 December 2020 at 21:00

I expect you need to use the PARTITION BY option in the external table definition. Just having the S3 key path with keywords in it doesn't tell Redshift Spectrum what data is in each folder. So Redshift has to read all the files looking for the data you want. You may need to change the keypaths to your files so that they conform to the S3 partition syntax.

A good example of doing this is at "Example 2: Partitioning with a multiple partition key" in https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html

mrc On 24 February 2021 at 10:17

Set of things to do. Here the checklist:

Ensure you are using hive partitioning. Not needed the year=2020 format as said in some answer. Hive partitioning works using only 2020. If you don't use the year=2020 naming the partition it's named partition_n where n is a number.
Check data volume for some partitions. It's better to have less files, and bigger ones (parquet) than otherwise.
Split data in significant buckets. In my case I work with 4 providers. And each of them are expected to grom in data volume, then I did 4 bucekts instead of one where first partition was the provider.

Speed up AWS Redshift spectrum select

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in AMAZON-WEB-SERVICES

Related Questions in AMAZON-S3

Related Questions in AMAZON-REDSHIFT

Related Questions in AMAZON-REDSHIFT-SPECTRUM

Trending Questions

Popular # Hahtags

Popular Questions