How to optimize ETL data pipeline for fault tolerance when using Spark and Redshift?

Question

How to optimize ETL data pipeline for fault tolerance when using Spark and Redshift?

378 Views Asked by snackbar At 28 July 2025 at 02:51

I'm writing a big batch job using PySpark that ETLs 200 tables and loads into Amazon Redshift. These 200 tables are created from one input datasource. So the batch job is successful only when data is loaded into ALL 200 tables successfully. The batch job runs everyday while appending the data into tables for each date.

For fault tolerance, reliability and idempotency my current workflow follows:

Using staging tables. Create temporary Redshift tables with CREATE TEMP TABLE LIKE <target_table>
Transform and Load data into staging table.
Repeat 1-2 for 200 other tables.
Start BEGIN transaction.
Copy staging table data into target table using INSERT INTO <taget_table> SELECT * FROM <staging_table>
END Transaction
DROP all staging tables.

This way I can guarantee that if Step 3 fails (which is more probable), I don't have to worry about removing partial data from original tables. Rather, I'll simply re-run entire batch job since temporary tables are discarded after the JDBC disconnection.

While it solved most of the problems it's not elegant, hackish and consumes extra time. I would like to if Spark and/or Redshift provides standard tools to solve this very common problem in ETL world.

Thanks

Original Q&A

There are 1 best solutions below

**Bill Weiner** · Accepted Answer

COPY commands can be in a transaction block. You just NEED:

BEGIN
COPY data to all the tables
COMMIT (if successful)

Redshift will maintain the previous version of the tables for all other viewers and their view of the tables won't change until COMMIT.

The upside of the process you laid out is that during the time that the transaction is running other processes cannot get an exclusive locks on the tables (ALTER TABLE etc). Your inserts will run fast than the COPYs so the transaction will be open less time. This is only an issue if other processes are modifying the table at the same time the ETL is running which is generally not a good idea.

How to optimize ETL data pipeline for fault tolerance when using Spark and Redshift?

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in AMAZON-REDSHIFT

Related Questions in SPARK-REDSHIFT

Trending Questions

Popular # Hahtags

Popular Questions