Updating Tables in AWS Redshift Datawamart

115 Views Asked by At

Background: We use fivetran to ingest data from multiple sources into AWS Redshift. We've got ETL scripts which we run on top of these tables to create other more relevant tables. Furthermore, we've got final views/tables which can be built on top of these second layer tables.

All these new tables we create can be either be independent (just read from root tables) or can rely on each other.

This is where the problem arises. Too many inter-connections between views/tables.

Lastly, i've got a basic Flask app which uses rq scheduler to queue jobs and run ETL scripts at given time.

Problem: As we grow, we are seeing a drop in database performance (increase in execution times) and serialisation error on different tables.

Question: What's the best way to structure our infrastructure?

1

There are 1 best solutions below

4
On

You need to step back and look at the dependencies in your updates. If you are getting serialization errors you have some circular dependencies that haven't been thought through (very common). Let's define some phases of data update within Redshift for each update cycle. (You phase definitions may vary but what is important is to think through these dependencies.)

First, phase 0 - data load for this cycle (N). Load data (COPY) into staging tables. This only depends on external information. Staging tables are phase 0 tables. Phase 0 tables only depend on external data (and these may be temp tables in reality).

Phase 1 - process & sanitize incoming data and apply this information to phase 1 tables. If any of these phase 1 tables need information from other phase 1, 2, 3 (etc) tables then these need to be defined as information coming from the N-1, previous cycle of data update.

Phase 2 - is the update of fact tables (phase 2 tables). These can only depend on Phase 1 or 0 tables for this cycle or if these updates need phase 2 or 3 information then these are defined as N - 1 versions of this data.

Phase 3 - update of derived / summary tables (phase 3 tables). These can depend on any previous phase of this cycle.

Again you definition of phases can be different. What is important is that you define the linear flow of information for each update cycle. No loops! If you need to use some later phase information is needs to be defined as N - 1 (previous data update) information. If something cannot be defined as N - 1 version then you may need to add a phase to the flow.

Removing the loops will remove the serialization errors as long as each phase only runs AFTER the previous phase is complete (and committed). This will also address any performance issues arising from locks and stalls. Every update SQL of each phase can run in parallel as all these SQL statements have already updated input tables (either N, or N - 1 versions).

Once you have this linear flow defined you can see which SQL statements are the long tent poles for each phase and attack any performance issues in these statements. As data grows the impact of inefficiencies grows and new issues will arise.

Now if your database has multiple independent data domains (dependency trees) you can break these apart and run them independently but be careful to ensure that things are independent. This can be done for only later phases if there is an advantage to that - common phase 0 and 1 but split phase 2 and 3.

So I'd say that you, firstly, have a data lineage issue and that this needs to be addressed before looking at infrastructure. Infrastructure / code may (will) need to be addressed as well but first things first.