Changed Data Capture (CDC) - periodically Sync datasets between S3 Staging File & Snowflake Tables

597 Views Asked by At

For ex:

Source : S3

Scheduled update: 2 times/Day

Target : Snowflake Table

Streaming : NO

On Day-0, I have a customer table created & loaded in Snowflake DB. S3 Files are updated 2 times a day and they need to reflected in Snowflake table as well.

I want to come up with a solution that so that following 3 cases are taken care 2 times/day without streaming or messaging technlogy

  1. Insert - new records
  2. Update - existing records
  3. Delete - existing records
1

There are 1 best solutions below

2
On

If you want them to stay in sync at all times, you could use EXTERNAL TABLES in Snowflake to make sure that tables are always kept in sync with Snowflake. Lay a MATERIALIZED VIEW over your external table, and now you essentially have a Snowflake table that is in sync with your S3 files. In this case, every time an S3 file is modified or added, the file will be loaded to the MV.

https://docs.snowflake.com/en/user-guide/tables-external.html

If the files that are coming have updated records and deletes marked in the data as CDC data, then you'd need to leverage streams and tasks, instead. First, load the file into a staging table using either a COPY INTO or Snowpipe (which requires SQS be setup), then put a stream over the staging table, and then finally create a task that checks the stream for new records and executes a MERGE against your final table.

https://docs.snowflake.com/en/user-guide/data-pipelines.html