For ex:
Source : S3
Scheduled update: 2 times/Day
Target : Snowflake Table
Streaming : NO
On Day-0, I have a customer table created & loaded in Snowflake DB. S3 Files are updated 2 times a day and they need to reflected in Snowflake table as well.
I want to come up with a solution that so that following 3 cases are taken care 2 times/day without streaming or messaging technlogy
- Insert - new records
- Update - existing records
- Delete - existing records
If you want them to stay in sync at all times, you could use
EXTERNAL TABLES
in Snowflake to make sure that tables are always kept in sync with Snowflake. Lay aMATERIALIZED VIEW
over your external table, and now you essentially have a Snowflake table that is in sync with your S3 files. In this case, every time an S3 file is modified or added, the file will be loaded to the MV.https://docs.snowflake.com/en/user-guide/tables-external.html
If the files that are coming have updated records and deletes marked in the data as CDC data, then you'd need to leverage streams and tasks, instead. First, load the file into a staging table using either a
COPY INTO
or Snowpipe (which requires SQS be setup), then put a stream over the staging table, and then finally create a task that checks the stream for new records and executes aMERGE
against your final table.https://docs.snowflake.com/en/user-guide/data-pipelines.html