How to capture data change in aws glue?

4.8k Views Asked by At

We have source data in on premise sql-server. We are using AWS glue to fetch data from sql-server and place it to the S3. Could anyone please help how can we implement change data capture in AWS Glue?

Note- We don't want to use AWS DMS.

3

There are 3 best solutions below

0
On

It's not possible to implement a change data capture through direct glue data extraction. While a Job bookmark can help you identify inserts and updates if your table contains an update_at timestamp column, it won't cover delete cases. You actually need a CDC solution.

While AWS glue direct connection to a database source is a great solution, I strongly discourage using it for incremental data extraction due to the cost implication. It's like using a Truck to ship one bottle of table water.

As you already commented, I am not also a fan of AWS DMS, but for a robust CDC solution, a tool like Debezium could be a perfect solution. It integrates with kafka and Kinesis. You can easily sink the stream to s3 directly. Debezium gives you the possibility to capture deletes and append a special boolean __delete column to your data, so your glue etl can manage the removal of these deleted records with this field.

0
On

I'm only aware of Glue Bookmarks. They will help you with the new records (Inserts), but won't help you with the Updates and Deletes that you typically get with a true CDC solution.

Not sure of your use case, but you could check out the following project. It has a pretty efficient diff feature and, with the right options, can give you a CDC-like output

https://github.com/G-Research/spark-extension/blob/master/DIFF.md

1
On

You can leverage AWS DMS for CDC and then use the Apache IceBerg connections with Glue Data Catalog to achieve this: https://aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/