SCD2 Implementation in Redshift using AWS GLue Pyspark

1.9k Views Asked by At

I have a requirement to move data from S3 to Redshift. Currently I am using Glue for the work. Current Requirement:

  1. Compare the primary key of record in redshift table with the incoming file, if a match is found close the old record's end date (update it from high date to current date) and insert the new one.
  2. If primary key match is not found then insert the new record. Implementation: I have implemented it in Glue using pyspark with the following steps: Created dataframes which will cover three scenarios:
  3. If a match is found update the existing record's end date to current date.
  4. Insert the new record to Redshift table where PPK match is found
  5. Insert the new record to Redshift table where PPK match is not found

Finally, Union all these three data frames into one and write this to redshift table.

With this approach, both old record ( which has high date value) and the new record ( which was updated with current date value) will be present.

Is there a way to delete the old record with high date value using pyspark? Please advise.

1

There are 1 best solutions below

0
On

We have successfully implemented the desired functionality where in we were using AWS RDS [PostGreSql] as database service and GLUE as a ETL service . My Suggestion would be instead of computing the delta in sparkdataframes it would be far more easier and elegant solution if you create stored procedures and call them in pyspark Glue Shell . [for example : S3 bucket - > Staging table -> Target Table]

In addition if your execution logic is getting executed in less than 10 mins I will suggest you to use python shell and use external libraries such as psycopyg2 / sqlalchemy for DB operations .