Quite a niche question, I think, but my organisation has started using Delta Live Tables in Databricks for data modelling, recently. One of the dimensions I am trying to model takes data from 3 existing tables in our data lake. It also needs to be a type 2 slowly changing dimension (SCD).
This appears to be quite simple if you have just one table, according to the databricks documentation (https://docs.databricks.com/en/delta-live-tables/cdc.html), but I'm struggling when trying to make the final dimension a composite of 3 source tables.
For example, let's say the tables are: person, personDetail, and job.
person:
PersonID | Name |
---|---|
1 | Name1 |
2 | Name2 |
personDetail:
PersonID | JobID | detail |
---|---|---|
1 | 1 | detail1 |
2 | 2 | detail2 |
job:
jobID | jobName |
---|---|
1 | job1 |
2 | job2 |
The final dimension needs to have personID, detail, and jobName, and show any changes to any of the 3 tables.
Other than making a really complicated union in pyspark at the beginning of the code, I can't think of a way to make this work, perhaps creating each table as a separate streaming table and then joining them? Am I missing something obvious?
Any help much appreciated.
You need to independently apply SCD Type 2 to each of the three tables, and then create a final view that joins the three tables together.