I have implemented SCD type 1 using merge into statement in azure databricks.
When I am doing first load, it’s loading the data that is expected with some just say 5000 rows for ingest date 28 Nov, 2023.
Now when I am doing the load today and picking up only 29 Nov, 2023 data from source, my SCD type 1 is loading the correct records/row count of 1300 that is correct as per source data count but now for 28 Nov, 2023, the records count got changed to 4500.
This is unexpected as it should not do any change to my older date records.
Can anyone please help me understand this issue?
I am using simple Merge into statement where I am updating the rows in case of match found between my key columns and inserting records in case of no match found between my key columns.
As @NotFound mentioned, when you update the table and filter by time, you get a count smaller than the original, as some rows have been updated. However, the count may be even smaller than expected.
In the example below, I have 30 rows and am updating 6 rows. When filtering out to get rows before this timestamp, the expected count is 24, but it is showing 21.
Source
After performing the merge:
The expected count is 24, but it is showing 21. This is because the Delta merge rewrites some rows without modifying them.
In the image below, you can see
numTargetRowsCopied
is 3 andnumOutputRows
is 9. Three records are rewritten without being modified, which results in a different count.You can refer to this solution for more information.
My suggestion is to use Delta versioning. You can get the previous records' data using either
version
ortimestamp
, and then you will get correct results.