Slow changing dimension | SCD type 1 deleting rows from prior date data

186 Views Asked by Saurabh At 06 June 2025 at 15:58

I have implemented SCD type 1 using merge into statement in azure databricks.

When I am doing first load, it’s loading the data that is expected with some just say 5000 rows for ingest date 28 Nov, 2023.

Now when I am doing the load today and picking up only 29 Nov, 2023 data from source, my SCD type 1 is loading the correct records/row count of 1300 that is correct as per source data count but now for 28 Nov, 2023, the records count got changed to 4500.

This is unexpected as it should not do any change to my older date records.

Can anyone please help me understand this issue?

I am using simple Merge into statement where I am updating the rows in case of match found between my key columns and inserting records in case of no match found between my key columns.

Original Q&A

There are 1 best solutions below

JayashankarGS On 02 December 2023 at 05:00

As @NotFound mentioned, when you update the table and filter by time, you get a count smaller than the original, as some rows have been updated. However, the count may be even smaller than expected.

In the example below, I have 30 rows and am updating 6 rows. When filtering out to get rows before this timestamp, the expected count is 24, but it is showing 21.

Source

enter image description here

After performing the merge:

enter image description here

The expected count is 24, but it is showing 21. This is because the Delta merge rewrites some rows without modifying them.

In the image below, you can see numTargetRowsCopied is 3 and numOutputRows is 9. Three records are rewritten without being modified, which results in a different count.

enter image description here

You can refer to this solution for more information.

My suggestion is to use Delta versioning. You can get the previous records' data using either version or timestamp, and then you will get correct results.

%sql
select count(*) from target version as of 0 where _metadata.file_modification_time<"2023-12-02T03:57:00"

enter image description here

Slow changing dimension | SCD type 1 deleting rows from prior date data

There are 1 best solutions below

Related Questions in SQL

Related Questions in PYSPARK

Related Questions in MERGE

Related Questions in AZURE-DATABRICKS

Related Questions in SCD

Trending Questions

Popular # Hahtags

Popular Questions