Merge data logic for CDC snowflake

517 Views Asked by debaroti At 18 August 2025 at 07:48

I am working on a logic for merging duplicate rows from one table(1000+ rows) into another empty table based on the product_id and extraction_date. The extraction_date is used to determine which data is the latest. In cases where extraction_date is same in multiple rows my logic fails. Here's an example:

create or replace table source(v variant); 
 INSERT INTO source SELECT parse_json('{
  "pd": {
    "extraction_date": "1644471240",
    "product_id": "357946",
   "retailerName": "retailer",
    "productName":"product"
  }
}');
 INSERT INTO source SELECT parse_json('{
  "pd": {
    "extraction_date": "1644471240",
    "product_id": "357946",
   "retailerName": "retailer2",
    "productName":"product2"
  }
}');

//Merge logic:

create or replace TABLE target AS 
SELECT * from source 
where v:pd:extraction_date not in
(SELECT c1.v:pd:extraction_date 
 FROM source c1, source c2 where 
 c1.v:pd:product_id=c2.v:pd:product_id and
 c1.v:pd:extraction_date>c2.v:pd:extraction_date);

In the given example the target table will have two duplicate rows because of the same extraction_date. Please suggest me a modification which only selects one row when extraction_date is same.

Original Q&A

There are 1 best solutions below

Simeon Pilgrim On 16 May 2022 at 11:18 BEST ANSWER

Don't self join, just use QUALIFY/ROW_NUMBER to force a single value per product_id if there are 2+ rows with the same "latest" date, it will randomly pick one and only one.

create or replace TABLE target AS 
SELECT s.* 
from source as s 
QUALIFY row_number() over (partition by s.v:pd:product_id order by s.v:pd:extraction_date) = 1;

If you want to sort of extra details like retailerName or productName put those in the ORDER BY and it will be more "predictable" which is something you normally want. Or just even hash the json blob, so it picks the same on many runs.. this might help in later debugging.

Also, while variant data is "almost" the same as native columns, if your data is "all the same" using native columns can help, and keeping the json (and never accessing it) can improve performance, while allowing for handling the day when the data changes shape and you need to handle the old and new shape.

Merge data logic for CDC snowflake

There are 1 best solutions below

Related Questions in SQL

Related Questions in SNOWFLAKE-CLOUD-DATA-PLATFORM

Related Questions in SNOWFLAKE-SCHEMA

Related Questions in ORACLE-CDC

Trending Questions

Popular # Hahtags

Popular Questions