I am working on a logic for merging duplicate rows from one table(1000+ rows) into another empty table based on the product_id and extraction_date. The extraction_date is used to determine which data is the latest. In cases where extraction_date is same in multiple rows my logic fails. Here's an example:
create or replace table source(v variant);
INSERT INTO source SELECT parse_json('{
"pd": {
"extraction_date": "1644471240",
"product_id": "357946",
"retailerName": "retailer",
"productName":"product"
}
}');
INSERT INTO source SELECT parse_json('{
"pd": {
"extraction_date": "1644471240",
"product_id": "357946",
"retailerName": "retailer2",
"productName":"product2"
}
}');
//Merge logic:
create or replace TABLE target AS
SELECT * from source
where v:pd:extraction_date not in
(SELECT c1.v:pd:extraction_date
FROM source c1, source c2 where
c1.v:pd:product_id=c2.v:pd:product_id and
c1.v:pd:extraction_date>c2.v:pd:extraction_date);
In the given example the target table will have two duplicate rows because of the same extraction_date. Please suggest me a modification which only selects one row when extraction_date is same.
Don't self join, just use QUALIFY/ROW_NUMBER to force a single value per
product_id
if there are 2+ rows with the same "latest" date, it will randomly pick one and only one.If you want to sort of extra details like
retailerName
orproductName
put those in the ORDER BY and it will be more "predictable" which is something you normally want. Or just even hash the json blob, so it picks the same on many runs.. this might help in later debugging.Also, while variant data is "almost" the same as native columns, if your data is "all the same" using native columns can help, and keeping the json (and never accessing it) can improve performance, while allowing for handling the day when the data changes shape and you need to handle the old and new shape.