I have seen a issue with the data ,i am trying to find the duplicated data which got inserted on different day partition, now that the problem i want retain the data for older date and delete the latest ingested data.
In the screenshot as you see data is same in two rows but due to some issues same data got inserted into 2021-04-28 as well, Now i want to delete all such records in 2021-04-28 partition. And retain the 04-27 data and actual rows of 04-28.
Could someone please share the query to find these duplicates and query to get rid of the duplicates that are in 2021-04-28 partition.
Below is the query i have written to identify the duplicates but looks it is not perfectly correct.
select grp_nbr,port_rgs_id,tranc_number,strt_tm,sqr_nbr,itm_nbr,trvl_dte from
(select grp_nbr,port_rgs_id,tranc_number,strt_tm,sqr_nbr,itm_nbr,trvl_sdte, row_number() over(partition by grp_nbr,port_rgs_id,tranc_number,strt_tm,sqr_nbr,itm_nbr,trvl_dte order by trvl_dte desc)rn from table_name)a where rn=1
Your query is correct. In a more readable way it can be:
EDIT: the trvl_dte column should not be included in the PARTITION BY statement. Also since you want to keep the earliest trvl_dte, you need to ORDER BY trvl_dte ASC and not DESC.