I have an s3 folder that keeps on getting new files. These files could also have duplicates based on url column.
s3file1.csv - lastmodified 2022-03-01 at 10 UTC
url name
http://a/ jai
http://b/ nitu
s3file2.csv lastmodified 2022-03-01 at 12 UTC
url name
http://a/ aron
http://b/ max
I create my external table as:
create external table test
(
url VARCHAR as (nullif(value:c1,'')::VARCHAR)
refershed_on TIMESTAMP_LTZ(9) as CURRENT_TIMESTAMP()
)
with location = @test_stage
file_format = test_format
auto_refersh=true
pattern = '.*s3file[.]csv';
The issue is that I have duplicates in the table test based on url. And the refreshed_on date is also same for all the rows. How can I remove the duplicates and keep only the entry with latest last modified date unique on url?
The final table test should be having just s3file2.csv data but it has both files data
You will need to add ETL/ELT process to dedup your data. It is an external table, Snowflake will just read the files as they are. If there are duplicates, then the result will have duplicates.