Snowflake external table primary key

201 Views Asked by At

I have an s3 folder that keeps on getting new files. These files could also have duplicates based on url column.

s3file1.csv - lastmodified 2022-03-01 at 10 UTC

url             name

http://a/       jai
http://b/       nitu

s3file2.csv lastmodified 2022-03-01 at 12 UTC

url              name

http://a/       aron
http://b/       max

I create my external table as:

create external table  test
(
url VARCHAR as (nullif(value:c1,'')::VARCHAR)
refershed_on TIMESTAMP_LTZ(9)  as CURRENT_TIMESTAMP()
)
with location = @test_stage
file_format = test_format
auto_refersh=true
pattern = '.*s3file[.]csv';

The issue is that I have duplicates in the table test based on url. And the refreshed_on date is also same for all the rows. How can I remove the duplicates and keep only the entry with latest last modified date unique on url?

The final table test should be having just s3file2.csv data but it has both files data

2

There are 2 best solutions below

0
On

You will need to add ETL/ELT process to dedup your data. It is an external table, Snowflake will just read the files as they are. If there are duplicates, then the result will have duplicates.

0
On

If you capture the filename in your external table, you could add a view over that table that has a window function that only has the latest records. However, this will be using the external table every time you query it, so the performance not be as good. It would be better to just ingest the data and process the data as it comes in incrementally to update the records accordingly. It doesn't sound like the underlying data is being managed in a way that leads to a good external table use-case.