How to clean up S3 files that is used by AWS Firehose after loading the files?

1.5k Views Asked by At

AWS Firehose uses S3 as an intermittent storage before the data is copied to redshift. Once the data is transferred to redshift, how to clean them up automatically if it succeeds.

I deleted those files manually, it went out of state complaining that files got deleted, I had to delete and recreate Firehose again to resume.

Deleting those files after 7 days with S3 rules will work? or Is there any automated way, that Firehose can delete the successful files that got moved to redshift.

2

There are 2 best solutions below

0
On BEST ANSWER

Discussing with Support AWS,

Confirmed it is safe to delete those intermediate files after 24 hour period or to the max retry time.

A Lifecycle rule with an automatic deletion on S3 Bucket should fix the issue.

Hope it helps.

2
On

Once you're done loading your destination table, execute something similar to (the below snippet is typical to a shell script):

aws s3 ls $aws_bucket/$table_name.txt.gz
    if [ "$?" = "0" ]
    then
            aws s3 rm $aws_bucket/$table_name.txt.gz
    fi

This'll check whether the table you've just loaded exists on s3 or not and will drop it. Execute it as a part of a cronjob.

If your ETL/ELT is not recursive, you can write this snippet towards the end of the script. It'll delete the file on s3 after populating your table. However, before execution of this part, make sure that your target table has been populated.

If you ETL/ELT is recursive, you may put this somewhere at the beginning of the script to check and remove the files created in the previous run. This'll retain the files created till the next run and should be preferred as the file will act as a backup in case the last load fails (or you need a flat file of the last load for any other purpose).