I wanted to delete large number of S3 files (may be few 100K or 1000K, which I do not have control) in a bulk async process. I tried to look into multiple blogs and collated below strategies:
Leverage AWS S3 REST API from the async thread of custom application Here the drawbacks are:
- I will have to make huge number of S3 API calls as 1 request is limited for 1000 S3 objects and I may not know the exact S3 object.
- Even if I identify the S3 objects to delete, I will have to first GET and then DELETE which will make the solution costly.
- Here I will have to keep track of deleted chunks and in case of any failure in middle of operation, I will have to build a mechanism to re-trigger the chunks which failed to be deleted.
Leveraging S3 lifecycle policy Here the drawbacks are:
- We are storing multiple customer data into same bucket segregated by customer-id in prefix. With growing number of customers, we foresee that the 1000 rules per bucket hard limit may hit us.
- To surpass above drawback, we can delete the rule and free-up the quota for next requests. But we were looking for any event based notification which can tell us back that the bulk delete operation is complete.
- Again with growing number of customers, here we may loose predictability of the bulk delete operation. This is because of accumulated jobs due to reached quota limit and a submitted bulk delete job may have to wait for days to be completed.
Create only 1 rule with a special bulk delete tag and use it to set 1 S3 lifecycle policy With this approach, we believe we will not hit the limit issue as we are expecting in above approach. And as we understood that these S3 lifecycle rules gets executed once a day (though we don't know exactly when), so we are assured that in max next 24h, the rule will get triggered and then it will take some time to actually complete the bulk delete operation (may be few mins or hours, we don't know). Here also we have the open question as: Is there a notification event after completion of 1 execution of S3 lifecycle rule which we can listen and update the status of all submitted bulk delete jobs as DONE? In lack of such notification event, it becomes difficult to let transparently communicate it back to the end-user who triggered the bulk delete async operation.
Any comments/advice on below strategies will be helpful. Also if you can help me with the answer for the last strategy which I guess is the most preferable choice I have as of now.
I tried all the above stated strategies and got stuck at the mentioned problem for each. Any inputs/advice on above will be of great help.
After all evaluations, we have finalized to go with codeful delete relevant data for specific time-range as an async java process leveraging S3 bulk delete SDK (DeleteObjectsRequest).