S3DistCp (AWS-EMR) - deleteOnSuccess option creates file on source bucket

886 Views Asked by At

I'm working on an AWS-EMR cluster and added a step to run S3DISTCP (https://docs.aws.amazon.com/es_es/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html), this is in order to copy objects from an s3 bucket (target/destination is also an s3 bucket).

Objects are copied correctly to the destination bucket and using --deleteOnSuccess option copied objects deleted from source bucket as expected. The problem here is, for every folder that contained a copied object (on the source bucket), there is a new file created at the root of the source bucket (this only happens with --deleteOnSuccess option).

Arguments that I use are:

s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://MY_SOURCE_BUCKET/ --dest=s3://MY_DESTINATION_BUCKET/ --srcPrefixesFile=s3://ANOTHER_BUCKET/objects_list.txt --deleteOnSuccess

In this case, if in s3://MY_SOURCE_BUCKET/ contains:

s3://MY_SOURCE_BUCKET/
     |--folder_a/
     |      |------ a.txt
     |      |------ b.txt
     |      |------ c.txt
     |--folder_b/
            |------ d.txt

and if I want to copy and delete only s3://MY_SOURCE_BUCKET/folder_a/b.txt, once S3DISTCP run is completed, source bucket looks like:

s3://MY_SOURCE_BUCKET/
     |--folder_a_$folder$    <-- This is the new file created with `_$folder$` suffix
     |--folder_a/
     |      |------ a.txt
     |      |------ c.txt
     |--folder_b/
            |------ d.txt

Is there a way to avoid this new files are created by S3DISTCP on the source bucket?

1

There are 1 best solutions below

3
On

The "_$folder$" files are placeholders. Apache Hadoop creates these files when you use the -mkdir command to create a folder in an S3 bucket. Hadoop doesn't create the folder until you PUT the first object. If you delete the "_$folder$" files before you PUT at least one object, Hadoop can't create the folder. This results in a "No such file or directory" error.As of now there is no way to prevent this files from generating when you are working with EMR.

It is safe to delete these files.You can either delete them manually by running commands like shown below or by creating a lambda s3 trigger to look for these files and delete them periodically.But deleting them while you are copying the data may cause issues. Refer to this [https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/] link to know more about this.

read s3path; \
aws s3 rm --dryrun s3://$s3path/ \
--recursive \
--exclude '*' \
--include "*_\$folder$" \ ;