EMR fails while uploading very large file

1.3k Views Asked by At

I have a use case where I have to upload 1000s of 20GB files from EMR to S3.

While uploading files using fileSystem.moveFromLocalFile API, job fails with following error:

16/12/23 07:25:04 WARN TaskSetManager: Lost task 107.0 in stage 16.0 (TID 94646, ip-172-31-3-153.ec2.internal): java.io.IOException: Error closing multipart upload
    at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadMultiParts(MultipartUploadOutputStream.java:377)
    at com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.close(MultipartUploadOutputStream.java:394)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:74)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:108)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:61)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:356)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:338)
    at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2017)
    at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1985)
    at org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:1972)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.moveFromLocalFile(EmrFileSystem.java:419)

Note that this error occurs frequently when the number of 20GB files are of the order of 1000s and less frequently when the number of files are of the order of 100s.

Need some guidance on how to go about debugging it.

1

There are 1 best solutions below

0
On

There's a limit of 5GB on the size of the file being uploaded from EMR to S3. You could either compress your files before uploading them or split them into multiple parts.