How to merge the small files on S3 generated by EMR with thousands of reducers

1.3k Views Asked by rninja At 27 July 2025 at 20:11

My cascalog EMR job generated thousands of small files on S3 buckets. It generate the same number of files as the number of reducers I used. Dumping all these tiny files take minutes. I wonder if there is a way to concat them on S3 so that I can dump them quickly?

Thanks

Kang

Original Q&A

There are 1 best solutions below

hiroprotagonist On 24 April 2013 at 05:40

There are a few solutions to this problem -- here is the one I use:

https://github.com/nathanmarz/dfs-datastores/blob/develop/dfs-datastores/src/main/java/com/backtype/hadoop/Consolidator.java

How to merge the small files on S3 generated by EMR with thousands of reducers

There are 1 best solutions below

Related Questions in HADOOP

Related Questions in AMAZON-WEB-SERVICES

Related Questions in AMAZON-S3

Related Questions in EMR

Related Questions in CASCALOG

Trending Questions

Popular # Hahtags

Popular Questions