My cascalog EMR job generated thousands of small files on S3 buckets. It generate the same number of files as the number of reducers I used. Dumping all these tiny files take minutes. I wonder if there is a way to concat them on S3 so that I can dump them quickly?
Thanks
Kang
There are a few solutions to this problem -- here is the one I use:
https://github.com/nathanmarz/dfs-datastores/blob/develop/dfs-datastores/src/main/java/com/backtype/hadoop/Consolidator.java