I have some log files compressed at lzo setting 7 and gzip at default compression and my results are as follows:
MapReduce job over:
- 1GB .gz file - 340 seconds
- 1GB .lzo file un-indexed - 410 seconds
- 1GB .lzo file indexed - 380 seconds
The MapReduce job simply utilizes the Hadoop-LZO library's LzoTextInputFormat class instead of the usual TextInputFormat class. That's the only difference.
I see 37 map tasks come through and split up the job and use the .index file, but the performance leaves a lot to be desired. Any ideas?