can reduce task accept compressed data in hadoop

128 Views Asked by At

we see that map can accept and output compressed and uncompressed data. I was going through cloudera training and teacher mentioned that reduce task input has to be in form o key value and thus can't work on compressed data.

Is that right? If thats right, how can i handle network latency when transferring bug data from shuffler/partitioner to reduce task.

Thanks for your help.

2

There are 2 best solutions below

1
On

If the Mapper can output compressed data, of course, the Reducer can accept compressed data. This is transparent to both of them, so the output is compressed and uncompressed automatically.

I think he/she must have been saying that Hadoop must uncompress that compressed input for you since the Reducer is not expecting compressed data that it has to uncompress itself.

Reducers can also output compressed data, and that's controlled separately. You can control the codec. You can also read compressed data as input to a Mapper automatically.

There are some catches though: for example, gzip compressed files can't be split by a Mapper, and that's bad for parallelism. But a bzip compressed file can be split in some cases.

0
On

Yes it can. Just add this on your driver class' main method:

  Configuration conf = new Configuration();
  conf.setBoolean("mapred.compress.map.output", true);
  conf.setClass("mapred.map.output.compression.codec", SnappyCodec.class, CompressionCodec.class);