we see that map can accept and output compressed and uncompressed data. I was going through cloudera training and teacher mentioned that reduce task input has to be in form o key value and thus can't work on compressed data.
Is that right? If thats right, how can i handle network latency when transferring bug data from shuffler/partitioner to reduce task.
Thanks for your help.
If the
Mapper
can output compressed data, of course, theReducer
can accept compressed data. This is transparent to both of them, so the output is compressed and uncompressed automatically.I think he/she must have been saying that Hadoop must uncompress that compressed input for you since the
Reducer
is not expecting compressed data that it has to uncompress itself.Reducer
s can also output compressed data, and that's controlled separately. You can control the codec. You can also read compressed data as input to aMapper
automatically.There are some catches though: for example,
gzip
compressed files can't be split by aMapper
, and that's bad for parallelism. But abzip
compressed file can be split in some cases.