we see that map can accept and output compressed and uncompressed data. I was going through cloudera training and teacher mentioned that reduce task input has to be in form o key value and thus can't work on compressed data.
Is that right? If thats right, how can i handle network latency when transferring bug data from shuffler/partitioner to reduce task.
Thanks for your help.
If the
Mappercan output compressed data, of course, theReducercan accept compressed data. This is transparent to both of them, so the output is compressed and uncompressed automatically.I think he/she must have been saying that Hadoop must uncompress that compressed input for you since the
Reduceris not expecting compressed data that it has to uncompress itself.Reducers can also output compressed data, and that's controlled separately. You can control the codec. You can also read compressed data as input to aMapperautomatically.There are some catches though: for example,
gzipcompressed files can't be split by aMapper, and that's bad for parallelism. But abzipcompressed file can be split in some cases.