Normally, Hadoop map/reduce job produces list of key-value pairs that are written to job's output file (using OutputFormat
class). Rarely, both keys and values are useful, usually either keys or values contain required information.
Is there an option (on client side) to suppress keys in output file or to suppress values in output file?
If I wanted to do this for just one particular job, I could create new OutputFormat
implementation that would ignore keys or values. But I need generic solution that is reusable for more jobs.
EDIT: It might be unclear what I mean by "I need generic solution that is reusable for more jobs." Let me explain that on example:
Let's say I have a lot of prepared Mapper
, Reducer
, OutputFormats
classes. I want to combine them to different 'jobs' and run those 'jobs' on different input files to produce various output files. In some cases (for some jobs) I need to suppress keys, so they are not written to output file. I do not want to change code of my mappers, reducers of output formats - there is just too many of them to do that. I need some generic solution that does not need to change code of given mappers, reducer or output formats. How do I do that?
There's no reason why your final step in a hadoop flow can't be configured to write a
NullWritable
as either a key or a value. You just shouldn't expect that file to be much use in any subsequent map reduce steps.