How to filter keys or values in Hadoop map/reduce job output file?

1.9k Views Asked by At

Normally, Hadoop map/reduce job produces list of key-value pairs that are written to job's output file (using OutputFormat class). Rarely, both keys and values are useful, usually either keys or values contain required information.

Is there an option (on client side) to suppress keys in output file or to suppress values in output file? If I wanted to do this for just one particular job, I could create new OutputFormat implementation that would ignore keys or values. But I need generic solution that is reusable for more jobs.

EDIT: It might be unclear what I mean by "I need generic solution that is reusable for more jobs." Let me explain that on example:

Let's say I have a lot of prepared Mapper, Reducer, OutputFormats classes. I want to combine them to different 'jobs' and run those 'jobs' on different input files to produce various output files. In some cases (for some jobs) I need to suppress keys, so they are not written to output file. I do not want to change code of my mappers, reducers of output formats - there is just too many of them to do that. I need some generic solution that does not need to change code of given mappers, reducer or output formats. How do I do that?

1

There are 1 best solutions below

5
On

There's no reason why your final step in a hadoop flow can't be configured to write a NullWritable as either a key or a value. You just shouldn't expect that file to be much use in any subsequent map reduce steps.