Change the default delimiter of the mapreduce

578 Views Asked by At

Hi I am a beginner to MapReduce, and I want to program the WordCount so it output the K/V pairs. But the question is I don't want to use the 'tab' as the key value pair delimiter for the file. How could I change it?

The code I use is slightly different from the example one. Here is the driver class.

    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "Job1");
    job.setJarByClass(Simpletask.class);
    job.setMapperClass(TokenizerMapper.class);
    //job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(IntWritable.class);
    job.setOutputValueClass(Text.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

Since I want the file name to be respective with the partition of the reducer, I use multipleout.write() in the reduce function, and thus the code is slightly different.

public void reduce(IntWritable key,Iterable<Text> values, Context context) throws IOException, InterruptedException {
    String accu = "";
    for (Text val : values) {
        String[] entry=val.toString().split(",");
        String MBR = entry[1];
        //ASSUME MBR IS ENTRY 1. IT CAN BE REPLACED BY INVOKING FUNCTION TO CALCULATE MBR([COORDINATES])
        String mes_line = entry[0]+",MBR"+MBR+" ";
        result.set(mes_line);
        mos.write(key, result, generateFileName(key));
    }

Any help will be appreciated! Thank you!

1

There are 1 best solutions below

0
On

Since you are using FileInputFormat the key is the line offset in the file, and the value is a line from the input file. It's upto the mapper to split the input line with any delimiter. You can use it to split the record read in map method. The default behavior comes with a specific input format like TextInputFormat etc.