Part of key changes when iterating through values when using composite key - Hadoop

157 Views Asked by At

I have implemented Secondary sort on Hadoop and I don't really understand the behavior of the framework.

I have created a composite key which contains original key and part of value, that is used for sorting.

To achieve this I have implemented my own partitioner

public class CustomPartitioner extends Partitioner<CoupleAsKey, LongWritable>{

@Override
public int getPartition(CoupleAsKey couple, LongWritable value, int numPartitions) {

    return Long.hashCode(couple.getKey1()) % numPartitions;
}

My own group comparator

public class GroupComparator extends WritableComparator {

protected GroupComparator()
{
    super(CoupleAsKey.class, true);
}

@Override
public int compare(WritableComparable w1, WritableComparable w2) {

    CoupleAsKey c1 = (CoupleAsKey)w1;
    CoupleAsKey c2 = (CoupleAsKey)w2;

    return Long.compare(c1.getKey1(), c2.getKey1());
}

}

And defined the couple in the following way

public class CoupleAsKey implements WritableComparable<CoupleAsKey>{

private long key1;
private long key2;

public CoupleAsKey() {
}

public CoupleAsKey(long key1, long key2) {
    this.key1 = key1;
    this.key2 = key2;
}

public long getKey1() {
    return key1;
}

public void setKey1(long key1) {
    this.key1 = key1;
}

public long getKey2() {
    return key2;
}

public void setKey2(long key2) {
    this.key2 = key2;
}

@Override
public void write(DataOutput output) throws IOException {

    output.writeLong(key1);
    output.writeLong(key2);

}

@Override
public void readFields(DataInput input) throws IOException {

    key1 = input.readLong();
    key2 = input.readLong();
}

@Override
public int compareTo(CoupleAsKey o2) {

    int cmp = Long.compare(key1, o2.getKey1());

    if(cmp != 0)
        return cmp;

    return Long.compare(key2, o2.getKey2());
}

@Override
public String toString() {
    return key1 + ","  + key2 + ",";
}

}

And here is the driver

Configuration conf = new Configuration();
    Job job = new Job(conf);

    job.setJarByClass(SSDriver.class);

    job.setMapperClass(SSMapper.class);
    job.setReducerClass(SSReducer.class);

    job.setMapOutputKeyClass(CoupleAsKey.class);
    job.setMapOutputValueClass(LongWritable.class);
    job.setPartitionerClass(CustomPartitioner.class);
    job.setGroupingComparatorClass(GroupComparator.class);

    FileInputFormat.addInputPath(job, new Path("/home/marko/WORK/Whirlpool/input.csv"));
    FileOutputFormat.setOutputPath(job, new Path("/home/marko/WORK/Whirlpool/output"));

    job.waitForCompletion(true);

Now, this works, but what is really strange is that while iterating in reducer for a key, second part of the key (the value part) changes in each iteration. Why and how?

 @Override
protected void reduce(CoupleAsKey key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

    for (LongWritable value : values) {

        //key.key2 changes during iterations, why?
        context.write(key, value);
    }

}
1

There are 1 best solutions below

0
On

Definition says that "if you want all your relevant rows within a partition of data sent to a single reducer you must implement a grouping comparator". This only ensures that those set of keys will be sent to a single reduce call, and not that the key will change from composite (or whatever) to something that only contains that part of key on which grouping was done.

However, when you iterate over values, the corresponding keys will also change. We normally do not observe this happening, as by default the values are grouped on the same (non-composite) key, and thus, even when the value changes, the (value of-) key remains the same.

You can try printing the object reference of the key, and you shall notice that with every iteration, the object reference of the key is also changing (like this:)

IntWritable@1235ft
IntWritable@6635gh
IntWritable@9804as

Alternatively, you can also try applying a group-comparator on an IntWritable in a following way (you will have to write your own logic to do so):

Group1:    
1        a    
1        b    
2        c

Group2:
3        c
3        d
4        a

and you shall see that with every iteration of value, your key is also changing.