Multiple column sorting hadoop streaming (EMR)

213 Views Asked by At

I'm trying to sort differently on each column on the mapper output. My output looks like this:

xx yy 2 4
xx yy 1 5
xx yy 5 39
xx yy 8 3

So the first 2 columns are text the the last 2 columns are numbers.

This is how I try to do this:

-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator
-D "mapreduce.partition.keycomparator.options=-k1,2 -k3,3nr -k4,4nr"

It just doesn't sort numerically ... only alphabetically.

I also tried:

-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator
-D mapreduce.partition.keycomparator.options='-k1,2 -k3,3nr -k4,4nr'

but got an error that -k3,3nr is not a valid parameter.

Ideas?

0

There are 0 best solutions below