I am fairly new to Hadoop and MapReduce programming. I want to know whether it is possible to group by another value (not key) after joining of two files.
I have two files which have following data
File1
name marks
A Male
B Male
C Female
File2
name marks
A 25
B 28
A 30
C 22
Now is there any method to find the percentage of marks for each gender. I am trying to get the following as output
Male percentage_of_marks_of_male_students
Female percentage_of_marks_of_female_students
Is there anyway to do this in a single job. I've tried using two jobs for this, but couldn't find any headway.
Any tips would be appreciated.
Edit:
After joining the files I get something like this
{name1 - ["gender","marks1","marks2",...]}
{name2 - ["gender","marks1","marks2",...]}
{name3 - ["gender","marks1","marks2",...]}
...
I'm currently stuck at finding sum of marks of male and females separately in the reducer phase
Edit:
I have solved the problem. I used two jobs. First job joins two files, gives output as
[gender, the sum of marks of each student]
I sent the output file as input to second job which gives percentage of marks by gender.