I am trying to do group by two columns in Spark and am using reduceByKey as follows:
pairsWithOnes = (rdd.map(lambda input: (input.column1,input.column2, 1)))
print pairsWithOnes.take(20)
The above maps command works fine and produces three columns with the third one being all ones. I tried summing the third by the first two columns as follows:
reduced = pairsWithOnes.reduceByKey(lambda a,b,c : a+b+c)
print reduced.take(20)
However, running the last print command throws an error "too many values to unpack". Could someone guide me on the right way to reduce it by two columns?
As far I understand you goal is to count
(column1,input.column2)
pairs and your input looks more or less like this:Result:
First of all you have to group by a (column1, column2):
Result:
All whats left is simple
reduceByKey
:Result