reduceByKey with two columns in Spark

4.3k Views Asked by At

I am trying to do group by two columns in Spark and am using reduceByKey as follows:

pairsWithOnes = (rdd.map(lambda input: (input.column1,input.column2, 1)))
print pairsWithOnes.take(20)

The above maps command works fine and produces three columns with the third one being all ones. I tried summing the third by the first two columns as follows:

reduced = pairsWithOnes.reduceByKey(lambda a,b,c : a+b+c)
print reduced.take(20)

However, running the last print command throws an error "too many values to unpack". Could someone guide me on the right way to reduce it by two columns?

1

There are 1 best solutions below

0
On BEST ANSWER

As far I understand you goal is to count (column1,input.column2) pairs and your input looks more or less like this:

from numpy.random import randint, seed
from pyspark.sql import Row

seed(323)

rdd = sc.parallelize(
    Row(column1=randint(0, 5), column2=randint(0, 5)) for _ in range(1000))
rdd.take(3)

Result:

[Row(column1=0, column2=0),
 Row(column1=4, column2=4),
 Row(column1=3, column2=2)]

First of all you have to group by a (column1, column2):

pairsWithOnes = rdd.map(lambda input: ((input.column1, input.column2), 1))
pairsWithOnes.take(3)

Result:

[((0, 0), 1), ((4, 4), 1), ((3, 2), 1)]

All whats left is simple reduceByKey:

pairsWithOnes.reduceByKey(lambda x, y: x + y).take(3)

Result

[((1, 3), 37), ((3, 0), 43), ((2, 1), 40)]