reduceByKey with two columns in Spark

4.3k Views Asked by Ravi At 16 August 2025 at 22:30

I am trying to do group by two columns in Spark and am using reduceByKey as follows:

pairsWithOnes = (rdd.map(lambda input: (input.column1,input.column2, 1)))
print pairsWithOnes.take(20)

The above maps command works fine and produces three columns with the third one being all ones. I tried summing the third by the first two columns as follows:

reduced = pairsWithOnes.reduceByKey(lambda a,b,c : a+b+c)
print reduced.take(20)

However, running the last print command throws an error "too many values to unpack". Could someone guide me on the right way to reduce it by two columns?

Original Q&A

There are 1 best solutions below

zero323 On 21 June 2015 at 20:00 BEST ANSWER

As far I understand you goal is to count (column1,input.column2) pairs and your input looks more or less like this:

from numpy.random import randint, seed
from pyspark.sql import Row

seed(323)

rdd = sc.parallelize(
    Row(column1=randint(0, 5), column2=randint(0, 5)) for _ in range(1000))
rdd.take(3)

Result:

[Row(column1=0, column2=0),
 Row(column1=4, column2=4),
 Row(column1=3, column2=2)]

First of all you have to group by a (column1, column2):

pairsWithOnes = rdd.map(lambda input: ((input.column1, input.column2), 1))
pairsWithOnes.take(3)

Result:

[((0, 0), 1), ((4, 4), 1), ((3, 2), 1)]

All whats left is simple reduceByKey:

pairsWithOnes.reduceByKey(lambda x, y: x + y).take(3)

Result

[((1, 3), 37), ((3, 0), 43), ((2, 1), 40)]

reduceByKey with two columns in Spark

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in REDUCE

Related Questions in PYSPARK

Trending Questions

Popular # Hahtags

Popular Questions