Confused about the behavior of Reduce function in map reduce

139 Views Asked by J. Bend At 27 July 2025 at 13:07

I'm having problems with the following map reduce exercise in Spark with python. My map function returns the following RDD.

rdd = [(3, ({0: [2], 1: [5], 3: [1]}, set([2]))),
(3, ({0: [4], 1: [3], 3: [5]}, set([1]))),
(1, ({0: [4, 5], 1: [2]}, set([3)))]

I wrote a reducer function that is supposed to do some computations on tuples with the same key (in the previous example the first two have key = 3, and the last key is 1)

def Reducer(k, v):
 cluster = k[0]
 rows = [k[1], v[1]]
 g_p = {} 
 I_p = set()
 for g, I in rows:
     g_p = CombineStatistics(g_p, g)
     I_p = I_p.union(I)
 return (cluster, [g_p, I_p])

The problem is that I'm expecting that k and v will always have the same key (i.e. k[0]==v[0]). But it is not the case with this code.

I'm working on Databricks platform, and honestly it is a nightmare not being able to debug, sometimes not even 'print' works. It's really frustrating to work in this environment.

Original Q&A

There are 1 best solutions below

Mariusz On 07 January 2017 at 11:13 BEST ANSWER

If you want to reduce RDD based on the same key you should use reduceByKey instead of reduce transformation. After replacing function name you should take into account that parameters to the reduceByKey function are values (k[1] and v[1] in your case), not whole rdd rows.

Prints inside reducer function will not work in distributed environment on databricks, because this function is evaluated on executors (inside amazon cloud). If you start spark in local mode, all python prints will work (but i'm not sure if local mode is available on databricks).

Confused about the behavior of Reduce function in map reduce

There are 1 best solutions below

Related Questions in PYTHON-2.7

Related Questions in APACHE-SPARK

Related Questions in MAPREDUCE

Related Questions in PYSPARK

Related Questions in DATABRICKS

Trending Questions

Popular # Hahtags

Popular Questions