How lambda function in takeOrdered function works in pySpark?

6.5k Views Asked by At

I can't quite get the behavior of lambda in following code:

rdd = sc.parallelize([5,3,1,2)]
rdd.takeOrdered(3,lambda s: -1*s)

From what I have understood, lambda applies an operation to all elements in a list, so I expected above code to return

[-1,-2,-3]

But it returned

[5,3,2]

What am I missing here?

7

There are 7 best solutions below

0
On BEST ANSWER

https://spark.apache.org/docs/1.1.1/api/python/pyspark.rdd.RDD-class.html

takeOrdered(self, num, key=None) Get the N elements from a RDD ordered in ascending order or as specified by the optional key function.

so in your example you are providing an order function.

2
On

rdd.takeOrdered actually accepts a comparator as it's second parameter.

What you want to do is this:

rdd.map(lambda s: -1*s).takeOrdered(3)

That will map your values, and then take the first 3 by order.

I'm not sure what spark is doing with the lamda you're passing it to be honest.

0
On

probably you want to do this

rdd.takeOrdered(3, key = lambda s: (-1*s))

0
On

Try mapping first:

rdd = sc.parallelize([5,3,1,2)]
newRDD = rdd.map(lambda s: -1*s)

Then return or print an action (map is a transformation)... e.g.

rdd.collect()

then if you want to take an specific order of the numbers or items (ascending or descending) you can try with takeOrdered("number of items you want, "the order in which you want them to be taken (-1 reverse the order)".

or

newRDD = (rdd
           .map(lambda s: -1*s)
           .takeOrdered(3, lambda s: -1*s))
0
On

The following means get the first 3 elements by descending order, the lambda is basically applied to the ordering attribute and not the final result.

rdd.takeOrdered(3, key = lambda s: -s)

The following means get the first 3 elements by ascending order:

rdd.takeOrdered(3, key = lambda s: s)

What you want to do is use the map function before the takeOrdered, the map function is what is actually applied to each element in the list i.e. map is what is used to modify each value in the list, producing the desired output of [-1, -2, -3]

rdd = sc.parallelize([5,3,1,2])
rdd.map(lambda s: -s).takeOrdered(3, key = lambda s: -s)
0
On

It might be easier to think of the second parameter to takeOrdered, the lambda, as a "key extractor" since it doesn't do any transformation on the underlying data.

In the simple case where we've got this array of numbers, the key is just the value

rdd = sc.parallelize([5,3,1,2)]   
rdd.takeOrdered(3, lambda x: x) #[1,2,3]

Or, in the code the you submitted, the items are sorted by the inverse of the value (-5 < -3 < -2 ...).

rdd.takeOrdered(3, lambda x: -x) #[5,3,2]

All you're doing when you give the lambda to takeOrdered is telling it what you'd like it ordered by. If you want additional transformations, they must happen in another step.

To return the output you wanted, you could map the items to their inverse and then take them sorted by the original value (inverse of the inverse):

rdd.map(lambda x: -x)\ #[-5,-3,-1,-2]
   .takeOrdered(3, lambda x: -x) #[-1,-2,-3]
0
On

It's very similar to the existing sorted function in Python. Check out the examples on "key Functions" from this site: https://wiki.python.org/moin/HowTo/Sorting

You started with [5, 3, 1, 2].

Imagine that the keys are attached as [(5, -5), (3, -3), (1, -1), (2, -2)].

Then, you sort it by keys in ascending order so you get: [(5, -5), (3, -3), (2, -2), (1, -1)].

Now, ignore the second element (the key) from each pair: [5, 3, 2, 1]

Then, select the first 3 items: [5, 3, 2]