When does an action not run on the driver in Apache Spark?

2k Views Asked by At

I have just started with Spark and was struggling with the concept of tasks.

Can any one please help me in understanding when does an action (say reduce) not run in the driver program.

From the spark tutorial,

"Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. "

I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.

From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.

Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"

2

There are 2 best solutions below

0
On

Ill take a stab at this, although I may be missing part of the question. A task is indeed always transformation(s) and an action. The transformation's are lazy and would not submit anything, thus the need for an action. You can always call .toDebugString on your RDD to see where each job split will be; each level of indentation is a new stage. I think the reduce function showing on the driver is a bit of a misnomer as it will run first in parallel and then merge the results. So, I would expect that the task does indeed run on the workers as far as it can.

0
On

All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).

Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks