Do Spark's stages within a job run in parallel?

249 Views Asked by At

Do Spark's stages within a job run in parallel?

I know that within a job in Spark, multiple stages could be running in parallel, but when I checked, it seems like the executors are doing a context switch.

For example, assume that The Spark log UI logs report that the two stages are running in parallel. But actually, They don't look like they're running in parallel, it feels like they're just changing the Stage they're running, like the Executors are context switching.

Sorry for not attached the pictures.

Thanks.

My question here is, is there any way to make the Executors run in parallel across Stages, for example, if I have 8 Executors, is there way to put 4 Executors on Stage 2 and 4 Executors on Stage 3 in parallel?

2

There are 2 best solutions below

4
thebluephantom On

In general, Stages run sequentially. One Stage must complete before the other starts. That is the Spark paradigm.

https://queirozf.com/entries/apache-spark-architecture-overview-clusters-jobs-stages-tasks Can be consulted. Noting oc comment.

Unless a Spark App has more than 1 completely unrelated set of transformation logic paths, thereby allowing parallel Stage execution. In all honesty I never have done that, I had N Apps.

0
Ziya Mert Karakas On

Spark's scheduler can submit multiple stages of a job for execution concurrently if they have no dependencies on each other, after being lazily evaluated. This means that if Stage 2 and Stage 3 have no data dependencies or shuffling requirements between them, they could potentially run concurrently. If there are data dependencies between stages, such as shuffling data between partitions, the execution of stages might not be entirely parallel. Stages that depend on the output of a previous stage need to wait for that stage to complete before they can start.

Another factor to consider is that Executors in Spark are separate JVM processes, and they don't context switch like threads within a single process. However, there might be contention for resources like CPU, memory, and I/O, which can lead to varying degrees of concurrency.

From documentation:

"Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings."


Sources:

https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application