Understanding the shuffle in spark

265 Views Asked by figs_and_nuts At 30 October 2025 at 23:10

Shuffling in spark is (as per my understanding):

Identify the partition that the records have to go to (Hashing and modulo)
Serialize the data that needs to go to the same partition
transmit the data
The data gets deserialized and read by the executors on the other end

I have a question about this:

How is the data transmitted between the executors? Even if we have the space available in Memory. Let us assume our execution memories are 50GiB per executor and the entire data to be shuffled is just 100 MB. Is the data transmission from Storage memory (exec 1) to Storage memory (exec 2) or are there disk writes involved as intermediate steps?

Original Q&A

There are 1 best solutions below

Abdennacer Lachiheb On 13 December 2022 at 14:51 BEST ANSWER

Spark shuffle outputs are always written to disk.

Why ? because simply you cannot send data from an executor memory to another executor memory directly, it has to be written locally than loaded into the executor memory, that's why you have serialization deserialization during shuffling, that's why having a quality disks (ssd) is also important for spark.

from blog.scottlogic.com

During a shuffle, data is written to disk and transferred across the network, halting Spark’s ability to do processing in-memory and causing a performance bottleneck.

Understanding the shuffle in spark

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in SPARK-SHUFFLE

Trending Questions

Popular # Hahtags

Popular Questions