Does Spark shuffle write all intermediate data to disk, or only that which will not fit in memory ("spill")?
In particular, if the intermediate data is small, will anything be written to disk, or will the shuffle be performed entirely using memory without writing anything to disk?
I've checked the docs and related StackOverflow questions, but they weren't clear on this precise question.
From an AWS guide, https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/#:~:text=In%20Apache%20Spark%2C%20shuffling%20happens,which%20can%20cause%20straggling%20executors, but also to be found elsewhere.
Apache Spark utilizes in-memory caching and optimized query execution for fast analytic queries against your datasets, which are split into multiple Spark partitions on different nodes so that you can process a large amount of data in parallel.
In Apache Spark, shuffling happens when data needs to be redistributed across the cluster. During a shuffle, data is written to local disk and transferred across the network. The shuffle operation is often constrained by the available local disk capacity, or data skew, which can cause straggling executors.
That is to say, the architecture of Spark is to write Mapper output to local disk, for Reducer phase, tasks to consume. Size of data does not matter. I agree certain aspects are not clear.