As per the documentation:
Shuffle spill (memory) is the size of the deserialized form of the shuffled data in memory.
Shuffle spill (disk) is the size of the serialized form of the data on disk.
My understanding of shuffle is this:
- Every executor takes all the partitions on it and hashpartitions them into 200 new partitions (this 200 can be changed). Each new partition is associated with an executor that it will later on go to. For example:
For each existing partition: new_partition = hash(partitioning_id)%200; target_executor = new_partition%num_executors
where%
is the modulo operator and the num_executors is the number of executors on the cluster. - These new partitions are dumped onto the disk of each node of their initial executors. Each new partitions will, later on, be read by the target_executor
- Target executors pick up their respective new partitions (out of the 200 generated)
Is my understanding of the shuffle operation correct?
Can you help me put the definition of shuffle spill (memory) and shuffle spill (disk) in the context of the shuffle mechanism (the one described above if it is correct)? For example (maybe): "shuffle spill (disk) is the part that is happening in point 2 mentioned above where the 200 partitions are dumped to the disk of their respective nodes" (I do not know if it is correct to say that; just giving an example)
Lets take a look at docu where we can find this:
This is what your executor loads into memory when stage processing is starting, you can think about this as shuffle files prepared in previous stage by other executors
This is size of output of your stage which may be picked up by next stage for processing, in other words this is a size of shuffle files which this stage created
And now what is shuffle spill
Shuffle spill hapens when your executor is reading shuffle files but they cannot fit into execution memory of this executor. When this happens, some chunk of data is removed from memory and written to disc (its spilled to disc in other words)
Moving back to your question: what is the difference between spill(memory) and spill(disc)? Its describing excatly the same chunk of data. First metric is describing space occupied by those spilled data in memory before they were moved to disc, second is describing their size when written to disc. Those two metrics may be different because data may be represented differently when written to disc, for example they may be compressed.
If you want to read more:
Cloudera questions
Medium 1 Medium 2