When upgrading from spark 2.3 to spark 2.4.3, I saw a 20-30% increase in the amount of shuffle disk spill one of my stages generated.
The same code is being executed in both environments.
All configurations are identical between both environments
When upgrading from spark 2.3 to spark 2.4.3, I saw a 20-30% increase in the amount of shuffle disk spill one of my stages generated.
The same code is being executed in both environments.
All configurations are identical between both environments
Copyright © 2021 Jogjafile Inc.
Run .explain(false) on both 2.4.3 and 2.3.0. Additionally dump the configs used on both. There have been changes to the way optimization rules in those releases. Also where are you running spark? There is a dirty secret that many of the providers of spark have been customizing and improving spark under the hood. I suspect there is more going on than you suspect.