How to clear Spark temporary shuffle files between stages to avoid "no space left on device" error?

809 Views Asked by At

I am running a spark job on a AWS EMR 6.6, (Spark 3.2.0) however it seems that spark is writing a lot of data to disk. I always thought that spark was all in memory, but it appears that spark writes temporary files to disk each time there is a wide shuffle i.e. between stages (I am not sure why). I however, this is really only an issue because these temp files don't get deleted between stages.

For my understanding the temp files from one stage are read by the next stage, however, I don't think they should be needed for the stage after that. So if my job is 3 stages, after stage 1 completes I should be able to delete the temporary shuffle files created by stage 1 before I run stage 3.

I believe this would resolve my problem since it means my local storage will have at most 2 (sequential) stages worth of shuffle temp data. However, I can't seem to find any way to do this.

I know I can just increase the EBS storage or use AWS Glue, but I'd like to avoid that.

0

There are 0 best solutions below