I am trying to convert a spark dataframe into a pandas dataframe. I have a sufficiently large driver. I am trying to set the spark.driver.maxResultSize value , like this
spark = (
SparkSession
.builder
.appName('test')
.enableHiveSupport()
.config("spark.sql.execution.arrow.pyspark.enabled", "true")
.config("spark.driver.maxResultSize","0")
.getOrCreate()
)
But the job is failing with the following error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of XXXX tasks (1026.4 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
In version 3.0.0, your error is triggered by this piece of code:
As you can see, if
maxResultSize == 0you will never get the error you're getting. A bit higher, you see thatmaxResultSizecomes fromconfig.MAX_RESULT_SIZE. And that one is finally defined byspark.driver.maxResultSizein this piece of code:Conclusion
You are trying the correct thing! Having
spark.driver.maxResultSizeequal to 0 also works in Spark 3.0. As you see in your error message, it seems that yourconfig.MAX_RESULT_SIZEis still equal to the default value of1024MB.This means that your configuration is probably not coming through somehow. I would investigate your whole setup. How are you submitting your application? What is your master? Is your
spark.sql.execution.arrow.pyspark.enabledconfig coming through?