I have a spark dataframe with 5 million rows and 250 columns. When I do topandas()
conversion of this dataframe with "spark.sql.execution.arrow.enabled"
as "true"
it returns a empty dataframe with just the columns.
With pyarrow disabled I get below error
Py4JJavaError: An error occurred while calling o124.collectToPython. : java.lang.OutOfMemoryError: GC overhead limit exceeded
Is there any way to perform this operation with increasing some sort of memory allocation?
I couldn't find any online resoruces for this except https://issues.apache.org/jira/browse/SPARK-28881 which is not that helpful
Ok this problem is related to the memory because while converting a spark dataframe to pandas dataframe spark (Py4j) has to pass throw the collect which consumes a lot of memory, so what I advise you to do is while creating the spark session just reconfigure the memory, here an example :
Proceed with sc (spark context ) or with spark session as you like If this will not work it may be a version conflict so please check those options :
-Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python
-Downgrade to pyarrow < 0.15.0 for now.
If you can share your script it will be more clear