topandas() using pyarrow returns empty dataframe

1.2k Views Asked by At

I have a spark dataframe with 5 million rows and 250 columns. When I do topandas() conversion of this dataframe with "spark.sql.execution.arrow.enabled" as "true" it returns a empty dataframe with just the columns.

With pyarrow disabled I get below error

Py4JJavaError: An error occurred while calling o124.collectToPython. : java.lang.OutOfMemoryError: GC overhead limit exceeded

Is there any way to perform this operation with increasing some sort of memory allocation?

I couldn't find any online resoruces for this except https://issues.apache.org/jira/browse/SPARK-28881 which is not that helpful

1

There are 1 best solutions below

0
On

Ok this problem is related to the memory because while converting a spark dataframe to pandas dataframe spark (Py4j) has to pass throw the collect which consumes a lot of memory, so what I advise you to do is while creating the spark session just reconfigure the memory, here an example :

from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '16g')
sc = SparkContext("local", "stack_over_flow") 

Proceed with sc (spark context ) or with spark session as you like If this will not work it may be a version conflict so please check those options :
-Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python
-Downgrade to pyarrow < 0.15.0 for now.
If you can share your script it will be more clear