Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect

Question

Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect

4.2k Views Asked by petzholt At 09 December 2020 at 17:42

I'm trying to transform a pyspark dataframe of size [2734984 rows x 11 columns] to a pandas dataframe calling toPandas(). Whereas it is working totally fine (11 seconds) when using an Azure Databricks Notebook, I run into a java.lang.OutOfMemoryError: Java heap space exception when i run the exact same code using databricks-connect (db-connect version and Databricks Runtime Version match and are both 7.1).

I already increased the spark driver memory (100g) and the maxResultSize (15g). I suppose that the error lies somewhere in databricks-connect because I cannot replicate it using the Notebooks.

Any hint what's going on here?

The error is the following one:

Exception in thread "serve-Arrow" java.lang.OutOfMemoryError: Java heap space
    at com.ning.compress.lzf.ChunkDecoder.decode(ChunkDecoder.java:51)
    at com.ning.compress.lzf.LZFDecoder.decode(LZFDecoder.java:102)
    at com.databricks.service.SparkServiceRPCClient.executeRPC0(SparkServiceRPCClient.scala:84)
    at com.databricks.service.SparkServiceRemoteFuncRunner.withRpcRetries(SparkServiceRemoteFuncRunner.scala:234)
    at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPC(SparkServiceRemoteFuncRunner.scala:156)
    at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPCHandleCancels(SparkServiceRemoteFuncRunner.scala:287)
    at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute0$1(SparkServiceRemoteFuncRunner.scala:118)
    at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$934/2145652039.apply(Unknown Source)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.service.SparkServiceRemoteFuncRunner.withRetry(SparkServiceRemoteFuncRunner.scala:135)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute0(SparkServiceRemoteFuncRunner.scala:113)
    at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute$1(SparkServiceRemoteFuncRunner.scala:86)
    at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$1031/465320026.apply(Unknown Source)
    at com.databricks.spark.util.Log4jUsageLogger.recordOperation(UsageLogger.scala:210)
    at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:346)
    at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:325)
    at com.databricks.service.SparkServiceRPCClientStub.recordOperation(SparkServiceRPCClientStub.scala:61)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute(SparkServiceRemoteFuncRunner.scala:78)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute$(SparkServiceRemoteFuncRunner.scala:67)
    at com.databricks.service.SparkServiceRPCClientStub.execute(SparkServiceRPCClientStub.scala:61)
    at com.databricks.service.SparkServiceRPCClientStub.executeRDD(SparkServiceRPCClientStub.scala:225)
    at com.databricks.service.SparkClient$.executeRDD(SparkClient.scala:279)
    at com.databricks.spark.util.SparkClientContext$.executeRDD(SparkClientContext.scala:161)
    at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:864)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:928)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2331)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2426)
    at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$6(Dataset.scala:3638)
    at org.apache.spark.sql.Dataset$$Lambda$3567/1086808304.apply$mcV$sp(Unknown Source)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$3(Dataset.scala:3642)```

Original Q&A

There are 1 best solutions below

**sander-db** · Accepted Answer · 2020-12-10T11:22:57.303000

This is likely because Databricks-connect is executing the toPandas on the client machine which can then run out of memory. You could increase the local driver memory by setting spark.driver.memory in the (local) config file ${spark_home}/conf/spark-defaults.conf where ${spark_home} can be obtained with databricks-connect get-spark-home.

Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PYSPARK

Related Questions in DATABRICKS

Related Questions in DATABRICKS-CONNECT

Trending Questions

Popular # Hahtags

Popular Questions