Collecting a table using sparklyr in Databricks

734 Views Asked by Kenneth Singh At 17 August 2025 at 05:09

I have a parquet table with approximately 5 billion rows. After all manipulations using sparklyr it is reduced to 1,880,573 rows and 629 columns. When I try to collect this for Factor Analysis using sdf_collect() it is giving me this memory error:

Error : org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (4.0 GB). The average row size was 5.0 KB

Is 1,880,573 rows x 629 columns too big for sparklyr to collect? Further, checking the number of rows using data %>% dplyr::count() took 9 minutes - how do I reduce this time?

Original Q&A

There are 2 best solutions below

edog429 On 29 September 2020 at 21:19

Yes. 1,880,573 rows and 629 columns is too big. This isn't just a sparklyr problem but your R instance will have a lot of trouble collecting this into local memory.

As for the count(), 9 minutes isn't THAT long when you're working with data of this size. One thing you could try is to reduce the count to only one variable. data %>% select(one_var) %>% count(). That being said, I don't believe there is a way to dramatically speed up this time other than increasing your spark session parameters (i.e. the number of executors).

I'd suggest doing the factor analysis in spark if you can or using a smaller sample of your data.

talegari On 30 June 2021 at 21:57

Export the spark dataframe to disk using sparklyr::spark_write_* and then read it into your R session.

Parquet is a good choice due to fast and compact read/write capability. Repartitioning the spark dataframe with sparklyr::repartition into one part before the write operation results in a single file. This is better to read into R, than multiple files and then a subsequent row binding operation.

Its advisable to not collect a 'large' (depends on your spark configuration, RAM) dataframe using collect function as it might brings all data to the driver node.

Collecting a table using sparklyr in Databricks

There are 2 best solutions below

Related Questions in R

Related Questions in PARQUET

Related Questions in DATABRICKS

Related Questions in SPARKLYR

Trending Questions

Popular # Hahtags

Popular Questions