Collecting a table using sparklyr in Databricks

742 Views Asked by At

I have a parquet table with approximately 5 billion rows. After all manipulations using sparklyr it is reduced to 1,880,573 rows and 629 columns. When I try to collect this for Factor Analysis using sdf_collect() it is giving me this memory error:

Error : org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (4.0 GB). The average row size was 5.0 KB

Is 1,880,573 rows x 629 columns too big for sparklyr to collect? Further, checking the number of rows using data %>% dplyr::count() took 9 minutes - how do I reduce this time?

2

There are 2 best solutions below

1
On

Yes. 1,880,573 rows and 629 columns is too big. This isn't just a sparklyr problem but your R instance will have a lot of trouble collecting this into local memory.

As for the count(), 9 minutes isn't THAT long when you're working with data of this size. One thing you could try is to reduce the count to only one variable. data %>% select(one_var) %>% count(). That being said, I don't believe there is a way to dramatically speed up this time other than increasing your spark session parameters (i.e. the number of executors).

I'd suggest doing the factor analysis in spark if you can or using a smaller sample of your data.

0
On

Export the spark dataframe to disk using sparklyr::spark_write_* and then read it into your R session.

Parquet is a good choice due to fast and compact read/write capability. Repartitioning the spark dataframe with sparklyr::repartition into one part before the write operation results in a single file. This is better to read into R, than multiple files and then a subsequent row binding operation.

Its advisable to not collect a 'large' (depends on your spark configuration, RAM) dataframe using collect function as it might brings all data to the driver node.