I have a parquet
table with approximately 5 billion rows
. After all manipulations using sparklyr
it is reduced to 1,880,573 rows
and 629 columns
. When I try to collect this for Factor Analysis
using sdf_collect()
it is giving me this memory error:
Error : org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (4.0 GB). The average row size was 5.0 KB
Is 1,880,573 rows x 629 columns
too big for sparklyr
to collect? Further, checking the number of rows using data %>% dplyr::count()
took 9 minutes
- how do I reduce this time?
Yes.
1,880,573 rows
and629 columns
is too big. This isn't just a sparklyr problem but your R instance will have a lot of trouble collecting this into local memory.As for the count(), 9 minutes isn't THAT long when you're working with data of this size. One thing you could try is to reduce the count to only one variable.
data %>% select(one_var) %>% count()
. That being said, I don't believe there is a way to dramatically speed up this time other than increasing your spark session parameters (i.e. the number of executors).I'd suggest doing the factor analysis in spark if you can or using a smaller sample of your data.