R optimParallel uses huge amounts of RAM

339 Views Asked by At

On my (large) server (Windows with 255GB RAM) my optimparallel skript is running out of memory and then crashes with Error in serialize(data, node$con) : error writing to connection. While I would understand if the data was huge and each node would allocate the data, but this is isn't the case.

The data isn't huge (some 2 millions rows) and requires 600MB RAM loaded. With a slightly smaller data set, the program worked just fine. I appreciate any help!

Here is the data set: data

and here is my script:

library(data.table)
sampler <- function(par,veh_id,routt,ccloops){
  veh_id[,multi:=par]
  routt[veh_id,par:=multi,on=.(vehicle_id)]
  sumrout <- routt[,sum(par),.(edge_id,Ndets)]
  sumdet <- routt[,.(Nmodel=sum(par)),.(edge_id,Ndets)]
  routt[,par:=NULL]
  geh_inside_cc <- sumdet[Ndets>0 & edge_id %in% ccloops$edge_id,mean(sqrt(2*(Ndets-Nmodel)^2/(Ndets+Nmodel)))]
  geh_outside_cc <- sumdet[Ndets>0 & !(edge_id %in% ccloops$edge_id),mean(sqrt(2*(Ndets-Nmodel)^2/(Ndets+Nmodel)))]
  # weight geh_inside_cc a bit higher
  return(2*geh_inside_cc+geh_outside_cc)
}

routt <- fread("routt.csv")
veh_id <- fread("veh_id.csv")
ccloops <- fread("ccloops.csv")

library(optimParallel)
cl0 <- makeCluster(5) # set the number of processor cores
# registerDoParallel(cl <- makeCluster(2))
setDefaultCluster(cl=cl0) # set 'cl' as default cluster
clusterEvalQ(cl0, library("data.table"))
opt <- optimParallel(par = rep(1,nrow(veh_id)),veh_id=veh_id,routt=routt, ccloops=ccloops,fn = sampler,lower = 0, upper = 10000,
                     parallel=list(loginfo=TRUE, cl=cl0), control = list(maxit = 5))
stopCluster(cl0)

R version: 4.1 optimParallel version: 1.0-2

1

There are 1 best solutions below

1
On

So I tested your case with a dummy data. First of all I had to add forward argument inside a parallel one (getOption("optimParallel.forward") is a NULL for me).

Under multisession processing you generally have to perform the following additional steps: Create a PSOCK cluster, Register the cluster if desired, Load necessary packages on the cluster workers, Export necessary data and functions to the global environment of the cluster workers. So the data is clones/copied across all sessions and all calculations requirements (RAM too). When we calculate available memory for each session (cluster) might be less than 40GB (minus main session).

Moreover the size of the dataset is not all RAM consumed by an algorithm. For example lm linear regression might require x10 more memory than the dataset size. The complexity of algorithms is growing exponentially most often with the number of variables (here parameters) so this could be another problem.