On my (large) server (Windows with 255GB RAM) my optimparallel skript is running out of memory and then crashes with Error in serialize(data, node$con) : error writing to connection
.
While I would understand if the data was huge and each node would allocate the data, but this is isn't the case.
The data isn't huge (some 2 millions rows) and requires 600MB RAM loaded. With a slightly smaller data set, the program worked just fine. I appreciate any help!
Here is the data set: data
and here is my script:
library(data.table)
sampler <- function(par,veh_id,routt,ccloops){
veh_id[,multi:=par]
routt[veh_id,par:=multi,on=.(vehicle_id)]
sumrout <- routt[,sum(par),.(edge_id,Ndets)]
sumdet <- routt[,.(Nmodel=sum(par)),.(edge_id,Ndets)]
routt[,par:=NULL]
geh_inside_cc <- sumdet[Ndets>0 & edge_id %in% ccloops$edge_id,mean(sqrt(2*(Ndets-Nmodel)^2/(Ndets+Nmodel)))]
geh_outside_cc <- sumdet[Ndets>0 & !(edge_id %in% ccloops$edge_id),mean(sqrt(2*(Ndets-Nmodel)^2/(Ndets+Nmodel)))]
# weight geh_inside_cc a bit higher
return(2*geh_inside_cc+geh_outside_cc)
}
routt <- fread("routt.csv")
veh_id <- fread("veh_id.csv")
ccloops <- fread("ccloops.csv")
library(optimParallel)
cl0 <- makeCluster(5) # set the number of processor cores
# registerDoParallel(cl <- makeCluster(2))
setDefaultCluster(cl=cl0) # set 'cl' as default cluster
clusterEvalQ(cl0, library("data.table"))
opt <- optimParallel(par = rep(1,nrow(veh_id)),veh_id=veh_id,routt=routt, ccloops=ccloops,fn = sampler,lower = 0, upper = 10000,
parallel=list(loginfo=TRUE, cl=cl0), control = list(maxit = 5))
stopCluster(cl0)
R version: 4.1 optimParallel version: 1.0-2
So I tested your case with a dummy data. First of all I had to add
forward
argument inside aparallel
one (getOption("optimParallel.forward")
is aNULL
for me).Under multisession processing you generally have to perform the following additional steps: Create a PSOCK cluster, Register the cluster if desired, Load necessary packages on the cluster workers, Export necessary data and functions to the global environment of the cluster workers. So the data is clones/copied across all sessions and all calculations requirements (RAM too). When we calculate available memory for each session (cluster) might be less than 40GB (minus main session).
Moreover the size of the dataset is not all RAM consumed by an algorithm. For example
lm
linear regression might require x10 more memory than the dataset size. The complexity of algorithms is growing exponentially most often with the number of variables (here parameters) so this could be another problem.