%dopar% is very slow when using iterators compared to loading the data from text file

134 Views Asked by At

%dopar% is phenomenally slower when using iterators, compared to loading the data from text file. I've compared 3 use cases

  • using %dopar% without any iterators dopar1
  • using %dopar% with iterators dopar2
  • using %dopar% when loading data from flat file dopar3

now calculating dopar3 is 2 times faster (.5 seconds) than calculating dopar1 or dopar2 (about 1.2 seconds). Both dopar1 and dopar2 have similar performance. I dont know if is converting the data frame of dopar1 into iterator internally.

I was expecting dopar2 to outperform dopar3. Is it because dopar2 is still loading the entire dataset for each thread ? If so then why isn't iterators not doing their job to prevent that from happening. Or have I used iterators wrongly ? Any help is much appreciated.

(My system has 8 physical cores, and so 4 cores were used for this parallel processing - as per the code)

library('foreach')
library("parallelly")
library("parallel")
library("doParallel")
library("data.table")

# Setting up and registering the cluster
cluster1 = makeCluster(ceiling(detectCores(logical=FALSE)/2), type="PSOCK", outfile="")
cluster1 = autoStopCluster(cluster1)
registerDoParallel(cluster1)

# Generating Data
data1 = as.data.frame(matrix(round(runif(100000000), 2), ncol=100))

# Storing Data to Flat File -- For Use Case - 3
data_loc = file.path("1.Basics", "ParallelDataset")
dir.create(data_loc, recursive=TRUE, showWarnings=FALSE)
for(var1 in 1:ncol(data1)) {
    fwrite(data1[,var1, drop=FALSE], file.path(data_loc,paste(var1, ".csv", sep="")))
}

# Case-1 Parallel execution Without Iterator
system.time({
    dopar1 = foreach(var1=data1) %dopar% {
        sum(var1)
    }
})

# Case-2 Parallel execution With Iterator
system.time({
    dopar2 = foreach(var1=iter(data1, by="col")) %dopar% {
        sum(var1)
    }
})

# Case-3 Parallel execution - Loading data from flatfile
system.time({
    dopar3 = foreach(file1=list.files(data_loc)) %dopar% {
        sum(data.table::fread(file.path(data_loc, file1)))
    }
})


stopCluster(cluster1)
1

There are 1 best solutions below

1
M.Viking On

One driver is what OS your system is running. Not sure if parallelly is adding anything. And for makeCluster, I get faster results using registerDoParallel alone. Try this simplified approach in a clean R session

library(doParallel)
library(data.table) 
#note DT adds it's own multi threading 
##>>(data.table 1.14.4 using 2 threads (see ?getDTthreads).  Latest news: r-datatable.com)

registerDoParallel(cores=2) 

system.time({dopar1 = foreach(var1=data1) %dopar% {sum(var1)}})
#   user  system elapsed 
#  0.117   0.044   0.140 
    
system.time({dopar3 = foreach(file1=list.files(data_loc)) %dopar% {
               sum(data.table::fread(file.path(data_loc, file1)))}})
#   user  system elapsed 
#  0.032   0.031   1.600 

#An mclapply method:
system.time({mcpar1 = mclapply(iter(data1, by="col"), FUN=sum)})
#   user  system elapsed 
#  0.005   0.037   0.143 

stopImplicitCluster()