I am currently fitting a set of models on a subset of data for each level of a factor variable. As the models take a long time to run, I use the foreach and doParallel package to estimate the set of models for each level of a variable in parallel using %dopar%. I only pass the subset of data to each worker to avoid memory issues, using isplit() function from the iterators package.
Now, my question is how to extend my code so that in the first iteration, the models are estimated on the whole dataset, by passing the full dataset to one of the workers. In the next iterations then, I want to pass only a subset of the data to each worker and estimate the models.
I illustrate my problem using some example data of the mtcars dataset below.
Suppose, I want to calculate the the average number of forward gears a car has (gear column), by the number of cylinders of cars (cyl column), in parallel.
First, load package and import the data
library(doParallel)
library(foreach)
library(iterators)
library(dplyr)
#get sample data to illustrate problem
data("mtcars")
df <- mtcars
df$cyl <- as.factor(df$cyl) #make cyl categorical
Next, iterate over each level of the cyl column and do the necessary calculations
mycluster <- makeCluster(3)
registerDoParallel(mycluster)
result <- foreach(subset = isplit(df, df$cyl), .combine = "c", .packages = "dplyr") %dopar% {
x <- summarise(subset$value, mean(gear, na.rm = T))
return(x)
}
stopCluster(mycluster)
The result is a list containing the average number of gears for each category of number of cylinders.
> result
$`mean(gear, na.rm = T)`
[1] 4.090909
$`mean(gear, na.rm = T)`
[1] 3.857143
$`mean(gear, na.rm = T)`
[1] 3.285714
Now, what I want is to extend this code, so that I have four iterations. In the first iteration, I want to pass the full dataset to the first worker, and calculate the average number of gears for all cars included in the whole dataset. Next, I want to pass the specific subsets of data for each level of gear to the other workers, and calculate the average number of gears, as shown above. So the new thing is just to add one iteration to the isplit() statement where I pass the full dataset.
Expected output:
> result
$`mean(gear, na.rm = T)` #average number of gears across all cars in dataset
[1] 3.6875
$`mean(gear, na.rm = T)`
[1] 4.090909
$`mean(gear, na.rm = T)`
[1] 3.857143
$`mean(gear, na.rm = T)`
[1] 3.285714
I know the example is silly, but it illustrates what I am trying to achieve. In reality, I use a very large dataset and estimate a couple of models that each take a long time to run.The data are however from a census, so I cannot share a few lines of it.