How to run R commands in parallel with controlled memory usage?

62 Views Asked by At

I have a large object (~50GB) that I wish to break into roughly 10 non-disjoint subsets to perform analysis on concurrently. The issue is that I have had numerous bad experiences with parallelism in R whereby large memory objects from the environment are duplicated in each worker, blowing out the memory usage.

Suppose the subsets data[subset, ] are only ~5GB each in size, what method could I use to perform concurrent processing with properly controlled memory usage?

One solution I found was using jobs::jobs() which allows me to explicitly specify the objects exported. Unfortunately this works by starting up RStudio jobs which I don't have a way of programatically monitoring the completion of. I wish to do aggregation of all the analyses after they are all complete, and currently I would have to look at the job monitor in RStudio to check all jobs have completed before I proceed to run the next code block.

Ideally I would like to use something like futures but with a tightly controlled memory footprint, from my understanding futures will capture any global variables it finds (including my very large full subset) and send it off to each worker. However if I am able to control the exported data then the promises will allow my program to automatically continue once the heavy concurrent work is done.

So I am looking for a package, code pattern or other solution that would allow me to

  1. Be able to execute each iteration of
for (subset in subset_list) {
    data_subset <- data[subset, ]
    run_analysis(data_subset)
}

asynchronously (preferably through processes).

  1. Be able to wait for the completion of all analysis workers and continue on with a script.
  2. Strictly control the objects exported into the worker processes.

Following Henrik's comment, is it correct to believe the following creates duplicates of the data and data_split within the workers? If so, is there a strategy to avoid such duplication?

library(furrr)

plan(multisession(workers = 3))

data <- runif(2e8)

data_split <- split(data, sample(1:4, 2e8, replace = TRUE))

process <- function(x) {
    Sys.sleep(60)
    exp(x)
}

res <- future_map(data_split, process)
0

There are 0 best solutions below