I'm testing out the targets package and am running into a problem with customizing parallelization. My workflow has two steps, and I'd like to parallelize the first step over 4 workers and the second step over 16 workers.
I want to know if I can solve the problem by calling tar_make_future(), and then specifying how many workers each step requires in the tar_target calls. I've got a simple example below, where I'd like the data step to execute with 1 worker, and the sums step to execute with 3 workers.
library(targets)
tar_dir({
tar_script({
library(future)
library(future.callr)
library(dplyr)
plan(callr)
list(
# Goal: this step should execute with 1 worker
tar_target(
data,
data.frame(
x = seq_len(6),
id = rep(letters[seq_len(3)], each = 2)
) %>%
group_by(id) %>%
tar_group(),
iteration = "group"
),
# Goal: this step should execute with 3 workers, in parallel
tar_target(
sums,
sum(data$x),
pattern = map(data),
iteration = "vector"
)
)
})
tar_make_future()
})
I know that one option is to configure the parallel backend separately within each step, and then call tar_make() to execute the workflow serially. I'm curious about whether I can get this kind of result with tar_make_future().
I would recommend that you call
tar_make_future(workers = <max_parallel_workers>)and lettargetsfigure out how many workers to run in parallel.targetsautomatically figures out which targets can run in parallel and which need to wait for upstream dependencies to finish. In your case, some of thedatabranches may finish before others, in which casesumcan start right away. In other words, somesumbranches will start running before othersumbranches can start, and you can trusttargetsto scale up transient workers when the need arises. The animation at https://books.ropensci.org/targets/hpc.html#future may help visualize this. If you were to micromanage the parallelism fordataandsumseparately, you would likely have to wait for all ofdatato finish before any ofsumcan start, which could take a long time.