I'm testing out the targets
package and am running into a problem with customizing parallelization. My workflow has two steps, and I'd like to parallelize the first step over 4 workers and the second step over 16 workers.
I want to know if I can solve the problem by calling tar_make_future()
, and then specifying how many workers each step requires in the tar_target
calls. I've got a simple example below, where I'd like the data
step to execute with 1 worker, and the sums
step to execute with 3 workers.
library(targets)
tar_dir({
tar_script({
library(future)
library(future.callr)
library(dplyr)
plan(callr)
list(
# Goal: this step should execute with 1 worker
tar_target(
data,
data.frame(
x = seq_len(6),
id = rep(letters[seq_len(3)], each = 2)
) %>%
group_by(id) %>%
tar_group(),
iteration = "group"
),
# Goal: this step should execute with 3 workers, in parallel
tar_target(
sums,
sum(data$x),
pattern = map(data),
iteration = "vector"
)
)
})
tar_make_future()
})
I know that one option is to configure the parallel backend separately within each step, and then call tar_make()
to execute the workflow serially. I'm curious about whether I can get this kind of result with tar_make_future()
.
I would recommend that you call
tar_make_future(workers = <max_parallel_workers>)
and lettargets
figure out how many workers to run in parallel.targets
automatically figures out which targets can run in parallel and which need to wait for upstream dependencies to finish. In your case, some of thedata
branches may finish before others, in which casesum
can start right away. In other words, somesum
branches will start running before othersum
branches can start, and you can trusttargets
to scale up transient workers when the need arises. The animation at https://books.ropensci.org/targets/hpc.html#future may help visualize this. If you were to micromanage the parallelism fordata
andsum
separately, you would likely have to wait for all ofdata
to finish before any ofsum
can start, which could take a long time.