I have a long-ish targets
pipeline (takes more than an hour to execute) for which parallel execution is possible. Specifically, many, but not all, calculations can be done in parallel across 155 countries and 60 years. There are times when the country-specific calculations are aggregated from continents to the world, for example, and that sum is not amenable to parallel execution. I am running the pipeline on my local machine only (not on a cluster or networked computer).
When I run the pipeline with 5 countries and tar_make_clustermq(workers = 8)
(on an 10-core machines, 14-inch Apple silicon MacBook Pro and iMac Pro), the pipeline is successful. Furthermore, I see 5 processors in use simultaneously. However, when I run the pipeline with 6 or more countries, there are several times when the pipeline stalls or seemingly switches to single-threaded execution. I have found that I need to restart the pipeline with tar_make_clustermq(workers = 8)
or (worse) restart with tar_make()
(single-threaded) to get it going again.
The points in the pipeline when restart is required are 100% repeatable for a specific set of countries. The points in the pipeline when a restart is required changes with the countries in the analysis.
It would be pretty difficult to develop a reprex for this behavior, because of the large files and pipelines involved. So at this time, I am requesting suggestions for next steps for debugging or changing course altogether. Here are some specific questions:
- I have searched and found only this report (https://github.com/ropensci/targets/issues/182). Have I missed other reports of similar behavior?
- If others found unreliable behavior from
targets
andclustermq
on the local machine, what hints can you provide for getting around these problems? - I have considered switching from
clustermq
tofuture
intargets
. I'm wondering if that switch would provide improvements. I have not triedfuture
. So if someone has experience with both, I welcome your input.
Thanks in advance for any hints!
I switched to using
future
with much success. I'm usingfuture::plan(future.callr::callr)
. My pipeline no longer hangs/stalls, as it did when usingclustermq
. Rather it completes without intervention, as desired. For this pipeline, at least, future is the way to go!