Suppose I have a master program, which is basically a 1 rank mpi which uses MPI spawn to spawn 5 worker programs.
Now, if I execute my master using the following command
aprun -n 1 -N 1 master
The total number of ranks after spawning will be 6. But will all the 6 ranks be running on the same node? Is there anyway I can distribute the 6 among 3 nodes?
I can exactly one copy of the master process and 5 worker processes.
Cray MPI has not supported MPI_Comm_spawn until recently, and its solution to managing resources for spawned MPI jobs is unique. A place-holder job is launched using
aprun
to manage the resources used to host spawned jobs, i.e., the cores/nodes that will be hosting the spawned MPI ranks. The set of resources managed by the place-holder job is called a "rank pool", in analogy to a memory pool. Here's how you would set up and use a rank pool:rankpool.c
spawning_app.c
If you want to distribute 6 ranks across three nodes, you can launch your rank pool using
aprun -n 6 -N 2
, so you have 6 total ranks and 2 ranks per node.If you want a more specific layout for your spawned ranks, you can reorder the ranks in the communicator that you pass to
MPIX_Comm_rankpool
to obtain this effect. For example, if your master job spawns various child jobs each with 4 ranks, and you want the ranks for each child job spread evenly across nodes, you can reorder the ranks inMPI_COMM_WORLD
from this:to this:
MPIX_Comm_rankpool
will attempt to assign a contiguous set of ranks to each child job, so child jobs will generally have one rank on each node.For more details on how this all works, see Cray's dynamic process management whitepaper.