i am trying to run a program that spawns worker programs using the command MPI_Comm_spawn however let's say that i set the number of processes to be spawn to 4, the master process will spawn 3 and then crash with the following error code :
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: sr530-01
PID: 154333
Message: connect() to myipadd:1028 failed
Error: Operation now in progress (115)**
i can always spawn n- 1 worker processes before it crashes. i seperated my code in two files one for the master code and one for the worker code. In the master code i set a variable worker_count this determines the number of workers no matter the value i set, i always get the same error.
Master code
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main(int argc, char *argv[]) {
int rank, size;
int worker_count = 3; // Number of worker processes to spawn
MPI_Comm worker_comm;
int array_of_errcodes[3]; // Array to store error codes
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0) { // Master process
printf("Master process is running.\n");
// Define the command and arguments for the worker program
const char *worker_program = "./worker"; // Path to the worker program executable
char *worker_argv[] = {"./worker", NULL}; // Arguments for the worker program
int maxprocs = worker_count; // Number of worker processes to spawn
MPI_Info info = MPI_INFO_NULL; // No additional info
// Spawn worker processes
MPI_Comm_spawn(worker_program, worker_argv, maxprocs, info, 0, MPI_COMM_SELF, &worker_comm, array_of_errcodes);
// Optionally, you can perform work with the worker processes here
// Wait for all worker processes to complete
MPI_Barrier(worker_comm);
// Disconnect the intercommunicator only once
if (worker_comm != MPI_COMM_NULL) {
MPI_Comm_disconnect(&worker_comm);
}
printf("Master process is done.\n");
}
MPI_Finalize();
return 0;
}
Worker code
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main(int argc, char *argv[]) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank != 0) { // Worker processes (rank > 0)
printf("Worker process %d is running.\n", rank);
// Perform the work needed by worker processes
printf("Worker process %d is done.\n", rank);
}
MPI_Finalize();
return 0;
}
Here is the complete output when i run the master process, output + error, in this case i have set the worker_count to two :
Master process is running.
Worker process 1 is running.
Worker process 1 is done.
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: sr530-01
PID: 154333
Message: connect() to 0.0.0.0:1028 failed **fake ip address
Error: Operation now in progress (115)
Firstly, OpenMPI is not failing to spawn new processes in your scenario. It is working as intended!
The
MPI_Barriercall in your master process is waiting for the child processes in the communicator to call the barrier while your child process has already calledMPI_Finalize(There is noMPI_Barrierin your child process) and exited the worker program. As a result, open MPI shows the warning that it failed to connect to a peer MPI process.Secondly, it is not trivial to synchronise the process between master and child process. Please look at this SO thread.
Lastly,
MPI_Comm_disconnectis not ideally intended to use with spawn. Please see the website.To summarize, if you remove the
MPI_BarrierandMPI_Comm_disconnect, the program will work as expected!If you need to use the
MPI_Barrier, add the following lines to your worker code:Obtain the communicator of the parent process,
MPI_Comm parent;
MPI_Comm_get_parent(&parent);
Use that communicator for MPI_Barrier inside the worker,
MPI_Barrier(parent);
If you don't want to use the barrier, your child process could also send message to the master process (For example, to inform about the end of task) by utilising the parent communicator.
To make the code work, modify your worker code:
Remove/Comment the
MPI_Comm_disconnectcall in the master code:Hope this helps!