Open MPI failed to spawn process

219 Views Asked by At

i am trying to run a program that spawns worker programs using the command MPI_Comm_spawn however let's say that i set the number of processes to be spawn to 4, the master process will spawn 3 and then crash with the following error code :

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: sr530-01
  PID:        154333
  Message:    connect() to myipadd:1028 failed
  Error:      Operation now in progress (115)**

i can always spawn n- 1 worker processes before it crashes. i seperated my code in two files one for the master code and one for the worker code. In the master code i set a variable worker_count this determines the number of workers no matter the value i set, i always get the same error.

Master code

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

int main(int argc, char *argv[]) {
    int rank, size;
    int worker_count = 3;  // Number of worker processes to spawn
    MPI_Comm worker_comm;
    int array_of_errcodes[3];  // Array to store error codes

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (rank == 0) {  // Master process
        printf("Master process is running.\n");

        // Define the command and arguments for the worker program
        const char *worker_program = "./worker";  // Path to the worker program executable
        char *worker_argv[] = {"./worker", NULL};  // Arguments for the worker program
        int maxprocs = worker_count;  // Number of worker processes to spawn
        MPI_Info info = MPI_INFO_NULL;  // No additional info

        // Spawn worker processes
        MPI_Comm_spawn(worker_program, worker_argv, maxprocs, info, 0, MPI_COMM_SELF, &worker_comm, array_of_errcodes);

        // Optionally, you can perform work with the worker processes here

        // Wait for all worker processes to complete
        MPI_Barrier(worker_comm);

        // Disconnect the intercommunicator only once
        if (worker_comm != MPI_COMM_NULL) {
            MPI_Comm_disconnect(&worker_comm);
        }

        printf("Master process is done.\n");
    }

    MPI_Finalize();
    return 0;
}

Worker code

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

int main(int argc, char *argv[]) {
    int rank, size;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (rank != 0) {  // Worker processes (rank > 0)
        printf("Worker process %d is running.\n", rank);

        // Perform the work needed by worker processes

        printf("Worker process %d is done.\n", rank);
    }

    MPI_Finalize();
    return 0;
}

Here is the complete output when i run the master process, output + error, in this case i have set the worker_count to two :

Master process is running.
Worker process 1 is running.
Worker process 1 is done.
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: sr530-01
  PID:        154333
  Message:    connect() to 0.0.0.0:1028 failed **fake ip address
  Error:      Operation now in progress (115)
1

There are 1 best solutions below

2
j23 On

Firstly, OpenMPI is not failing to spawn new processes in your scenario. It is working as intended!

The MPI_Barrier call in your master process is waiting for the child processes in the communicator to call the barrier while your child process has already called MPI_Finalize (There is no MPI_Barrier in your child process) and exited the worker program. As a result, open MPI shows the warning that it failed to connect to a peer MPI process.

Secondly, it is not trivial to synchronise the process between master and child process. Please look at this SO thread.

Lastly, MPI_Comm_disconnect is not ideally intended to use with spawn. Please see the website.

To summarize, if you remove the MPI_Barrier and MPI_Comm_disconnect, the program will work as expected!

If you need to use the MPI_Barrier, add the following lines to your worker code:

  1. Obtain the communicator of the parent process,

    MPI_Comm parent;

    MPI_Comm_get_parent(&parent);

  2. Use that communicator for MPI_Barrier inside the worker,

    MPI_Barrier(parent);

If you don't want to use the barrier, your child process could also send message to the master process (For example, to inform about the end of task) by utilising the parent communicator.

To make the code work, modify your worker code:

int main(int argc, char *argv[]) {
    int rank, size;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm parent;
    MPI_Comm_get_parent(&parent);
    if (rank != 0) {  // Worker processes (rank > 0)
        printf("Worker process %d is running.\n", rank);

        // Perform the work needed by worker processes

        printf("Worker process %d is done.\n", rank);
    }
    MPI_Barrier(parent);
    MPI_Finalize();
    return 0;
}

Remove/Comment the MPI_Comm_disconnect call in the master code:

    // Wait for all worker processes to complete
    MPI_Barrier(worker_comm);

    // Disconnect the intercommunicator only once
    /*
    if (worker_comm != MPI_COMM_NULL) {
       MPI_Comm_disconnect(&worker_comm);
    }
    */

Hope this helps!