Fail to connect in Mpi4py

41 Views Asked by At

I am using mpi4py with scipy.

Here is my attempt, but it sometimes hangs:

if __name__ == '__main__':

    if sys.argv[2] == 'mpi':
        from mpi4py.futures import MPIPoolExecutor
        with MPIPoolExecutor() as executor:
            num_workers = executor.map
            results = differential_evolution(likelihood, workers=num_workers)

    else:
        num_workers = int(sys.argv[1])
        results = differential_evolution(likelihood, workers=num_workers)

Job script for the cluster, where each node has 128 physical cores and 256 virtual cores.

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-core=2

module load --auto python/3.9.15-gcc-12.2.0-3sr5utz
module load --auto py-pandas/1.5.1-gcc-12.2.0-356d2ew
module load --auto py-scipy/1.8.1-gcc-12.2.0-7uvxgvy
module load --auto py-mpi4py/3.1.3-gcc-12.2.0-xvabib2

mpiexec  --map-by node --mca btl "^openib"  -n 201  python3 -m mpi4py.futures ./mypyscripy.py 200 'mpi'

Sometimes the above fails with the following error messages:

[n3510-008][[54035,1],198][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)

--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: n3511-018
  PID:        43453
  Message:    connect() to 10.191.11.16:1105 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------
[n3511-016:43831] 65 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
[n3511-016:43831] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
0

There are 0 best solutions below