I am using mpi4py with scipy.
Here is my attempt, but it sometimes hangs:
if __name__ == '__main__':
if sys.argv[2] == 'mpi':
from mpi4py.futures import MPIPoolExecutor
with MPIPoolExecutor() as executor:
num_workers = executor.map
results = differential_evolution(likelihood, workers=num_workers)
else:
num_workers = int(sys.argv[1])
results = differential_evolution(likelihood, workers=num_workers)
Job script for the cluster, where each node has 128 physical cores and 256 virtual cores.
#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-core=2
module load --auto python/3.9.15-gcc-12.2.0-3sr5utz
module load --auto py-pandas/1.5.1-gcc-12.2.0-356d2ew
module load --auto py-scipy/1.8.1-gcc-12.2.0-7uvxgvy
module load --auto py-mpi4py/3.1.3-gcc-12.2.0-xvabib2
mpiexec --map-by node --mca btl "^openib" -n 201 python3 -m mpi4py.futures ./mypyscripy.py 200 'mpi'
Sometimes the above fails with the following error messages:
[n3510-008][[54035,1],198][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (104)
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: n3511-018
PID: 43453
Message: connect() to 10.191.11.16:1105 failed
Error: Operation now in progress (115)
--------------------------------------------------------------------------
[n3511-016:43831] 65 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
[n3511-016:43831] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages