I am running analysis on a cluster and internally I am spawning some processes. Most of the times it works, but sometimes I get following error:
mm_xpmem.c:135 UCX ERROR failed to attach xpmem apid 0x600005c0e offset 0x2b8cb9183000 length 12288: No such file or directory
mm_ep.c:172 UCX ERROR mm ep failed to connect to remote FIFO id 0x2b8cb9183000: Input/output error
This error is raised randomly. What is the cause for this error and how can this be resolved?
OpenMPI: 4.0.5
mpi4py: 3.1.3
I don't know if this is possible in your case, but removing the xpmem kernel module (done by administrator) fixed a similar problem I had with openMPI 4.1.1.1.