I'm running an MPI job on a SLURM/DMTCP/MPICH stack, but the execution is never completed, hitting the time limit and never starting the MPI program execution. The output file is this:
SLURM_JOBID=4
SLURM_JOB_NODELIST=node [1-2]
SLURM_NNODES=2
SLURMTMPDIR=
work ing directory = #homermanager
slurmstepd-nodei: error: *** JOB 4 ON node1 CANCELLED AT 2024-03-29T17:15:00 DUE TO TIME LIMIT ***
Tmpiexec@node1] HYDU_sock_write (utils/sock/sock.c:286): write error (Bad file descriptor)
[mpiexec@node1] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy
[mpiexec@node1] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream
[mpiexec@nodel] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@node1] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec@node1l main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
If I run the mpirun -np 4 ./algorithm, it works just fine, but when I run a job that makes use of DMTCP (dmtcp_launch --rm mpirun ...) I'll get that error as a response.