I am facing an issue with a Slurm job submitted to a node in our cluster, which is running Rocky Linux 8.8. The job's status is 'R', but it has been running for over a day without producing any output or errors.
Previously, this job would complete within a few minutes, also attempting to cancel the running job results in it freezing in the 'CG' state.
I tried to restart the Slurm service on the node using the commands:
systemctl restart slurmd systemctl restart slurmd.service systemctl restart sshd
I also tried to reboot the node.
However, the issue persists, and the problem occurs consistently with different jobs submitted.
What can cause this issue, and how to fix it?
Thanks
This is often caused by some I/O operation that is blocked ; the job cannot write to a filesystem, and Slurm is not able to properly cancel the job because of a process stuck in D state. From the Slurm controller view, the job remains in
CG
state ("completing").Often, a failing network mount, for instance NFS, is the culprit, but if the problem remains after a node reboot, you should probably look for a failing local disk (local scratch, OS disk, etc.)