Slurm jobs are running, but there is no output or errors

295 Views Asked by At

I am facing an issue with a Slurm job submitted to a node in our cluster, which is running Rocky Linux 8.8. The job's status is 'R', but it has been running for over a day without producing any output or errors.

Previously, this job would complete within a few minutes, also attempting to cancel the running job results in it freezing in the 'CG' state.

I tried to restart the Slurm service on the node using the commands:

systemctl restart slurmd systemctl restart slurmd.service systemctl restart sshd

I also tried to reboot the node.

However, the issue persists, and the problem occurs consistently with different jobs submitted.

What can cause this issue, and how to fix it?

Thanks

1

There are 1 best solutions below

2
On

This is often caused by some I/O operation that is blocked ; the job cannot write to a filesystem, and Slurm is not able to properly cancel the job because of a process stuck in D state. From the Slurm controller view, the job remains in CG state ("completing").

Often, a failing network mount, for instance NFS, is the culprit, but if the problem remains after a node reboot, you should probably look for a failing local disk (local scratch, OS disk, etc.)