Slurm jobs are running, but there is no output or errors

295 Views Asked by QA310 At 18 October 2025 at 10:59

I am facing an issue with a Slurm job submitted to a node in our cluster, which is running Rocky Linux 8.8. The job's status is 'R', but it has been running for over a day without producing any output or errors.

Previously, this job would complete within a few minutes, also attempting to cancel the running job results in it freezing in the 'CG' state.

I tried to restart the Slurm service on the node using the commands:

systemctl restart slurmd systemctl restart slurmd.service systemctl restart sshd

I also tried to reboot the node.

However, the issue persists, and the problem occurs consistently with different jobs submitted.

What can cause this issue, and how to fix it?

Thanks

Original Q&A

There are 1 best solutions below

damienfrancois On 14 December 2023 at 09:08

This is often caused by some I/O operation that is blocked ; the job cannot write to a filesystem, and Slurm is not able to properly cancel the job because of a process stuck in D state. From the Slurm controller view, the job remains in CG state ("completing").

Often, a failing network mount, for instance NFS, is the culprit, but if the problem remains after a node reboot, you should probably look for a failing local disk (local scratch, OS disk, etc.)

Slurm jobs are running, but there is no output or errors

There are 1 best solutions below

Related Questions in LINUX

Related Questions in JOBS

Related Questions in SLURM

Related Questions in CG

Trending Questions

Popular # Hahtags

Popular Questions