All slurm jobs fail silently with exit code 0:53

928 Views Asked by At

All my slurm jobs fail with exit code 0:53 within two seconds of starting.

When I look at job details with scontrol show jobid <JOBID> it doesn't say anything suspicious.

When I look at the files that stdout and stderr write to, there is nothing there.

I couldn't find anything on the listed signal 53.

3

There are 3 best solutions below

0
On

It turns out that the directory containing the files that slurm was supposed to write stdout and stderr to didn't exist.

In my submit.sh script, the relevant lines were:

#SBATCH --output=log/%j.out                 # where to store the output ( %j is the JOBID )
#SBATCH --error=log/%j.err                  # where to store error messages

The log directory in the current working directory from which I was submitting the job didn't exist. Once I created the directory slurm jobs no longer failed with 0:53.

My slurm version is 22.05.2. Per this answer, slurm no longer errors silently when the output directory doesn't exist from version 23.02 upwards. Seems to have been reported in this issue.

1
On

I wanted to add that while this error has happened to me if the directory does not exist, the same thing happens if you exceed your quota.

0
On

I've had the same issue as the OP and in my case the log directory existed, however, was on a filesystem that was read-only. To cite the entry from the ZIH HPC Compendium

When redirecting stderr and stderr into a file using --output= and --stderr=, make sure the target path is writeable on the compute nodes, i.e., it may not point to a read-only mounted filesystem like /projects.

https://compendium.hpc.tu-dresden.de/jobs_and_resources/slurm/