Slurm : invalid job credential

616 Views Asked by At

I tried to set up a Slurm cluster, composed of one compute node and one control node.

Currently, launch some task doesn't work. The node sometime just become down even if the queue is not empty. Srun never work but sbatch yes.

#srun -N1 -l /bin/hostname
run: error: Task launch for StepId=28.0 failed on node toto2: Invalid job credential
srun: error: Application launch failed: Invalid job credential
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

I already set up the munge key with the good user, encode and decode is possible:

#munge -n | ssh slurm@toto2 unmunge
STATUS:           Success (0)
ENCODE_HOST:      toto2 (10.0.0.2)
ENCODE_TIME:      2023-09-29 11:23:31 +0200 (1695979411)
DECODE_TIME:      2023-09-29 11:23:31 +0200 (1695979411)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

sbatch work sometimes, but if the sbatch job doesn't start right away, the compute node could possibly become not responding, and I have to manually make it idle again, even if my node is available through SSH.

My munge service is:

#cat /lib/systemd/system/munge.service
[Unit]
Description=MUNGE authentication service
Documentation=man:munged(8)
After=network.target
After=time-sync.target

[Service]
Type=forking
ExecStart=/usr/sbin/munged
PIDFile=/var/run/munge/munged.pid
User=munge
Group=munge
Restart=on-abort

[Install]
WantedBy=multi-user.target

In toto1 (the control node) slurmctl is started with: User=slurm In toto2 (the compute node) slurmctl is started with: User=root

And of course, the UID/GID of slurm and munge user are the same between both node.

1

There are 1 best solutions below

1
On

Note that the UID/GID must match between all nodes for all users who can submit jobs, not only for the slurm user. If you submit jobs as root, also make sure that DisableRootJobs is not set in slurm.conf.

If UID/GID match, then you should investigate for a possible time drift between the nodes ; the munge credential include a timestamp and can be invalid if the nodes are not all aligned on the same NTP server. I have observed a small time drift can lead to a situation where some jobs run and some other in a seemingly random way, but most probably actually based on the resolution of the timestamp.