Slurm and Munge "Invalid Credential"

7.7k Views Asked by At

I'm installing slurm for the first time. I've installed the 19.05.1-2 tarball and used the configurator to make a very simple two node cluster. Control node is sdc, compute nodes (running slurmd) are sdc and sdc1. Both rebuilt with Ubuntu 18.04

I can start the controller, and the compute node sdc and also successfully submit jobs with srun. That's great. However, when I start slurmd on the second node, SDC1, I get:

slurmd: error: Unable to register: Zero Bytes were transmitted or received

That quickly led me to my munge configuration. Munge.log on the controller (sdc) shows "Invalid credential" every second. I triple checked that munge.key on both hosts are identical. I verified that ntp is running too.

So by hand I did munge -s foobar | unmunge on SDC1 and of course that worked locally. Then I saved the munged text from SDC1 to a file on SDC and tried unmunge. That did give me the error "Invalid credential" again.

Because of this I uninstalled and reinstalled munge on both systems, distributed the key and repeated that test with the same result.

I guess I'm missing something simple. I don't know what else to do to properly install munge.

2

There are 2 best solutions below

0
On

Did you remember to restart the munge daemon after copying the munge.key to /etc/munge? I got the same error doing

1: install slurm:

$ apt install -y slurm-client

2: copy slurm.conf (perhaps create slurm-llnl beforehand):

$ cp slurm.conf /etc/slurm-llnl 

3: copy munge key to client (munge.key copied before from slurm server/slurmctld)

$ cp munge.key /etc/munge

and then I got all the invalid credetial errors and problems reported here and in reports including the 'Zero Bytes' error on the client side

[CLIENT]$ sinfo 
slurm_load_partitions: Zero Bytes were transmitted or received

with corresponding entries in the Slurm SERVER/slurmctld logs ala

[SERVER]$ tail /var/log/munge/munged.log 
2022-12-30 22:57:23 +0100 Notice:    Running on .. 
2022-12-30 23:01:11 +0100 Info:      Invalid credential ...

and

[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log 
[2022-12-30T23:01:11.440] error: Munge decode failed: Invalid credential
[2022-12-30T23:01:11.440] ENCODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] DECODED: Thu Jan 01 01:00:00 1970
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: REQUEST_PARTITION_INFO has authentication error: Invalid authentication credential
[2022-12-30T23:01:11.440] error: slurm_unpack_received_msg: Protocol authentication error

All of this is fixed by rebooting the client, as suggested by other here, or slightly less intrusive, just to restart the client munge daemon

(CLIENT)$ sudo systemctl restert  munge.service

and then munge on client / unmunge on server works, but it also fixes my main problem of getting client to see the slurm server without the dreaded 'Zero Bytes' error

[CLIENT]$ sinfo 
slurm_load_partitions: Zero Bytes were transmitted or received

with server log entries

[SERVER]$ tail /var/log/slurm-llnl/slurmctld.log 
...
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Invalid Protocol Version 9472 from uid=-1 at XX.XX.XX.XX:44150
[2022-12-30T23:17:14.017] error: slurm_unpack_received_msg: Incompatible versions of client and server code
[2022-12-30T23:17:14.027] error: slurm_receive_msg [XX.XX.XX.XX:44150]: Unspecified error

And, after munge restart, voilà:

[CLIENT] $ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
LocalQ*      up   infinite      1   idle XXX

for the examples: SERVER Ubuntu 20.04, CLIENTS Ubuntu 20.04 (and 22.04 that seem to be incompatible with the SERVER slurm version, says the log)

4
On

It was UID/GID mismatch between nodes. Of course it's mentioned in the installation guide.