Error install slurm, slurmd could no be started

1.8k Views Asked by At

I am trying to install slurm in a small two pc system. But I've got the following error while start slurmd

Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.

The output of systemctl status slurmd.service and journalctl -xe are as followed

● slurmd.service - Slurm node daemon
   Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Fri 2020-12-04 13:18:51 CST; 4min 50s ago
     Docs: man:slurmd(8)
  Process: 26501 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)

12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: Starting Slurm node daemon...
12月 04 13:18:51 Y-Cluster-Node1 slurmd[26501]: fatal: Unable to determine this slurmd's NodeName
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: slurmd.service: Control process exited, code=exited status=1
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: slurmd.service: Failed with result 'exit-code'.
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: Failed to start Slurm node daemon.

12月 04 13:21:05 Y-Cluster-Node1 sshd[26624]: Disconnected from authenticating user root 150.158.213.234 port 54962 [preauth]
12月 04 13:21:23 Y-Cluster-Node1 sshd[26632]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=115.68.207.186  user=root
12月 04 13:21:25 Y-Cluster-Node1 sshd[26632]: Failed password for root from 115.68.207.186 port 58882 ssh2
12月 04 13:21:25 Y-Cluster-Node1 sshd[26632]: Received disconnect from 115.68.207.186 port 58882:11: Bye Bye [preauth]
12月 04 13:21:25 Y-Cluster-Node1 sshd[26632]: Disconnected from authenticating user root 115.68.207.186 port 58882 [preauth]
12月 04 13:21:25 Y-Cluster-Node1 sshd[26630]: Connection closed by 212.64.12.236 port 46106 [preauth]
12月 04 13:22:13 Y-Cluster-Node1 sshd[26635]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=118.25.24.84  user=root
12月 04 13:22:14 Y-Cluster-Node1 sshd[26637]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=111.125.70.22  user=root
12月 04 13:22:14 Y-Cluster-Node1 sshd[26635]: Failed password for root from 118.25.24.84 port 47018 ssh2
12月 04 13:22:15 Y-Cluster-Node1 sshd[26635]: Received disconnect from 118.25.24.84 port 47018:11: Bye Bye [preauth]
12月 04 13:22:15 Y-Cluster-Node1 sshd[26635]: Disconnected from authenticating user root 118.25.24.84 port 47018 [preauth]
12月 04 13:22:15 Y-Cluster-Node1 sshd[26637]: Failed password for root from 111.125.70.22 port 58216 ssh2
12月 04 13:22:15 Y-Cluster-Node1 sshd[26637]: Received disconnect from 111.125.70.22 port 58216:11: Bye Bye [preauth]
12月 04 13:22:15 Y-Cluster-Node1 sshd[26637]: Disconnected from authenticating user root 111.125.70.22 port 58216 [preauth]
12月 04 13:22:16 Y-Cluster-Node1 sshd[26639]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=72.167.227.34  user=root
12月 04 13:22:18 Y-Cluster-Node1 sshd[26639]: Failed password for root from 72.167.227.34 port 56304 ssh2
12月 04 13:22:18 Y-Cluster-Node1 sshd[26639]: Received disconnect from 72.167.227.34 port 56304:11: Bye Bye [preauth]
12月 04 13:22:18 Y-Cluster-Node1 sshd[26639]: Disconnected from authenticating user root 72.167.227.34 port 56304 [preauth]
12月 04 13:22:32 Y-Cluster-Node1 sshd[26641]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=182.138.239.224  user=root
12月 04 13:22:34 Y-Cluster-Node1 sshd[26641]: Failed password for root from 182.138.239.224 port 48870 ssh2
12月 04 13:22:36 Y-Cluster-Node1 sshd[26641]: Received disconnect from 182.138.239.224 port 48870:11: Bye Bye [preauth]
12月 04 13:22:36 Y-Cluster-Node1 sshd[26641]: Disconnected from authenticating user root 182.138.239.224 port 48870 [preauth]
12月 04 13:22:56 Y-Cluster-Node1 sshd[26648]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=81.68.123.185  user=root
12月 04 13:22:58 Y-Cluster-Node1 sshd[26648]: Failed password for root from 81.68.123.185 port 60848 ssh2
12月 04 13:23:00 Y-Cluster-Node1 sshd[26648]: Received disconnect from 81.68.123.185 port 60848:11: Bye Bye [preauth]
12月 04 13:23:00 Y-Cluster-Node1 sshd[26648]: Disconnected from authenticating user root 81.68.123.185 port 60848 [preauth]
12月 04 13:23:02 Y-Cluster-Node1 sshd[26652]: Connection closed by 139.217.221.89 port 35808 [preauth]
12月 04 13:23:13 Y-Cluster-Node1 sshd[26654]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=159.65.1.41  user=root
12月 04 13:23:16 Y-Cluster-Node1 sshd[26654]: Failed password for root from 159.65.1.41 port 40538 ssh2
12月 04 13:23:16 Y-Cluster-Node1 sshd[26654]: Received disconnect from 159.65.1.41 port 40538:11: Bye Bye [preauth]
12月 04 13:23:16 Y-Cluster-Node1 sshd[26654]: Disconnected from authenticating user root 159.65.1.41 port 40538 [preauth]
12月 04 13:23:43 Y-Cluster-Node1 sshd[26656]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=222.222.31.70  user=root
12月 04 13:23:46 Y-Cluster-Node1 sshd[26656]: Failed password for root from 222.222.31.70 port 35282 ssh2
12月 04 13:23:46 Y-Cluster-Node1 sshd[26656]: Received disconnect from 222.222.31.70 port 35282:11: Bye Bye [preauth]
12月 04 13:23:46 Y-Cluster-Node1 sshd[26656]: Disconnected from authenticating user root 222.222.31.70 port 35282 [preauth]
12月 04 13:24:02 Y-Cluster-Node1 sshd[26660]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=150.158.213.234  user=root
12月 04 13:24:04 Y-Cluster-Node1 sshd[26660]: Failed password for root from 150.158.213.234 port 36350 ssh2
12月 04 13:24:05 Y-Cluster-Node1 sshd[26660]: Received disconnect from 150.158.213.234 port 36350:11: Bye Bye [preauth]
12月 04 13:24:05 Y-Cluster-Node1 sshd[26660]: Disconnected from authenticating user root 150.158.213.234 port 36350 [preauth]

I tried to understand the problem, it looks like an connection issue that the control node(node1) cannot access to compute node(node2).

I did some search around, some mentioned it could due to the mismatch of UIDs and GIDs. As mentioned in the installation guideline, "Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster." I did not find any issues regarding UIDs/GIDs myself, is there anyways to have a check on this? Could anyone give me a hand here?

Some additional Information: used "munge -n | unmunge" I got the following on both node

y-cluster@Y-Cluster-Node1:~$ munge -n | unmunge
STATUS:           Success (0)
ENCODE_HOST:      Y-Cluster-Node1 (192.168.1.111)
ENCODE_TIME:      2020-12-04 15:00:18 +0800 (1607065218)
DECODE_TIME:      2020-12-04 15:00:18 +0800 (1607065218)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              y-cluster (1000)
GID:              y-cluster (1000)
LENGTH:           0
y-cluster@Y-Cluster-Node2:~/.ssh$ munge -n | unmunge
STATUS:           Success (0)
ENCODE_HOST:      Y-Cluster-Node2 (192.168.1.112)
ENCODE_TIME:      2020-12-04 15:00:20 +0800 (1607065220)
DECODE_TIME:      2020-12-04 15:00:20 +0800 (1607065220)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              y-cluster (1000)
GID:              y-cluster (1000)
LENGTH:           0

Both looks fine, same UID/GID/TIME. From "slurmctld -Dcvvv", I get the following error, I wonder does it got to do with ownship of some log files?

y-cluster@Y-Cluster-Node1:~$ slurmctld -Dcvvv
slurmctld: debug:  Log file re-opened
slurmctld: killing old slurmctld[4787]

0

There are 0 best solutions below