SLURM: Master said to be UP and DOWN at same time

Question

SLURM: Master said to be UP and DOWN at same time

150 Views Asked by paul runner At 28 July 2025 at 05:23

I am setting up a small cluster of 1 master node and 6 compute nodes for academic research purposes. I currently have the master and one compute node up trying to get those set up first. When I run sinfo on the master node I get:

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 5 down* comp[02-06]debug* up infinite 1 idle comp01

When I run scontrol ping on the compute node I get

Slurmctld(primary) at grid is UP

However when I run the same command on the master, I get

Slurmctld(primary) at grid is DOWN

I am able to successfully run "srun hostname" on the compute node, but get this error in my logs when I run it on the master:

[2023-07-17T13:12:30.715] error: _getnameinfo: getnameinfo() failed: Name or service not known
[2023-07-17T13:12:30.715] error: auth_p_get_host: Lookup failed for 193.10.1.171
[2023-07-17T13:12:30.716] sched: _slurm_rpc_allocate_resources JobId=3 NodeList=comp01 usec=20150
[2023-07-17T13:12:30.785] _job_complete: JobId=3 WEXITSTATUS 0
[2023-07-17T13:12:30.785] _job_complete: JobId=3 done
[2023-07-17T13:12:40.172] error: _getnameinfo: getnameinfo() failed: Name or service not known
[2023-07-17T13:12:40.172] error: auth_p_get_host: Lookup failed for 10.125.16.198
[2023-07-17T13:12:40.173] sched: _slurm_rpc_allocate_resources JobId=4 NodeList=comp01 usec=19035
[2023-07-17T13:16:39.219] job_step_signal: JobId=4 StepId=0 not found
[2023-07-17T13:16:39.443] job_step_signal: JobId=4 StepId=0 not found
[2023-07-17T13:17:11.002] job_step_signal: JobId=4 StepId=0 not found
[2023-07-17T13:17:11.004] _job_complete: JobId=4 WTERMSIG 126
[2023-07-17T13:17:11.004] _job_complete: JobId=4 cancelled by interactive user
[2023-07-17T13:17:11.004] _job_complete: JobId=4 done

Any help would be appreciated as my deadline to finish this project is fast approaching.

Here are the relevant lines of my config file (i redacted non related ips with ____):

ClusterName=cluster1
SlurmctldHost=grid
SlurmctldAddr=193.10.1.92


NodeName=comp01 NodeAddr=193.10.1.171 CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=15380 State=UNKNOWN
NodeName=comp02 NodeAddr=_________ CPUs=40 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=31506 State=UNKNOWN
NodeName=comp03 NodeAddr=_________ CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=31506 State=UNKNOWN
NodeName=comp04 NodeAddr=_________ CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=15380 State=UNKNOWN
NodeName=comp05 NodeAddr=_________ CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=15380 State=UNKNOWN
NodeName=comp06 NodeAddr=_________ CPUs=40 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=31506 State=UNKNOWN
#define partitions
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UPe here

Earlier, both master and comp01 would show the master as UP, however comp01 could not run srun hostname. I was getting errors described here I did put both the master and comp01 ips in each others /etc/hosts file like this post suggested and now comp01 can run srun hostname, but now I'm having the problem above.

Original Q&A

There are 1 best solutions below

**damienfrancois** · Answer 1

The error message lists an IP 10.125.16.198 which is not referenced in the portion of the configuration file that you shared. You should look that up.

Make sure also the configuration file is identical on all nodes if you do not use the configless feature. The same command giving different results on different nodes can be symptom of different configuration files.

SLURM: Master said to be UP and DOWN at same time

There are 1 best solutions below

Related Questions in SLURM

Related Questions in HPC

Related Questions in MUNGE

Trending Questions

Popular # Hahtags

Popular Questions