I am setting up a small cluster of 1 master node and 6 compute nodes for academic research purposes. I currently have the master and one compute node up trying to get those set up first. When I run sinfo on the master node I get:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 5 down* comp[02-06]debug* up infinite 1 idle comp01
When I run scontrol ping on the compute node I get
Slurmctld(primary) at grid is UP
However when I run the same command on the master, I get
Slurmctld(primary) at grid is DOWN
I am able to successfully run "srun hostname" on the compute node, but get this error in my logs when I run it on the master:
[2023-07-17T13:12:30.715] error: _getnameinfo: getnameinfo() failed: Name or service not known
[2023-07-17T13:12:30.715] error: auth_p_get_host: Lookup failed for 193.10.1.171
[2023-07-17T13:12:30.716] sched: _slurm_rpc_allocate_resources JobId=3 NodeList=comp01 usec=20150
[2023-07-17T13:12:30.785] _job_complete: JobId=3 WEXITSTATUS 0
[2023-07-17T13:12:30.785] _job_complete: JobId=3 done
[2023-07-17T13:12:40.172] error: _getnameinfo: getnameinfo() failed: Name or service not known
[2023-07-17T13:12:40.172] error: auth_p_get_host: Lookup failed for 10.125.16.198
[2023-07-17T13:12:40.173] sched: _slurm_rpc_allocate_resources JobId=4 NodeList=comp01 usec=19035
[2023-07-17T13:16:39.219] job_step_signal: JobId=4 StepId=0 not found
[2023-07-17T13:16:39.443] job_step_signal: JobId=4 StepId=0 not found
[2023-07-17T13:17:11.002] job_step_signal: JobId=4 StepId=0 not found
[2023-07-17T13:17:11.004] _job_complete: JobId=4 WTERMSIG 126
[2023-07-17T13:17:11.004] _job_complete: JobId=4 cancelled by interactive user
[2023-07-17T13:17:11.004] _job_complete: JobId=4 done
Any help would be appreciated as my deadline to finish this project is fast approaching.
Here are the relevant lines of my config file (i redacted non related ips with ____):
ClusterName=cluster1
SlurmctldHost=grid
SlurmctldAddr=193.10.1.92
NodeName=comp01 NodeAddr=193.10.1.171 CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=15380 State=UNKNOWN
NodeName=comp02 NodeAddr=_________ CPUs=40 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=31506 State=UNKNOWN
NodeName=comp03 NodeAddr=_________ CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=31506 State=UNKNOWN
NodeName=comp04 NodeAddr=_________ CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=15380 State=UNKNOWN
NodeName=comp05 NodeAddr=_________ CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=15380 State=UNKNOWN
NodeName=comp06 NodeAddr=_________ CPUs=40 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=31506 State=UNKNOWN
#define partitions
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UPe here
Earlier, both master and comp01 would show the master as UP, however comp01 could not run srun hostname. I was getting errors described here I did put both the master and comp01 ips in each others /etc/hosts file like this post suggested and now comp01 can run srun hostname, but now I'm having the problem above.
The error message lists an IP
10.125.16.198
which is not referenced in the portion of the configuration file that you shared. You should look that up.Make sure also the configuration file is identical on all nodes if you do not use the configless feature. The same command giving different results on different nodes can be symptom of different configuration files.