We have recently started to work with SLURM. We are operating a cluster with a number of nodes with 4 GPUs each, and some nodes with only CPUs. We would like to start jobs using GPUs with higher priority. Therefore, we have two partitions, however, with overlapping node lists. The partition with GPUs, called 'batch' has a higher 'PriorityTier' value. The partition without GPUs is called 'cpubatch'.
The main reason for this construbtion is that we want to use the idle CPUs on the nodes with GPUs, if they are not needed for GPU jobs.
Now we encounter the problem that jobs in the 'cpubatch' partition do not start on nodes where jobs of the 'batch' partition are already running, even if there are sufficiently many CPUs idle on these nodes.
Here is the slurm.conf file
ControlMachine=qbig
AuthType=auth/none
CryptoType=crypto/openssl
JobCredentialPrivateKey=/qbigwork/slurm/etc/slurm.key
JobCredentialPublicCertificate=/qbigwork/slurm/etc/slurm.cert
MailProg=/qbigwork/slurm/etc/mailwrapper.sh
MpiDefault=none
MpiParams=ports=12000-12999
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm.state/
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
#
# SCHEDULING
DefMemPerNode=1024
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
#
#
# JOB PRIORITY
PriorityType=priority/multifactor
PriorityFavorSmall=NO
PriorityWeightJobSize=1000
PriorityWeightQOS=0
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmSchedLogFile=/var/log/slurm/slurmsched.log
SlurmSchedLogLevel=3
#
#
# GRES
GresTypes=gpu,bandwidth
#
# COMPUTE NODES
NodeName=lnode[01-12] CPUs=8 RealMemory=64525 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 Gres=gpu:4
NodeName=lcpunode01 CPUs=32 RealMemory=129174 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2
NodeName=qbig CPUs=4 RealMemory=40000 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1
#
PartitionName=batch Maxtime=48:00:00 Nodes=lnode[01-12] PriorityTier=50000 DefaultTime=30 OverSubscribe=NO MaxNodes=12 State=UP
PartitionName=cpubatch Maxtime=48:00:00 Nodes=lnode[01-12],qbig,lcpunode01 Default=YES PriorityTier=5000 DefaultTime=30 OverSubscribe=NO MaxNodes=14 State=UP
and here the gres.conf file
NodeName=lnode[01-12] Name=gpu File=/dev/nvidia[0-3]
NodeName=DEFAULT Name=bandwidth Type=lustre Count=4M
We have freshly compiled slurm 17.02.7. 'squeue' gives for instance
$> squeue
[...]
1030 batch sWC_A2p1 user1 PD 0:00 1 (Priority)
1029 batch sWC_A2p1 user1 PD 0:00 1 (Resources)
951 cpubatch 002_E_11 user2 PD 0:00 1 (Resources)
1062 batch sWC_A2p1 user1 PD 0:00 1 (Priority)
[...]
but e.g. on lnode[02-12] there are resources available:
$scontrol show node lnode02
NodeName=lnode02 Arch=x86_64 CoresPerSocket=4
CPUAlloc=4 CPUErr=0 CPUTot=8 CPULoad=4.06
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:4
NodeAddr=lnode02 NodeHostName=lnode02 Version=17.02
OS=Linux RealMemory=64525 AllocMem=0 FreeMem=38915 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=batch,cpubatch
BootTime=2016-09-14T11:55:35 SlurmdStartTime=2017-09-11T13:49:36
CfgTRES=cpu=8,mem=64525M
AllocTRES=cpu=4
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
and job 951 askes only for 4 CPUs
$scontrol show job 951
JobId=951 JobName=002_E_110000
UserId=user2(1416) GroupId=theorie(149) MCS_label=N/A
Priority=50 Nice=0 Account=(null) QOS=(null)
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
SubmitTime=2017-09-11T11:41:01 EligibleTime=2017-09-11T11:41:01
StartTime=2017-09-12T13:31:30 EndTime=2017-09-14T13:31:30 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=cpubatch AllocNode:Sid=qbig:30138
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=lcpunode01
NumNodes=1-1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=1024,node=1
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=4 MinMemoryNode=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/hiskp2/user2/testtoy/corr.sh
WorkDir=/hiskp2/user2/testtoy
StdErr=/hiskp2/user2/testtoy/test.%J.out
StdIn=/dev/null
StdOut=/hiskp2/user2/testtoy/test.%J.out
Power=
The pending GPU job looks as follows
$scontrol show job 1029
JobId=1029 JobName=sWC_A2p1_Mpi270_L24T96_strange_0589_1
UserId=use1(1407) GroupId=theorie(149) MCS_label=N/A
Priority=50 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:02:49 TimeLimit=07:00:00 TimeMin=N/A
SubmitTime=2017-09-11T15:37:40 EligibleTime=2017-09-11T15:37:40
StartTime=2017-09-12T12:43:34 EndTime=2017-09-12T19:43:34 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=batch AllocNode:Sid=qbig:12473
ReqNodeList=(null) ExcNodeList=(null)
NodeList=lnode06
BatchHost=lnode06
NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=25G,node=1
Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=*
MinCPUsNode=4 MinMemoryNode=25G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=gpu:4 Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/hiskp2/user1/peram_generation/0120-Mpi270-L24-T96/strange/cnfg0589/rnd_vec_01/quda.job.slurm.0589_01.cmd
WorkDir=/hiskp2/user1/peram_generation/0120-Mpi270-L24-T96/strange/cnfg0589/rnd_vec_01
StdErr=/hiskp2/user1/peram_generation/0120-Mpi270-L24-T96/strange/cnfg0589/rnd_vec_01/slurm-1029.out
StdIn=/dev/null
StdOut=/hiskp2/user1/peram_generation/0120-Mpi270-L24-T96/strange/cnfg0589/rnd_vec_01/slurm-1029.out
Power=
Please provide help or any other solution for the priority of GPU jobs. Thanks!