SGE MPI jobs running on particular set of hosts only even though we have lot of nodes in pool

142 Views Asked by user1739504 At 19 April 2022 at 19:06

We are seeing strange issue in our SGE gpu queue as we have plenty of nodes available in gpu queue but whenever we launch MPI parallel jobs they always going to one set of nodes only in our case it always going to 4 gpu nodes and when they get saturated jobs are remaining in "qw" state and not progressing..the remaining nodes in Queue are healthy and have exact identical settings.

This is our ppn4 config and job submission cmd:

qconf -sp ppn4
pe_name                ppn4
slots                  999999
used_slots             0
bound_slots            0
user_lists             NONE                  
xuser_lists            NONE                  
start_proc_args        NONE
stop_proc_args         NONE
per_pe_task_prolog     NONE
per_pe_task_epilog     NONE
allocation_rule        4
control_slaves         TRUE
job_is_first_task      FALSE
urgency_slots          min
accounting_summary     TRUE
daemon_forks_slaves    FALSE
master_forks_slaves    FALSE

mpirun -pe ppn4 16 -l gpu=4 -l <queue name> <job submissionscript>

Thank you CS

Original Q&A

There are 1 best solutions below

Vladimir Vlasov On 15 November 2022 at 22:42

I suppose you are already solved the issue, but just in case. in your command mpirun -pe ppn4 16 ....

16 is a total slots number that will be used across the cluster according to the selected PE. So, PE allocation rule takes 4 slots and 4 nodes x 4 slots = 16 slots you are ordering. You have to increase that slot number in order to load more nodes.

Best, V

SGE MPI jobs running on particular set of hosts only even though we have lot of nodes in pool

There are 1 best solutions below

Related Questions in MPI

Related Questions in HPC

Related Questions in SUNGRIDENGINE

Trending Questions

Popular # Hahtags

Popular Questions