Accessing multiple GPUs on different hosts using LSF

41 Views Asked by At

I am using a HPC cluster with LSF Resource manager. The task is to train a tensorflow model. The graphics queue gq (Max jobs 96) has two hosts host1 with 4 GPUs and host2 with 3 GPUs. I want to use the total 7 GPUs together.

The below are various combinations of bsub commands and the results:-

1. bsub -q gq -n 96 -gpu "num=7:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"

PENDING REASONS: There are no suitable hosts for the job

2. bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"

PENDING REASONS: Not enough hosts to meet the job's spanning requirement;

3. bsub -q gq -n 96 -gpu "num=3:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"

Running. 96 Tasks started on 2 hosts. But only 3 GPUs from Host 1 are used

4. bsub -q gq -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"

Running. 1 Tasks started on host1. All 4 GPUs from Host 1 are used

What should I do so that I can use all 7 GPUs and 96 tasks across both the hosts??

0

There are 0 best solutions below