I am trying to figure out why my multi-gpu training using tensorflow MirroredStrategy is not scaling for training a 20 block x 128 filters ResNet. The single GPU run is scaling 100% with no gaps and the input pipeline seems to be fast enough. For a 2 GPU run though the training epoch time doesn't reduce at all, even though I have doubled the batch size. With a single GPU, I am able to use a maximum batch size of 8 due to the large image sizes of 128x256x74, and 16 for 2 GPUs. I have attached the tensorf profile result below. I do not know how to interpret well to figure out the bottleneck. It seems that GPUs 0 and 1 are working sequentially, and that the NCCL communication time is rather large, right? I just want to understand what is causing the scaling issue: the input pipeline or the interconnect between GPUs ? The data is read from RAM so I am not sure if the former is the cause.
The GPU uses PCIe bridge for GPUs 0 and 1
> nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity
GPU0 X PIX PHB PHB SYS SYS SYS SYS SYS 0-9 0
GPU1 PIX X PHB PHB SYS SYS SYS SYS SYS 0-9 0
GPU2 PHB PHB X PIX SYS SYS SYS SYS SYS 0-9 0
GPU3 PHB PHB PIX X SYS SYS SYS SYS SYS 0-9 0
GPU4 SYS SYS SYS SYS X PIX PHB PHB PHB 10-19 1
GPU5 SYS SYS SYS SYS PIX X PHB PHB PHB 10-19 1
GPU6 SYS SYS SYS SYS PHB PHB X PIX PHB 10-19 1
GPU7 SYS SYS SYS SYS PHB PHB PIX X PHB 10-19 1
NIC0 SYS SYS SYS SYS PHB PHB PHB PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx4_0
