Why Does SageMaker Data Parallel Distributed Training Only Support 3 Instances types?

219 Views Asked by At

I see here that SageMaker Data Distributed Library only supports 3 instance types: ml.p3.16xlarge, ml.p3dn.24xlarge, ml.p4d.24xlarge.

Why is this? I would have thought there might be use cases for parallel training for other GPUs, and even potentially CPUs

1

There are 1 best solutions below

0
Arun Lokanatha On

SageMaker DDP is designed to work with GPU's only and it uses NVIDIA Collective Communications Library (NCCL) for its all reduce approach. It gives good performance when used with Instances with more GPU's and higher network bandwidth. I believe this is the reason why only few instances are supported.