Why does SageMaker PyTorch DDP init times out on SageMaker?

1.7k Views Asked by At

I'm using PyTorch DDP on SageMaker PyTorch Training DLC 1.8.1 The code seems properly DDP-formatted. I'm using instance_count = 2, and launching torch.distributed.launch and I believe the ranks and world size are properly set however the dist.init_process_group waits and times out

RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00)

What could go wrong? machines not networked together?

2

There are 2 best solutions below

0
Arun Lokanatha On

This is usually something to do with the way local_rank is retrieved and used during initialization. Please refer to the below example and see if you can figure out the difference.

https://github.com/aruncs2005/pytorch-ddp-sagemaker-example

0
juvchan On

The torch.distributed.launch is the helper utility within the torch.distributed package which can be used to launch multiple processes per node for distributed training. It tells all workers which IP address is of rank 0 which is set by MASTER_ADDR,

Each rank needs to be able to communicate to the MASTER_ADDR on the port MASTER_PORT. If those are set but the workers cannot reach the MASTER_ADDR, then it can be the root cause of hang and timeoutfor the job.

Besides, it will also wait until all nodes report in from --nodes defined in the launch as well.