Communication using PyTorch distributed package extremely slow

77 Views Asked by At

I am using pytorch distributed package to write a simple program that let multiple GPUs on a single server communicate with each other. Here is the code:

def run(rank, world_size):
    # torch.cuda.set_device(rank)
    
    device = f'cuda:{rank}'
    comm = Communicator(rank, world_size, '127.0.0.1', 'nccl', device=device)
    nums = [rank for _ in range(world_size)]
    
    start_time = time.time()
    result = comm.exchange_single_number(nums)
    end_time = time.time()
    elapsed_time = end_time - start_time
    
    print(f"Process on GPU {rank}, numbers after exchange: {result}")
    print(f"Time taken for exchange_single_number: {elapsed_time} seconds")

if __name__ == "__main__":
    num_gpus = torch.cuda.device_count()
    mp.spawn(run, args=(num_gpus,), nprocs=num_gpus)

The initiation function of class Communicator mainly calls the torch.distributed.init_process_group() function that initiates the pytorch distributed environment, and below is the exchange_single_number() method:

def exchange_single_number(self, send_nums: List) -> List:
    """
    Send a single number to other ranks. 

    @comm_vol: A list indicating the number send to each rank. The length of send_nums should be consistent with the world size

    --Returns--
    A list of numbers received from other ranks
    """
    send_tensors = [torch.tensor([send_nums[rank]], dtype=torch.int64, device=self.device) for rank in range(self.world_size)]
    recv_tensors = [torch.empty(1, dtype=torch.int64, device=self.device) for _ in range(self.world_size)]
    dist.all_to_all(recv_tensors, send_tensors)
    return [x.item() for x in recv_tensors]

inside which I use torch.distributed.all_to_all() function to exchange data among GPUs.

However, on server A with 2 NVIDIA RTX A6000, the output of this program is:

Process on GPU 0, numbers after exchange: [0, 1]
Time taken for exchange_single_number: 3247.1511476039886 seconds
Process on GPU 1, numbers after exchange: [0, 1]
Time taken for exchange_single_number: 3247.144986629486 seconds

The output of the received data shows this program runs correctly, yet it takes nearly an hour to complete the communication.

I run the same program on another server B with 4 NVIDIA GeForce RTX 2080 Ti, and the ouput is:

Process on GPU 0, numbers after exchange: [0, 1, 2, 3]
Time taken for exchange_single_number: 3.9615700244903564 seconds
Process on GPU 3, numbers after exchange: [0, 1, 2, 3]
Time taken for exchange_single_number: 3.955087661743164 seconds
Process on GPU 1, numbers after exchange: [0, 1, 2, 3]
Time taken for exchange_single_number: 3.9523494243621826 seconds
Process on GPU 2, numbers after exchange: [0, 1, 2, 3]
Time taken for exchange_single_number: 3.95304536819458 seconds

The time cost is much reasonable.

P.S. Another operation (annotated in the above code) calling torch.cuda.set_device() also takes unreasonable time to finish on Server A.

What is the possible cause of this phenomenon?

0

There are 0 best solutions below