I am using pytorch distributed package to write a simple program that let multiple GPUs on a single server communicate with each other. Here is the code:
def run(rank, world_size):
# torch.cuda.set_device(rank)
device = f'cuda:{rank}'
comm = Communicator(rank, world_size, '127.0.0.1', 'nccl', device=device)
nums = [rank for _ in range(world_size)]
start_time = time.time()
result = comm.exchange_single_number(nums)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Process on GPU {rank}, numbers after exchange: {result}")
print(f"Time taken for exchange_single_number: {elapsed_time} seconds")
if __name__ == "__main__":
num_gpus = torch.cuda.device_count()
mp.spawn(run, args=(num_gpus,), nprocs=num_gpus)
The initiation function of class Communicator
mainly calls the torch.distributed.init_process_group()
function that initiates the pytorch distributed environment, and below is the exchange_single_number()
method:
def exchange_single_number(self, send_nums: List) -> List:
"""
Send a single number to other ranks.
@comm_vol: A list indicating the number send to each rank. The length of send_nums should be consistent with the world size
--Returns--
A list of numbers received from other ranks
"""
send_tensors = [torch.tensor([send_nums[rank]], dtype=torch.int64, device=self.device) for rank in range(self.world_size)]
recv_tensors = [torch.empty(1, dtype=torch.int64, device=self.device) for _ in range(self.world_size)]
dist.all_to_all(recv_tensors, send_tensors)
return [x.item() for x in recv_tensors]
inside which I use torch.distributed.all_to_all()
function to exchange data among GPUs.
However, on server A with 2 NVIDIA RTX A6000, the output of this program is:
Process on GPU 0, numbers after exchange: [0, 1]
Time taken for exchange_single_number: 3247.1511476039886 seconds
Process on GPU 1, numbers after exchange: [0, 1]
Time taken for exchange_single_number: 3247.144986629486 seconds
The output of the received data shows this program runs correctly, yet it takes nearly an hour to complete the communication.
I run the same program on another server B with 4 NVIDIA GeForce RTX 2080 Ti, and the ouput is:
Process on GPU 0, numbers after exchange: [0, 1, 2, 3]
Time taken for exchange_single_number: 3.9615700244903564 seconds
Process on GPU 3, numbers after exchange: [0, 1, 2, 3]
Time taken for exchange_single_number: 3.955087661743164 seconds
Process on GPU 1, numbers after exchange: [0, 1, 2, 3]
Time taken for exchange_single_number: 3.9523494243621826 seconds
Process on GPU 2, numbers after exchange: [0, 1, 2, 3]
Time taken for exchange_single_number: 3.95304536819458 seconds
The time cost is much reasonable.
P.S. Another operation (annotated in the above code) calling torch.cuda.set_device()
also takes unreasonable time to finish on Server A.
What is the possible cause of this phenomenon?