[Pytorch]Error when using DistributedDataParallel in the broadcasting stage of initialization

402 Views Asked by At

I'm currently working on GroupFormer which used DistributedDataParallel for trainning. The error message is listed below and it shows that the error is caused by tensor size mismatch while broadcasting in the initialization stage.

This error first occurred when I set --nnodes=1 and --nproc_per_node=2 (train with 2 GPU on 1 computer), but even when I set --nnodes=1 and --nproc_per_node=1 (train with 1 GPU on 1 computer) , the same error still occurred. As far as I know, these broadcasting functions(_sync_params_and_buffers,dist._broadcast_coalesced) are designed to broadcast parameters from main GPU to others, it doesn't make sense that this error still occurred when training with 1 GPU.

Traceback (most recent call last):
  File "main.py", line 53, in <module>
    main()
  File "main.py", line 43, in main
    group_helper = Group(config, work_dir=config['basedir'])
  File "/home/disk1/wgf/project/GroupFormer/group/group.py", line 54, in __init__
    self._build()
  File "/home/disk1/wgf/project/GroupFormer/group/group.py", line 59, in _build
    self._build_model()
  File "/home/disk1/wgf/project/GroupFormer/group/group.py", line 104, in _build_model
    self.model = DistributedDataParallel(model.cuda(), device_ids=[self.rank % torch.cuda.device_count()],
  File "/home/disk1/wgf/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 648, in __init__
    _sync_module_states(
  File "/home/disk1/wgf/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/utils.py", line 113, in _sync_module_states
    _sync_params_and_buffers(
  File "/home/disk1/wgf/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/utils.py", line 131, in _sync_params_and_buffers
    dist._broadcast_coalesced(
RuntimeError: The size of tensor a (64) must match the size of tensor b (0) at non-singleton dimension 3

The command I used for training is also listed below.

 python -m torch.distributed.run --nnodes=1 --nproc_per_node=1 --node_rank=0  --master_port=22332 main.py

I tried using a simple model as below, and this error did not occur. How could this happen? What could be wrong with the original model in GroupFormer? And how can I fix it?

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
0

There are 0 best solutions below