I'm currently working on GroupFormer which used DistributedDataParallel for trainning. The error message is listed below and it shows that the error is caused by tensor size mismatch while broadcasting in the initialization stage.
This error first occurred when I set --nnodes=1 and --nproc_per_node=2 (train with 2 GPU on 1 computer), but even when I set --nnodes=1 and --nproc_per_node=1 (train with 1 GPU on 1 computer) , the same error still occurred. As far as I know, these broadcasting functions(_sync_params_and_buffers,dist._broadcast_coalesced) are designed to broadcast parameters from main GPU to others, it doesn't make sense that this error still occurred when training with 1 GPU.
Traceback (most recent call last):
File "main.py", line 53, in <module>
main()
File "main.py", line 43, in main
group_helper = Group(config, work_dir=config['basedir'])
File "/home/disk1/wgf/project/GroupFormer/group/group.py", line 54, in __init__
self._build()
File "/home/disk1/wgf/project/GroupFormer/group/group.py", line 59, in _build
self._build_model()
File "/home/disk1/wgf/project/GroupFormer/group/group.py", line 104, in _build_model
self.model = DistributedDataParallel(model.cuda(), device_ids=[self.rank % torch.cuda.device_count()],
File "/home/disk1/wgf/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 648, in __init__
_sync_module_states(
File "/home/disk1/wgf/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/utils.py", line 113, in _sync_module_states
_sync_params_and_buffers(
File "/home/disk1/wgf/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/utils.py", line 131, in _sync_params_and_buffers
dist._broadcast_coalesced(
RuntimeError: The size of tensor a (64) must match the size of tensor b (0) at non-singleton dimension 3
The command I used for training is also listed below.
python -m torch.distributed.run --nnodes=1 --nproc_per_node=1 --node_rank=0 --master_port=22332 main.py
I tried using a simple model as below, and this error did not occur. How could this happen? What could be wrong with the original model in GroupFormer? And how can I fix it?
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x