I was experimenting with a problem training an image comparison model. I simplified it to the following problem.
I give pairs of images (3x128x128) to a model. The images are either completely black or completely white. The model takes both images through separate resnet models and concatenates the outputs which then go through a fully connected layer. It should return 1.0 if both images are the same color (both black or both white) and 0.0 otherwise. However, the model converges to always assigning ~0.5 even though this task should be simple.
The model:
class TemplateEvaluator(nn.Module):
def __init__(self, q_encoder=resnet18(), t_encoder=resnet18()):
super(TemplateEvaluator, self).__init__()
self.q_encoder = q_encoder
self.t_encoder = t_encoder
# Set requires_grad to True to train resnet
for param in self.q_encoder.parameters():
param.requires_grad = True
for param in self.t_encoder.parameters():
param.requires_grad = True
self.fc = nn.Sequential(
nn.Linear(2000, 1),
nn.Sigmoid()
)
def forward(self, data):
q = data[0]
t = data[1]
# If singular images:
if q.ndim == 3: q = q.unsqueeze(0)
if t.ndim == 3: t = t.unsqueeze(0)
q = self.q_encoder(q)
t = self.t_encoder(t)
res = self.fc(torch.cat([q,t],-1)).flatten()
return res
The dataloader:
class BlackOrWhiteDataset(Dataset):
def __init__(self):
self.tf = transforms.ToTensor()
def __getitem__(self, i):
black = (255,255,255)
white = (0,0,0)
x1_col = black if (np.random.random() > 0.5) else white
x2_col = black if (np.random.random() > 0.5) else white
y = torch.tensor(x1_col == x2_col, dtype=torch.float)
x1 = Image.new('RGB', (img_width,img_width), x1_col)
x2 = Image.new('RGB', (img_width,img_width), x2_col)
return self.tf(x1), self.tf(x2), y
def __len__(self):
return 100
def create_data_loader(dataset, batch_size, verbose=True):
dl = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True,
collate_fn=lambda x: tuple(x_.to(device) for x_ in default_collate(x)))
return dl
The training:
t_eval = TemplateEvaluator().to(device)
opt = optim.SGD(t_eval.parameters(), lr=0.001, momentum=0.01)
epochs = 10
losses = []
for epoch in tqdm(range(epochs)):
t_eval.train()
for X1, X2, Y in dl:
Y_pred = t_eval(torch.stack([X1,X2]))
loss = F.mse_loss(Y_pred,Y)
opt.zero_grad()
loss.backward()
opt.step()
sys.stdout.write('\r')
sys.stdout.write("loss: %f" % loss.item())
sys.stdout.flush()
losses.append(loss.item())
plt.plot(losses)
plt.ylim(0,1)
And the results:
0%| | 0/10 [00:00<?, ?it/s]
loss: 0.259106
10%|█ | 1/10 [00:01<00:13, 1.54s/it]
loss: 0.241787
20%|██ | 2/10 [00:02<00:11, 1.40s/it]
loss: 0.258519
30%|███ | 3/10 [00:04<00:09, 1.36s/it]
loss: 0.250100
40%|████ | 4/10 [00:05<00:08, 1.35s/it]
loss: 0.257565
50%|█████ | 5/10 [00:06<00:06, 1.35s/it]
loss: 0.264662
60%|██████ | 6/10 [00:08<00:05, 1.35s/it]
loss: 0.246792
70%|███████ | 7/10 [00:09<00:04, 1.34s/it]
loss: 0.260988
80%|████████ | 8/10 [00:10<00:02, 1.34s/it]
loss: 0.241590
90%|█████████ | 9/10 [00:12<00:01, 1.34s/it]
loss: 0.250159
100%|██████████| 10/10 [00:13<00:00, 1.35s/it]
Example case:
t_eval.eval()
for X1, X2, Y in dl:
view([X1[0],X2[0]])
print(Y[0].item())
print(t_eval(torch.stack([X1[0],X2[0]])).item())
break
gives:
or:
When setting 'Y' to be only zeros, the model does converge such that Y_pred approaches zero. So the optimizer is working. When setting 'Y' to indicate if the first image is black, the model does converge as expected. Same for the second image. So the model can interpret both images individually.
Thus, it seems that the model fails to combine the information in both inputs and I do not see why.
Update
Thanks to user23818208 I found a solution.
A single-layer perception cannot compute equality. This is known as the XOR/NXOR problem. Instead of combining the image features through concatenation, I now perform element-wise multiplication like so:
class TemplateEvaluator(nn.Module):
def __init__(self, q_encoder=resnet18(), t_encoder=resnet18()):
super(TemplateEvaluator, self).__init__()
self.q_encoder = q_encoder
self.t_encoder = t_encoder
self.fc = nn.Sequential(
nn.Linear(1000, 1),
nn.Sigmoid()
)
def forward(self, data):
q = data[0]
t = data[1]
if q.ndim == 3: q = q.unsqueeze(0)
if t.ndim == 3: t = t.unsqueeze(0)
q_features = self.q_encoder(q)
t_features = self.t_encoder(t)
combined_features = q_features * t_features
res = self.fc(combined_features).flatten()
return res
The model now converges:
0%| | 0/10 [00:00<?, ?it/s]
loss: 0.065883
10%|█ | 1/10 [00:01<00:16, 1.89s/it]
loss: 0.002977
20%|██ | 2/10 [00:03<00:14, 1.76s/it]
loss: 0.000158
30%|███ | 3/10 [00:05<00:12, 1.74s/it]
loss: 0.000015
40%|████ | 4/10 [00:06<00:10, 1.71s/it]
loss: 0.000003
50%|█████ | 5/10 [00:08<00:08, 1.71s/it]
loss: 0.000002
60%|██████ | 6/10 [00:10<00:06, 1.70s/it]
loss: 0.000001
70%|███████ | 7/10 [00:12<00:05, 1.70s/it]
loss: 0.000001
80%|████████ | 8/10 [00:13<00:03, 1.69s/it]
loss: 0.000000
90%|█████████ | 9/10 [00:15<00:01, 1.70s/it]
loss: 0.000000
100%|██████████| 10/10 [00:17<00:00, 1.71s/it]




Your model to check for equality has a single dense layer. A single layer perceptron however can not learn the XOR function and by extension XNOR (which is equality), this is quite a famous result from early machine learning history.