I am trying to retrain resnet50 from scratch using a dataset that is similar to ImageNet. I wrote the following training loop:
def train_network(epochs , train_loader , val_loader , optimizer , network):
since = time.time ( )
train_acc_history = []
val_acc_history = []
best_model_weights = copy.deepcopy (network.state_dict ( ))
best_accuracy = 0.0
for epoch in range (epochs):
correct_train = 0
correct_val = 0
for x , t in train_loader:
x = x.to (device)
t = t.to (device)
optimizer.zero_grad ( )
z = network (x)
J = loss (z , t)
J.backward ( )
optimizer.step ( )
_ , y = torch.max (z , 1)
correct_train += torch.sum (y == t.data)
with torch.no_grad ( ):
network.eval ( )
for x_val , t_val in val_loader:
x_val = x_val.to (device)
t_val = t_val.to (device)
z_val = network (x_val)
_ , y_val = torch.max (z_val , 1)
correct_val += torch.sum (y_val == t_val.data)
network.train ( )
train_accuracy = correct_train.float ( ) / len (train_loader.dataset)
val_accuracy = correct_val.float ( ) / len (val_loader.dataset)
print (
F"Epoch: {epoch + 1} train_accuracy: {(train_accuracy.item ( ) * 100):.3f}% val_accuracy: {(val_accuracy.item ( ) * 100):.3f}%" ,
flush = True)
# time_elapsed_epoch = time.time() - since
# print ('Time taken for Epoch {} is {:.0f}m {:.0f}s'.format (epoch + 1, time_elapsed_epoch // 60 , time_elapsed_epoch % 60))
if val_accuracy > best_accuracy:
best_accuracy = val_accuracy
best_model_weights = copy.deepcopy (network.state_dict ( ))
train_acc_history.append (train_accuracy)
val_acc_history.append (val_accuracy)
print ( )
time_elapsed = time.time ( ) - since
print ('Training complete in {:.0f}m {:.0f}s'.format (time_elapsed // 60 , time_elapsed % 60))
print ('Best Validation Accuracy: {:3f}'.format (best_accuracy * 100))
network.load_state_dict (best_model_weights)
return network , train_acc_history , val_acc_history
But I am getting extremely poor training and validation accuracies as below:
> Epoch: 1 train_accuracy: 3.573% val_accuracy: 3.481%
> Epoch: 2 train_accuracy: 3.414% val_accuracy: 3.273%
> Epoch: 3 train_accuracy: 3.515% val_accuracy: 4.039%
> Epoch: 4 train_accuracy: 3.567% val_accuracy: 4.195%
Upon googling, I found that the accuracies of training from scratch are usually not so poor (in fact they start off from around 40% - 50%). I am finding it difficult to understand where the glitch might be. It would be great if someone could help me figure out where I might be going wrong.
Thanks
I tried your training loop without the weight checkpoint and got accuracy over 90% on fashionMNIST dataset using my own ResNet. So if you are using a good loss/optimizer I would suggest looking at the network architecture or creation of the data-loaders.
So, if you tested with the training loop I used (yours with small mods) and it still doesn't work I would check data loader and play around with network architecture.