Why can't weight be updated when using detach()?

63 Views Asked by At

I'm programming VGG model for a school project but this model cuases problems during training.

If I use detach() to two tensors that is scaled_similarity and target_tensor, the weights of the model are not updated. but if I don't use detach() to the tensors, an error happens with a message.

RuntimeError: one of the variables needed for gradient computation 
has been modified by an inplace operation: 
[torch.cuda.FloatTensor [1000, 800]], which is output 0 of AsStridedBackward0,
is at version 14; expected version 13 instead. 

Hint: the backtrace further above shows the operation that failed to compute its gradient.
The variable in question was changed in there or anywhere later. Good luck!"

This model is drawing style recognition AI. After transferring an image to a tensor that is modified by VGG model.

for i, (_image1, _label1) in enumerate(train_loader):
    optimizer.zero_grad()
    image1 = _image1.to(DEVICE)
    label1 = _label1[0]
    vector1_tensor = model(image1)

    if (i == 0): #Exception Case
      image2 = image1
      label2 = label1
      vector2_tensor = vector1_tensor

    similarity = Similarity(vector1_tensor, vector2_tensor)
    similarity_value = similarity.item()
    similarity_vector = [similarity_value]

    if label1 == label2:
      target_vector = [1]
    else :
      target_vector = [0]
    similarity_tensor = torch.tensor(similarity_vector).float()
    target_tensor = torch.tensor(target_vector).float()
    cost = loss(similarity_tensor, target_tensor)
    cost.requires_grad_(True)
    cost.backward()
    optimizer.step()

    #연산량 감소를 위한 텐서 재활용
    # Prepare for next iteration
    image2 = image1
    label2 = label1
    vector2_tensor = vector1_tensor

model & hyper paremeter define

import torch.nn.init as init

seed = time.time()

def custom_init_weights(m):
  if seed is not None:
    torch.manual_seed(seed)
  if isinstance(m, torch.nn.Linear) and m.weight is not None:
    init.normal_(m.weight, mean=1, std=0.01)  # 가중치 초기화 (평균 1, 표준편차 0.01)
    if m.bias is not None:
      init.constant_(m.bias, 0)  # 편향 초기화 (0)

model = trans_VGG(base_dim=64)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(DEVICE)
model.apply(custom_init_weights)
loss = nn.BCELoss()
optimizer =torch.optim.SGD(model.parameters(), lr = 5.5,momentum = 0.9, weight_decay = 0.0005)
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5, factor=0.1, verbose=True)

transform = transforms.Compose(
    [transforms.ToTensor(), transforms.RandomCrop(224)])

model print

Cost: 0.3133
Epoch: 000/050 | Batch 040/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 080/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 120/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 160/3200 | Cost: 1.3124
Epoch: 000/050 | Batch 200/3200 | Cost: 1.3117
Epoch: 000/050 | Batch 240/3200 | Cost: 1.3060
Epoch: 000/050 | Batch 280/3200 | Cost: 0.5588
Epoch: 000/050 | Batch 320/3200 | Cost: 1.3115
Epoch: 000/050 | Batch 360/3200 | Cost: 1.2896
Epoch: 000/050 | Batch 400/3200 | Cost: 0.9186
Epoch: 000/050 | Batch 440/3200 | Cost: 1.3095
Epoch: 000/050 | Batch 480/3200 | Cost: 0.6058
Epoch: 000/050 | Batch 520/3200 | Cost: 0.9654
Epoch: 000/050 | Batch 560/3200 | Cost: 0.9650
Epoch: 000/050 | Batch 600/3200 | Cost: 1.3003
Epoch: 000/050 | Batch 640/3200 | Cost: 0.7438
Epoch: 000/050 | Batch 680/3200 | Cost: 1.3058
Epoch: 000/050 | Batch 720/3200 | Cost: 0.5621
Epoch: 000/050 | Batch 760/3200 | Cost: 0.5050
Epoch: 000/050 | Batch 800/3200 | Cost: 0.4707
Epoch: 000/050 | Batch 840/3200 | Cost: 0.4403
Epoch: 000/050 | Batch 880/3200 | Cost: 1.1651
Epoch: 000/050 | Batch 920/3200 | Cost: 0.4814
Epoch: 000/050 | Batch 960/3200 | Cost: 1.2855
Epoch: 000/050 | Batch 1000/3200 | Cost: 0.7209
Epoch: 000/050 | Batch 1040/3200 | Cost: 0.6030
Epoch: 000/050 | Batch 1080/3200 | Cost: 0.5533
Epoch: 000/050 | Batch 1120/3200 | Cost: 1.1723
Epoch: 000/050 | Batch 1160/3200 | Cost: 1.3111
Epoch: 000/050 | Batch 1200/3200 | Cost: 0.3397
Epoch: 000/050 | Batch 1240/3200 | Cost: 1.3123
Epoch: 000/050 | Batch 1280/3200 | Cost: 1.3025
Epoch: 000/050 | Batch 1320/3200 | Cost: 1.3132
Epoch: 000/050 | Batch 1360/3200 | Cost: 1.3131
Epoch: 000/050 | Batch 1400/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1440/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1480/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1520/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1560/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1600/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1640/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1680/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1720/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1760/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1800/3200 | Cost: 1.3132
Epoch: 000/050 | Batch 1840/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1880/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 1920/3200 | Cost: 1.3132
Epoch: 000/050 | Batch 1960/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2000/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2040/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2080/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2120/3200 | Cost: 1.3131
Epoch: 000/050 | Batch 2160/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2200/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2240/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2280/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2320/3200 | Cost: 1.3132
Epoch: 000/050 | Batch 2360/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2400/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2440/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2480/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2520/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2560/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2600/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2640/3200 | Cost: 1.3133
Epoch: 000/050 | Batch 2680/3200 | Cost: 1.3133

how can I modify this model, to update weights and avoid the error?

1

There are 1 best solutions below

6
On

detach() is used to detach a tensor from the graph, it then becomes irrelevant for the backpropagation and the weight updates. It should be used if follow up calculations are not related to the loss, or rather if the calculations will not be back propagated, else you will end up immediately with an error or sooner or later with a out-of-memory error as all these steps are kept unnecessarily in memory.

I cannot verify the errors with your code, I am not getting them. However logically detach the old result before you give it to the next iteration, else the graph would double:

if (i == 0): #Exception Case
      image2 = image1 # inputs have no gradients
      label2 = label1
      vector2_tensor = vector1_tensor.detach() # arguable you could try to leave the detach out here
      # Note: do you want this behavior also in your second epoch?

...
vector2_tensor = vector1_tensor.detach() # we used this for loss calculation and do not want to use it for the next loop.
# <end of loop>

About the Graph duplication: What you do is something like this: Your Model M creates an Outputs O which is used to calculate the Loss L.

When you update the weights via backpropagation the path from L to M goes backward. If you reuse an output O1 for a second loss L2 the graph becomes twice as long, as the second loss still depends on the first.

This graphic demonstrates the situation:

           ┌───┐
           │   │
    ┌───┐ ┌┴─┐ ▼  ┌──┐
I1> │ M │►│O1├─┬─►│L1│
    └───┘ └──┘ │  └──┘
               │ <--- want to detach here.
    ┌───┐ ┌──┐ ▼  ┌──┐
I2> │ M │►│O2├─┬─►│L2│
    └───┘ └──┘ │  └──┘
               ▼
              ...