While trying to generate text using GPT-2 the custom loss function accesses PAD_TOKEN_ID

173 Views Asked by At

While training the custom loss function tries to access the PAD_TOKEN_ID resulting in the below error.50257 is the PAD_TOKEN_ID and the vocab size of GPT-2

InvalidArgumentError: {{function_node __wrapped__SparseSoftmaxCrossEntropyWithLogits_device_/job:localhost/replica:0/task:0/device:CPU:0}} Received a label value of 50257 which is outside the valid range of [0, 50257).  Label values: 389 1976 1437 264 649 24867 1762 503 5633 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 50257 5025...

In order to remove this I tried masking the Labels and the logits.The labels before masking have a shape of (1260,) and post masking it is (132,). The logits before masking have a shape of (1260, 50257) and post masking it is (63323820,) which is (1260 * 63323820,). The code I am using to mask the logits is as follows:-

shift_logits = logits[..., :-1, :]
shift_logits = tf.reshape(shift_logits, [-1, shift_logits.shape[-1]])
mask_logits = tf.math.logical_not(tf.math.equal(shift_logits, pad_token_id))
mask_logits = tf.cast(mask_logits, dtype=tf.float32)
shift_logits_masked = tf.boolean_mask(shift_logits,mask_logits)

So there is a primary problem where the label value of 50257 is being accessed and while trying to remove that by masking both logits and labels they fail due to different shapes. This is probably a dumb question however since I am running out of ideas hence it would be really helpful if someone can have a look.

I tried masking both the labels and logits but as mentioned above the size of the labels are (1260,) and logits (1260,50257) hence whenever I am trying to apply tf.boolean_mask then it fails with shape mismatch error. I am expecting to calculate the loss as mentioned below :

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.NONE)
loss = loss_fn(shift_labels_masked, shift_logits_masked)

since this is text generation in my training loop am passing the labels as input_ids as shown below:

for epoch in range(num_epochs):
  for batch in train_ds:
    input_ids = batch["input_ids"]
    with tf.GradientTape() as tape:
      outputs = model(input_ids)
      loss = loss_fn(outputs,labels=batch["input_ids"],pad_token_id=tokenizer.pad_token_id)
      loss = tf.reduce_mean(loss)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    #if optimizer.iterations % 100 == 0:
    print("Epoch {} Batch {} Loss {:.4f}".format(epoch + 1, optimizer.iterations.numpy(), loss.numpy()))

0

There are 0 best solutions below