I'm trying to fine-tune huggingface's implementation of distilbert for multi-class classification (100 classes) on a custom dataset following the tutorial at https://huggingface.co/transformers/custom_datasets.html.
I'm doing so using Tensorflow, and fine-tuning in native tensorflow, that is, I use the following part of the tutorial for dataset creation:
import tensorflow as tf
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
dict(test_encodings),
test_labels
))
And this one for fine-tuning:
from transformers import TFDistilBertForSequenceClassification
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
Everything seems to go fine with fine-tuning, but when I try to predict on the test dataset using model.predict(test_dataset)
as argument (with 2000 examples), the model seems to yield one prediction per token rather than one prediction per sequence...
That is, instead of getting an output of shape (1, 2000, 100)
, I get an output of shape (1, 1024000, 100)
, where 1024000 is the number of test examples (2000) * the sequence length (512).
Any hint on what's going on here? (Sorry if this is naive, I'm very new to tensorflow).
I had exactly the same problem. I do not know why it's happening, as it should by the right code by looking at the tutorial.
But for me it worked to create numpy arrays out of the train_encodings and pass them directly to the fit method instead of creating the Dataset.