I have a csv file that has two input columns and one class with multiple labels which means I'm trying to do a multi-class classification using fine-tuned RoBERTa model. This is the structure of my csv file (df
):
text text2 label
murray returns scotland fold euan murray named People generally approve of dogs 3
concerns school diploma plan appeal I'll have you know I've written 4
I followed this HuggingFace tutorial and saw that they use DatasetDict
so I transformed my csv file into a DatasetDict
structure by
train, test = train_test_split(df, test_size=0.2)
train_dataset = datasets.Dataset.from_dict(train)
test_dataset = datasets.Dataset.from_dict(test)
my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
"""
Received:
DatasetDict({
train: Dataset({
features: ['text', 'text2', 'label'],
num_rows: 1780
})
test: Dataset({
features: ['text', 'text2', 'label'],
num_rows: 445
})
})
"""
After this, I proceeded with tokenizing the data by doing
MODEL_NAME = "roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(MODEL_NAME)
def tokenize_function(dataset_x):
return tokenizer(dataset_x["text"], dataset_x["text2"], truncation=True)
tokenized_datasets = my_dataset_dict.map(tokenize_function)
I kept following the tutorial and proceeded further like
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=["labels"],
shuffle=True,
collate_fn=data_collator,
batch_size=8
)
tf_validation_dataset = tokenized_datasets["test"].to_tf_dataset(
columns=["attention_mask", "input_ids", "token_type_ids"],
label_cols=["labels"],
shuffle=False,
collate_fn=data_collator,
batch_size=8
)
I then initialize the model and compile it so I can fit later on, I have 5 classes so I set num_labels=5
roberta_model = TFRobertaModel.from_pretrained(MODEL_NAME, num_labels=5)
roberta_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5), loss='sparse_categorical_crossentropy', metrics=["accuracy"])
model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)
However the last line throws me an error that says
Node: 'model_4/tf_roberta_model_7/roberta/Reshape' Input to reshape is a tensor with 3368 values, but the requested shape has 2048 [[{{node model_4/tf_roberta_model_7/roberta/Reshape}}]] [Op:__inference_train_function_113985]
I'm still learning this so I have no idea where this comes from, I went by the same steps from the tutorial I linked except they use BERT and have two classes, while in my case the only thing I changed is the MODEL_NAME
and having five classes. Do you happen to know how can I fix this, what should I pay more attention at and how can I avoid errors like this in the future?
I'm not really an expert, maybe I'm wrong. However, shouldn't you use a model appropriate for your task (something like a
TFRobertaForTokenClassification
orTFRobertaForSequenceClassification
, etc.). See the documentation to see all of them.As far as I know
RobertaTokenizer
has no head on top. Maybe that's the problem.