I have a csv file that has two input columns and one class with multiple labels which means I'm trying to do a multi-class classification using fine-tuned RoBERTa model. This is the structure of my csv file (df):

text                                             text2                                label
murray returns scotland fold euan murray named   People generally approve of dogs     3
concerns school diploma plan appeal              I'll have you know I've written      4

I followed this HuggingFace tutorial and saw that they use DatasetDict so I transformed my csv file into a DatasetDict structure by

train, test = train_test_split(df, test_size=0.2)

train_dataset = datasets.Dataset.from_dict(train)
test_dataset = datasets.Dataset.from_dict(test)

my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})

"""
Received:
DatasetDict({
    train: Dataset({
        features: ['text', 'text2', 'label'],
        num_rows: 1780
    })
    test: Dataset({
        features: ['text', 'text2', 'label'],
        num_rows: 445
    })
})
"""

After this, I proceeded with tokenizing the data by doing

MODEL_NAME = "roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(dataset_x):
    return tokenizer(dataset_x["text"], dataset_x["text2"], truncation=True)

tokenized_datasets = my_dataset_dict.map(tokenize_function)

I kept following the tutorial and proceeded further like

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8
)

tf_validation_dataset = tokenized_datasets["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8
)

I then initialize the model and compile it so I can fit later on, I have 5 classes so I set num_labels=5

roberta_model = TFRobertaModel.from_pretrained(MODEL_NAME, num_labels=5)
roberta_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5), loss='sparse_categorical_crossentropy', metrics=["accuracy"])
model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)

However the last line throws me an error that says

Node: 'model_4/tf_roberta_model_7/roberta/Reshape' Input to reshape is a tensor with 3368 values, but the requested shape has 2048 [[{{node model_4/tf_roberta_model_7/roberta/Reshape}}]] [Op:__inference_train_function_113985]

I'm still learning this so I have no idea where this comes from, I went by the same steps from the tutorial I linked except they use BERT and have two classes, while in my case the only thing I changed is the MODEL_NAME and having five classes. Do you happen to know how can I fix this, what should I pay more attention at and how can I avoid errors like this in the future?

1

There are 1 best solutions below

0
On

I'm not really an expert, maybe I'm wrong. However, shouldn't you use a model appropriate for your task (something like a TFRobertaForTokenClassification or TFRobertaForSequenceClassification, etc.). See the documentation to see all of them.

As far as I know RobertaTokenizer has no head on top. Maybe that's the problem.