Roberta with multi-class and OneHotEncoding

178 Views Asked by At

I have a dataset for fake news, which has 4 different classes: true, false, partially true, other. Currently my code uses LabelEncoding to these labels but I would like to switch to OneHot Encoding. So no I am trying to turn these labels into OneHot vectors. How can I achieve that in a way, where it will be possible after that to pass these labels to RoBERTa model. Here I will share my current code :

Firstly I convert the labels to numerical values (0-3)

le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])

After that I split the data to train and validation sets

texts = []
labels = []
for i in range(len(df)):
   text = df["text"].iloc[i]
   label = df["label"].iloc[i]
   text = df["title"].iloc[i] + " - " + text
   texts.append(text)
   labels.append(label)

   train_test_split(texts, labels, test_size=test_size)

Finally I make the data compatible for Bert model:

    tokenizer = RobertaTokenizerFast.from_pretrained(model_name)

    train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
    valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

    train_dataset = NewsGroupsDataset(train_encodings, train_labels)
    valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)  

Where NewsGroupsDataset looks like this:

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

How can I switch to OneHotEncoding, because I do not want the model to assume that there is a natural order between the labels?

0

There are 0 best solutions below