I have a dataset for fake news, which has 4 different classes: true
, false
, partially true
, other
.
Currently my code uses LabelEncoding to these labels but I would like to switch to OneHot Encoding.
So no I am trying to turn these labels into OneHot vectors. How can I achieve that in a way, where it will be possible after that to pass these labels to RoBERTa model.
Here I will share my current code :
Firstly I convert the labels to numerical values (0-3)
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])
After that I split the data to train and validation sets
texts = []
labels = []
for i in range(len(df)):
text = df["text"].iloc[i]
label = df["label"].iloc[i]
text = df["title"].iloc[i] + " - " + text
texts.append(text)
labels.append(label)
train_test_split(texts, labels, test_size=test_size)
Finally I make the data compatible for Bert model:
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)
Where NewsGroupsDataset
looks like this:
class NewsGroupsDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
item["labels"] = torch.tensor([self.labels[idx]])
return item
def __len__(self):
return len(self.labels)
How can I switch to OneHotEncoding, because I do not want the model to assume that there is a natural order between the labels?