I have a question regarding "on-the-fly" tokenization. This question was elicited by reading the "How to train a new language model from scratch using Transformers and Tokenizers" here. Towards the end there is this sentence: "If your dataset is very large, you can opt to load and tokenize examples on the fly, rather than as a preprocessing step". I've tried coming up with a solution that would combine both datasets
and tokenizers
, but did not manage to find a good pattern.
I guess the solution would entail wrapping a dataset into a Pytorch dataset.
As a concrete example from the docs
import torch
class SquadDataset(torch.utils.data.Dataset):
def __init__(self, encodings):
# instead of doing this beforehand, I'd like to do tokenization on the fly
self.encodings = encodings
def __getitem__(self, idx):
return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
def __len__(self):
return len(self.encodings.input_ids)
train_dataset = SquadDataset(train_encodings)
How would one implement this with "on-the-fly" tokenization exploiting the vectorized capabilities of tokenizers?
UPDATE Feb 2021
As of v1.3.0 datasets supports lazy evaluation of functions via the
set_transform
method. Therefore, you can apply on-the-fly tokenization directly like shown here.OLD ANSWER
In the end I settled for this solution. I do not like that the batch_size is now controlled at the dataset level. However, it does its job.
In this way we exploit two nice things:
fast indexing the HuggingFace datasets
vectorization capabilities of the HuggingFace tokenizer