I'm trying to train an LLM (mt5-XL) using the transformer library, but i keep getting the error
torch.cuda.OutOfMemoryError: CUDA out of memory
Even though i have 80gb RAM and this model should only need about 48gb according to (https://huggingface.co/spaces/hf-accelerate/model-memory-usage)
So i figured this should be due to the space taken by the data (100k pairs of queries and document). So i thought if i can load the data batch by batch instead of load the whole dataset will solve my situation.
This is how the code is looking now:
train_dataset = IndexingTrainDataset(path_to_data="path_to_train_dataset.json",
max_length=256,
cache_dir='cache',
tokenizer=tokenizer)
valid_dataset = IndexingTrainDataset(path_to_data="path_to_dev_dataset.json",
max_length=256,
cache_dir='cache',
remove_prompt=True,
tokenizer=tokenizer)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
data_collator=IndexingCollator(
tokenizer,
padding='longest',
),
compute_metrics=make_compute_metrics(tokenizer, train_dataset.valid_ids),
restrict_decode_vocab=restrict_decode_vocab,
id_max_length=256
)
IDK how to give the path to the dataset to the trainer and let it load the data batch by batch (if that's possible)