I am trying to fine fine tunning t5-small with xsum dataset on pytorch Windows 10 (CUDA 12.1).
Unfortunately Trainer (or Seq2SeqTrainer) class from bitsandbytes is not avaliable for Windows, so it was necessary to create a epoch loop:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, get_scheduler
from torch.utils.data import DataLoader
from torch.optim import AdamW
import torch
from tqdm.auto import tqdm
dataset = load_dataset("xsum")
MODEL_NAME = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
prefix = "summarize: "
max_input_length = 1024
max_target_length = 128
def tokenize_function(examples):
inputs = [prefix + doc for doc in examples["document"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
# Setup the tokenizer for targets
labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['document', 'summary', 'id'])
tokenized_datasets.set_format("torch")
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
model.save_pretrained("outputs/trained")
I got this error:
RuntimeError: stack expects each tensor to be equal size, but got [352] at entry 0 and [930] at entry 1
How can I fix that?
"equal size"?
In your
tokenize_function
, you are truncating the input to a maximum length (max_input_length
) and the target to a different maximum length (max_target_length
). A common practice to handle text data of varying lengths is to pad the sequences to a consistent length within each batch. Check if you can pass to tokenizer apadding
argument/token set toTrue
or'longest'
to pad the sequences within a batch to the same length.Additionally, make sure the DataLoader is properly handling the batches. In some cases, you might need to define a custom collate function (as in this thread) to make sure the batches are being formed properly, especially when dealing with text data of varying lengths.
If the issue persists, there might be an inconsistency in tensor sizes when the model tries to process the batches.
A possible solution would be to try and create a custom collate function which makes sure all tensors within a batch are padded to the same length before being fed to the model. In the custom collate function, you can use the tokenizer's
padding
functionality to pad all sequences in a batch to the length of the longest sequence.In this custom collate function
custom_collate_fn
, thepad_sequence
function from PyTorch is used to pad theinput_ids
andlabels
tensors to the length of the longest sequence in the batch. Theattention_mask
is updated accordingly to indicate where the actual tokens are and where the padding tokens are. That custom collate function is then passed to thecollate_fn
argument of yourDataLoader
instances: that should make all tensors within a batch are padded to the same length before being fed to the model.