Stack size errors on fine tunning t5 with xsum using pytorch

131 Views Asked by At

I am trying to fine fine tunning t5-small with xsum dataset on pytorch Windows 10 (CUDA 12.1).

Unfortunately Trainer (or Seq2SeqTrainer) class from bitsandbytes is not avaliable for Windows, so it was necessary to create a epoch loop:

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, get_scheduler
from torch.utils.data import DataLoader
from torch.optim import AdamW
import torch
from tqdm.auto import tqdm

dataset = load_dataset("xsum")
MODEL_NAME = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

prefix = "summarize: "
max_input_length = 1024
max_target_length = 128

def tokenize_function(examples):
    
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['document', 'summary', 'id'])
tokenized_datasets.set_format("torch")

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

model.train()

for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

model.save_pretrained("outputs/trained")

I got this error:

RuntimeError: stack expects each tensor to be equal size, but got [352] at entry 0 and [930] at entry 1

How can I fix that?

1

There are 1 best solutions below

4
On BEST ANSWER

"equal size"?

In your tokenize_function, you are truncating the input to a maximum length (max_input_length) and the target to a different maximum length (max_target_length). A common practice to handle text data of varying lengths is to pad the sequences to a consistent length within each batch. Check if you can pass to tokenizer a padding argument/token set to True or 'longest' to pad the sequences within a batch to the same length.

def tokenize_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding='longest')

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True, padding='longest')

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Additionally, make sure the DataLoader is properly handling the batches. In some cases, you might need to define a custom collate function (as in this thread) to make sure the batches are being formed properly, especially when dealing with text data of varying lengths.


If the issue persists, there might be an inconsistency in tensor sizes when the model tries to process the batches.

A possible solution would be to try and create a custom collate function which makes sure all tensors within a batch are padded to the same length before being fed to the model. In the custom collate function, you can use the tokenizer's padding functionality to pad all sequences in a batch to the length of the longest sequence.

from torch.nn.utils.rnn import pad_sequence
from torch import nn

def custom_collate_fn(batch):
    inputs = [item['input_ids'] for item in batch]
    labels = [item['labels'] for item in batch]
    
    # Pad sequences within the batch
    padded_inputs = pad_sequence([seq.clone().detach() for seq in inputs], batch_first=True, padding_value=tokenizer.pad_token_id)
    padded_labels = pad_sequence([seq.clone().detach() for seq in labels], batch_first=True, padding_value=tokenizer.pad_token_id)
    
    # Create a new batch with padded sequences
    new_batch = {
        'input_ids': padded_inputs,
        'labels': padded_labels,
        'attention_mask': padded_inputs.ne(tokenizer.pad_token_id)
    }
    return new_batch

# rest of your code

# Use the custom collate function in your DataLoaders
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8, collate_fn=custom_collate_fn)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8, collate_fn=custom_collate_fn)

# rest of your code

In this custom collate function custom_collate_fn, the pad_sequence function from PyTorch is used to pad the input_ids and labels tensors to the length of the longest sequence in the batch. The attention_mask is updated accordingly to indicate where the actual tokens are and where the padding tokens are. That custom collate function is then passed to the collate_fn argument of your DataLoader instances: that should make all tensors within a batch are padded to the same length before being fed to the model.