HuggingFace Transformers ValueError: not enough values to unpack (expected 2, got 1) when training Roberta model

61 Views Asked by At

I'm trying to train a Roberta model using the Hugging Face Transformers library. My Python code is as follows:

# Relevant code from train_prompt_model.py
# ...

# Load pre-trained model and tokenizer
model_type = "CLTL/MedRoBERTa.nl"
model = AutoModelForCausalLM.from_pretrained(model_type)
tokenizer = AutoTokenizer.from_pretrained(model_type)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


# Load dataset
def gen() -> dict:
    '''
    Read the json file at path and yield each line.
    '''
    with open('datasets/HealthCareMagic-100k/HealthCareMagic100k.json', 'r') as f:
        for line in f:
            line = json.loads(line)
            yield {"context": line["system_prompt"], "prompt": line["question_text"], "output": line["orig_answer_texts"]}

def preprocess_function(example):
    text = ' '.join(example['context']) + ' ' + ' '.join(example['prompt'])
    target = example['output']
    model_inputs = tokenizer(text, truncation=True, padding="max_length", max_length=1000)
    model_inputs["labels"] = tokenizer(target, truncation=True, padding="max_length", max_length=1000)["input_ids"]
    
    return model_inputs

# ...

# Define training arguments and instantiate Trainer
training_args = TrainingArguments(
    output_dir="test-trainer", 
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
)

trainer = Trainer(
    model=model, 
    args=training_args, 
    train_dataset=train_dataset, 
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

trainer.train()

However, when I run the script, I get the following error:

ValueError: not enough values to unpack (expected 2, got 1)

The error seems to originate from this line in the Transformers library (.env/lib/python3.11/site-packages/transformers/models/roberta/modeling_roberta.py):

batch_size, seq_length = input_shape

I'm not sure what's causing this error. It seems like the input_shape variable only has one value when it's expected to have two. Any ideas on what might be causing this and how to fix it?

I tried printing the input_shape value, but that yields torch.Size([16]). This corresponds to the batch size, but I don't understand how it should be reshaped (I assume) to be able to be passed to the model. Any help is much appreciated!

NB My data is initially formatted as follows: {"context", "prompt", "output"}

0

There are 0 best solutions below