I'm trying to train a Roberta model using the Hugging Face Transformers library. My Python code is as follows:
# Relevant code from train_prompt_model.py
# ...
# Load pre-trained model and tokenizer
model_type = "CLTL/MedRoBERTa.nl"
model = AutoModelForCausalLM.from_pretrained(model_type)
tokenizer = AutoTokenizer.from_pretrained(model_type)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Load dataset
def gen() -> dict:
'''
Read the json file at path and yield each line.
'''
with open('datasets/HealthCareMagic-100k/HealthCareMagic100k.json', 'r') as f:
for line in f:
line = json.loads(line)
yield {"context": line["system_prompt"], "prompt": line["question_text"], "output": line["orig_answer_texts"]}
def preprocess_function(example):
text = ' '.join(example['context']) + ' ' + ' '.join(example['prompt'])
target = example['output']
model_inputs = tokenizer(text, truncation=True, padding="max_length", max_length=1000)
model_inputs["labels"] = tokenizer(target, truncation=True, padding="max_length", max_length=1000)["input_ids"]
return model_inputs
# ...
# Define training arguments and instantiate Trainer
training_args = TrainingArguments(
output_dir="test-trainer",
evaluation_strategy="epoch",
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
compute_metrics=compute_metrics,
data_collator=data_collator,
)
trainer.train()
However, when I run the script, I get the following error:
ValueError: not enough values to unpack (expected 2, got 1)
The error seems to originate from this line in the Transformers library (.env/lib/python3.11/site-packages/transformers/models/roberta/modeling_roberta.py):
batch_size, seq_length = input_shape
I'm not sure what's causing this error. It seems like the input_shape variable only has one value when it's expected to have two. Any ideas on what might be causing this and how to fix it?
I tried printing the input_shape value, but that yields torch.Size([16]). This corresponds to the batch size, but I don't understand how it should be reshaped (I assume) to be able to be passed to the model. Any help is much appreciated!
NB My data is initially formatted as follows: {"context", "prompt", "output"}