Fine-tuning a pre-trained LLM for question-answering

5.6k Views Asked by At

Objective

My goal is to fine-tune a pre-trained LLM on a dataset about Manchester United's (MU's) 2021/22 season (they had a poor season). I want to be able to prompt the fine-tuned model with questions such as "How can MU improve?", or "What are MU's biggest weaknesses?". The ideal responses would be insightful/logical and +100 words

Data

  • I will simply use text from the relevant wiki page as my data: https://en.wikipedia.org/wiki/2021%E2%80%9322_Manchester_United_F.C._season
  • How should I structure my data? Should it be a list dictionaries where the keys are the questions and the values are the answers (i.e. a list of question-answer pairs), or a long string containing all the text data (for context), or a combination of both?

Notes

  • I have mainly been experimenting with variations of Google's T5 (e.g.: https://huggingface.co/t5-base) which I have imported from the Hugging Face Transformers library
  • So far I have only fine-tuned the model on a list of 30 dictionaries (question-answer pairs), e.g.: {"question": "How could Manchester United improve their consistency in the Premier League next season?", "answer": " To improve consistency, Manchester United could focus on strengthening their squad depth to cope with injuries and fatigue throughout the season. Tactical adjustments could also be explored to deal with teams of different strengths and styles."}
  • Use of this small dataset (list of 30 dictionaries) has given poor results

Further Questions and Notes

  • Other than increasing the size of my dataset, is my approach sound?
  • What would you recommend as a minimum number of dictionaries to train/fine-tune the model on?
  • I am also aware that I can tune the hyperparameters to improve performance, but for now I am more concerned about my general approach being logical
1

There are 1 best solutions below

4
On

You can try to see how far you can get with LLMs and prompting (e.g., use Alpaca-LoRA or libraries like LangChain and FastChat).

However, if you want to persist with an approach similar to your current one, given the limited data you have, I would highly recommend considering a zero-shot approach. This means you must fine-tune your T5 model on a large Q&A dataset that is unrelated to your problem domain, and then test it on your current annotated data. If you are satisfied with the model's performance, you can stop there.

You can refer to my paper To tune or not to tune? Zero-shot models for legal case entailment, where I deal with a very similar problem. The conclusion of the paper is that if you don't have enough data for fine-tuning, it is sometimes better to simply forgo the target domain and fine-tune your models on a well-established dataset, even if it may be on a completely different subject.

As for how you should structure your test data, I can't provide a specific answer because it's highly dependent on what is happening in your code. It's difficult to prescribe what kind of preprocessing should be done in a high-level discussion like this.