How much time can data preprocessing and annotation for fine tuning an LLM take for training it on around 1k docs

27 Views Asked by At

For data preprocessing, I am estimating having to do data cleaning, text normalization, parsing, tokenization, handling jargon, and data structuring for a question-answer tasked LLM.

I want to get an estimate of how much labor and time preprocessing and annotation can take if I am training my LLM on a corpus of around 1000 legal documents, each of approx. 100-200 pages. My base model is pile-of-law/legalbert-large-1.7M-2 (https://huggingface.co/pile-of-law/legalbert-large-1.7M-2) which I will further fine tune with more specific documents.

I am still working on estimating the timeline of my project, and have looked at some pre-trained base models for my use case so far.

0

There are 0 best solutions below