For data preprocessing, I am estimating having to do data cleaning, text normalization, parsing, tokenization, handling jargon, and data structuring for a question-answer tasked LLM.
I want to get an estimate of how much labor and time preprocessing and annotation can take if I am training my LLM on a corpus of around 1000 legal documents, each of approx. 100-200 pages. My base model is pile-of-law/legalbert-large-1.7M-2 (https://huggingface.co/pile-of-law/legalbert-large-1.7M-2) which I will further fine tune with more specific documents.
I am still working on estimating the timeline of my project, and have looked at some pre-trained base models for my use case so far.