I'm a novice in the field of LLMs, interning at a research institute where my current project involves constructing a database for supervised fine tuning the Llama2 model with a focus on long context windows. We're targeting a window length of 100k tokens. This entails creating a corpus with 100k-token-long texts, starting with general testing across various data types, which will eventually pivot towards medical data for a healthcare-focused model. I'm seeking guidance on how to build such a corpus, as my literature search hasn't yielded substantial leads. Any advice or pointers to relevant resources would be greatly appreciated.XD;XD;XD
I've been exploring the use of arXiv databases, treating article bodies and abstracts as summary questions and answers, and splitting Harry Potter books 1-7 for long-text segments with corresponding summaries as labels. However, this method feels quite monotonous. I’m contemplating, based on some literature, the possibility of employing GPT to independently generate diverse questions and their answers. Could this be a viable approach to enrich my corpus, and if so, how might I implement it?