How to improve/preprocess text (in special cases) so the embeddings and LLM will have better context?

607 Views Asked by At

I have been working on setting up local documents to be ingested into vectordb and then to be used (embeddings) as context for the LLM.

Problem is, local documents are very much high level (check below more details). After it's chunked with embeddings,

  • When asking a question related to a heading, first few text chunks are returned (ex: heading_1 : list of items --- when asked vectordb about the heading_1, it only returns the few chunks of embeddings where heading_1 is).
  • Certain questions capture previous point's (statements/ bulltet point) data as well to answer (ex: 1. item 1: blah blah \n item 2: foo foo ---- and when asked about item 2, vector db gets back item 1 "blah blah" also)

Most of the time partial embeddings returned, and sometimes embeddings are not returned even though the information is there...

more informaiton -

Local documents - very much in high level. Mostly constains bullet or numbered list of points/updates/statements about the topic. (PDF files) pdf reader - PyMuPdf vecordb - Chroma LLM - GPT4ALL Sentence transformer - all-MiniLM-L6-v2 (btw I am a data engineer, and learning while doing this...)

I guess it's because of lacking context (Model does not know about, and embeddings also). So I planned to add more context to the document, by

  • Identifying heading and list of items, and add context as "below/above are the list of items..."
  • (just an idea) create a nested dict of unstructure data (how - using PyMuPdf, have access to size of the text, so using it to create nested dict, while heading is key and value is content or child)
  • or just to break it by (heading, content) and push it as seperate source to vectordb with some metadata

Will these approches work or is there any better solution for this? (training model would be last resort at this point)

1

There are 1 best solutions below

0
On

Have you thought about:

Bigger chunk size to get a more holistic context?

Transformers from doctran? Maybe interrogate might be useful in your case?

Furthermore it’s quite difficult to answer your question without a good example of a document.