I'm working on a Python project that involves processing documents through a language model within a for loop. Basically, I have some questions and I want to ask these questions to a LLM that will analyse a lot of pdf documents within the same folder. Answers will be in a table. I used a free and local LLM/embeddings model (downloaded from GPT4all)
Here's a simplified version of what my code does:
How I create the function that will give me answers for one document:
import os
import pandas as pd
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import GPT4All
from langchain.chains import RetrievalQA
from sentence_transformers import SentenceTransformer
def question_pdf(base_questions, pdf_name):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts = text_splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(model_name="X:/.../sentence-transformers_all-MiniLM-L6-v2")
db2 = Chroma.from_documents(texts, embeddings, persist_directory="db2")
model_path = "X:/.../mistral-7b-openorca.Q4_0.gguf"
llm = GPT4All(model=model_path, backend="gptj", verbose=False)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=db2.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True,
verbose=False,
)
for question in base_questions:
res = qa.invoke(question)
data[question] = res["result"]
temp_df = pd.DataFrame([data])
df_results = temp_df
return df_results
and the loop that will apply the function to every document:
final_df_results = pd.DataFrame()
for pdf_file in pdf_files:
pdf_path = os.path.join(folder_path, pdf_file)
df_results = question_pdf(base_questions, pdf_path)
final_df_results = pd.concat([final_df_results, df_results], ignore_index=True)
However, I've encountered an unexpected behavior where the model appears to be fine-tuning on the input data at each iteration of the loop. This process also results in unwanted folder and file generation in the directory where the emebeddings model located, which I presume are related to this fine-tuning activity. Because of this, the model "learns" by itself and takes into account the previous documents analysed when he analyses the next documents. I know this because when I take a look at the source of the answers, I can see that the model based its analyse with the previous documents. This is something I don't want to happen, because the analyses must be independent.
I experimented with modifying the code to explicitly disable training or fine-tuning modes in the model's configuration, but it didn't work.