Preventing Automatic Fine-Tuning during Inference Loop in Python

21 Views Asked by At

I'm working on a Python project that involves processing documents through a language model within a for loop. Basically, I have some questions and I want to ask these questions to a LLM that will analyse a lot of pdf documents within the same folder. Answers will be in a table. I used a free and local LLM/embeddings model (downloaded from GPT4all)

Here's a simplified version of what my code does:

How I create the function that will give me answers for one document:

import os
import pandas as pd
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import GPT4All
from langchain.chains import RetrievalQA
from sentence_transformers import SentenceTransformer


def question_pdf(base_questions, pdf_name):
   

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
    texts = text_splitter.split_documents(documents)

    embeddings = HuggingFaceEmbeddings(model_name="X:/.../sentence-transformers_all-MiniLM-L6-v2")
    db2 = Chroma.from_documents(texts, embeddings, persist_directory="db2")

    model_path = "X:/.../mistral-7b-openorca.Q4_0.gguf"
    llm = GPT4All(model=model_path, backend="gptj", verbose=False)

    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=db2.as_retriever(search_kwargs={"k": 3}),
        return_source_documents=True,
        verbose=False,
    )

    
    for question in base_questions: 
        res = qa.invoke(question)
        data[question] = res["result"]  

   
    temp_df = pd.DataFrame([data])
    df_results = temp_df
 
    return df_results

and the loop that will apply the function to every document:

final_df_results = pd.DataFrame()


for pdf_file in pdf_files:
    pdf_path = os.path.join(folder_path, pdf_file)  
    df_results = question_pdf(base_questions, pdf_path) 
    final_df_results = pd.concat([final_df_results, df_results], ignore_index=True) 

However, I've encountered an unexpected behavior where the model appears to be fine-tuning on the input data at each iteration of the loop. This process also results in unwanted folder and file generation in the directory where the emebeddings model located, which I presume are related to this fine-tuning activity. Because of this, the model "learns" by itself and takes into account the previous documents analysed when he analyses the next documents. I know this because when I take a look at the source of the answers, I can see that the model based its analyse with the previous documents. This is something I don't want to happen, because the analyses must be independent.

I experimented with modifying the code to explicitly disable training or fine-tuning modes in the model's configuration, but it didn't work.

0

There are 0 best solutions below