How to detect whether ConversationalRetrievalChain called the OpenAI LLM?

1k Views Asked by At

I have the following code:

chat_history = []
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(chunks, embeddings)
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())
result = qa({"question": "What is stack overflow", "chat_history": chat_history})

The code creates embeddings, creates a FAISS in-memory vector db with some text that I have in chunks array, then it creates a ConversationalRetrievalChain, followed by asking a question.

Based on what I understand from ConversationalRetrievalChain, when asked a question, it will first query the FAISS vector db, then, if it can't find anything matching, it will go to OpenAI to answer that question. (is my understanding correct?)

How can I detect if it actually called OpenAI to get the answer or it was able to get it from the in-memory vector DB? The result object contains question, chat_history and answer properties and nothing else.

4

There are 4 best solutions below

3
On BEST ANSWER

"Based on what I understand from ConversationalRetrievalChain, when asked a question, it will first query the FAISS vector db, then, if it can't find anything matching, it will go to OpenAI to answer that question."

This part is not correct. Each time ConversationalRetrievalChain receives your query in conversation, it will rephrase the question, and retrieves documents from your vector store(It is FAISS in your case), and returns answers generated by LLMs(It is OpenAI in your case). Meaning that ConversationalRetrievalChain is the conversation version of RetrievalQA.

0
On

I personaly don't think ConversationalRetrievalChain could get you any answer from document without sending api request to OpenAI in provided example. But I'm not expert with it, I could be wrong.

But you could use another cheaper/local llm as a way to condense final question to help optimize token count.

Here is their example:

qa = ConversationalRetrievalChain.from_llm(
    ChatOpenAI(temperature=0, model="gpt-4"),
    vectorstore.as_retriever(),
    condense_question_llm = ChatOpenAI(temperature=0, model='gpt-3.5-turbo'),
)

A way one could trace usage of API is as follows:

from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
    result = llm("Tell me a joke")
    print(cb)

Tokens Used: 42 Prompt Tokens: 4 Completion Tokens: 38 Successful Requests: 1 Total Cost (USD): $0.00084

The other usefull way is to use additional tool to trace requests: https://github.com/amosjyng/langchain-visualizer

1
On

You can detect if the answer was obtained from the in-memory vector database by checking if the "answer" property exists and is not empty in the result object. If it's present, the answer came from the database; otherwise, it was generated by the OpenAI model.

0
On

Hi you can apply for https://smith.langchain.com/ to visual tracking of the ConversationalRetrievalChain

See the image:enter image description here

Here I'm using AzureChatOpenAI. The first call of the LLMChain is for "Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

The second call is for your specific prompt or the langchain default prompt.

In addition you can set verbose=True on ConversationalRetrievalChain.from_llm to see the what is happening.

Hope it helps. Regards.