I am following various tutorials on LangChain, and am now trying to figure out how to use a subset of the documents in the vectorstore instead of the whole database.
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
Imagine a chat scenario.
- User: I am looking for X.
- Chatbot: (asks a deterministic question e.g.) From what date is this document?
- User: 2023-11-16
backend Filters the vectorstore somehow e.g.
retriever = vectorstore.as_retriever(filters="document_name matches '2023-11-16*'")
- Chatbot: here are some relevant documents: ...
In the documentation, they list an example:
docsearch.as_retriever(
search_kwargs={'filter': {'paper_title':'GPT-4 Technical Report'}}
)
What isn't clear is:
- What is paper_title? Is that metadata or text inside the document?
- If this is metadata, then how to specify it?
paper_title
is a column name in a document. you are searching through document filtering'paper_title':'GPT-4 Technical Report'
chromadb
uses sqlite to store all the embeddings. you can read hereyes that is metadata and from docs this si how you specify