I am working on a chat application in Langchain, Python. The idea is that user submits some pdf files that the chat model is trained on and then asks questions from the model regarding those documents. The embeddings are stored in Chromadb vector database. So effectively a RAG-based solution.
Now, both the creation and storage of embeddings are working fine and also chat is working good. However, I am storing my custom metadata to the embeddings and some ids. The code for that is given as under:
def read_docs(pdf_file):
pdf_loader = PyPDFLoader(pdf_file)
pdf_documents = pdf_loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(pdf_documents)
return documents
def generate_and_store_embeddings(documents, pdf_file, user_id):
client = chromadb.PersistentClient(path="./trained_db")
collection = client.get_or_create_collection("PDF_Embeddings", embedding_function=embedding_functions.OpenAIEmbeddingFunction(api_key=config["OPENAI_API_KEY"], model_name=configs.EMBEDDINGS_MODEL))
now = datetime.now()
#custom metadata and ids I want to store along with the embeddings for each pdf
metadata = {"source": pdf_file.filename, "user": str(user_id), 'created_at':
now.strftime("%d/%m/%Y %H:%M:%S")}
ids = [str(uuid.uuid4()) for _ in range(len(documents))]
try:
vectordb = Chroma.from_documents(
documents,
embedding=OpenAIEmbeddings(openai_api_key=config["OPENAI_API_KEY"],
model=configs.EMBEDDINGS_MODEL),
persist_directory='./trained_db',
collection_name = collection.name,
client = client,
ids = ids,
collection_metadata = {item: value for (item, value) in metadata.items()}
)
vectordb.persist()
except Exception as err:
print(f"An error occured: {err=}, {type(err)=}")
return {"answer": "An error occured while generating embeddings. Please check terminal
for more details."}
return vectordb
Now, what I want is to retrieve those ids and metadata associated with the pdf file rather than all the ids/metadata in the collection. This is so that when a user enters the pdf file to delete the embeddings of, I can retrieve the metadata and the ids of that pdf file only so that I can use those IDs to delete the embeddings of the pdf file from the collection.
I know the vectordb._collection.get() function but it will return all the IDs.
I also used this code: print(vectordb.get(where={"source": pdf_file.filename})) but it returns
{'ids': [], 'embeddings': None, 'metadatas': [], 'documents': [], 'uris': None, 'data': None}
The functionality of
vectordb.get(where={"source": pdf_file.filename})applies to metadata on the individualDocument. You'll have to add thesourcemetadata to eachDocumentin order to query for all documents of a single source. This is how you can get all document IDs of a single PDF so that they can be deleted later.When
generate_and_store_embeddings(documents, pdf_file, user_id)is called, theDocumentmetadata will be persisted. You don't need to pass the metadata to thecollection_metadataparameter in theChroma.from_documents()function.