Cannot load persisted db using Chroma / Langchain

2.6k Views Asked by At

I ingested all docs and created a collection / embeddings using Chroma. I have a local directory db. Within db there is chroma-collections.parquet and chroma-embeddings.parquet. These are not empty. Chroma-collections.parquet when opened returns a collection name, uuid, and null metadata.

When I load it up later using langchain, nothing is here.

from langchain.vectorstores import Chroma

embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
CHROMA_SETTINGS = Settings(
        chroma_db_impl='duckdb+parquet',
        persist_directory='db',
        anonymized_telemetry=False
)

db = Chroma(persist_directory='db', embedding_function=embeddings, client_settings=CHROMA_SETTINGS)

db.get() returns {'ids': [], 'embeddings': None, 'documents': [], 'metadatas': []}

I've tried lots of other alternate approaches online. E.g.

import chromadb

client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",
                                    persist_directory='./db'))
coll = client.get_or_create_collection("langchain", embedding_function=embeddings)
coll.count() returns 0

I'm expecting all the docs and embeddings to be available. What am I missing?

3

There are 3 best solutions below

0
On

I got the problem too and found it is beacause my program ran chromadb in jupyter lab (or jupyter notebook which is the same).

In chromadb official git repo example, it says:

In a notebook, we should call persist() to ensure the embeddings are written to disk. This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

So, If your program is also ran in jupyter env,the best way is to call client.persist() everytime when you need to save your modification to chromadb's local persistence. The example code is as follow:

import chromadb

client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",
                                    persist_directory='./db'))
coll = client.get_or_create_collection("langchain", embedding_function=embeddings)

... # any modifications on chromadb, include create, upsert, delete...

client.persist() # save modifications above to chroma's local persistence
0
On

We need to add collection_name while saving/loading Chromadb.

save to disk

db2 = Chroma.from_documents(docs, embedding_function,  persist_directory="./chroma_db", collection_name='v_db')
db2.persist()
docs = db2.similarity_search(query)

load from disk

db3 = Chroma(collection_name='v_db', persist_directory="./chroma_db", embedding_function)
docs = db3.similarity_search(query)
print(docs[0].page_content)
1
On

It looks like the langchain dokumentation was wrong https://github.com/langchain-ai/langchain/issues/19807

You can change

from langchain_community.vectorstores import Chroma

to

from langchain_community.vectorstores.chroma import Chroma