How to combine two Chroma databases

Question

How to combine two Chroma databases

5.1k Views Asked by randomQs At 18 April 2023 at 20:52

I created two dbs like this (same embeddings) using langchain 0.0.143:

db1 = Chroma.from_documents(
    documents=texts1,
    embedding=embeddings, 
    persist_directory=persist_directory1,
)
db1.persist()

db21 = Chroma.from_documents(
    documents=texts2,
    embedding=embeddings, 
    persist_directory=persist_directory2,
)
db2.persist()

then later accessing them with

db1 = Chroma(
    persist_directory=persist_directory1,
    embedding_function=embeddings,
)

db2 = Chroma(
    persist_directory=persist_directory2,
    embedding_function=embeddings,
)

How do I combine db1 and db2? I want to use them in a ConversationalRetrievalChain setting retriever=db.as_retriever().

I tried a couple of suggestions from searching but am missing something obvious

Original Q&A

There are 3 best solutions below

**Simon Podhajsky** · Answer 1 · 2023-04-30T10:14:19.807000

The simpler option is going to be loading the two documents into the same Chroma object. They'll retain separate metadata, so you can still tell which document each embedding came from:

from langchain.vectorstores import Chroma

chroma_directory = 'db/'

db = Chroma(persist_directory=chroma_directory, embedding_function=embedding)

db.add_documents(documents=texts1)
db.add_documents(documents=texts2)

db.similarity_search_with_score(query="Introduction to the document")
# --> results from both documents

The more complicated option: default Chroma storage is two parquet files and an index. If you could guarantee no index conflicts, you could theoretically merge the respective parquet files and merge the two index/ folders by copying the content of each into a new index/ folder adjacent to the two new parquet files.

**Jordy** · Answer 2 · 2023-06-06T13:39:56.150000

Another option would be to add the items from one Chroma db into the other Chroma db like so:


db1 = Chroma(
    persist_directory=persist_directory1,
    embedding_function=embeddings,
)

db2 = Chroma(
    persist_directory=persist_directory2,
    embedding_function=embeddings,
)

#can add collections up to 100K+
db1._collection.add(
     embeddings=db2.get()['embeddings'],
     metadatas=db2.get()['metadatas'],
     documents=db2.get()['documents'],
     ids=db2.get()['ids']
)

Note that the documentation suggests up to 100k+!, so there is a limit what you can add to the collection at once.

Source: https://docs.trychroma.com/api-reference#methods-related-to-collections

Note: using this method will join the specified source data (db2) to the target collection (db1). Meaning that if db1 has a collection named 'db1_collection' and db2 has a collection named 'db2_collection', using this method will only have 'db1_collection' remaining.

**Mike Feng** · Answer 3 · 2023-07-26T15:05:57.640000

Building on the above answer by Jordy, this is how I ended up doing it without rebuilding embeddings every time:

db1 = Chroma(
    persist_directory=persist_directory1,
    embedding_function=embeddings,
)

db2 = Chroma(
    persist_directory=persist_directory2,
    embedding_function=embeddings,
)

db2_data=db2._collection.get(include=['documents','metadatas','embeddings'])
db1._collection.add(
     embeddings=db2_data['embeddings'],
     metadatas=db2_data['metadatas'],
     documents=db2_data['documents'],
     ids=db2_data['ids']
)

Langchain Chroma's default get() does not include embeddings, so calling collection.get through chromadb and asking for embeddings is necessary.

How to combine two Chroma databases

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in COMBINERS

Related Questions in LANGCHAIN

Trending Questions

Popular # Hahtags

Popular Questions