How can I add collections/object in Chroma database

2.2k Views Asked by At

I'm trying to run few documents through OpenAI’s text embedding API and insert the resulting embedding along with text in the Chroma database locally.

sales_data = medium_data_split + yt_data_split
sales_store = Chroma.from_documents(
    sales_data, embeddings, collection_name="sales"
)

This process fails because I get RateLimitError: Rate limit reached for default-text-embedding-ada-002 error from OpenAI API as I'm using my personal account.

As a work around, I want to run a loop by splitting the big medium_data_split documents into smaller chunks and stage them running through OpenAI API embedding with a minute gap each.

To do that I want to join/combine resulting chroma databases but I couldn't find a way yet. Can someone suggest one?

Tried this way

sales_data1 = yt_data_split
sales_store1 = Chroma.from_documents(
    sales_data1, embeddings, collection_name="sales"
)

sales_data2 = medium_data_split[0:25]
sales_store2 = Chroma.from_documents(
    sales_data2, embeddings, collection_name="sales"
)

sales_store_concat = sales_store1.add(sales_store2)

I get the following error: AttributeError: 'Chroma' object has no attribute 'add'

1

There are 1 best solutions below

0
On

One solution would be use TextSplitter to split the documents into multiple chunks and store it in disk.

split it into chunks

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

You can also use OpenSource Embeddings like SentenceTransformerEmbeddings for creation of embeddings.

create the open-source embedding function

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

Then you can search the Embeddings for similarity and depending upon the similarity score you can select the chunks and send it to OpenAI.

save to disk

    db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
    db2.persist()

    docs_and_scores = db2.similarity_search_with_score("search query", k=2, fetch_k=5)

Also there is an option of merging vectorstores Please check: https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/faiss

Reference : https://python.langchain.com/docs/modules/data_connection/document_transformers/

https://python.langchain.com/docs/modules/data_connection/text_embedding/integrations/sentence_transformers