I'm trying to run few documents through OpenAI’s text embedding API and insert the resulting embedding along with text in the Chroma database locally.
sales_data = medium_data_split + yt_data_split
sales_store = Chroma.from_documents(
sales_data, embeddings, collection_name="sales"
)
This process fails because I get RateLimitError: Rate limit reached for default-text-embedding-ada-002
error from OpenAI API as I'm using my personal account.
As a work around, I want to run a loop by splitting the big medium_data_split documents into smaller chunks and stage them running through OpenAI API embedding with a minute gap each.
To do that I want to join/combine resulting chroma databases but I couldn't find a way yet. Can someone suggest one?
Tried this way
sales_data1 = yt_data_split
sales_store1 = Chroma.from_documents(
sales_data1, embeddings, collection_name="sales"
)
sales_data2 = medium_data_split[0:25]
sales_store2 = Chroma.from_documents(
sales_data2, embeddings, collection_name="sales"
)
sales_store_concat = sales_store1.add(sales_store2)
I get the following error: AttributeError: 'Chroma' object has no attribute 'add'
One solution would be use TextSplitter to split the documents into multiple chunks and store it in disk.
split it into chunks
You can also use OpenSource Embeddings like
SentenceTransformerEmbeddings
for creation of embeddings.create the open-source embedding function
Then you can search the Embeddings for similarity and depending upon the similarity score you can select the chunks and send it to OpenAI.
save to disk
Also there is an option of merging vectorstores Please check: https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/faiss
Reference : https://python.langchain.com/docs/modules/data_connection/document_transformers/
https://python.langchain.com/docs/modules/data_connection/text_embedding/integrations/sentence_transformers