Context:
I learned that to improve the search results of the RAG based approach, we can finetune the open source embedding models.
Now I have one pdf
file (my private dataset) using which I will create the train/eval dataset, finetune my embedding model, and store the embeddings into a vector DB. Suppose n
embeddings were stored into the DB.
Question:
Now tommorrow, a new pdf
file comes to me.
Then should I re-finetune the earlier finetuned embedding model?
Because this re-finetuned model will now generate slightly different embeddings, will the earlier
n
embeddings stored in my DB get outdated? Will I have to delete them?So now, if there are
m
new embeddings from the secondpdf
file to be stored, will I have to store a total of (n + m
) new embeddings into the DB?
So as the new data comes up, this problem will become a squared order problem as far as complexity is considered.