LangChain vectorestore Search based on distance of cosine similarity

613 Views Asked by At

Let us say we had n documents and we passed through chunking and embedding into the LangChain vectorstore. I have 2 related questions:

1- Is it possible to query and find out of "all" entries close to an embedded query (not just a cerein number of them

2- Is there a way to find out ho many of the original n documents were involved with the retrieved vectors

1

There are 1 best solutions below

0
On
  1. Querying All Entries Close to an Embedded Query: Traditional vector search engines often provide a way to retrieve the 'k' nearest neighbors, where 'k' is a specified number. If you want to find "all" entries close to an embedded query, the definition of "close" becomes crucial. In practice, you could set 'k' to a very high number, but this might not be efficient or practical, especially if the number of documents (n) is very large. An alternative approach is to define a threshold for similarity or distance. You can retrieve all documents whose distance from the query vector is below a certain threshold. This method ensures that you only get documents that are meaningfully similar to your query.

  2. Finding Out How Many Original Documents Are Involved: This depends on how the vectorstore handles embeddings. If each document is represented by a single vector, then counting the unique documents corresponding to the retrieved vectors would directly give you the number of original documents involved. If a document is represented by multiple vectors (due to chunking), the vectorstore needs to maintain a mapping of vectors back to their source documents. You can then aggregate the results to find out how many unique documents are represented in the retrieved vectors. It's also important to note that some embeddings might be very similar across different documents, especially if those documents contain similar content or sections. This could complicate the process of mapping vectors back to unique documents.

In summary, querying all entries close to an embedded query can be done by setting a high 'k' value or using a distance threshold, and determining the number of original documents involved requires a mapping mechanism in the vectorstore to link vectors back to their source documents.