How is vector search able to match exact keywords even for words which are randomly generated and have no meaning?

798 Views Asked by At

I'm doing some POC for my LLM based project, and for that I'm using Vector Database for Document Retrieval (IR).

Recently, I came across a few blogs from some of the most famous Vector Databases which suggested using hybrid search (Vector Search + Keyword Search) for better IR. That too mainly helps with Domain Specific keywords.

So before I start implementing Hybrid Search I thought of doing some tests and surprisingly found that all those blogs are wrong because, with Vector Search, I'm able to match Domain specific keywords from the query.

My Testing

  • Generated some keywords that don't have any meaning and moreover, doesn't exists

  • I'm using ChromaDB as vector database which uses hnswlib for ANN

    Sample Documents

    {
        "document_name": "Return Policy",
        "Category": "Fashion",
        "Product Name": "Zinsace",
        "Policy": "Customers can return the product within 14 days of purchase if it is unworn, with all tags attached and in its original condition. A refund will be provided in the original form of payment. However, customized or personalized Zinsace products are non-returnable."
    },
    {
        "document_name": "Return Policy",
        "Category": "Electronics",
        "Product Name": "Zisava",
        "Policy": "Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, Zisava products that have been used or show signs of damage are non-returnable."
    },
    {
        "document_name": "Return Policy",
        "Category": "Fashion",
        "Product Name": "Zinsape",
        "Policy": "Customers can return the product within 14 days of purchase if it is unworn, with all tags attached and in its original condition. A refund will be provided in the original form of payment. However, customized or personalized Zinsape products are non-returnable."
    },
    {
        "document_name": "Return Policy",
        "Category": "Electronics",
        "Product Name": "Zisada",
        "Policy": "Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, Zisada products that have been used or show signs of damage are non-returnable."
    }
    

    Script to Index & Search

    import uuid
    
    import chromadb
    from chromadb.config import Settings
    from chromadb.utils import embedding_functions
    
    from hybrid.dummy_data import DUMMY_DATA
    
    client = chromadb.Client(Settings(
        chroma_db_impl="duckdb+parquet",
        persist_directory="./hybrid"
    ))
    
    openai_ef = embedding_functions.OpenAIEmbeddingFunction(
        api_key="XXXX",
        model_name="text-embedding-ada-002"
    )
    
    st_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name='all-mpnet-base-v2')
    
    # st_ef_mini = embedding_functions.SentenceTransformerEmbeddingFunction()
    
    texts = [doc['Policy'] for doc in DUMMY_DATA]
    
    metadatas = [{k: v for k, v in d.items() if k != 'Policy'} for d in DUMMY_DATA]
    
    collection = client.get_or_create_collection(name="mpnet", metadata={'hnsw:space': 'l2'},
                                                 embedding_function=st_ef)
    ids = [str(uuid.uuid4()) for _ in texts]
    
    collection.add(
        documents=texts,
        metadatas=metadatas,
        ids=ids
    )
    
    res = collection.query(
        query_texts=["I want to return Zinsace"],
        n_results=10
    )
    
    print(res.get('documents'))
    

    Output

        [['Customers can return the product within 14 days of purchase if it is unworn, with all tags attached and in its original condition. A refund will be provided in the original form of payment. However, customized or personalized **Zinsace** products are non-returnable.', 'Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, **Zisada** products that have been used or show signs of damage are non-returnable.', 'Customers can return the product within 30 days of purchase if it is unopened and in its original packaging. A refund will be issued in the original form of payment, excluding any shipping fees. However, **Zisava** products that have been used or show signs of damage are non-returnable.']]
    

    Output Analysis

    • I've used 3 models for embeddings

      • text-embedding-ada-002
      • all-mpnet-base-v2
      • all-MiniLM-L6-v2
    • I indexed some documents related to refund policy with product names which are very random (which doesn't have meaning & doesn't exist)

    • When I tried query I want to return Zinsace or I want to buy Zinsace, with all 3 embedding models first result returned is always correct and it is able to do exact keyword match

This led me into the confusion of how these models are able to generate embeddings that can do exact keyword matches as well, and that too for words that those models have never seen before.

If vector search is able to do keyword match why all vector database guys suggests using Hybrid Search. Haven't they tested properly? or Are they in any bias?

3

There are 3 best solutions below

1
Franck Dernoncourt On

How vector search is able to match exact keyword (even words which are randomly generated and have no meaning)

Because the embeddings use subword segmentation such as WordPieces:

To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces").

1
Prabhat Jha On

Your examples are very "similar" which means that there are not much noise so your ANN query is on the spot.

Another reason if you are generating random words that don't exist in dictionary is that those words are getting ignored.

2
Erick Ramirez On

Vector search is not the same as the traditional text search.

Vector search performs a similarity comparison on the vectors (embeddings) which are numeric encodings of semantic information about the data so that you can perform a semantic similarity search using another vector/embedding. It is not the traditional keyword match that you are familiar with in a regular text search.

Since vector search looks for similiary based on the semantics (meaning) of the text, it can return results that appear to be nonsense but is in fact somehow relevant due to similarities in the query. This is one of the things that lead to "hallucinations". Cheers!