Can anyone explain how the vector database work? On which step it creates indexes?

501 Views Asked by At

So if I understood correctly at first we need to create an embedding and then is it directly inserted to database? or at first it should be indexed? or a first it is indexed and only after that stored in database

I'm just trying to figure out the process for exaplaining it in my essay

2

There are 2 best solutions below

3
On

Yes, that's correct understanding. A developer shouldn't need to worry about the indexing part of a database that offers the vector search capability and only do the ingestion with the appropriate embeddings. The index building process happens at the database level and each database does the indexing in its own fashion. For e.g. some databases gives immediacy while some does asynchronous indexing which takes time to catch up the indexes to be able to serve the fully relevant data. See this report that talks about how hard vector search problems were solved for.

Where the developer needs to worry is that if the indices for the data that was just inserted isn't available at the read-size immediately (i.e. real-time) and giving poor relevancy/throughput/latency until it catches up and completes fully indexing.

1
On

In addition to Madhavan's answer, I can point you to a repository where I have done exactly this process: https://github.com/aar0np/ecomProductEmbeddingLoader/tree/main

This Python repository takes E-commerce product data and generates embeddings based on the product names. These embeddings are later used to provide recommendations of similar products.

On the backend, I'm using Astra DB which is a Cassandra DBaaS with Vector Search. Astra DB defines its tables with CQL (similar to SQL), and our product_vectors table looks a little like this:

CREATE TABLE product_vectors (
    product_id TEXT PRIMARY KEY,
    name TEXT,
    product_group TEXT,
    parent_id UUID,
    category_id UUID,
    images SET<TEXT>,
    product_vector vector<float,384>);

CREATE CUSTOM INDEX ON product_vectors(product_vector) USING 'StorageAttachedIndex';

The product_vector column is a 384-dimensional float, to match the embeddings returned from the all-MiniLM-L6-v2 model from HuggingFace.

The data loader code looks like this:

# initialize the all-MiniLM-L6-v2 model locally
model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

#read file
filename = "data/product_ids.csv"

with open(filename, encoding = 'utf8') as file:
    reader = csv.reader(file)
    next(reader) # skip header line

    for dataline in reader:

        productId = dataline[0]
        product = getProduct(productId)

        # embedding generated for product name
        vectorEmb = model.embed_query(product["name"])

        # row inserted and indexed
        session.execute(insertVector,[productId,product["name"],product["product_group"],product["images"],vectorEmb])

As indicated in the code above, the data is indexed at write-time.

Many vector databases use the Lucene HNSW library for vector search and indexing, which is (unfortunately) single-threaded. As Madhavan mentioned, this can lead to delays before newly-written vectors can be queried.

However, the cool thing about using Astra DB or Apache Cassandra®, is that they use the JVector library which allows indexing to run concurrently with the rest of the database operations, making your embeddings queryable nigh-immediately.