NoSQL DB for searching in vector space

1.3k Views Asked by At

I am completely new to NoSQL DBS such as Cassandra, Mongo, Redis, etc. and I want to create this type of a structure :

{
  "item_id": "ABC1",
  "x1": 0.55,
  "x2": -0.29,
  ...
  "x100": 0.17
}

Basically, I have millions of items and 100 floats associated with each of them. My main task is to search for items that are near a given vector of floats (in the vector space of dimension 100), and get for example the top k items or all the items for which distance is less than d.

Is there a NoSQL database that is particularly suited for this kind of task?

Thank you for any hint, Patrick

4

There are 4 best solutions below

1
On BEST ANSWER

As far as I know, there are no databases with out-of-the-box support for non-(2|3)D spatial indexes yet, but you can implement your own inside your application layer.

In general, you would like to have an efficient algorithm for N-dimensional nearest neighbour search like these:

  • KD-Tree with overall O(log N) complexity
  • Geohash

But both of them are quite tricky to be implemented correctly.

0
On

2020 Update for this question: Elasticsearch has out-of-the-box cosine similarity function for vectors with up to 2048 features (using "dense vector" data type). i'm using it now, and it works well for data sets with several hundred-thousand vectors.

0
On

2023 update to the question: Cassandra is now a great alternative for this need. It was created to handle large volumes of data. You can try it with DataStax Astra. Just create a free account and run it.

After create a DB, you can create your table like this:

CREATE TABLE IF NOT EXISTS my_table (
    id UUID,
    embedding vector<float, 100>,
    PRIMARY KEY (id)
)

Then, you have to create an index:

    CREATE CUSTOM INDEX IF NOT EXISTS IX_my_table 
    ON my_table(embedding) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' 
WITH OPTIONS = { 
     'similarity_function': 'dot_product'
     };

You can choose other metrics: cosine, euclidean or dot_product.

Then, after loading the data, you can query it with:

SELECT id, similarity_dot_product(embedding,:my_vector) AS similarity
FROM my_table
ORDER BY embedding ANN OF :my_vector
LIMIT 10 

Where :my_vector is a 100-dimension vector from which we want to find the 10 most similar objects.

3
On

I believe none of mentioned DBs would give you what you need especially with the amount of data you do have, I recommend using Solr , I had similar case and Solr was the best solution.