Cosine similarity return empty

134 Views Asked by At

I am trying to access the most similar vectors but it returns empty and I don't understand.

I am following this documentation: https://redis-py.readthedocs.io/en/stable/examples/search_vector_similarity_examples.html

And this is my schema:

schema = (
                TagField("ticket_url"),
                NumericField("ticket_id"),
                NumericField("entity_id"),
                VectorField("embedding",
                            "HNSW", {
                                "TYPE": "FLOAT32",
                                "DIM": self.vector_dim,
                                "DISTANCE_METRIC": "COSINE",
                            }
                            ),
            )
            definition = IndexDefinition(
                prefix=[self.doc_prefix], index_type=IndexType.HASH)
            self.r.ft(self.index_name).create_index(
                fields=schema, definition=definition)

The function to search similar vectors

def search_similar_documents(self, entity_id, vector, topK=5, ticket_id=None):
        query = (
            Query("*=>[KNN 2 @embedding $vec as score]")
            .sort_by("score")
            .return_fields("score")
            .paging(0, 2)
            .dialect(2)
        )

        query_params = {"vec": vector}
        return self.r.ft(self.index_name).search(query, query_params).docs

enter image description here

Vectors are generated from an openai response and converted to bytes

def embedding_openai(self, text):
        try:
            response = openai.Embedding.create(
                input=text,
                model="text-embedding-ada-002"
            )
            embedding = response['data'][0]['embedding']
            array_embedding = np.array(embedding, dtype=np.float32)
            return array_embedding.tobytes()
        except Exception as ex:
            print(ex)
            return None

And redis.ft(index).info() return this

{'index_name': 'conversations', 'index_options': [], 'index_definition': [b'key_type', b'HASH', b'prefixes', [b'tickets:'], b'default_score', b'1'], 'attributes': [[b'identifier', b'ticket_url', b'attribute', b'ticket_url', b'type', b'TAG', b'SEPARATOR', b','], [b'identifier', b'ticket_id', b'attribute', b'ticket_id', b'type', b'NUMERIC'], [b'identifier', b'entity_id', b'attribute', b'entity_id', b'type', b'NUMERIC'], [b'identifier', b'embedding', b'attribute', b'embedding', b'type', b'VECTOR']], 'num_docs': '973', 'max_doc_id': '973', 'num_terms': '0', 'num_records': '3892', 'inverted_sz_mb': '0.00634765625', 'vector_index_sz_mb': '6.00555419921875', 'total_inverted_index_blocks': '2999', 'offset_vectors_sz_mb': '0', 'doc_table_size_mb': '0.086483001708984375', 'sortable_values_size_mb': '0', 'key_table_size_mb': '0.030145645141601562', 'records_per_doc_avg': '4', 'bytes_per_record_avg': '1.7101746797561646', 'offsets_per_term_avg': '0', 'offset_bits_per_record_avg': '-nan', 'hash_indexing_failures': '0', 'total_indexing_time': '347.62900000000002', 'indexing': '0', 'percent_indexed': '1', 'number_of_uses': 1, 'gc_stats': [b'bytes_collected', b'0', b'total_ms_run', b'0', b'total_cycles', b'0', b'average_cycle_time_ms', b'-nan', b'last_run_time_ms', b'0', b'gc_numeric_trees_missed', b'0', b'gc_blocks_denied', b'0'], 'cursor_stats': [b'global_idle', 0, b'global_total', 0, b'index_capacity', 128, b'index_total', 0], 'dialect_stats': [b'dialect_1', 0, b'dialect_2', 0, b'dialect_3', 0]}

the vectors are stored as bytes, I don't know if it's the algorithm or I'm the problem :/

1

There are 1 best solutions below

0
Spartee On

Was this resolved? A few things it could be

If you have alot of documents and you're using the FLAT index, it could be that the search simply isn't returning in the alloted 500ms timeout. This can be configured on startup or just use HNSW.

https://redisvl.com has a few examples of this in the user guide.