I am trying to access the most similar vectors but it returns empty and I don't understand.
I am following this documentation: https://redis-py.readthedocs.io/en/stable/examples/search_vector_similarity_examples.html
And this is my schema:
schema = (
TagField("ticket_url"),
NumericField("ticket_id"),
NumericField("entity_id"),
VectorField("embedding",
"HNSW", {
"TYPE": "FLOAT32",
"DIM": self.vector_dim,
"DISTANCE_METRIC": "COSINE",
}
),
)
definition = IndexDefinition(
prefix=[self.doc_prefix], index_type=IndexType.HASH)
self.r.ft(self.index_name).create_index(
fields=schema, definition=definition)
The function to search similar vectors
def search_similar_documents(self, entity_id, vector, topK=5, ticket_id=None):
query = (
Query("*=>[KNN 2 @embedding $vec as score]")
.sort_by("score")
.return_fields("score")
.paging(0, 2)
.dialect(2)
)
query_params = {"vec": vector}
return self.r.ft(self.index_name).search(query, query_params).docs
Vectors are generated from an openai response and converted to bytes
def embedding_openai(self, text):
try:
response = openai.Embedding.create(
input=text,
model="text-embedding-ada-002"
)
embedding = response['data'][0]['embedding']
array_embedding = np.array(embedding, dtype=np.float32)
return array_embedding.tobytes()
except Exception as ex:
print(ex)
return None
And redis.ft(index).info() return this
{'index_name': 'conversations', 'index_options': [], 'index_definition': [b'key_type', b'HASH', b'prefixes', [b'tickets:'], b'default_score', b'1'], 'attributes': [[b'identifier', b'ticket_url', b'attribute', b'ticket_url', b'type', b'TAG', b'SEPARATOR', b','], [b'identifier', b'ticket_id', b'attribute', b'ticket_id', b'type', b'NUMERIC'], [b'identifier', b'entity_id', b'attribute', b'entity_id', b'type', b'NUMERIC'], [b'identifier', b'embedding', b'attribute', b'embedding', b'type', b'VECTOR']], 'num_docs': '973', 'max_doc_id': '973', 'num_terms': '0', 'num_records': '3892', 'inverted_sz_mb': '0.00634765625', 'vector_index_sz_mb': '6.00555419921875', 'total_inverted_index_blocks': '2999', 'offset_vectors_sz_mb': '0', 'doc_table_size_mb': '0.086483001708984375', 'sortable_values_size_mb': '0', 'key_table_size_mb': '0.030145645141601562', 'records_per_doc_avg': '4', 'bytes_per_record_avg': '1.7101746797561646', 'offsets_per_term_avg': '0', 'offset_bits_per_record_avg': '-nan', 'hash_indexing_failures': '0', 'total_indexing_time': '347.62900000000002', 'indexing': '0', 'percent_indexed': '1', 'number_of_uses': 1, 'gc_stats': [b'bytes_collected', b'0', b'total_ms_run', b'0', b'total_cycles', b'0', b'average_cycle_time_ms', b'-nan', b'last_run_time_ms', b'0', b'gc_numeric_trees_missed', b'0', b'gc_blocks_denied', b'0'], 'cursor_stats': [b'global_idle', 0, b'global_total', 0, b'index_capacity', 128, b'index_total', 0], 'dialect_stats': [b'dialect_1', 0, b'dialect_2', 0, b'dialect_3', 0]}
the vectors are stored as bytes, I don't know if it's the algorithm or I'm the problem :/

Was this resolved? A few things it could be
If you have alot of documents and you're using the FLAT index, it could be that the search simply isn't returning in the alloted 500ms timeout. This can be configured on startup or just use HNSW.
https://redisvl.com has a few examples of this in the user guide.