I have a Weaviate instance running (ver 1.12.2). I am playing around with the Python client https://weaviate-python-client.readthedocs.io/en/stable/ (ver 3.4.2) (add - retrieve - delete objects...etc...)
I am trying to understand how filtered vector search works (outlined here https://weaviate.io/developers/weaviate/current/architecture/prefiltering.html#recall-on-pre-filtered-searches)
When applying pre-filtering, an 'allow-list' of object ids is constructed before carrying out vector search. This is done by using some property to filter out objects.
For example the Where filter I'm using is:
where_filter_1 = {
"path": ["user"],
"operator": "Equal",
"valueText": "billy"
}
This is because I've got many users whose data are kept in this DB and I would like for each user to be able to search their own data. In this case it is Image data.
This is how I implement this using the python client:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_where(where_filter_1)\
.with_near_vector(nearVector)\
.do()
I do not use any Vectorization modules so I create my own vector and pass it to the DB for vector search using .with_near_vector(nearVector)
after I have applied the filter with with_where(where_filter_1)
. This does work as I expect it so I think I'm doing this correctly.
I'm less sure if I'm applying post-filtering correctly: Each image has some text attached to it. I use the Where filter to search through the text by using the inverted index structure.
where_filter_2 = {
"path": ["image_text"],
"operator": "Like",
"valueText": "Paris France"
}
I apply post filtering like this:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_near_vector(nearVector)\
.with_where(where_filter_2).do()
However, I don't think I'm doing this properly. A basic inverted index search: (so just searching with text)
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_where(where_filter_2).do()
(Measured with the tqdm module) Gives me about 5 iters/sec. With 38k objects in the DB
While the post-filtering approach gives me the same performance, at 5 iters/sec
Am I wrong to find this weird? I was expecting performance closer to pure vector search:
result = client.query.get("Image", ["image_uri", "_additional {certainty}"])\
.with_near_vector(nearVector).do()
Which is close to 60 iters/sec (The flat search cut-off is set to 60k, so only brute-force search is used here)
Is the 'Where' filter applied only on the results supplied by the vector search? If so, shouldn't it be much faster? The filter would only be applied to 100 objects at most since that is the default number of results of vector search.
This is kind of confusing. Am I wrong in my understanding of how search works? Thanks for reading my question !
Your question seems to imply that you are switching between a pre- and post-filtering approach. But as of
v1.13
all filtered vector searches are using pre-filtering. There is currently no option for post-filtering. That explains why both your searches have identical results. Your are mostly experiencing the cost of building the filter.Side-Note 1:
I see that you are using a
Like
operator. TheLike
operator only differs from theEqual
operator if you are using wildcards. Since you are not using them, you can also use theEqual
operator which tends to be more efficient in many cases. (I'm not sure if that applies to your case, but it tends to be true overall)Side-Note 2:
If you are measuring throughput from a single client thread, i.e. using
tqdm
from a python script (without using multi-threading), you're not maxing out Weaviate. Since you only start sending the second query once the first has been processed client-side Weaviate will be idle most of the time. If you are interested in the maximum throughput, you need to make sure that you have at least as many client threads as you have cores on the server to max out Weaviate.