Does Lucene HNSW KNN Vector search support pre-filtering?

596 Views Asked by At

Lucene recently added HNSW approximate nearest neighbor search (ANN) for Lucene 9.0.0, based on this original branch: https://issues.apache.org/jira/browse/LUCENE-9004 .

Does Lucene support pre-filtering? For example, lets say we want to do a vector search for documents that are created after the year 2020. Is it possible to filter for these documents in the same request for the vector search? Or must we do a post-filter after getting back the ANN search results?

I notice there is a member acceptOrds under the query method here: https://javadoc.io/doc/org.apache.lucene/lucene-core/latest/org/apache/lucene/util/hnsw/HnswGraph.html . Might that be used for filtering?

3

There are 3 best solutions below

1
James Briggs On

As far as I know they don't, I imagine it is in the pipeline but I imagine it will take time. You should look into Pinecone, from what I've seen Pinecone's metadata filtering is really ahead.

The reason for this is with pre-filtering you're restricted your search scope, which filters out nodes in your HNSW graph and therefore you're no longer able to perform an ANN search with the graph, it's 'broken' from the filter. So the search reverts to an exact kNN search - eg it's slow.

Post-filtering can be fast as you're able to maintain the graph structure and perform an ANN search, but then you're filtering on the results. So if you say I want the top 5 most similar results, you can end up with 4, 2, or in the worst-case scenerio 0 results.

Pinecone has introduced something called 'single-stage filtering' which manages to maintain accuracy like pre-filtering and return the exact number of matches you requested, while (typically) offering a speed increase like with post-filtering. So you get the best of both worlds.

0
xeraa On

Quoting from Vector search in Elasticsearch: The rationale behind the design, which will be the most common way for people to consume Lucene's kNN search:

By having its own HNSW graph that is tied to a segment and where nodes are indexed by doc ID, Lucene can make interesting decisions about how best to pre-filter vector searches: either by linearly scanning documents that match the filter if it is selective, or by traversing the graph and only considering nodes that match the filter as candidates for top-k vectors otherwise.

0
Prabhat Jha On

Astra Vector search is powered by Lucene and Apache Cassandra and it does support pre-filtering. You can do something like:

SELECT * FROM vsearch.products where created_at > "2005-12-13"
ORDER BY item_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
LIMIT 1;