Lucene recently added HNSW approximate nearest neighbor search (ANN) for Lucene 9.0.0, based on this original branch: https://issues.apache.org/jira/browse/LUCENE-9004 .
Does Lucene support pre-filtering? For example, lets say we want to do a vector search for documents that are created after the year 2020. Is it possible to filter for these documents in the same request for the vector search? Or must we do a post-filter after getting back the ANN search results?
I notice there is a member acceptOrds under the query method here: https://javadoc.io/doc/org.apache.lucene/lucene-core/latest/org/apache/lucene/util/hnsw/HnswGraph.html . Might that be used for filtering?
As far as I know they don't, I imagine it is in the pipeline but I imagine it will take time. You should look into Pinecone, from what I've seen Pinecone's metadata filtering is really ahead.
The reason for this is with pre-filtering you're restricted your search scope, which filters out nodes in your HNSW graph and therefore you're no longer able to perform an ANN search with the graph, it's 'broken' from the filter. So the search reverts to an exact kNN search - eg it's slow.
Post-filtering can be fast as you're able to maintain the graph structure and perform an ANN search, but then you're filtering on the results. So if you say I want the top 5 most similar results, you can end up with 4, 2, or in the worst-case scenerio 0 results.
Pinecone has introduced something called 'single-stage filtering' which manages to maintain accuracy like pre-filtering and return the exact number of matches you requested, while (typically) offering a speed increase like with post-filtering. So you get the best of both worlds.