Retrieve text in Lucene proximity query

148 Views Asked by At

I'm using Lucene to index a set of sentences. My queries are with two "entities" and i create a proximity query like this:

"EntityA EntityB"~22 

and i want to retrieve all the sentences that contains this two entities in maximum range of 22 characters. Now i want to use Lucene Highlighter to retrieve the words between the two entity. I am using a code like this for split the content in fragments but i don't know how to set the fragment in the precise point between the two entities.

for (int i = 0; i < numTotalHits; i++) {
            int id = hits[i].doc;
            Document doc = searcher.doc(hits[i].doc);
            String text = doc.get("content");
            TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), id , "content", analyzer);
            String[] frag = getFragmentsWithHighlightedTerms(analyzer, query, "content", text, 10, 10);

            for (int j = 0; j < frag.length; j++) {
                    System.out.println((frag[j].toString()));
            }

My aim so to retrieve the text inside the entity, for example:

entity1 --> Canada
entity2 --> Ottawa
sentence --> Natural Resources Canada, Canadian Forest Service, Ottawa.
result --> , Canadian Forest Service, 
1

There are 1 best solutions below

0
On

The "foo bar"~22 syntax will create a phrase query with a 22 slop to the best of my knowledge. The 22 specifies that there can be a maximum of 22 moves to get the 2 tokens near one another in the same order as in the query . The 22 moves will involve switching places with other tokens, and have no relevance on the token length (in this context token means word).

Once you retrieve a relevant result with a phrase query I don't think there's any reliable way to get the entire fragment between the 2 entities .

If you can build the query object yourself, I'd actually go with a regex query myself , since you already mentioned the 22 character range, and highlight on that. Then you can easily trim the 2 entities from the highlighted text.