I have already indexed the documents with each word having payload that contains the Part of speech (POS) tag. I want to search only those documents for which the search query words have that POS tag. E.g. 'access google' has google as Noun. It should show only docs with google as noun. Can writing a custom analyser help? How can i access the Term when Payload is being accessed in Similarity class?
Lucene searching using payload and NLP tags
333 Views Asked by igopimac13 At
3
There are 3 best solutions below
0

I would recommend using span queries. Span queries can return a Spans object which allow to inspect the payload of every matching token.
See PayloadTermQuery.
1

You can use the PayloadAttribute class to store the tags as payloads and then override the scorePayload method of DefaultSimilarity class to make use of the tags. In your case you would want to return 1 if the tag content is noun and zero otherwise.
The following code snippet is useful to set the payload information
String tag = "noun";
byte[] payload = tag.getBytes();
Payload payloadData = new Payload(payload);
payloadAttr.setPayload(payloadData);
Now use the following lines of code to make use of the tags during retrieval. This has to done by extending the DefaultSimilarity class.
class PayloadSimilarity extends DefaultSimilarity {
...
...
protected float scorePayload(int doc, int start, int end, BytesRef payload) {
String payloadData = payload.utf8ToString();
return payloadData.equals("noun")? 1 : 0;
}
...
...
}
Finally just set your similarity class to your extended class during retrieval.
searcher.setSimilarity(new PayloadSimilarity());
doing exact (:google AND :'noun') queries in lucene can be tricky... what is your query and how are you writing the docs to the index?