I am using the official Elasticsearch-PHP client installed on a personal Debian server, and what I am trying to do involves indexing, searching and highlighting individual documents. i.e. each search result will only return one document - which will then be highlighted for "simple query string" searches. I am also using FVH (fast vector highlighting).
My question is similar to this one Position as result, instead of highlighting and the test code is basically the same so I won't repeat that here. However in my case I need both position and highlighting. I followed the link to the documentation about term vectors, but just like the other OP, my searches are not exact words per se. In some cases they are phrases. How would I approach this?
My use case is to search only one document (for each query), and present a summary of results with links which the user can click to go to the specific place in the document where that result came from. If I have the index / position I can simply use that against the full source of the document. I have checked the documentation to no avail.
You could try to install a specific plugin developed by wikimedia foundation called Experimental Highlighter -github here
You can install for elasticsearch 7.5 in this way - for other elasticsearch versions please refer to the github project page:
And restart elasticsearch.
Inasmuch you need to retrieve also the
positions
- if for your use case the offsets can replace the positions please go on to the next paragraph - you should declare your field with termvector with the index option"with_position_offset_payloads"
- doc hereFor other cases that don't need to retrieve also the position, it is faster and uses much less space to use the index option
"offsets"
- elastic doc here, plugin doc here:Then you could query with the experimental highlighter and return only offset of the highlighter part:
In this way no text is returned from your query but only the
start offset
and theend offset
- numbers that represent position. To retrieve your highlighted content you need to enter inside['hits']['hits'][0]['_source']['text']
-text is your field name - and extract text from the field using your start offset point and the end offset point. You need to ensure to use the correct string encoding -UTF-8
- otherwise the offsets don't match text. According to the doc:Let me know if that plugin could help!