Learning ES at the moment, but I'm very keen to implement this.
I know you can highlight different fields with different tags, using the pre_tags and post_tags keys of highlight in a query... but is it possible to delivery a marked-up string where the returned fragment has a different HTML colour tag for each separate identified word, e.g. using a simple query string?
So I query with "interesting data" and a document field is returned like so:
the other day I was walking through the woods and I had an <font color="blue">interesting</font>
thought about some <font color="red">data</font>
What I'm getting at is not simply that the tags alternate "mindlessly": again, you can do with Fast Vector Highlighter, e.g.:
"highlight": {
"fields": {
"description": {
"pre_tags": ["<b>", "<em>"],
"post_tags": ["</b>", "</em>"]
Instead, I would like the field
"the other data day data was walking through some interesting woods and data had an interesting thought about some data"
to be returned thus:
the other <font color="red">data</font> day <font color="red">data</font> was walking through some <font color="blue">
interesting</font> woods and <font color="red">data</font> had an <font color="blue">
interesting</font> thought about some <font color="red">data</font>
I've previously coded using Lucene, i.e. Java, and I did manage to implement this sort of thing, by majorly jumping through hoops.
NB one answer to this might be "forget about ES returning marked up text, just apply your own tags using re.sub( r'\bdata\b', '<font color="red">data</font>', field_string )".
This would be OK for a simple use-case like this. But it doesn't work with a stemmer analyser. E.g., to give a French example: search query is "changer élément". I want the following marked-up result:
Les autres <font color="red">éléments</font> ont été <font color="blue">
changés</font> car on a appliqué un <font color="blue">changement</font>
à chaque <font color="red">élément</font>
i.e. "changer", "changés" and "changement" all stem to "chang", and "élément" and "éléments" stem to "element". A standard highlighted return of this field would thus be:
Les autres <em>éléments</em> ont été <em>changés</em> car on a appliqué un
<em>changement</em> à chaque <em>élément</em>
The fast vector highlighter is a good place to start. I haven't worked w/ French yet so don't consider the following authoritative but based on the built-in
frenchanalyzer, we could do something like this:FYI the
frenchanalyzer could be reimplemented/extended as shown here.After ingesting the English & French examples:
We can query for
interesting datalike so:yielding
and analogously for
changer élément:yielding
which, to me, looks correctly stemmed.
Note that the
pre_tagsorder is enforced based on what token inside of thesimple_query_stringquery matches first. When querying forchanger élément, the shingleélémentsin thedescriptionis matched first but what caused it to match is the 2nd token (élément), thereby thebluehtml tag instead of thered.