I'm using dtSearch to highlight text search matches within a document. The code to do this, minus some details and cleanup, is roughly along these lines:
SearchJob sj = new SearchJob();
sj.Request = "\"audit trail\""; // the user query
sj.FoldersToSearch.Add(path_to_src_document);
sj.Execute();
FileConverter fileConverter = new FileConverter();
fileConverter.SetInputItem(sj.Results, 0);
fileConvert.BeforeHit = "<a name=\"HH_%%ThisHit%%\"/><b>";
fileConverter.AfterHit = "</b>";
fileConverter.Execute();
string myHighlightedDoc = fileConverter.OutputString;
If I give dtSearch a quoted phrase query like
"audit trail"
then dtSearch will do hit highlighting like this:
An <a name="HH_0"/><b>audit</b> <a name="HH_1"/><b>trail</b> is a fun thing to have an <a name="HH_2"/><b>audit</b> <a name="HH_last"/><b>trail</b> about!
Note that each word of the phrase is highlighted separately. Instead, I would like phrases to get highlighted as whole units, like this:
An <a name="HH_0"/><b>audit trail</b> is a fun thing to have an <a name="HH_last"/><b>audit trail</b> about!
This would A) make highlighting look better, B) improve behavior of my javascript that helps users navigate from hit to hit, and C) give more accurate counts of total # hits.
Is there good ways to make dtSearch highlight phrases this way?
Note: I think the text and code here could use some more work. If people want to help revise the answer or the code, this can probably become community wiki.
I asked dtSearch about this (4/26/2010). Their response was two-part:
First, it's not possible get the desired highlighting behavior just by, say, altering a flag.
Second, it is possible to get some lower-level hit information where phrase matches are treated as wholes. In particular if you set both the dtsSearchWantHitsByWord and the dtsSearchWantHitsArray flags in your SearchJob, then your search results will be annotated with the word offsets of where each word or phrase in your query matches. For example, if your input document is
and your query is
then (in the .NET API), sj.Results.CurrentItem.HitsByWord[0] will contain a string like this:
indicating that the phrase "audit trail" is found starting at the 2nd word and the 11th word in the document.
One thing you can do with this information is to create a "skip list" indicating which of the dtSearch highlights are insignificant (i.e. which ones are phrase continuations, rather than being the start of a word or phrase). For example, if your skip list was [4, 7, 9], that might mean that the 4th, 7th and 9th hits were insignificant, whereas the other hits were legit. A "skip list" of this sort could be used in at least two ways:
Supposing these "skip lists" are indeed useful, how would you generate them? Well here's some code that mostly works: