I would like to create a word concordance hit list with Solr, which gives all occurrences of the given word with context.
An English example:
...bla bla1 <b>dog</b> bla bla 1...
...bla bla2 <b>dog</b> bla bla 2...
...bla bla3 <b>dogs</b> bla bla 3
...bla bla4 <b>dogging</b> bla bla 4...
...bla bla5 <b>dog</b> bla bla 5...
It's important to be able to customize the size of the context. (Sometimes more than 1 sentence.)
My question: how can i do this with Solr?
Lucene 4.1 is able to do this, for example with FastVectorHighlighter:
//indexing
FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
offsetsType.setStored(true);
offsetsType.setIndexed(true);
offsetsType.setStoreTermVectors(true);
offsetsType.setStoreTermVectorOffsets(true);
offsetsType.setStoreTermVectorPositions(true);
offsetsType.setStoreTermVectorPayloads(true);
doc.add(new Field("content", fileContent, offsetsType));
//searching
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(indexPath)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = StandardAnalyzer(Version.LUCENE_41);
QueryParser parser = new QueryParser(Version.LUCENE_41, "content", analyzer);
Query query = parser.parse("dog");
TopDocs results = searcher.search(query, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
int id = results.scoreDocs[i].doc;
Document doc = searcher.doc(id);
FastVectorHighlighter h = new FastVectorHighlighter();
String[] hs = h.getBestFragments(h.getFieldQuery(query), reader, id, "content", contextSize, 10000);
if (hs != null)
for(String f : hs)
System.out.println(" highlight: " + f);
}
But how can i ask Solr to do the same?
My trial was this (solrconfig.xml):
<fragmentsBuilder name="colored" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder">
<lst name="defaults">
<str name="hl.tag.pre"><![CDATA[
<b style="background:yellow">,<b style="background:lawgreen">,
<b style="background:aquamarine">,<b style="background:magenta">,
<b style="background:palegreen">,<b style="background:coral">,
<b style="background:wheat">,<b style="background:khaki">,
<b style="background:lime">,<b style="background:deepskyblue">]]></str>
<str name="hl.tag.post"><![CDATA[</b>]]></str>
</lst>
</fragmentsBuilder>
<requestHandler name="drupal" class="solr.SearchHandler" default="true">
...
<str name="hl">true</str>
<str name="hl.fl">content</str>
<int name="hl.snippets">5000</int>
<int name="hl.fragsize">300</int>
<str name="hl.simple.pre"><![CDATA[ <b style="background:yellow"><i> ]]></str>
<str name="hl.simple.post"><![CDATA[ </i></b> ]]></str>
<str name="hl.mergeContiguous">true</str>
<str name="hl.fragListBuilder">single</str>
<str name="hl.useFastVectorHighlighter">true</str>
But it always gives one great fragment (for each doc), but not with all occurrences.
Thanks, Steve
Can you try with
hl.fragsize=100
andhl.mergeContiguous=false
and see how many fragments you get?(Before adding the params directly in your SearchHandler in solrconfig.xml you can try various options by specifying all your params in query. Once you find a set of params you are happy with, use those in solrconfig.)