Lucene tokenization / filters not working as expected | Solr analysis confusion

74 Views Asked by At

I am trying to figure out the correct configuration for my analyzer configuration in my Solr/Lucidworks setup.

The results that I am seeing in Solr analysis seem to indicate that I should be getting matches, but when I do the Solr query (native or in the Lucidworks UI), no results are returned.

The relevant fragments from schema are:

<field name="content" indexed="true" multiValued="false" required="false" stored="true" type="dlowe_text_en"/>


<dynamicField indexed="true" name="*_txt_en_dlowe_split_tight" stored="true" type="dlowe_text_en"/>
<fieldType autoGeneratePhraseQueries="true" class="solr.TextField" name="dlowe_text_en" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

I have indexed some content that contains the string:

Administrator's Guide

Now, when I use the Solr analysis, this is the results that I get:

enter image description here

My understanding is if any the results are highlighted, this represents a match, but when I do the search in Solr on "Administrator" no results are found:

enter image description here

If I search on:

Administrator's

I do get the expected result.

I'm I totally miss understanding of how the analysis tool should work?

What I am trying to achieve is a search index that support a lot of technical items, that will only match on exact values. For example:

  • V-123-1231-1231
  • WILL_NOT_CHANGE
  • /mnt/abc/Drivers/
  • 4040:5050

So the WhitespaceTokenizer seems to make the most sense, but I also need stemming on the non-technical strings which would be indicated by periods (.), dashes (-), underlines (_), slashes (\ or /), etc.

Any insight / suggestions would be greatly appreciated.

1

There are 1 best solutions below

0
On

Based upon further investigation and bumping up the latest version of Solr (8.7) verses the very old corp. version that we are using (6.4.2).

Plus the re-enforcement from Abhijit above, I found out that the "full record" search of Solr doesn't work the way that I would expected.

Instead, I needed to:

  • copy all the fields that I want indexed into a single multivalue field (eg. content_all)
  • then I need to add query parameter: df=content_all to execution.

Once I did that, I started getting the results that I expected.

Probably obvious for those that use solr/lucene on a regular basis, but wasn't clear to me. Switching to 8.7 which doesn't have a 'default field', let me down the path to this solution.

Hopefully this will be of help to others in the future.