Solr: Ignore casing of strings when calculating facet numbers

508 Views Asked by At

I have these values for the title field in my database:

"I Am A String"
"I am A string"

I want to make the title field available as facets in my search results.

Current result:

<lst name="title">
    <int name="I Am A String">4</int>
    <int name="I am A string">3</int>
</lst>

Desired result:

<lst name="title">
    <int name="I Am A String">7</int>
</lst>

I actually don't care which of the 2 available string options is chosen for the final result, as long as the same strings (case insenstive) are counted for the same facet.

I tried the following field definitions for the title field. I also added the resulting facet logic.

string = sees casing as different strings
string_exact = sees casing as different strings
text_ws = breaks up into words with casing intact
text = breaks into separate words
textTight = breaks into separate words
textTrue = breaks up in words with casing intact
string_exacttest = breaks up in words with casing intact

Here's my schema.xml

<field name="title" type="string" indexed="true" stored="true"/>


<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />

<fieldType name="string_exact" class="solr.TextField"
    sortMissingLast="true" omitNorms="true">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>           
    </analyzer>
</fieldType>    

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

<!-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
    Synonyms and stopwords are customized by external files, and stemming is enabled. Duplicate tokens at the same position (which may result from Stemmed Synonyms or WordDelim parts) are removed.-->
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>


<!-- Less flexible matching, but less false matches. Probably not ideal for product names,but may be good for SKUs.  Can insert dashes in the wrong place and still match. -->
<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
    <!--
      this filter can remove any duplicate tokens that appear at the same position - sometimes possible with WordDelimiterFilter in conjuncton with
      stemming.
    -->
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>


<fieldType name="textTrue" class="solr.TextField" positionIncrementGap="100" >
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
  </analyzer>
</fieldType>    

How can I make sure that the same strings (ignoring case) are grouped together when calculating the facets?

1

There are 1 best solutions below

2
On

The string_exact definition is almost what you need, but you need to have a LowercaseFilter applied as well, so that each sentence is lowercased. The KeywordTokenizer keeps the whole value as a single token (so you won't see it broken into separate terms based on whitespace), and while a string field doesn't allow any additional processing, a TextField with a KeywordTokenizer behaves the same way - but you can add filters to how the token is processed afterwards.

<fieldType name="string_facet" class="solr.TextField" sortMissingLast="true" omitNorms="true">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>     
        <filter class="solr.LowerCaseFilterFactory"/>      
    </analyzer>
</fieldType>