How to ignore stop words and use remaining words to fetch expected results alone using Solr?

316 Views Asked by At

I have a name field, with the following definition:

    <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
      </analyzer>
    </fieldType>

and I would like to search for contents based on the following criteria:

Example values:

test a value
test of solr
not a value
test me says value

So, if i do search for test value, I should only get results containing both test and values. And, even if I do test a value, I should still get only the same result as stop words will be excluded. But, with this current setup, even with including edismax in the solr query, I get all of the results. It goes by a record with either of those two tokens. Could someone suggest me the update I could do to the definition to get a result as expected? And, am I using stopwords as expected? I do not want stopwords in the search consuming execution time.

I updated the definition as per the suggestion and even then the result does not make any sense to me.

I have a value what a term. And, there are other values like what the term ; about a term; about the term; description test; Name a Nothing etc. A search for what a term returns all of these. And, I also had a value just a and the. They were also getting returned in the result. Though, for what a term, as per the below screenshot, the query omits the stop word, the result does not make any sense to me.

enter image description here

1

There are 1 best solutions below

4
On

You can ignore the stopsword during the search and index time. You cannot ignore these words in the response. The response will come as the text is stores as it is. The data stored for search and response is different. Search happens on the indexed data. In the response you get the data stored.

In your field type definition you are using KeywordTokenizerFactory.

KeywordTokenizerFactory : is used to when you dont want to create any tokens of your text. This tokenizer treats the entire text field as a single token.

Use of any other filter is of no use after this.

You can use StandardTokenizerFactory instead KeywordTokenizerFactory. StandardTokenizerFactory : This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters.

<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
      </analyzer>
</fieldType>

On the analysis page, I analyzed the data for the above field. Index the data "test a value" and query it with "test value", the result is found. Here you can see while indexing the data the stopwords are skipped as we have applied the stopwordfilterfactory

sc1

Now use "test a value" while indexing as well query in the analyze page. It skips the stopword "a" as filter is applied and matched the result.

sc2