Solr query not working for inputs with spaces though the output from analysis phase seems fine for it to work

832 Views Asked by At

I'm stuck up with an issue as elaborated here. I have a text field that stores bed and bath info into it, while indexing I store values like 2b 3bt for 2 beds and 3 baths respectively. Finally I need to support queries like "2 beds 3 baths" , "beds 2 3 baths", "2 bed rooms 3 baths", "2bd 3bth" ....

For attaining this, I use a text field with the text_general type as below

    <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>


    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />       
       <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
       <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?i)((\d\.?\d{0,2})\s*(bed\s*rooms|bed\s*room|beds|bed|bdr|bd|br|b)|(bed\s*rooms|bed\s+room|beds|bed|bdr|bd|br|b)\s*(\d\.?\d{0,2}))" replacement="$2$5b" />
       <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?i)((\d\.?\d{0,2})\s*(bath\s*rooms|bath\s*room|baths|bath|bth|bt|bh|ba)|(bath\s*rooms|bath\s*room|baths|bath|bth|bt|bh|ba)\s*(\d\.?\d{0,2}))" replacement="$2$5bt" />     
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.TrimFilterFactory" updateOffsets="true"/>        
       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />        
       <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    </fieldType>

I tried Solr queries with the admin interface and it is almost working fine for all combinations except for case with intermediate spaces like "6 beds 6 baths" or "6 bed room 6 bath room" at the same time "6beds 6baths" gets me correct results. Here is the url with the parameters that I put across to solr for this query

    /solr/select?q=6b+6ba&wt=xml&indent=true&q.op=AND

I checked the Solr admin analysis interface for each of these case and found no difference at all. As the analysis phase is producing the same results I was expecting both these queries to work similar. Can any one direct me, why these two queries are not behaving in a similar manner ?

This is what I see at the solr admin analysis interface for the two queries in question

    For input : 6 beds 6 bath room,

    PRCF 6b 6bath room
    PRCF 6b 6bt
    ST   6b | 6bt
    TF   6b | 6bt
    SF   6b | 6bt
    LCF  6b | 6bt

    For input : 6b 6bt
    PRCF 6b 6bt
    PRCF 6b 6bt
    ST   6b | 6bt
    TF   6b | 6bt
    SF   6b | 6bt
    LCF  6b | 6bt

Sample inputs & outputs - Here are some sample inputs that I tried using the field definition I already mentioned above, Note: (#) is just the serial number and is not part of the input

   (1) 2beds 3baths Fresno
   (2) 3baths 2beds Fresno
   (3) Fresno 2bedroom 3bathroom
   (4) beds2 3baths Fresno
   (5) beds2 bathrooms3 Fresno

All the above are working fine even now, Here are some inputs that are still a concern for me with the current field definition

   (6) 2 beds 3 baths Fresno
   (7) 2 bed rooms 3 baths Fresno
   (8) Fresno 2 bed room  3 baths
   (9) Fresno 3baths 2   bed rooms

The output that I expect for the above inputs after analysis phase in the same serial number order is as below (as while indexing for 2beds 3 baths, I index the data as 2b 3bt),

   (1) 2b 3bt Fresno
   (2) 3bt 2b Fresno
   (3) Fresno 2b 3bt
   (4) 2b 3bt Fresno
   (5) 2b 3bt Fresno
   (6) 2b 3bt Fresno
   (7) 2b 3bt Fresno
   (8) Fresno 2b 3bt
   (9) Fresno 3bt 2b 

But up to this point I think I'm doing fine as I can generate the exact same output on analysis which I confirmed through the Solr admin Analysis interface, The real issue here is that I can get the query to fetch correct search results for the first section of the input (ie) up to #5 but for the inputs from #6 to #9 I don't get any results

This is a sample query format that I try for input #1 ie) 2beds 3baths Fresno

    /solr/collection1/select?q=2beds+3baths+Fresno&wt=xml&indent=true&q.op=AND

And this one for #6, ie) 2 beds 3 baths Fresno

/solr/collection1/select?q=2+beds+3+baths+Atlanta&wt=xml&indent=true&q.op=AND
1

There are 1 best solutions below

0
On BEST ANSWER

The final solution that I applied here is as below,

I removed the PatternReplaceCharFilterFactory for bed and bath from the Query time Analyser and did a similar pattern replacement on the input text from my servlet.

So now for the following input text

    2 beds 3 baths Fresno

From my servlet code, I convert it to

    2b 3bt Fresno

This is what I then pass on to solr to work on ... and it is now working fine

Here is the modified fieldtype definition for the text_general field,

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
         <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
     <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.TrimFilterFactory" updateOffsets="true"/>       
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
  </fieldType>