solr: how to search for date ranges with at least X days?

887 Views Asked by At

Given: a list of consultants with a list of intervals when they are NOT available:

<consultant>
    <id>1</id>
    <not-available>
        <interval><from>2013-01-01</from><to>2013-01-10</to>
        <interval><from>2013-20-01</from><to>2013-01-30</to>
        ...
    </not-available>
</consultant>
...

I'd like to search for consultants that are available (!) for at least X days in a specific interval from STARTDATE to ENDDATE.

Example: Show me all consultants that are available for at least 5 days in the range 2013-01-01 - 2013-02-01 (this would match consultant 1 because he is free from 2013-01-11 to 2013-01-19).

  • Question 1: How should my solr document look like?
  • Question 2: How has the query to look like?
1

There are 1 best solutions below

1
On

As a general advice: precalculate as much as you can, store the data that you are querying for rather than the data you are getting as input.

Also, use several indexes based on different entities - if you have the liberty to do so, and if the queries would become simpler and more straight forward.

Ok, generalities aside and on to your question.


From your example I take it that you currently store in the index if a consultant is not available - probably, because that is what you get as input. But what you want to query is when they are available. So, you should think about storing the availability rather then the non-availability.

EDIT: The most forward way to query this is to use the intervals as entities such that you do not have to resort to special SOLR features to query the start and the end of an interval on two multi valued fields.

Once you have stored the availability intervals you can also precalculate and store their lengths:

<!-- id of the interval -->
<field name="id" type="int" indexed="true" stored="true" multiValued="false" />
<field name="consultant_id" type="int" indexed="true" stored="true" multiValued="false" />
<!-- make sure that the time is set to 00:00:00 (*/DAY) -->
<field name="interval_start" type="date" indexed="true" stored="true" multiValued="false" />
<!-- make sure that the time is set to 00:00:00 (*/DAY) -->
<field name="interval_end" type="date" indexed="true" stored="true" multiValued="false" />
<field name="interval_length" type="int" indexed="true" stored="true" multiValued="false" />

Your query:

(1.) Optionally, retrieve all intervals that have at least the requested length:

fq=interval_length:[5 to *]

This is an optional step. You might want to benchmark whether it improves the query performance. Additionally, you could also filter on certain consultant_ids.

(2.) The essential query is for the interval (use q.alt in case of dismax handler):

q=interval_start:[2013-01-01T00:00:00.000Z TO 2013-02-01T00:00:00.000Z-5DAYS]
  interval_end:[2013-01-01T00:00:00.000Z+5DAYS TO 2013-02-01T00:00:00.000Z] 

(added linebreak for readability, the two components of the query should be separated by regular space)

Make sure that you always set the time to the same value. Best is 00:00:00 because that is what /DAY does: http://lucene.apache.org/solr/4_4_0/solr-core/org/apache/solr/util/DateMathParser.html . The less different values the better the caching.


More info: http://wiki.apache.org/solr/SolrQuerySyntax - Solr Range Query http://wiki.apache.org/solr/SolrCaching#filterCache - caching of fq filter results


EDIT:

More info on q and fq parameters:

http://wiki.apache.org/solr/CommonQueryParameters They are handled differently when it comes to caching. That's why I added the other link (see above), in the first place. Use fq for filters that you expect to see often in your queries. You can combine multiple fq parameters while you can only specify q once per request.

How can I "use several indexes based on different entities"?

Have a look at the multicore feature: http://wiki.apache.org/solr/CoreAdmin

Would it be overkill to save for each available day: date;num_of_days_to_end_of_interval - should make querying much simpler?

Depends a bit on how much more data you are expecting in that case. I'm also not exactly sure that it would really help you for the query you posted. The date range queries are very flexible and fast. You don't need to avoid them. Just make sure you specify the time as broad as you can to allow for caching.