Lucene StandardAnalyzer - multiple spaces in the query phrase

2.7k Views Asked by At

When creating org.apache.lucene.document.Document during indexing I create a org.apache.lucene.document.StringField which has multiple spaces together e.g. "ID____45_2013". I use org.apache.lucene.analysis.standard.StandardAnalyzer for creating the index and for querying it.

When querying index using phrases with multiple spaces e.g. "ID__45_2013" (where _ is a space) I get an empty result.

I have examined my query using luke and I realized that multiple spaces are parsed into one space.

What should I do to be able to use multiple spaces in query phrase and get the right result?

2

There are 2 best solutions below

1
On
0
On

The problem isn't just multiple spaces. If you had only single spaces, your query would be tokenized, while the indexed data wouldn't be (since it's created with a StringField). You would be searching for the tokens ID, 45, 2013 vs the single token ID 45 2013, which would still get you no results.

You can keep the field as a StringField, and set the Analyzer used by the QueryParser to aKeywordAnalyzer. You'dd still need to be careful of query syntax of course, but quoting the string as mentioned should do the trick.

The nicer way to query StringFields, I think, is to construct the TermQuery yourself. This removes the need for you to worry about the Analyzer. Simply create the query like:

Query query = new TermQuery(new Term("id", "ID   45 2013"));

You could also, If you wish to use a phrase query like what you've mentioned, you should use a TextField, Analyzed using the same analyzer as you use for querying (StandardAnalyzer, in thiscase). This would provide more free-text searching capability, if that is what you are looking for. I doesn't sound to me like that's the desired representation, but provided for your consideration.