I'm working with an index where there's a mix of documents and some might contain custom tags like:
"Some long sentence
<custom-tag attr="value" />which ends here""Some long sentence
<custom-tag attr="value" />which ends<custom-tag-2 attr="value2" />here"- "Another long sentence
<another-custom-tag attr="value" />which ends<another-custom-tag attr=value />here"
I'm supposed to find exact matches completely agnostic to tag's names and attributes. Building such an hypothetical query, the first thing which comes to my mind are regular expressions, for example:
- "Some long sentence
regex(<[^>]*>?which ends here"
would return the first document, and
- "Some long sentence
regex(<[^>]*>?which endsregex(<[^>]*>?here"
would return the second document.
Is this something I could achieve with Lucene 3.x ? I'm even considering migrating to Lucene 4.8 Beta if it justifies.
As anyone dealt with something similar? Are there pitfalls I should consider?
I guess the easiest way would to store the same text but stripped away from tags on a second field and perform the search on that one instead. I'd appreciate any input or suggestions.
Your best option (in any version) is to create a TokenFilter which would recognise the tag/regex and omit them from the token stream.
btw: I've found it "good" to never store the fields (possibly excepting the "identifier" field. Then serializing the object into a binary field. This separates the "index" from the "data". There is some benefit in search speed and IO requirements