wh" /> wh" /> wh"/>

Finding exacts matches while ignoring custom tags

38 Views Asked by At

I'm working with an index where there's a mix of documents and some might contain custom tags like:

  • "Some long sentence <custom-tag attr="value" /> which ends here"

  • "Some long sentence <custom-tag attr="value" /> which ends <custom-tag-2 attr="value2" /> here"

  • "Another long sentence <another-custom-tag attr="value" /> which ends <another-custom-tag attr=value /> here"

I'm supposed to find exact matches completely agnostic to tag's names and attributes. Building such an hypothetical query, the first thing which comes to my mind are regular expressions, for example:

  • "Some long sentence regex(<[^>]*>? which ends here"

would return the first document, and

  • "Some long sentence regex(<[^>]*>? which ends regex(<[^>]*>? here"

would return the second document.

Is this something I could achieve with Lucene 3.x ? I'm even considering migrating to Lucene 4.8 Beta if it justifies.

As anyone dealt with something similar? Are there pitfalls I should consider?

I guess the easiest way would to store the same text but stripped away from tags on a second field and perform the search on that one instead. I'd appreciate any input or suggestions.

1

There are 1 best solutions below

2
AndyPook On BEST ANSWER

Your best option (in any version) is to create a TokenFilter which would recognise the tag/regex and omit them from the token stream.

btw: I've found it "good" to never store the fields (possibly excepting the "identifier" field. Then serializing the object into a binary field. This separates the "index" from the "data". There is some benefit in search speed and IO requirements