Finding exacts matches while ignoring custom tags

39 Views Asked by pelican_george At 20 August 2016 at 08:06

I'm working with an index where there's a mix of documents and some might contain custom tags like:

"Some long sentence <custom-tag attr="value" /> which ends here"

"Some long sentence <custom-tag attr="value" /> which ends <custom-tag-2 attr="value2" /> here"

"Another long sentence <another-custom-tag attr="value" /> which ends <another-custom-tag attr=value /> here"

I'm supposed to find exact matches completely agnostic to tag's names and attributes. Building such an hypothetical query, the first thing which comes to my mind are regular expressions, for example:

"Some long sentence regex(<[^>]*>? which ends here"

would return the first document, and

"Some long sentence regex(<[^>]*>? which ends regex(<[^>]*>? here"

would return the second document.

Is this something I could achieve with Lucene 3.x ? I'm even considering migrating to Lucene 4.8 Beta if it justifies.

As anyone dealt with something similar? Are there pitfalls I should consider?

I guess the easiest way would to store the same text but stripped away from tags on a second field and perform the search on that one instead. I'd appreciate any input or suggestions.

Original Q&A

There are 1 best solutions below

AndyPook On 16 January 2017 at 13:48 BEST ANSWER

Your best option (in any version) is to create a TokenFilter which would recognise the tag/regex and omit them from the token stream.

btw: I've found it "good" to never store the fields (possibly excepting the "identifier" field. Then serializing the object into a binary field. This separates the "index" from the "data". There is some benefit in search speed and IO requirements

Finding exacts matches while ignoring custom tags

There are 1 best solutions below

Related Questions in LUCENE

Related Questions in LUCENE.NET

Related Questions in LUCENE.NET.LINQ

Trending Questions

Popular # Hahtags

Popular Questions