How do we filter all tokens belonging to a certain language using SOLR?

136 Views Asked by Swetha Baskaran At 22 June 2015 at 18:50

In my case, I want to filter out all English words from documents that predominantly contain Arabic words.

There are 1 best solutions below

Alexandre Rafalovitch On 25 June 2015 at 14:05

Assuming the text is in Unicode, English and Arabic letters use different characters and you could filter them out with regular expressions.

So, in Solr, you would use something like PatternReplaceFilterFactory and standard Java regular expressions. Notice that Java's implementation is actually very deep and supports scripts, blocks and other shortcut ways to use Unicode standard ranges.

Solr also has some ICU filters and tokenizers, but they are more for transliteration, transformation and normalization of complex characters.

How do we filter all tokens belonging to a certain language using SOLR?

There are 1 best solutions below

Related Questions in SOLR

Related Questions in INFORMATION-RETRIEVAL

Trending Questions

Popular # Hahtags

Popular Questions