in CountVectorizer of python's library, i want to persian words that contain half space be one token not two word .
I will be grateful to guide me. thank you.
i used "درختهای زیبا" in CountVectorizer . i wanted it to turn into ["درختهای","زیبا"] but turned into ["درخت","ها","زیبا"] .
CountVectorizeris using the default token_pattern(?u)\b\w\w+\b. The regex metacharacter\win Python's core regular expression engine does not include ZWJ and ZWNJ.There are two approaches that can be taken:
token_pattern; ortoken_patterntoNoneand define your owntokenizer.Python's
\w, used by scikit-learn, is not compatible with the Unicode definition. Where the definition matters, the second approach would be preferred.1) Custom token_pattern
In this scenario, we specify a custom regex pattern that adds ZWJ and ZWNJ to the pattern. Using ICU, allows language specific boundary analysis:
The input string is divided into two words.
2) Custom tokenizer
In this scenario, I will use an ICU4C break iterator. The break iterator returns the indexes for break boundaries, it is necessary to process results of the break iteration.
N.B.
token_patternneeds to be set toNoneto usetokenizer.2B) Custom tokenizer using regex
There is a variation of the custom tokeniser, where we use the default regular expression pattern for tokenisation with an alternative regular expression engine. The default behaviour and the reason it fails for Persian and many other languages is because the definition of
\win core Python differs from the Unicode definition. If we use a more compliant version of regex, the original pattern used byCountVectorizerwill work with most languages, just not Persian.