half space (\u200c) don't support in CountVectorizer

92 Views Asked by At

in CountVectorizer of python's library, i want to persian words that contain half space be one token not two word .

I will be grateful to guide me. thank you.

i used "درخت‌های زیبا" in CountVectorizer . i wanted it to turn into ["درخت‌های","زیبا"] but turned into ["درخت","ها","زیبا"] .

1

There are 1 best solutions below

0
Andj On

CountVectorizer is using the default token_pattern (?u)\b\w\w+\b. The regex metacharacter \w in Python's core regular expression engine does not include ZWJ and ZWNJ.

There are two approaches that can be taken:

  1. Use a custom token_pattern; or
  2. Set token_pattern to None and define your own tokenizer.

Python's \w, used by scikit-learn, is not compatible with the Unicode definition. Where the definition matters, the second approach would be preferred.

1) Custom token_pattern

In this scenario, we specify a custom regex pattern that adds ZWJ and ZWNJ to the pattern. Using ICU, allows language specific boundary analysis:

from sklearn.feature_extraction.text import CountVectorizer
s = ["درخت‌های زیبا"]
cv1 = CountVectorizer(
    token_pattern = r'(?u)\b\w+[\u200C\u200D]?\w+\b'
)
cv1.fit(s)
print(*cv2.vocabulary_, sep="\n")
# درخت‌های
# زیبا

The input string is divided into two words.

2) Custom tokenizer

In this scenario, I will use an ICU4C break iterator. The break iterator returns the indexes for break boundaries, it is necessary to process results of the break iteration.

N.B. token_pattern needs to be set to None to use tokenizer.

import icu
from sklearn.feature_extraction.text import CountVectorizer
import regex as re

bi = icu.BreakIterator.createWordInstance(icu.Locale('fa_IR'))
def tokenise(text, interator=bi, strip_punct=True):
    interator.setText(text)
    tokens = []
    start = interator.first()
    for end in interator:
        if strip_punct:
            if not re.match('[\p{Z}\p{N}\p{P}]+', text[start:end]):
                tokens.append(text[start:end])
        else:
            tokens.append(text[start:end])
        start = end
    return tokens

s = ["درخت‌های زیبا"]
cv2 = CountVectorizer(
    tokenizer = tokenise,
    token_pattern = None
)
cv2.fit(s)
print(*cv3.vocabulary_, sep="\n")
# درخت‌های
# زیبا

2B) Custom tokenizer using regex

There is a variation of the custom tokeniser, where we use the default regular expression pattern for tokenisation with an alternative regular expression engine. The default behaviour and the reason it fails for Persian and many other languages is because the definition of \w in core Python differs from the Unicode definition. If we use a more compliant version of regex, the original pattern used by CountVectorizer will work with most languages, just not Persian.

from sklearn.feature_extraction.text import CountVectorizer
import regex as re

s = ["درخت‌های زیبا"]
def tokenise(text):
    return re.findall(r'(?u)\b\w\w+\b', text)
cv3 = CountVectorizer(
    tokenizer = tokenise,
    token_pattern = None
)
cv3.fit(s)
print(*cv.vocabulary_, sep="\n")
# درخت‌های
# زیبا