wordninja does not work with other languages

300 Views Asked by Ivan Stankov At 16 August 2021 at 14:58

I have a question that I cant solve alone. I am currently building an NLP preprocessing pipeline and though about using wordninja with cyrilic languages (Russian and Ukrainian) I have set the dictionaries as described and everything seemed to look alright, but I can make it work.

import wordninja
wordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('setup/ru_ninja_dict.txt.gz')
wordninja.split("приветпока")

(the output is an empty list [], while ["привет", "пока"] was expected)

My main assumption is that there is an issue with encodings. However, I do not know how to check it myself.

Please let me know if you have any ideas!

Original Q&A

There are 1 best solutions below

Ivan Stankov On 19 August 2021 at 09:10

Ok. So, as I've figured out, there was an issue in compiling the regex pattern. In the original wordninja code there is

_SPLIT_RE = re.compile("[^a-zA-Z0-9']+")

which will only work with a limited number of languages. (definitely not Cyrillic)

replace with

_SPLIT_RE = re.compile("[U+0400–U+04FF]+")

for it to work appropriately with Russian, Ukrainian and other slavic languages.

wordninja does not work with other languages

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in NLP

Related Questions in SPELLING

Trending Questions

Popular # Hahtags

Popular Questions