I have a question that I cant solve alone. I am currently building an NLP preprocessing pipeline and though about using wordninja with cyrilic languages (Russian and Ukrainian) I have set the dictionaries as described and everything seemed to look alright, but I can make it work.
import wordninja
wordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('setup/ru_ninja_dict.txt.gz')
wordninja.split("приветпока")
(the output is an empty list [], while ["привет", "пока"] was expected)
My main assumption is that there is an issue with encodings. However, I do not know how to check it myself.
Please let me know if you have any ideas!
Ok. So, as I've figured out, there was an issue in compiling the regex pattern. In the original wordninja code there is
which will only work with a limited number of languages. (definitely not Cyrillic)
replace with
for it to work appropriately with Russian, Ukrainian and other slavic languages.