How to handle with words which have space between characters?

530 Views Asked by At

I am using nltk.word_tokenize in Dari language. The problem is that we have space between one word.
For example the word "زنده گی" which means life. And the same; we have many other words. All words which end with the character "ه" we have to give a space for it, otherwise, it can be combined such as "زندهگی".

Can anyone help me using [tag:regex] or any other way that should not tokenize the words that a part of one word ends with "ه" and after that, there will be the "گ " character.

1

There are 1 best solutions below

0
On

To resolve this problem in Persian we have a character calls Zero-width_non-joiner (or نیم‌فاصله in Persian or half space or semi space) which has two symbol codes. One is standard and the other is not standard but widely used :

  1. \u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
  2. \u200F : Right-to-left mark (http://unicode-table.com/en/#200F)

As I know Dari is very similar to Persian. So first of all you should correct all the words like زنده گی to زنده‌گی and convert all wrong spaces to half spaces then you can simply use this regex to match all words of a sentence:

[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+

Online demo (the black bullet in test string is half space which is not recognizable for regex101 but if you check the match information part and see Match 5 you will see that is correct)

For converting wrong spaces of a huge text to half spaces there is an add on for Microsoft word calls virastyar which is free and open source. You can install it and refine your whole text. But consider this add on is created for Persian and not Dari. For example In Persian we write زنده‌گی as زندگی and it can not correct this word for you. But the other words like می شود would easily corrects and converts to می‌شود. Also you can add custom words to the database.