Unicode text normalization in bengali

294 Views Asked by At

I want to perform Unicode text normalization in the Bengali language. For example: Consider the sentence: প্রায়শ্চিত্ত - মনীন্দ্র ও তার পড়াশুনা and প্রায়শ্চিত্ত - মণীন্দ্র ও তার পড়াশুনা both differ in their Unicode values in the following ways (Notice the difference in ন and ণ in the first and second sentence of the word মনীন্দ্র):

SENTENCE 1: প্রায়শ্চিত্ত - মনীন্দ্র ও তার পড়াশুনা

[('প', 2474), ('্', 2509), ('র', 2480), ('া', 2494), ('য়', 2527), ('শ', 2486), ('্', 2509), ('চ', 2458), ('ি', 2495), ('ত', 2468), ('্', 2509), ('ত', 2468), (' ', 32), ('-', 45), (' ', 32), ('ম', 2478), ('ন', 2472), ('ী', 2496), ('ন', 2472), ('্', 2509), ('দ', 2470), ('্', 2509), ('র', 2480), (' ', 32), ('ও', 2451), (' ', 32), ('ত', 2468), ('া', 2494), ('র', 2480), (' ', 32), ('প', 2474), ('ড়', 2524), ('া', 2494), ('শ', 2486), ('ু', 2497), ('ন', 2472), ('া', 2494)]

SENTENCE 2: প্রায়শ্চিত্ত - মণীন্দ্র ও তার পড়াশুনা

[('প', 2474), ('্', 2509), ('র', 2480), ('া', 2494), ('য়', 2527), ('শ', 2486), ('্', 2509), ('চ', 2458), ('ি', 2495), ('ত', 2468), ('্', 2509), ('ত', 2468), (' ', 32), ('-', 45), (' ', 32), ('ম', 2478), ('ণ', 2467), ('ী', 2496), ('ন', 2472), ('্', 2509), ('দ', 2470), ('্', 2509), ('র', 2480), (' ', 32), ('ও', 2451), (' ', 32), ('ত', 2468), ('া', 2494), ('র', 2480), (' ', 32), ('প', 2474), (' ড়', 2524), ('া', 2494), ('শ', 2486), ('ু', 2497), ('ন', 2472), ('া', 2494)]

I had found this library https://github.com/csebuetnlp/normalizer for normalization but it is not showing any difference in the Unicode values after normalizing the input text. Also from using https://github.com/anoopkunchukuttan/indic_nlp_library text normalization is happening only for limited characters like poorna viram('|' full stop). Any suggestions in performing the normalization would be helpful.

Detailed Explanation:

The issue I am trying to mention is that the Unicode values of the same character are not consistent. If I am searching for a string "apple" where 'a' has Unicode value 200 and there are two candidate strings out of n total strings present in the system. String 1 contains "apple" wherein 'a' has Unicode value 200 and String 2 contains "apple" wherein 'a' has Unicode value 300 then I want both String 1 and String 2 to show up. Currently, only String 1 will show up because it is totally matching with the query string.

Both ন and ণ are the same characters, but they are treated differently since their Unicode values are different. For this particular case, I can replace ণ with ন. I am doing this because when I am performing a string search and I want to get words containing 'ন' and 'ণ'. However, there can be cases where some other letters have such ambiguity, or maybe ন is written in some other fashion where its Unicode value is different than 2472 and 2467. I want to know about a principled approach to handling this scenario.

P.S. It will also be really helpful if you can point me to any Bengali language-specific resource to get the canonical representations.

0

There are 0 best solutions below