Count Number of Character in Indic Language (Hindi,Tamil support all indian language)

548 Views Asked by At

Is there any optimal way to implement character count for indic language like Hindi Tamil For example, if we take the word "Mother" in English, it is a 6 letter word. But if you type the same word(माता) in Hindi, it is a two letter word(मा + ता) but the length of character become 4.is there any way to count the number of real characters?

माता -> actual -> 4, Expected-> 2
जगदीश  -> actual ->5 , Expected -> 4
क्रमश  -> actual -> 5, expected -> 3

Any help on this would be greatly appreciated...

1

There are 1 best solutions below

0
On

I know answering after 5 years is not of any help. But might help few others who are searching for the same thing.

I am also having same requirement. from what I have searched, there isnt any plug-and-play package to do it. see the problem with indic languages is, the माता word is considered as "ma" + "aa" (matra) + "tha" + "aa" (matra) so it becomes 4. to avoid this you will have to hardcode the range of characters in Unicode that correspond to only full letters, and ignore characters.

Look into this: [https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)][1]

In the table, (U+090x4 to U+093x9) + (U+095x8 to U+095xF) will become normal characters, and others are matras, which you should ignore, so in the programming language you use, you should a .filter() or similar operation to find the number of characters.