Is there any optimal way to implement character count for indic language like Hindi Tamil For example, if we take the word "Mother" in English, it is a 6 letter word. But if you type the same word(माता) in Hindi, it is a two letter word(मा + ता) but the length of character become 4.is there any way to count the number of real characters?
माता -> actual -> 4, Expected-> 2
जगदीश -> actual ->5 , Expected -> 4
क्रमश -> actual -> 5, expected -> 3
Any help on this would be greatly appreciated...
I know answering after 5 years is not of any help. But might help few others who are searching for the same thing.
I am also having same requirement. from what I have searched, there isnt any plug-and-play package to do it. see the problem with indic languages is, the माता word is considered as "ma" + "aa" (matra) + "tha" + "aa" (matra) so it becomes 4. to avoid this you will have to hardcode the range of characters in Unicode that correspond to only full letters, and ignore characters.
Look into this: [https://en.wikipedia.org/wiki/Devanagari_(Unicode_block)][1]
In the table, (U+090x4 to U+093x9) + (U+095x8 to U+095xF) will become normal characters, and others are matras, which you should ignore, so in the programming language you use, you should a .filter() or similar operation to find the number of characters.