Hindi words length

455 Views Asked by At

I am trying to find out the length of Hindi words in Python, like 'प्रवीण' has length of 3 as per my knowledge.

w1 = 'प्रवीण'
print(len(w1))

I tried this code but it didn't work.

3

There are 3 best solutions below

1
On

Writing working kotlin code corresponding to the pseudo code provided by Codeman. This can help you get these 2 things:-

  1. Length of the string in terms of base characters
  2. Split the string into parts on the basis of base characters
const val HINDI_LETTERS = "कखगघङचछजझञटठडढणतथदधनपफबभमक़ख़ग़ज़ड़ढ़फ़यरलळवहशषसऱऴअआइईउऊऋॠऌॡएऐओऔॐऍऑऎऒ"

fun getHindiWordLength(word: String): Int{
    var count = 0
    var n = word.length
    for(i in 0..n-1){
        println(word[i])    //Just to see how each character in the string looks like
        if(word[i] in HINDI_LETTERS && (i == 0 || word[i-1] != '्'))        // Make sure not a half-letter
            count++
    }
    return count
}

fun splitHindiWordOnBaseLetter(word: String): MutableList<String>{
    var n = word.length
    var curWord = ""
    val splitWords: MutableList<String> = mutableListOf()
    for(i in 0..n-1){
        if(word[i] in HINDI_LETTERS && (i > 0 && word[i-1] != '्'))     // Make sure not a half-letter
        {
            splitWords.add(curWord)
            curWord = ""
        }
        curWord += word[i]
    }
    splitWords.add(curWord)         //last letter
    return splitWords
}

I have tested this code on these inputs:-

    println(getHindiWordLength("प्रवीण"))
    println(splitHindiWordOnBaseLetter("प्रवीण"))
    
    println(getHindiWordLength("आम"))
    println(splitHindiWordOnBaseLetter("आम"))
    
    println(getHindiWordLength("पेड़"))
    println(splitHindiWordOnBaseLetter("पेड़"))
    
    println(getHindiWordLength("अक्षर"))
    println(splitHindiWordOnBaseLetter("अक्षर"))
    
    println(getHindiWordLength("दिल"))
    println(splitHindiWordOnBaseLetter("दिल"))

This is the output that I am getting:-

प
्
र
व
ी
ण
3
[प्र, वी, ण]
आ
म
2
[आ, म]
प
े
ड
़
2
[पे, ड़]
अ
क
्
ष
र
3
[अ, क्ष, र]
द
ि
ल
2
[दि, ल]
0
On

In the Hindi language, each character need not be of length one as is in English. For example, वी is not one character but rather two characters combined into one:

So in your case, the word प्रवीण is not of length 3 but rather 6.

w1 = "प्रवीण"
for w in w1:
    print(w)

And the output would be

प
्
र
व
ी
ण
0
On

As @betelgeuse has said, Hindi does not function the way you think it does. Here's some pseudocode (working) to do what you expect though:

w1 = 'प्रवीण'

def hindi_len(word):
    hindi_letts = 'कखगघङचछजझञटठडढणतथदधनपफबभमक़ख़ग़ज़ड़ढ़फ़यरलळवहशषसऱऴअआइईउऊऋॠऌॡएऐओऔॐऍऑऎऒ'
    # List of hindi letters that aren't halves or mantras
    count = 0
    for i in word:
        if i in hindi_letts:
            count += 1 if word[word.index(i) - 1] != '्' else 0 # Make sure it's not a half-letter
    return count

print(hindi_len(w1))

This outputs 3. It's up to you to customize it as you'd like, though.

Edit: Make sure you use python 3.x or prefix Hindi strings with u in python 2.x, I've seen some language errors with python 2.x non-unicode encoding somewhere before...