Split Kannada word into syllabic clusters

Question

Split Kannada word into syllabic clusters

1.3k Views Asked by mpsbhat At 30 July 2025 at 06:58

We are wondering if there is any method to split a Kannada word to get the syllabic clusters using JavaScript.

For example, I want to split the word ಕನ್ನಡ into the syllabic clusters ["ಕ", "ನ್ನ", "ಡ"]. But when I split it with split, the actual array obtained is ["ಕ", "ನ", "್", "ನ", "ಡ"]

Example Fiddle

Original Q&A

There are 2 best solutions below

AudioBubble On 01 June 2017 at 13:18

Consider using the "inSC" property associated with Unicode characters--you can get this from a database--which indicates the Indic Syllabic Character. (You might also want to consult the "category", to see if it is "non-spacing mark"). For instance, ""್" has the type "Virama" (see http://graphemica.com/0CCD). To take another example, "ಿ" (KANNADA VOWEL SIGN I) has an InSC of "Vowel_Dependent" (and is also in the "non-spacing mark" category). You could potentially then detect which individual graphemes need to be combined with others, and put back together complete characters, as follows:

const graphemes = [..."ಕನ್ನಡ"];

console.log("graphemes are", graphemes);

const rebuild = [graphemes[0], graphemes.slice(1, 4).join(''), graphemes[4]];

console.log(rebuild);

Even if you can make this work, you'll have more work to do. It's unclear to me how you would detect that the three characters "ನ", ""್", and "ನ" are to be combined, rather than treated as the two characters "ನ್" and "ನ". The problem is that in this case the virama is used to indicate a consonant cluster, so you would need to identify the X-V-X pattern (where V is virama) and treat that as one combined character. There are probably many, many other such special cases.

This might be of interest: https://www.microsoft.com/typography/OpenTypeDev/kannada/intro.htmj. It talks about finding "syllable clusters", in this particular case as a prelude for rendering the characters graphically. You may also want to take a look at http://www.unicode.org/L2/L2003/03068-kannada.pdf.

**bugs_cena** · Accepted Answer

I cannot say that this is a complete solution. But works to an extent with some basic understanding of how words are formed:

var k = 'ಕನ್ನಡ';
var parts = k.split('');
arr = []; 
for(var i=0; i< parts.length; i++) {
  var s = k.charAt(i); 

  // while the next char is not a swara/vyanjana or previous char was a virama 
  while((i+1) < k.length && k.charCodeAt(i+1) < 0xC85 || k.charCodeAt(i+1) > 0xCB9 || k.charCodeAt(i) == 0xCCD) { 
    s += k.charAt(i+1); 
    i++; 
  } 
  arr.push(s);
}
console.log(arr);

As the comments in the code say, we keep appending chars to previous char as long as they are not swara or vyanjana or previous char was a virama. You might have to work with different words to make sure you cover different cases. This particular case doesn't cover the numbers.

For Character codes you can refer to this link: http://www.unicode.org/charts/PDF/U0C80.pdf

Split Kannada word into syllabic clusters

There are 2 best solutions below

Related Questions in JAVASCRIPT

Related Questions in ARRAYS

Related Questions in SPLIT

Related Questions in KANNADA

Trending Questions

Popular # Hahtags

Popular Questions