I am new to Rust and I was trying to split Devanagari (vowels and) bi-tri and tetra conjuncts consonants as whole while keeping the vowel sign and virama. and later map them with other Indic script. I first tried using Rust's chars()
which didn't work. Then I came across grapheme clusters. I have been googling and searching SO about Unicode and UTF-8, grapheme clusters, and complex scripts.
I have used grapheme clusters in my current code, but it does not give me the desired output. I understand that this method may not work for complex scripts like Devanagari or other Indic scripts.
How can I achieve the desired output? I have another code where I attempted to build a simple cluster using an answer from Stack Overflow, converting it from Python to Rust, but I have not had any luck yet. It's been 2 weeks and I have been stuck on this problem.
Here's the Devanagari Script and Conjucts wiki:
Devanagari Script: https://en.wikipedia.org/wiki/Devanagari
Devanagari Conjucts: https://en.wikipedia.org/wiki/Devanagari_conjuncts
Here's what I wrote to split:
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
fn main() {
let hs = "हिन्दी मुख्यमंत्री हिमंत";
let hsi = hs.graphemes(true).collect::<Vec<&str>>();
for i in hsi {
print!("{} ", i); // double space eye comfort
}
}
Current output:
हि न् दी मु ख् य मं त् री हि मं त
Desired ouput:
हि न्दी मु ख्य मं त्री हि मं त
My another try:
I also tried to create a simple grapheme cluster following this SO answer https://stackoverflow.com/a/6806203/2724286
fn split_conjuncts(text: &str) -> Vec<String> {
let mut result = vec![];
let mut temp = String::new();
for c in text.chars() {
if (c as u32) >= 0x0300 && (c as u32) <= 0x036F {
temp.push(c);
} else {
temp.push(c);
if !temp.is_empty() {
result.push(temp.clone());
temp.clear();
}
}
}
if !temp.is_empty() {
result.push(temp);
}
result
}
fn main() {
let text = "संस्कृतम्";
let split_tokens = split_conjuncts(text);
println!("{:?}", split_tokens);
}
Output:
["स", "\u{902}", "स", "\u{94d}", "क", "\u{943}", "त", "म", "\u{94d}"]
So, how can I get the desired output?
Desired ouput:
हि न्दी मु ख्य मं त्री हि मं त
I also checked other SO answers (links below) dealing issues with Unicode, grpahemes, UTF-8, but no luck yet.
Combined diacritics do not normalize with unicodedata.normalize (PYTHON)
what-is-the-difference-between-combining-characters-and-grapheme-extenders