For combining diacritics, are they counted as letters? Since, as far as I know, they can only combine with other letters in well-formed Unicode.
The ICU function to determine if a Unicode codepoint is a letter only takes one codepoint, so for any given codepoint, it can't know if they've been combined with a diacritic- or if it's a diacritic, what it's been combined with. I'm trying to implement something akin to a Unicode-aware regex, using a construct like
while(is_letter(codepoint))
However, I'm quite concerned about what's going to happen if codepoint
is actually a diacritic, which would be collated with a previous codepoint, and other collating marks.
Is this safe to do? Or will I have to explicitly find and ignore diacritics and other collating marks?
Edit: What I really need to do is iterate characters, not codepoints.
This question is a victim of the XY problem. I need to raise a question about my actual problem.
I'm not totally clear on what you're trying to do, so I apologize in advance if this isn't the answer you're looking for, but:
Broadly speaking, diacritics are counted as "marks" rather than "letters". For example, U+0301 COMBINING ACUTE ACCENT, as in <ś>, is a "nonspacing mark", which is one of three kinds of "mark". However, the "modifier letters", which are counted as "letters", might nonetheless be thought of as diacritics; for example, U+02C0 MODIFIER LETTER GLOTTAL STOP, as in <sˀ>, is a "modifier letter".
If you look through the main file of the Unicode Character Database (warning: it's 1.3 MB text-file), you can get a sense for which characters are classified as "modifier letters" (
Lm
) and which as "nonspacing marks" (Mn
) or "spacing marks" (Ms
) or "enclosing marks" (Me
).