Maximum number of codepoints in a grapheme cluster

1.3k Views Asked by At

I am using the C++ ICU library. I wish to split a utf-8 string into approximately equal chunks. However, I want the chunks to be demarcated at grapheme cluster boundaries. I do not wish to convert my entire string into utf-16 to do this for both memory and speed efficiency. Instead, I want to translate a small number of utf-8 codepoints close to my estimated chunk boundaries into utf-16. I can then use ICU's BreakIterator to work out the exact boundaries.

Is there a hard upper limit of the number of codepoints that can make up a grapheme cluster? If so, what is it? I need to know this in order to determine the minimum codepoints that I need to translate from utf-8 to utf-16.

2

There are 2 best solutions below

0
On BEST ANSWER

Is there a hard upper limit of the number of codepoints that can make up a grapheme cluster?

No. There is no hard upper limit for how many code points a grapheme clusters - i.e. a user-perceived character - consists of.

You could for example repeatedly add ZERO WIDTH JOINER with a joined character.

0
On

Just to add an example to the accepted answer.

You can for example create arbitrarily large grapheme clusters using this page:

https://glitchtextgenerator.com/

As an example here is a "letter X" that occupies 73 bytes on disk:

x̧̡̬̘͓̖̲̻̻̲̠̪̻͓͙̜̂̓̊̔̀̀͗̑̀̅̀̂̚͘̕̚͘͢͜͠

I also created another that is close to 10 kilobytes, but perhaps better not post such monsters here because they could cause some problems. Depending on software these render in interesting ways.