I am working on a processor that parts texts into blocks with marks:
LOREM IPSUM SED AMED
will be parsed like:
{word:1}LOREM{/word:1}{space:2}
{word:3}IPSUM{/word:3}{space:4}
{word:5}SED{/word:5}{space:6}
{word:7}AMED{/word:7}
But I dont want to use "{word}" etc, because it causes processor down, because it is an string again... I need to mark like these:
\E002\0001 LOREM \E003\0001 \E004\0002
\E002\0003 IPSUM \E003\0004 \E004\0005
\E002\0006 SED \E003\0006 \E004\0007
\E002\0008 AMED \E003\0008
- First \E002 means element type number, its last bit represent element's close. So element number increments with +2.
- Second \0001 means element index for stacking.
- I am just used \E002 irrelevantly for this example.
But \0001 also using in Unicode Range, and this leads me to where I start again...
So which unicode range can I use? \ff0000? or how can I solve this?
Thanks!
The Unicode Consortium thought of this. There is a range of Unicode code points that are meant to never represent a displayable character, but meta-codes instead:
You should be able to use regular control characters as "private" tags, because these should never occur in proper strings. This would be the range from
U+0000toU+001F, excluding tab (U+0009), the common "returns" (U+000AandU+000D), and, for safety,U+0000itself (some libraries do not like Null characters in the middle of strings).You can use
U+FEFF(which is currently officially defined as Not-A-Character), orU+FFFEandU+FFFF. There are several more "officially not-a-characters" defined, and you can be fairly sure they would not occur in regular text strings.A few random sequences with predefined definitions, and so highly unlikely to occur in plain text strings are:
Staying within conventions, you can also use
U+2028(line separator) and/orU+2029paragraph separator.Technically, your use of
U+E000–U+F8FF(the "Private Use Area") is okay-ish, because these code points only can define an unambiguous character in combination with a certain font. However, it is possible these codes may pop up if you get your plain text from a source where the font was included.As for how to encode this into your strings: it doesn't really matter if the numerical code immediately following your private tag marker is a valid Unicode character or not. If you see one of your own tag markers, then the value immediately following is always your own private sequence number.
As you see, there are lots of possibilities. I guess the most important criterium is whether you want to use other functions on these strings. If you create a string that is technically invalid Unicode (for instance, because it includes not-a-character values), some external functions may choose to fail to work on them, or silently remove the bad values. In such a case, you'd need to rigorously stick to a system in which you only use 'valid' code points.