I want to improve my knowledge about Golang by reading the Golang specification but English isn't my native language; and, I do not fully understand what the following text means:
Source code is Unicode text encoded in UTF-8. The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter; those are treated as two code points. For simplicity, this document will use the unqualified term character to refer to a Unicode code point in the source text.
With reference to the above text, what do the following parts mean?
- The text is not canonicalized
- Single accented code
- Unqualified term character to refer to a Unicode code point in the source text
If questions of this type are not suitable for this site, please advise a more suitable place to ask such questions.
It's important that you understand a particular facet of the Unicode standard first. There are essentially two ways to represent a accented character like
ë
. One is the single code pointU+00EB
(Latin Small Letter E with Diaeresis), and the second is two code points̈e
which is the simple code pointU+0065
(Latin Small Letter E, a regular lettere
) with another code pointU+0308
(Combining Diaeresis).Now in effect, these two characters are the same. They are merely constructed differently. This leads to a concept called Unicode equivalence which normalizes (or canonicalizes) those two sets of code points to be equivalent.
This means that the two accented letters
ë
and̈e
above are not equivalent in the language spec. The first one is the "single accented code"U+00EB
, and the latter is the lettere
combined with a combining diacritic.It's just saying "We're defining for this document only the term 'character' to mean a single Unicode code point." This is for ease of reading, not to define anything in the language specification, and therefore it is "unqualified."