Wikipedia basically says the following for the four values of unistr
.
- NFC (Normalization Form Canonical Composition)
- Characters are decomposed
- then recomposed by canonical equivalence.
- NFKC (Normalization Form Compatibility Composition)
- Characters are decomposed by compatibility
- recomposed by canonical equivalence
- NFD (Normalization Form Canonical Decomposition)
- Characters are decomposed by canonical equivalence
- multiple combining characters are arranged in a specific order.
- NFKD (Normalization Form Compatibility Decomposition)
- Characters are decomposed by compatibility
- multiple combining characters are arranged in a specific order.
So for each of the choice, it is a two step transform? But normalize() only shows the final result. Is there a way to see the intermediate results?
Wiki also says
For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
But I can not reproduce it with normalize()
. Could anybody provide more examples to show how these four options work? What are their differences?
>>> from unicodedata import normalize
>>> print(normalize('NFC', 'A°'))
A°
>>> print(normalize('NFKC', 'A°'))
A°
>>> print(normalize('NFD', 'Å'))
Å
>>> print(normalize('NFKD', 'Å'))
Å
>>> len(normalize('NFC', 'A°'))
2
>>> len(normalize('NFKC', 'A°'))
2
>>> len(normalize('NFD', 'Å'))
2
>>> len(normalize('NFKD', 'Å'))
2
Try the following script (Python 3):
Output shows a feature/bug in WebKit browsers (Chrome, Safari); they normalize form data to NFC. See the picture at the very end (Windows 10
cmd
prompt, font DejaVu Sans Mono):