Understanding unistr of unicodedata.normalize()

1.1k Views Asked by At

Wikipedia basically says the following for the four values of unistr.

- NFC (Normalization Form Canonical Composition)
    - Characters are decomposed
    - then recomposed by canonical equivalence.
- NFKC (Normalization Form Compatibility Composition)
    - Characters are decomposed by compatibility
    - recomposed by canonical equivalence
- NFD (Normalization Form Canonical Decomposition)
    - Characters are decomposed by canonical equivalence
    - multiple combining characters are arranged in a specific order.
- NFKD (Normalization Form Compatibility Decomposition)
    - Characters are decomposed by compatibility
    - multiple combining characters are arranged in a specific order.

So for each of the choice, it is a two step transform? But normalize() only shows the final result. Is there a way to see the intermediate results?

Wiki also says

For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

But I can not reproduce it with normalize(). Could anybody provide more examples to show how these four options work? What are their differences?

>>> from unicodedata import normalize
>>> print(normalize('NFC', 'A°'))
A°
>>> print(normalize('NFKC', 'A°'))
A°
>>> print(normalize('NFD', 'Å'))
Å
>>> print(normalize('NFKD', 'Å'))
Å
>>> len(normalize('NFC', 'A°'))
2
>>> len(normalize('NFKC', 'A°'))
2
>>> len(normalize('NFD', 'Å'))
2
>>> len(normalize('NFKD', 'Å'))
2
1

There are 1 best solutions below

0
On

I am asking for examples to show the difference among the four cases.

Try the following script (Python 3):

# -*- coding: utf-8 -*-

import sys
from unicodedata import normalize

def encodeuni(s):
    '''
Returns input string encoded to escape sequences as in a string literal.
Output is similar to
  str(s.encode('unicode_escape')).strip("b'").replace('\\\\','\\');
but even every ASCII character is encoded as a \\xNN escape sequence
(except a space character). For instance: 

s = 'A á ř ';
encodeuni(s);       # '\\x41 \\xe1 \\u0159 \\U0001f308'     while 
str(s.encode('unicode_escape')).strip("b'").replace('\\\\','\\');
#                   #    'A \\xe1 \\u0159 \\U0001f308'
    '''
    def encodechar(ch):
        ordch = ord(ch)
        return ( ch                if ordch == 0x20   else 
                 f"\\x{ordch:02x}" if ordch <= 0xFF   else
                 f"\\u{ordch:04x}" if ordch <= 0xFFFF else
                 f"\\U{ordch:08x}" )
                 
    return ''.join([encodechar(ch) for ch in s]) 

if len(sys.argv) >= 2:
    letters = ' '.join([sys.argv[i] for i in range(1,len(sys.argv))])
    # .\SO\59979037.py  ÅÅÅ
else:
    letters = '\u212B \u00C5 \u0041\u030A \U0001f308'
    #          \u212B                     Å Angstrom Sign
    #                 \u00C5              Å Latin Capital Letter A With Ring Above
    #                        \u0041       A Latin Capital Letter A
    #                              \u030A ̊  Combining Ring Above
    #                                     \U0001f308  Rainbow

print('\t'.join( ['raw' , letters.ljust(10), str(len(letters)), encodeuni(letters),'\n']))
for form in ['NFC','NFKC','NFD','NFKD']:
    letnorm = normalize(form, letters)
    print( '\t'.join( [form, letnorm.ljust(10), str(len(letnorm)), encodeuni(letnorm)]))

Output shows a feature/bug in WebKit browsers (Chrome, Safari); they normalize form data to NFC. See the picture at the very end (Windows 10 cmd prompt, font DejaVu Sans Mono):

.\SO\59979037.py  ÅÅÅ
raw     ÅÅÅ            4       \u212b\xc5\x41\u030a

NFC     ÅÅÅ             3       \xc5\xc5\xc5
NFKC    ÅÅÅ             3       \xc5\xc5\xc5
NFD     ÅÅÅ          6       \x41\u030a\x41\u030a\x41\u030a
NFKD    ÅÅÅ          6       \x41\u030a\x41\u030a\x41\u030a

enter image description here