Python Unicode Normalization Can Not Normalize '\u0069\u0307' (i̇)

79 Views Asked by At

I'm working with Python's unicodedata module to normalize strings, but I'm encountering an unexpected behavior with a particular character. My goal is to normalize a string containing the character "i̇" (a Latin small letter i with dot above, Unicode U+0069 U+0307) using different normalization forms provided by unicodedata.

Here's the code snippet I'm using:

import unicodedata

test_string = "i̇"
print("Original length:", len(test_string))
print("NFKC normalized length:", len(unicodedata.normalize('NFKC', test_string)))
print("NFD normalized length:", len(unicodedata.normalize('NFD', test_string)))
print("NFC normalized length:", len(unicodedata.normalize('NFC', test_string)))
print("NFKD normalized length:", len(unicodedata.normalize('NFKD', test_string)))

The output I'm getting is:

Original length: 2
NFKC normalized length: 2
NFD normalized length: 2
NFC normalized length: 2
NFKD normalized length: 2

I expected the length of the normalized string to change with different normalization forms, especially with NFD and NFKD, which typically decompose characters. However, the length remains unchanged in all normalization forms.

Can anyone explain why this specific character does not change in length after normalization? Is there a different approach I should take to normalize such characters in Python?

0

There are 0 best solutions below