Given a list of Unicode code points, how does one split them into a list of Unicode characters?

372 Views Asked by At

I'm writing a lexical analyzer for Unicode text. Many Unicode characters require multiple code points (even after canonical composition). For example, tuple(map(ord, unicodedata.normalize('NFC', 'ā́'))) evaluates to (257, 769). How can I know where the boundary is between two characters? Additionally, I'd like to store the unnormalized version of the text. My input is guaranteed to be Unicode.

So far, this is what I have:

from unicodedata import normalize

def split_into_characters(text):
    character = ""
    characters = []

    for i in range(len(text)):
        character += text[i]

        if len(normalize('NFKC', character)) > 1:
            characters.append(character[:-1])
            character = character[-1]

    if len(character) > 0:
        characters.append(character)

    return characters

print(split_into_characters('Puélla in vī́llā vīcī́nā hábitat.'))

This incorrectly prints the following:

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī', '́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī', '́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

I expect it to print the following:

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']
2

There are 2 best solutions below

1
On BEST ANSWER

The boundaries between perceived characters can be identified with Unicode's Grapheme Cluster Boundary algorithm. Python's unicodedata module doesn't have the necessary data for the algorithm (the Grapheme_Cluster_Break property), but complete implementations can be found in libraries like PyICU and uniseg.

0
On

You may want to use the pyuegc library, an implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29.

from pyuegc import EGC  # pip install pyuegc

string = 'Puélla in vī́llā vīcī́nā hábitat.'
egc = EGC(string)
print(egc)
# ['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

print(len(string))
# 35
print(len(egc))
# 31