Determine if a unicode character exists in a unicode subset

354 Views Asked by At

I'd like to find a way to determine if a Unicode character exists in a standardized subset of Unicode characters, specifically Latin basic and Latin-1. I am using Python 2 and the unicodedata module but need a solution that works in 3 as well because my job will be upgrading soon.

My current thinking is to use the Unicode Scripts.txt file and parse it into some kind of dictionary to search through. The problem is that the format of the Unicode codes in that file are like this.

02B9..02C1

and Unicode points in python are like this

`u'\xe6'

I do not know how I'd go about comparing these two things. I guess it's hexadecimal, and Python's representation is just another way of representing hexadecimal.

Are there any existing JSON data sets of Unicode subsets and their characters I can reference? Googling has turned up nothing. Would it be best to just make one from the Wikipedia page since the dataset is relatively small?

1

There are 1 best solutions below

2
On

02B9 .. 02C1 are hexadecimal code points of these characters. Using unicodedata.name you can get their names:

import unicodedata 
for i in range(int('02b9', 16), int('02c1', 16) + 1): 
    char = chr(i) 
    print(hex(i), char, unicodedata.name(char))  


0x2b9 ʹ MODIFIER LETTER PRIME
0x2ba ʺ MODIFIER LETTER DOUBLE PRIME
0x2bb ʻ MODIFIER LETTER TURNED COMMA
0x2bc ʼ MODIFIER LETTER APOSTROPHE
0x2bd ʽ MODIFIER LETTER REVERSED COMMA
0x2be ʾ MODIFIER LETTER RIGHT HALF RING
0x2bf ʿ MODIFIER LETTER LEFT HALF RING
0x2c0 ˀ MODIFIER LETTER GLOTTAL STOP
0x2c1 ˁ MODIFIER LETTER REVERSED GLOTTAL STOP

If you want to know, whether they are subset of Latin-1, you can try to convert them into that (or any other) encoding:

import unicodedata 
for i in range(int('02b9', 16), int('02c1', 16) + 1): 
    char = chr(i) 
    try:
        char.encode('latin1')
    except UnicodeEncodeError:
        print(char, False)
    else:
        print(char, True)

All of them will return False because they are not subset of Latin1.