I'd like to find a way to determine if a Unicode character exists in a standardized subset of Unicode characters, specifically Latin basic and Latin-1. I am using Python 2 and the unicodedata module but need a solution that works in 3 as well because my job will be upgrading soon.
My current thinking is to use the Unicode Scripts.txt file and parse it into some kind of dictionary to search through. The problem is that the format of the Unicode codes in that file are like this.
02B9..02C1
and Unicode points in python are like this
`u'\xe6'
I do not know how I'd go about comparing these two things. I guess it's hexadecimal, and Python's representation is just another way of representing hexadecimal.
Are there any existing JSON data sets of Unicode subsets and their characters I can reference? Googling has turned up nothing. Would it be best to just make one from the Wikipedia page since the dataset is relatively small?
02B9 .. 02C1
are hexadecimal code points of these characters. Usingunicodedata.name
you can get their names:If you want to know, whether they are subset of Latin-1, you can
try
to convert them into that (or any other) encoding:All of them will return
False
because they are not subset of Latin1.