Get character code of specific encoding from string

1k Views Asked by At

I'm trying to get the shift-jis character code from a unicode string. I'm not really that knowledgable in python, but here is what I have tried so far:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from struct import *

data="臍"
udata=data.decode("utf-8")
data=udata.encode("shift-jis").decode("shift-jis")
code=unpack(data, "Q")
print code

But I get an UnicodeEncodeError: 'ascii' codec can't encode character u'\u81cd' in position 0: ordinal not in range(128) error. The string is always a single character.

2

There are 2 best solutions below

0
On

In python 2, when you create a utf-8 encoded string, you can leave encoded (data = "臍") or you can have python decode it into a unicode string for you when the program is parsed (`data = u"臍"). The second option is the normal way to create strings when your source file is utf-8 encoded.

When you tried to convert to JIS, you ended up decoding the JIS back into a python unicode string. And when you tried to unpack, you asked for "Q" (unisgned long long) when you really want "H" (unsigned short).

Following are two samples to get information on the character

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from struct import *

# here we have an "ascii" string that is really utf-8 encoded char
data="臍"
jis_data = data.decode('utf-8').encode("shift-jis")
code = unpack(">H", jis_data)[0]
print repr(data), repr(jis_data), hex(code)[2:]

# here python decodes the utf-8 encoded char for us
data=u"臍"
jis_data = data.encode("shift-jis")
code = unpack(">H", jis_data)[0]
print repr(data), repr(jis_data), hex(code)[2:]

Which results in

'\xe8\x87\x8d' '\xe4`' 58464 0xe460
u'\u81cd' '\xe4`' 58464 0xe460
0
On

That character is represented in shift-jis as the two byte sequence 0xE4 and 0x60:

>>> data = u'\u81cd'
>>> data_shift_jis = data.encode('shift-jis')
'\xe4`'
>>> hex(ord('`'))
0x60

So '\xe4\x60' is u'\u81cd' encoded as shift-jis.