Converting UTF-8 to bits - Python

1.7k Views Asked by At

Given a character, how can we transform its UTF-8 encoding to bits in Python?

As an example, a corresponds to 01100001. I am aware of ord, but something like bin(ord('a'))[2:] returns 1100001, and it does not include 0 to the left. Of course, by zfill(8) I can make it 8 bits, but I would like to know if there is a more pythonic way of doing this. For instance, if we do not know in-advance how many bits it requires, then zfill(8) approach may not work any longer, as it may be 16 bits long.

1

There are 1 best solutions below

0
On

Python 3 strings contain Unicode code points, not "UTF-8 characters". You can use ord() to get the Unicode code point value, and .encode() to convert it to UTF-8 bytes. Then format each byte as 8-digit binary text, and .join() them together. Example:

# starting and ending code points for 1-, 2-, 3- and 4-byte UTF-8.
s1 = '\x00\x7f\x80\u07ff\u0800\uffff\U00010000\U0010FFFF'

# some printable characters in each range
s2 = 'Aü马'

def utf8_bin(u):
    # format as 8-digit binary, join each byte with space
    return ' '.join([f'{i:08b}' for i in u.encode()])

for u in s1:
    col1 = f'U+{ord(u):04X}' # format Unicode codepoint, leading zeros if <4 digits.
    print(f'{col1:8} {utf8_bin(u)}')

print()

for u in s2:
    col1 = f'U+{ord(u):04X}'
    print(f'{col1:8} {u} {utf8_bin(u)}')

Output:

U+0000   00000000
U+007F   01111111
U+0080   11000010 10000000
U+07FF   11011111 10111111
U+0800   11100000 10100000 10000000
U+FFFF   11101111 10111111 10111111
U+10000  11110000 10010000 10000000 10000000
U+10FFFF 11110100 10001111 10111111 10111111

U+0041   A 01000001
U+00FC   ü 11000011 10111100
U+9A6C   马 11101001 10101001 10101100
U+1F382   11110000 10011111 10001110 10000010