Python: Determining if i have a 16bit encoded string

568 Views Asked by At

I have a UTF-16-BE encoded string:

utf16be = '\x0623\x0631\x0646\x0628'

print repr(utf16be)
> '\x0623\x0631\x0646\x0628'

I need to know if it's a 1-byte or 2-byte encoding, i have tried with the below snippet:

for c in utf16be:
    c_ord = ord(c)
    if c_ord >= 256:
        print 'Its a 2-byte (or more) encoded string'
        break

But that wont work because i thought utf16be[0] will be equal to '\x0623', but it's actually equal to '\x06':

for c in utf16be:
    print repr(c)

> '\x06'
> '2'
> '3'
> '\x06'
> '3'
> '1'
> '\x06'
> '4'
> '6'
> '\x06'
> '2'
> '8'

So what is the best practice to check if i have a 2-byte encoded string ?

2

There are 2 best solutions below

2
On

Use chardet package to guess encoding

2
On

A UTF-16-BE encoded string necessarily has two bytes per code unit (hence the name 16 bits). UTF-8 has single bytes but UTF-16 does not.

Your comment suggests you're getting a string and you need to figure out whether it's one, two or more bytes per character but that doesn't make sense. You need to know the encoding of the string to make sense of it - otherwise it's guesswork.