Convert UTF-8 To UTF-7 in Python

275 Views Asked by At

I am trying to find a way to implement this in python: CyberChef Example

I want to convert from UTF-8 to UTF-7.

For example, given the string

<root><test>aaa</test><hel>asd</hel></root>

The encoded output should be the bytes

+ADw-root+AD4-+ADw-test+AD4-aaa+ADw-/test+AD4-+ADw-hel+AD4-asd+ADw-/hel+AD4-+ADw-/root+AD4-

I tried using codec and decode('utf-7') but this just returns the string as is:

>>> "<root><test>aaa</test><hel>asd</hel></root>".encode("utf-7")
b'<root><test>aaa</test><hel>asd</hel></root>'
1

There are 1 best solutions below

1
On

< and > are among what UTF-7 (RFC 2152) calls "optional direct characters". These characters allowed to be either directly encoded as their ASCII equivalents, or be encoded using a Unicode shifted encoding. Python's UTF-7 implementation chooses to use direct encodings for all optional direct characters.

For example, when encoding to UTF-7, Python maps < to its single-byte ASCII representation b'<':

>>> "<".encode("utf-7")
b'<'

When decoding bytes from UTF-7, Python will happily handle either a direct encoding or Unicode shifted encoding for <:

>>> b"+ADw-".decode("utf-7")
'<'
>>> b"<".decode("utf-7")
'<'

If you want to obtain a Unicode shift encoding for < (or for other optional direct characters), you'll need to either a) manually convert the direct-encoded characters to their Unicode shift equivalents, or b) use a different UTF-7 implementation that provides more fine-grained encoding controls.

To manually convert < and > from direct encodings to Unicode shift encodings, you can use bytes.replace:

text = "<root><test>aaa</test><hel>asd</hel></root>"
payload = text.encode("utf-7")
payload = payload.replace(b"<", b"+ADw-")
payload = payload.replace(b">", b"+AD4-")

>>> print(payload)
b'+ADw-root+AD4-+ADw-test+AD4-aaa+ADw-/test+AD4-+ADw-hel+AD4-asd+ADw-/hel+AD4-+ADw-/root+AD4-'

>>> text == payload.decode("utf-7")
True