Python can't encode with surrogateescape

12.7k Views Asked by At

I have a problem with Unicode surrogates encoding in Python (3.4):

>>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed

If I'm not mistaken, according to Python documentation:

'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.

The code should just produce the source sequence (b'\xCC'). So why is the exception raised instead?

This is possibly related to my second question:

Changed in version 3.4: The utf-16* and utf-32* encoders no longer allow surrogate code points (U+D800–U+DFFF) to be encoded.

(From https://docs.python.org/3/library/codecs.html#standard-encodings)

From as far as I know, it's impossible to encode some code points to UTF-16 without surrogate pairs. So what's the reason behind this?

2

There are 2 best solutions below

2
On BEST ANSWER

This change was made because the Unicode standard explicitly disallows such encodings. See issue #12892, but apparently the surrogateescape error handler cannot be made to work with UTF-16 or UTF-32, because these codecs are not ASCII compatible.

Specifically:

I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not work as expected.

>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore')
'[]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace')
'[�]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape')
'[\udc80\udcdc\uffff'

=> I expected '[\udc80\udcdc]'.

to which came the response:

Yes, surrogateescape doesn't work with ASCII incompatible encodings and can't. First, it can't represent the result of decoding b'\x00\xd8' from utf-16-le or b'ABCD' from utf-32*. This problem is worth separated issue (or even PEP) and discussion on Python-Dev.

I believe the surrogateescape handler was more meant for UTF-8 data; that decoding to UTF-16 or UTF-32 works with it too now is a nice extra but it can't work in the other direction, apparently.

0
On

If you use surrogatepass (instead of surrogateescape), things should work on Python 3.

See: https://docs.python.org/3/library/codecs.html#codec-base-classes (which says that surrogatepass allows encoding and decoding of surrogate codes (for utf related encoding).