Python semantics for unicode ranges involving astral planes

406 Views Asked by At

What exactly are the intended semantics for character ranges in regular expressions if one or both endpoints of the range are outside the BMP? I've observed that the following input behaves different in Python 2.7 and 3.5:

import re
bool(re.match(u"[\u1000-\U00021111]", "\u1234"))

In my 2.7 I get False, in 3.5 I get True. The latter makes sense to me. The former is perhaps due to \U00021111 being represented by a surrogate pair \ud844\udd11, but even then I don't understand it since \u1000-\ud844 should include \u1234 just fine.

  • Is this specified somewhere?
  • Is this intended behavior?
  • Does this just depend on the Python version, or also on compile-time flags regarding UTF-16 vs. UTF-32?
  • Is there a way to get consistent behavior without case distinctions?
  • If case distinctions are unavoidable, what excatly are the conditions?
2

There are 2 best solutions below

6
Wiktor Stribiżew On BEST ANSWER

Just use the u prefix with the input string to tell Python it is a Unicode string:

>>> bool(re.match(u"[\u1000-\U00021111]", u"\u1234")) # <= See u"\u1234"
True

In Python 2.7, you need to decode the strings to Unicode each time you process them. In Python 3, all strings are Unicode by default, and it is stated in the docs.

3
MvG On

Here is what I found out so far.

PEP 261 which got accepted for Python 2.2 introduced a compile-time flag to build unicode support either using a narrow UTF-16 representation or a wide UTF-32 representation of characters. Check hex(sys.maxunicode) or len(u'\U00012345') to distinguish these at runtime: narrow builds will report a maximum of 0xffff and a length of 2, wide builds a maximum of 0x10ffff and a length of 1. PEP 393 for Python 3.3 hides the implementation details of a unicode string, making all strings appear like UTF-32 (without actually wasting that much space unless neccessary). So narrow builds prior to 3.3 will decompose codepoints on astral planes into surrogate pairs, and treat the individual surrogates independently both for the construction of the regular expression and the string to be matched against. Or at least I could find no indication to the contrary.

As Wiktor pointed out, my example was plain stupid since I forgot the u prefix to the second string literal. Therefore Python 2 will parse this not as an escape sequence but as a byte string instead. That explains why it looked as though the codepoint wasn't included in that range even after surrogate pairs were taken into account.

As for intended behavior: Since Python 3.3 the distinction based on build type should become obsolete. Treating each codepoint as a unit, no matter the plane, should be the way forward for Python 3. But backwards compatibility on narrow builds poses a conflicting goal for older versions.