Grapheme support in python regex

709 Views Asked by At

I'm using the awesome regex module, trying its \X grapheme support.

First, I try with the plain old .

>>> print regex.match('.', 'Ä').group(0)

>>> print regex.match('..', 'Ä').group(0)
Ä

It went as expected. Move on to \X

>>> print regex.match('\X', 'Ä').group(0)

>>> print regex.match('\X\X', 'Ä').group(0)
Ä

Why is it the same as .? Shouldn't a single \X be enough to capture the A-umlaut? Is it:

  • My understanding of grapheme or the meaning of \X is wrong?
  • Some flag/switch I need to turn on first? (I've searched the documentation, couldn't find)
  • Something with my environment? (Python 2.7.3, pip reports regex==2014.12.24)
  • Bug in the library?
  • Something else?
2

There are 2 best solutions below

0
On BEST ANSWER

It works by defining the Ä as unicode character.

>>> print regex.match('.', u'Ä').group()
Ä
>>> print regex.match('\X', u'Ä').group()
Ä

The main difference between Python 2 and Python 3 is the basic types that exist to deal with texts and bytes. On Python 3 we have one text type: str which holds Unicode data and two byte types bytes and bytearray.

On the other hand on Python 2 we have two text types: str which for all intents and purposes is limited to ASCII + some undefined data above the 7 bit range, unicode which is equivalent to the Python 3 str type and one byte type bytearray which it inherited from Python 3.

Reference - https://docs.python.org/2/howto/unicode.html#python-2-x-s-unicode-support

0
On

The problem is that by default python2 string are byte strings which makes no sense with unicode grapheme. If you specify using unicode strings it perfecly work.

>>> print(regex.match('\X', 'Ä').group(0))

>>> print(regex.match('\X', u'Ä').group(0))
Ä

In python3 the default string is unicode and to specify byte string you should prepend a b like this b"mybytestring"