Backslashes and escaping chars in Python vs Perl regexes

627 Views Asked by At

The goal is to deal with the task of tokenization in NLP and porting the script from the Perl script to this Python script.

The main issues comes with erroneous backslashes that happens when we run the Python port of the tokenizer.

In Perl, we could need to escape the single quotes and the ampersand as such:

my($text) = @_; # Reading a text from stdin

$text =~ s=n't = n't =g; # Puts a space before the "n't" substring to tokenize english contractions like "don't" -> "do n't".

$text =~ s/\'/\'/g;  # Escape the single quote so that it suits XML.

Porting the regex literally into Python

>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
...     text = re.sub(regexp, substitution, text)
...     print text
... 
this ai n't funny
this ai n\'t funny

The escaping of the ampersand somehow added it as a literal backslash =(

To resolve that, I could do:

>>> escape_singquote = r"\'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
...     text = re.sub(regexp, substitution, text)
...     print text
... 
this ai n't funny
this ai n't funny

But seemingly without escaping the single quote in Python, we get the desired result too:

>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> escape_singquote = r"'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
...     text = re.sub(regexp, substitution, text)
...     print text
... 
this ai n't funny
this ai n't funny

Now that's puzzling...

Given the context above, so the question is for which characters do we need to escape in Python and which characters in Perl? Regex in Perl and Python is not that equivalent right?

1

There are 1 best solutions below

2
On BEST ANSWER

In both Perl and Python, you have to escape the following regex metacharacters if you want to match them literally outside of a character class1:

{}[]()^$.|*+?\

Inside a character class, you have to escape metacharacters according to these rules2:

     Perl                          Python
-------------------------------------------------------------
-    unless at beginning or end    unless at beginning or end
]    always                        unless at beginning
\    always                        always
^    only if at beginning          only if at beginning
$    always                        never

Note that neither single quote ' nor ampersand & must be escaped, whether inside or outside a character class.

However, both Perl and Python will ignore the backslash if you use it to escape a punctuation character that isn't a metacharacter (e.g. \' is equivalent to ' inside a regex).


You seem to be getting tripped up by Python's raw strings:

When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string.

r"\'" is the string \' (literal backslash, literal single quote), while r'\'' is the string \' (literal backslash, literal ampersand, etc.).

So this:

re.sub(r"\'", r'\'', text)

replaces all single quotes with the literal text \'.


Putting it all together, your Perl substitution is better written:

$text =~ s/'/'/g;

And your Python substitution is better written:

re.sub(r"'", r''', text)

  1. Python 2, Python 3, and current versions of Perl treat non-escaped curly braces as literal curly braces if they aren't part of a quantifier. However, this will be a syntax error in future versions of Perl, and recent versions of Perl give a warning.

  2. See perlretut, perlre, and the Python docs for the re module.