Regex match invalid Unicode characters

735 Views Asked by At

I have strings like this:
ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ
and I want to filter out all these invalid characters beginning with a slash, which I am trying to do with regex in Python.

It does work like this:

re.sub(r",\u0f6e,", r",deleted,", s)

But not like this:

re.sub(r",\.{5},", r",deleted,", s)

It should work according to http://pythex.org, so I guess it's because they are invalid characters? How can I match them?

Edit: @metatoaster said my question is ambiguous: The problem seems to arise because the input string s is not a raw string.

>>> s = ' ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> re.sub(r",\u0f6e,", r",deleted,", s)
' ꐊ,ꀵ,deleted,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
2

There are 2 best solutions below

2
On BEST ANSWER

It seems you have a string with undefined Unicode codepoints. \u0f6e is a single code point represented as an escape code. Example:

>>> s = 'ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> s
'ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> print(s)
ꐊ,ꀵ,཮,ⴗ,ꦚ,⵵,ꢯ,⾌,꥽,⩱,ㇴ,⵮,鼺, Ꞁ

Note how printing the string shows the character as an undefined box. It is displayed as an escape code for debugging purposes. These code points have a few things in common. According to the Unicode database, they are category C (control) codepoints. They also don't have names. A quick way to filter is:

>>> ''.join(['deleted' if ud.category(c)[0] == 'C' else c for c in s])
'ꐊ,ꀵ,deleted,ⴗ,ꦚ,deleted,ꢯ,⾌,deleted,⩱,ㇴ,deleted,鼺,deletedꞀ'
0
On

I don't see how your first re.sub statement would have worked, if your string was truly defined as is.

>>> s = r' ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> re.sub(r",\u0f6e,", r",deleted,", s)                                        
' ꐊ,ꀵ,\\u0f6e,ⴗ,ꦚ,\\u2d75,ꢯ,⾌,\\ua97d,⩱,ㇴ,\\u2d6e,鼺,\\x00Ꞁ'                

Note how the first r'\u0f6e' remains. In regex, the \ character is also special so it must also be escaped. This can be done by using \\ instead. Now try:

>>> re.sub(r",\\u0f6e,", r",deleted,", s)                                       
' ꐊ,ꀵ,deleted,ⴗ,ꦚ,\\u2d75,ꢯ,⾌,\\ua97d,⩱,ㇴ,\\u2d6e,鼺,\\x00Ꞁ'                

In order to match the actual expression and not more than necessary, do note that the \\u sequence has exactly 4 subsequent characters between 0-9 and a-f. Instead of trying to match any 5 characters, be more specific, like:

>>> re.sub(r",\\u[0-9a-f]+,", r",deleted,", s)                                  
' ꐊ,ꀵ,deleted,ⴗ,ꦚ,deleted,ꢯ,⾌,deleted,⩱,ㇴ,deleted,鼺,\\x00Ꞁ'                

Note that this entire answer assumes the information you have given us is correct, and the escape sequences are actually the backslash character. It would be useful to update your question to include these code fragments like I had here to be less ambiguous about what is being done (as we can copy-paste your code and run it to see what went wrong and we can also correct it more easily).