Regex match invalid Unicode characters

Question

Regex match invalid Unicode characters

723 Views Asked by J Alan At 29 July 2025 at 05:58

I have strings like this:
ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ
and I want to filter out all these invalid characters beginning with a slash, which I am trying to do with regex in Python.

It does work like this:

re.sub(r",\u0f6e,", r",deleted,", s)

But not like this:

re.sub(r",\.{5},", r",deleted,", s)

It should work according to http://pythex.org, so I guess it's because they are invalid characters? How can I match them?

Edit: @metatoaster said my question is ambiguous: The problem seems to arise because the input string s is not a raw string.

>>> s = ' ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> re.sub(r",\u0f6e,", r",deleted,", s)
' ꐊ,ꀵ,deleted,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'

Original Q&A

There are 2 best solutions below

metatoaster On 08 November 2018 at 23:29

I don't see how your first re.sub statement would have worked, if your string was truly defined as is.

>>> s = r' ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> re.sub(r",\u0f6e,", r",deleted,", s)                                        
' ꐊ,ꀵ,\\u0f6e,ⴗ,ꦚ,\\u2d75,ꢯ,⾌,\\ua97d,⩱,ㇴ,\\u2d6e,鼺,\\x00Ꞁ'

Note how the first r'\u0f6e' remains. In regex, the \ character is also special so it must also be escaped. This can be done by using \\ instead. Now try:

>>> re.sub(r",\\u0f6e,", r",deleted,", s)                                       
' ꐊ,ꀵ,deleted,ⴗ,ꦚ,\\u2d75,ꢯ,⾌,\\ua97d,⩱,ㇴ,\\u2d6e,鼺,\\x00Ꞁ'

In order to match the actual expression and not more than necessary, do note that the \\u sequence has exactly 4 subsequent characters between 0-9 and a-f. Instead of trying to match any 5 characters, be more specific, like:

>>> re.sub(r",\\u[0-9a-f]+,", r",deleted,", s)                                  
' ꐊ,ꀵ,deleted,ⴗ,ꦚ,deleted,ꢯ,⾌,deleted,⩱,ㇴ,deleted,鼺,\\x00Ꞁ'

Note that this entire answer assumes the information you have given us is correct, and the escape sequences are actually the backslash character. It would be useful to update your question to include these code fragments like I had here to be less ambiguous about what is being done (as we can copy-paste your code and run it to see what went wrong and we can also correct it more easily).

**Mark Tolonen** · Accepted Answer

It seems you have a string with undefined Unicode codepoints. \u0f6e is a single code point represented as an escape code. Example:

>>> s = 'ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> s
'ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> print(s)
ꐊ,ꀵ,཮,ⴗ,ꦚ,⵵,ꢯ,⾌,꥽,⩱,ㇴ,⵮,鼺, Ꞁ

Note how printing the string shows the character as an undefined box. It is displayed as an escape code for debugging purposes. These code points have a few things in common. According to the Unicode database, they are category C (control) codepoints. They also don't have names. A quick way to filter is:

>>> ''.join(['deleted' if ud.category(c)[0] == 'C' else c for c in s])
'ꐊ,ꀵ,deleted,ⴗ,ꦚ,deleted,ꢯ,⾌,deleted,⩱,ㇴ,deleted,鼺,deletedꞀ'

Regex match invalid Unicode characters

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in UNICODE

Related Questions in NULL

Related Questions in INVALID-CHARACTERS

Trending Questions

Popular # Hahtags

Popular Questions