I have strings like this:
ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ
and I want to filter out all these invalid characters beginning with a slash, which I am trying to do with regex in Python.
It does work like this:
re.sub(r",\u0f6e,", r",deleted,", s)
But not like this:
re.sub(r",\.{5},", r",deleted,", s)
It should work according to http://pythex.org, so I guess it's because they are invalid characters? How can I match them?
Edit: @metatoaster said my question is ambiguous:
The problem seems to arise because the input string s
is not a raw string.
>>> s = ' ꐊ,ꀵ,\u0f6e,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
>>> re.sub(r",\u0f6e,", r",deleted,", s)
' ꐊ,ꀵ,deleted,ⴗ,ꦚ,\u2d75,ꢯ,⾌,\ua97d,⩱,ㇴ,\u2d6e,鼺,\x00Ꞁ'
It seems you have a string with undefined Unicode codepoints.
\u0f6e
is a single code point represented as an escape code. Example:Note how printing the string shows the character as an undefined box. It is displayed as an escape code for debugging purposes. These code points have a few things in common. According to the Unicode database, they are category C (control) codepoints. They also don't have names. A quick way to filter is: