How could I replace a section of Unicode?

79 Views Asked by At

How could I replace a section of Unicode? such as D800 through DB7F? I've asked the built-in replit ai for some help but its not doing much. this is the best result:

text = =('there's more code to link this var to a file but thats not important')
newtext = text.replace(r'\uE000', '#').replace(r'\uF8FF', '#').replace(r'\uDC00', '#').replace(r'\uDFFF', '#').replace(r'\uD800', '#').replace(r'\uDB7F', '#').replace(r'\uDB80', '#').replace(r'\uDBFF', '#')

I've tried asking AI (replit built-in), others around the replit community, and searching on google, but I've found nothing.

1

There are 1 best solutions below

6
On

Use a regular expression to match a range:

import re

text = 'test\udc00\udc01\udc02\uf8ff\ue000test'

result = re.sub(r'[\ud800-\udfff\ue000\uf8ff]', '#', text)
print(result)

Output:

test#####test

In case you actually have literal Unicode escapes in your code, below would first translate the escape codes to Unicode characters:

import re

text = r'test\udc00\udc01\udc02\uf8ff\ue000test' # literal escapes (raw string)
#text = 'test\udc00\udc01\udc02\uf8ff\ue000test'  # Unicode characters

# capture 4-digit hexadecimal escape and convert to Unicode
convert = re.sub(r'\\u([0-9a-fA-F]{4})', lambda m: chr(int(m.group(1), 16)), text)

result = re.sub(r'[\ud800-\udfff\uf8ff\ue000]', '#', convert)
print(result)