How to using `regexp` to remove all the character not in chinese and english

198 Views Asked by At

There is ori_string ,how to using regexp to remove all the character not in chinese and english? Thanks!

ori_string<-"没a w t _ 中/国.sz"

the wished result is

  "没awt中国sz"
2

There are 2 best solutions below

0
JustSeen On BEST ANSWER

I have coded it in python, as you didn't specify anything. The idea is here.

def remove_non_english_chinese(text):
    # Use a regex pattern to match any character that is not a letter or number
    pattern = r'[^a-zA-Z0-9\u4e00-\u9fff]'

    # Replace all non-English and non-Chinese characters with an empty string
    return re.sub(pattern, '', text)
0
Edward On

Seems you want to remove punctuation and spaces:

> regex <- '[[:punct:][:space:]]+'
> gsub(regex, '', ori_string)
[1] "没awt中国sz"