I have a large pandas dataframe and would like to perform a thorough text cleaning on it. For this, I have crafted the below code that evaluates if a character is either an emoji, number, Roman number, or a currency symbol, and replaces these with their unidode name from the unicodedata
package.
The code uses a double for loop though and I believe there must be far more efficient solutions than that but I haven't managed to figure out yet how I could implement it in a vectorized manner.
My current code is as follows:
from unicodedata import name as unicodename
def clean_text(text):
for item in text:
for char in item:
# Simple space
if char == ' ':
newtext += char
# Letters
elif category(char)[0] == 'L':
newtext += char
# Other symbols: emojis
elif category(char) == 'So':
newtext += f" {unicodename(char)} "
# Decimal numbers
elif category(char) == 'Nd':
newtext += f" {unicodename(char).replace('DIGIT ', '').lower()} "
# Letterlike numbers e.g. Roman numerals
elif category(char) == 'Nl':
newtext += f" {unicodename(char)} "
# Currency symbols
elif category(char) == 'Sc':
newtext += f" {unicodename(char).replace(' SIGN', '').lower()} "
# Punctuation, invisibles (separator, control chars), maths symbols...
else:
newtext += " "
At the moment I am using this function on my dataframe with an apply:
df['Texts'] = df['Texts'].apply(lambda x: clean_text(x))
Sample data:
l = [
"thumbs ups should be replaced: ",
"hearts also should be replaced: ❤️️❤️️❤️️❤️️",
"also other emojis: ☺️☺️",
"numbers and digits should also go: 40/40",
"Ⅰ, Ⅱ, Ⅲ these are roman numerals, change 'em"
]
df = pd.DataFrame(l, columns=['Texts'])
A good start would be to not do as much work:
lru_cache()
does that for you)category()
andname()
more times than you need to