More efficient way to replace special chars with their unicode name in pandas df

Question

More efficient way to replace special chars with their unicode name in pandas df

81 Views Asked by lazarea At 01 July 2025 at 11:02

I have a large pandas dataframe and would like to perform a thorough text cleaning on it. For this, I have crafted the below code that evaluates if a character is either an emoji, number, Roman number, or a currency symbol, and replaces these with their unidode name from the unicodedata package.

The code uses a double for loop though and I believe there must be far more efficient solutions than that but I haven't managed to figure out yet how I could implement it in a vectorized manner.

My current code is as follows:

from unicodedata import name as unicodename 

def clean_text(text):
    for item in text:
        for char in item: 
            # Simple space
            if char == ' ':
                newtext += char 
            # Letters
            elif category(char)[0] == 'L':
                newtext += char
            # Other symbols: emojis
            elif category(char) == 'So':
                newtext += f" {unicodename(char)} "
            # Decimal numbers 
            elif category(char) == 'Nd':
                newtext += f" {unicodename(char).replace('DIGIT ', '').lower()} "
            # Letterlike numbers e.g. Roman numerals 
            elif category(char) == 'Nl':
                newtext += f" {unicodename(char)} "
            # Currency symbols
            elif category(char) == 'Sc':
                newtext += f" {unicodename(char).replace(' SIGN', '').lower()} "
            # Punctuation, invisibles (separator, control chars), maths symbols...
            else:
                newtext += " "

At the moment I am using this function on my dataframe with an apply:

df['Texts'] = df['Texts'].apply(lambda x: clean_text(x))

Sample data:

l = [
    "thumbs ups should be replaced: ",
    "hearts also should be replaced:  ❤️️❤️️❤️️❤️️",
    "also other emojis: ☺️☺️",
    "numbers and digits should also go: 40/40",
    "Ⅰ, Ⅱ, Ⅲ these are roman numerals, change 'em"
]
df = pd.DataFrame(l, columns=['Texts'])

Original Q&A

There are 1 best solutions below

**AKX** · Accepted Answer

A good start would be to not do as much work:

once you've resolved the representation for a character, cache it. (lru_cache() does that for you)
don't call category() and name() more times than you need to

from functools import lru_cache
from unicodedata import name as unicodename, category


@lru_cache(maxsize=None)
def map_char(char: str) -> str:
    if char == " ":  # Simple space
        return char

    cat = category(char)

    if cat[0] == "L":  # Letters
        return char

    name = unicodename(char)

    if cat == "So":  # Other symbols: emojis
        return f" {name} "
    if cat == "Nd":  # Decimal numbers
        return f" {name.replace('DIGIT ', '').lower()} "
    if cat == "Nl":  # Letterlike numbers e.g. Roman numerals
        return f" {name} "
    if cat == "Sc":  # Currency symbols
        return f" {name.replace(' SIGN', '').lower()} "
    # Punctuation, invisibles (separator, control chars), maths symbols...
    return " "


def clean_text(text):
    for item in text:
        new_text = "".join(map_char(char) for char in item)
        # ...

More efficient way to replace special chars with their unicode name in pandas df

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PYTHON-MODULE-UNICODEDATA

Trending Questions

Popular # Hahtags

Popular Questions