How to improve performance of pandas.apply() to perform text cleaning operations on large sized pandas column?

265 Views Asked by At

I'm working on a tweet dataset where one column is the text of the tweet. Following function performs the cleaning of tweet which involves removal of punctuations, stopwords, lower case conversion, removal of emojis and these are themselves a small utility functions.

def clean_text(text):
    text = text.lower().strip()
    text = remove_punct(text)
    text = remove_emoji(text)
    text = remove_stopwords(text)
    
    return text

I'm creating a new column for the cleaned text:

df['clean_text'] = df['text'].apply(lambda x: clean_text(x))

This is becoming painfully slow as the dataset grows in size. Numpy.where() provides significant performance improvement for filtering the data. How do I speed above apply operation either using map() or numpy.where() or something else?

1

There are 1 best solutions below

0
Pieter Geelen On

if you don't want to tweak the function itself, you can use Pandarella to parrallize your apply https://github.com/nalepae/pandarallel. It even gives you a nice progress bar :)