How to improve performance of pandas.apply() to perform text cleaning operations on large sized pandas column?

265 Views Asked by Ravindra S At 08 July 2022 at 06:53

I'm working on a tweet dataset where one column is the text of the tweet. Following function performs the cleaning of tweet which involves removal of punctuations, stopwords, lower case conversion, removal of emojis and these are themselves a small utility functions.

def clean_text(text):
    text = text.lower().strip()
    text = remove_punct(text)
    text = remove_emoji(text)
    text = remove_stopwords(text)
    
    return text

I'm creating a new column for the cleaned text:

df['clean_text'] = df['text'].apply(lambda x: clean_text(x))

This is becoming painfully slow as the dataset grows in size. Numpy.where() provides significant performance improvement for filtering the data. How do I speed above apply operation either using map() or numpy.where() or something else?

Original Q&A

There are 1 best solutions below

Pieter Geelen On 08 July 2022 at 07:18

if you don't want to tweak the function itself, you can use Pandarella to parrallize your apply https://github.com/nalepae/pandarallel. It even gives you a nice progress bar :)

How to improve performance of pandas.apply() to perform text cleaning operations on large sized pandas column?

There are 1 best solutions below

Related Questions in PANDAS

Related Questions in NUMPY

Related Questions in PERFORMANCE

Related Questions in TWEETS

Related Questions in PANDAS-APPLY

Trending Questions

Popular # Hahtags

Popular Questions