I'm working on a tweet dataset where one column is the text of the tweet. Following function performs the cleaning of tweet which involves removal of punctuations, stopwords, lower case conversion, removal of emojis and these are themselves a small utility functions.
def clean_text(text):
text = text.lower().strip()
text = remove_punct(text)
text = remove_emoji(text)
text = remove_stopwords(text)
return text
I'm creating a new column for the cleaned text:
df['clean_text'] = df['text'].apply(lambda x: clean_text(x))
This is becoming painfully slow as the dataset grows in size. Numpy.where() provides significant performance improvement for filtering the data. How do I speed above apply operation either using map() or numpy.where() or something else?
if you don't want to tweak the function itself, you can use Pandarella to parrallize your apply https://github.com/nalepae/pandarallel. It even gives you a nice progress bar :)