I ran the following in a Jupyter Notebook and was disappointed that similar Pandas code is faster. Hoping someone can show a smarter approach in Polars.
POLARS VERSION
def cleanse_text(sentence):
RIGHT_QUOTE = r"(\u2019)"
sentence = re.sub(RIGHT_QUOTE, "'", sentence)
sentence = re.sub(r" +", " ", sentence)
return sentence.strip()
df = df.with_columns(pl.col("text").apply(lambda x: cleanse_text(x)).keep_name())
PANDAS VERSION
def cleanse_text(sentence):
RIGHT_QUOTE = r"(\u2019)"
sentence = re.sub(RIGHT_QUOTE, "'", sentence)
sentence = re.sub(r" +", " ", sentence)
return sentence.strip()
df["text"] = df["text"].apply(lambda x: cleanse_text(x))
The above Pandas version was 10% faster than the Polars version when I ran this on a dataframe with 750,000 rows of text.
Instead of combining
Series.applywithre.sub, you can chain 2 instances ofSeries.str.replacein this case, and finally addSeries.str.strip. This will be faster generally (see end of answer as to why), but particularly forpolars.Pandas version
Polars version
Performance comparison
Results of
timeittest for each method (dfschecked for equality):As you can see, both new methods for
pandasandpolarsare faster than the original methods, and thepolarsmethod is a clear winner, taking only 13.8% of the newpandasmethod.So, why is
Series.str.replace(or:str.strip) so much faster thanSeries.apply? The reason has to do with the fact that the former performs an operator on an entire Series (e.g. a "column") all at once ("vectorization"), while the latter calls a Python function for each element in the Series separately. E.g.lambda x: cleanse_text(x)means: apply aUDF(user-defined function) to 1st element in column, 2nd element in column, etc. On larger sets, this will make a huge difference. Cf. also the documentation forpl.DataFrame.apply.