Why is this apply() custom function is slower in Polars than in Pandas

Question

Why is this apply() custom function is slower in Polars than in Pandas

712 Views Asked by Biosopher At 08 January 2023 at 07:51

I ran the following in a Jupyter Notebook and was disappointed that similar Pandas code is faster. Hoping someone can show a smarter approach in Polars.

POLARS VERSION

def cleanse_text(sentence):
    RIGHT_QUOTE = r"(\u2019)"
    sentence = re.sub(RIGHT_QUOTE, "'", sentence)
    sentence = re.sub(r" +", " ", sentence)
    return sentence.strip()
df = df.with_columns(pl.col("text").apply(lambda x: cleanse_text(x)).keep_name())

PANDAS VERSION

def cleanse_text(sentence):
    RIGHT_QUOTE = r"(\u2019)"
    sentence = re.sub(RIGHT_QUOTE, "'", sentence)    
    sentence = re.sub(r" +", " ", sentence)
    return sentence.strip() 
df["text"] = df["text"].apply(lambda x: cleanse_text(x))

The above Pandas version was 10% faster than the Polars version when I ran this on a dataframe with 750,000 rows of text.

Original Q&A

There are 1 best solutions below

**ouroboros1** · Accepted Answer · 2023-01-08T09:21:48.437000

Instead of combining Series.apply with re.sub, you can chain 2 instances of Series.str.replace in this case, and finally add Series.str.strip. This will be faster generally (see end of answer as to why), but particularly for polars.

Pandas version

import pandas as pd
t = "'Hello  World\u2019 "
df = pd.DataFrame({'text': [t]*750000})

df['text'] = (df['text']
              .str.replace('\u2019',"'", regex=True)
              .str.replace(' +',' ', regex=True)
              .str.strip())

df.head()

            text
0  'Hello World'
1  'Hello World'
2  'Hello World'
3  'Hello World'
4  'Hello World'

Polars version

import polars as pl
t = "'Hello  World\u2019 "
df_pl = pl.DataFrame({'text': [t]*750000})

df_pl = (df_pl
         .with_column(pl.col('text')
                      .str.replace('\u2019',"'")
                      .str.replace(' +',' ')
                      .str.strip()))

df_pl.head()

┌───────────────┐
│ text          │
│ ---           │
│ str           │
╞═══════════════╡
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 'Hello World' │
└───────────────┘

Performance comparison

Results of timeit test for each method (dfs checked for equality):

       method  timeit (s)      perc
0  pandas_new    1.092429  1.000000
1  pandas_old    1.553892  1.422419
2  polars_new    0.151107  0.138322
3  polars_old    1.851840  1.695158

As you can see, both new methods for pandas and polars are faster than the original methods, and the polars method is a clear winner, taking only 13.8% of the new pandas method.

So, why is Series.str.replace (or: str.strip) so much faster than Series.apply? The reason has to do with the fact that the former performs an operator on an entire Series (e.g. a "column") all at once ("vectorization"), while the latter calls a Python function for each element in the Series separately. E.g. lambda x: cleanse_text(x) means: apply a UDF (user-defined function) to 1st element in column, 2nd element in column, etc. On larger sets, this will make a huge difference. Cf. also the documentation for pl.DataFrame.apply.

Why is this apply() custom function is slower in Polars than in Pandas

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PYTHON-POLARS

Related Questions in PANDAS-APPLY

Trending Questions

Popular # Hahtags

Popular Questions