Pandas resample().apply() with custom function very slow

207 Views Asked by At

I have a pandas Series in business-day frequency, and I want to resample it to weekly frequency where I take the product of those 5 days in a week.

Some dummy data:

dates = pd.bdate_range('2000-01-01', '2022-12-31')
s = pd.Series(np.random.uniform(size=len(dates)), index=dates)

# randomly assign NaN's
mask = np.random.randint(0, len(dates), round(len(dates)*.9))
s.iloc[mask] = np.nan

Notice that majority of this Series are NaN's.

The simple .prod method called after .resample is fast:

%timeit s.resample('W-FRI').prod()
10.2 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But I have to be very precise when taking the product in that I want to give min_count=1 when calling np.prod, and that's when it becomes very slow:

%timeit s.resample('W-FRI').apply(lambda x: x.prod(min_count=1))
69.1 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I think the problem is not specific to np.prod but can be generalized to comparing all pandas-recognizable functions vs. applying custom functions.

How do I achieve a similar performance as .resample().prod() with min_count=1 argument?

1

There are 1 best solutions below

0
On

Until I saw Trenton McKinney's comment, I was going to propose:

def f(rs, min_count=0):
    res = rs.prod()
    res[rs.count() < min_count] = np.nan
    return res

%timeit f(s.resample('W-FRI'), min_count=1)
# same timing as s.resample('W-FRI').prod()

But Trenton's suggestion is far better:

s.resample('W-FRI').prod(min_count=1)

I'm only mentioning this for other cases when one would be tempted to use .apply(), but where using the resampling object a couple of times with builtin numpy functions is faster.