I have a pandas Series in business-day frequency, and I want to resample it to weekly frequency where I take the product of those 5 days in a week.
Some dummy data:
dates = pd.bdate_range('2000-01-01', '2022-12-31')
s = pd.Series(np.random.uniform(size=len(dates)), index=dates)
# randomly assign NaN's
mask = np.random.randint(0, len(dates), round(len(dates)*.9))
s.iloc[mask] = np.nan
Notice that majority of this Series are NaN's.
The simple .prod
method called after .resample
is fast:
%timeit s.resample('W-FRI').prod()
10.2 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But I have to be very precise when taking the product in that I want to give min_count=1
when calling np.prod
, and that's when it becomes very slow:
%timeit s.resample('W-FRI').apply(lambda x: x.prod(min_count=1))
69.1 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I think the problem is not specific to np.prod
but can be generalized to comparing all pandas-recognizable functions vs. applying custom functions.
How do I achieve a similar performance as .resample().prod()
with min_count=1
argument?
Until I saw Trenton McKinney's comment, I was going to propose:
But Trenton's suggestion is far better:
I'm only mentioning this for other cases when one would be tempted to use
.apply()
, but where using the resampling object a couple of times with builtin numpy functions is faster.