I'm running a Monte Carlo simulation. Part of the calculation requires applying a function to a rolling window for each simulation. However, I don't know how to do that efficiently. I'm concerned that this may be a duplicate post, but I was unable to find another like this.
My minimal reproducible example is this:
import pandas as pd
import numpy as np
from scipy.stats import norm
# Number of simulations
trials = 10000
# Generate random variables
df1 = pd.DataFrame(norm.rvs(size = (500, trials)))
f = lambda x: np.sum(x > 0) > 20
# Make a deep copy of df1
df2 = df1.copy(deep = True)
for col in df2.columns:
df2[col] = df2[col].rolling(window = 30).apply(f)
Is there a way to write this without a for-loop or list comprehension? Since the simulations are columns, df2 would ideally have at least 10,000 columns. Having a data frame that's the transpose of this would also be fine. Within my code this part takes about 100 times more than the next longest process in my simulation.
Your is not exactly a minimal example. So we first try to see what happen with a minimal example and then we test performaces on the full one.
Data
Min example
here I both reduce the amount of data and change your function to work with less data
where
df_minisRun minimal example
Using apply
and the output is
Avoid apply
Set
df_min = df_min_bk.copy()and using built-in function we can rewrite the same function asWhich is almost 3x the previous case and the output is still
which is ok if we remember that the first
n-1columns of a rolling windows should be NaN.Avoid loop columns
Set again
df_min = df_min_bk.copy()we can use the precious function without looping columnsWhich is almost 2x the precious case and 6X the apply one. The output is the same as the previous example.
Full Example
this takes less than a second. While the apply and the loop through columns is taking several minutes
Timing apply
Compared to the previous method the speedup is 820x.
Conclusion
Play first with a small amount of data you can visualize, then eventually play with few full columns and then with all the data.