I have a big dataframe, 100,000,000 * 50(about 4G)
I want to calculate rolling window's weighted average like this:
#df shape is (100,000,000 * 50)
from functools import partial
window_size=[1,2,3,4,5,6]
for i in window_size:
df['triangle_mv_%d'%(i)] = df['mid'].diff(1).rolling(i).apply(partial(np.average, weights=range(i)))
I found it quite slow, for one loop, it cost more than 15 min.
I cant understand this, because rolling(i).mean is quite fast, i just call apply weighted_avg, how it can be this slow?
I also goolge a lot, some ref told me to rewrite the weighted_avg function for rolling like this:
import pandas as pd
import numpy as np
from pandas.core.window.rolling import _flex_binary_moment, _Rolling_and_Expanding
def weighted_mean(self, weights, **kwargs):
weights = self._shallow_copy(weights)
window = self._get_window(weights)
def _get_weighted_mean(X, Y):
X = X.astype('float64')
Y = Y.astype('float64')
sum_f = lambda x: x.rolling(window, self.min_periods, center=self.center).sum(**kwargs)
print(X)
print(Y)
return sum_f(X * Y) / sum_f(Y)
return _flex_binary_moment(self._selected_obj, weights._selected_obj,
_get_weighted_mean, pairwise=True)
_Rolling_and_Expanding.weighted_mean = weighted_mean
df = pd.DataFrame(np.reshape(range(25), (5,5)))
print(df[1].rolling(2).weighted_mean(pd.Series([1,2]))) # this is wrong, expected result should have 4 values, but there is only one valid values in output like this [NAN, 4.333, NAN, NAN, NAN]
Can anyone help on this? how can i implement this function fastly? and why apply method is so slow?