why rolling apply is so slow?

263 Views Asked by At

I have a big dataframe, 100,000,000 * 50(about 4G)

I want to calculate rolling window's weighted average like this:

#df shape is (100,000,000 * 50)
from functools import partial
window_size=[1,2,3,4,5,6]
for i in window_size:
    df['triangle_mv_%d'%(i)] = df['mid'].diff(1).rolling(i).apply(partial(np.average, weights=range(i)))

I found it quite slow, for one loop, it cost more than 15 min.

I cant understand this, because rolling(i).mean is quite fast, i just call apply weighted_avg, how it can be this slow?

I also goolge a lot, some ref told me to rewrite the weighted_avg function for rolling like this:

import pandas as pd
import numpy as np
from pandas.core.window.rolling import _flex_binary_moment, _Rolling_and_Expanding

def weighted_mean(self, weights, **kwargs):
    weights = self._shallow_copy(weights)
    window = self._get_window(weights)

    def _get_weighted_mean(X, Y): 
        X = X.astype('float64')
        Y = Y.astype('float64')
        sum_f = lambda x: x.rolling(window, self.min_periods, center=self.center).sum(**kwargs)
        print(X)
        print(Y)
        return sum_f(X * Y) / sum_f(Y)

    return _flex_binary_moment(self._selected_obj, weights._selected_obj,
                               _get_weighted_mean, pairwise=True)

_Rolling_and_Expanding.weighted_mean = weighted_mean

df = pd.DataFrame(np.reshape(range(25), (5,5)))

print(df[1].rolling(2).weighted_mean(pd.Series([1,2])))  # this is wrong, expected result should have 4 values, but there is only one valid values in output like this [NAN, 4.333, NAN, NAN, NAN]

Can anyone help on this? how can i implement this function fastly? and why apply method is so slow?

0

There are 0 best solutions below