Why does pd.rolling and .apply() return multiple outputs from a function returning a single value?

86 Views Asked by At

I'm trying to create a rolling function that:

  1. Divides two DataFrames with 3 columns in each df.
  2. Calculate the mean of each row from the output in step 1.
  3. Sums the averages from step 2.

This could be done by using pd.iterrows() hence looping through each row. However, this would be inefficient when working with larger datasets. Therefore, my objective is to create a pd.rolling function that could do this much faster.

What I would need help with is to understand why my approach below returns multiple values while the function I'm using only returns a single value.

EDIT : I have updated the question with the code that produces my desired output.

This is the test dataset I'm working with:

#import libraries
import pandas as pd
import numpy as np 

#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
        'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
        "column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
        }

df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)

    column1  column2  column3
0         7        1        3
1         2        5        6
2         3        2        3
3         1        4        9
4         3        1        7
5         2        5        1
6         5        5        2
7         3        3        3
8         2        1        7
9         4        5        5
10        6        3        4
11        8        5        1
12        1        8        4
13        3        1        2
14        7        6        9
15        3        4        6
16        7        2        5
17        2        3        1
18        6        9        4
19        3        1        1
20        8        4        3
   0  1  2
0  2  3  4
1  3  4  1
2  3  6  1

One method to achieve my desired output by looping through each row:

RunningSum = []
for index, rows in df1.iterrows():
    if index > 3:
        Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
        Average = Div.mean(axis=0)
        SumOfAverages = np.sum(Average)
        RunningSum.append(SumOfAverages)
        
        
            
        #printing my desired output values
        print(RunningSum)
[330.42328042328046,
 212.0899470899471,
 152.06349206349208,
 205.55555555555554,
 311.9047619047619,
 209.1269841269841,
 197.61904761904765,
 116.94444444444444,
 149.72222222222223,
 430.0,
 219.51058201058203,
 215.34391534391537,
 199.15343915343914,
 159.6031746031746,
 127.6984126984127,
 326.85185185185185,
 204.16666666666669]

However, this would be timely when working with large datasets. Therefore, I've tried to create a function which applies to a pd.rolling() object.

def SumOfAverageFunction(vals):
    Div = df2 / vals.reset_index(drop="True")
    Average = Div.mean(axis=0)
    SumOfAverages = np.sum(Average)
    return SumOfAverages

RunningSum = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)

The problem here is that my function returns multiple output. How can I solve this?

print(RunningSum)
     column1   column2   column3
0        NaN       NaN       NaN
1        NaN       NaN       NaN
2   3.214286  4.533333  2.277778
3   4.777778  3.200000  2.111111
4   5.888889  4.416667  1.656085
5   5.111111  5.400000  2.915344
6   3.455556  3.933333  5.714286
7   2.866667  2.066667  5.500000
8   2.977778  3.977778  3.063492
9   3.555556  5.622222  1.907937
10  2.750000  4.200000  1.747619
11  1.638889  2.377778  3.616667
12  2.986111  2.005556  5.500000
13  5.333333  3.075000  4.750000
14  4.396825  5.000000  3.055556
15  2.174603  3.888889  2.148148
16  2.111111  2.527778  1.418519
17  2.507937  3.500000  3.311111
18  2.880952  3.000000  5.366667
19  2.722222  3.370370  5.750000
20  2.138889  5.129630  5.666667
1

There are 1 best solutions below

1
On

After reordering of operations, your calculations can be simplified

BASE = df2.sum(axis=0) /3
BASE_series = pd.Series({k: v for k, v in zip(df1.columns, BASE)})
result = df1.rdiv(BASE_series, axis=1).sum(axis=1)

print(np.around(result[4:], 3))

Outputs:

4     5.508
5     4.200
6     2.400
7     3.000
...

if you dont want to calculate anything before index 4 then change:

df1.iloc[4:].rdiv(...