How to combine groupby, rolling and multiple columns' creation in Python

670 Views Asked by At

I have an issue that is easily solved with dplyr in R, yet can't seem to find an easy way in Python. I have a df with id(=customerid), s(=store), m(=month) and ttl(=total purchase) as columns. I would like to calculate multiple new columns on id+s - for example last 3 months purchase and minimum purchase.

Example (last two are new columns):

id  s         m   ttl ttl_3 min_id_s
1   A   1/1/2020    7   nan 3
1   A   2/1/2020    3   nan 3
1   A   3/1/2020    7   17  3
1   A   4/1/2020    6   16  3
1   A   5/1/2020    7   20  3
1   A   6/1/2020    7   20  3
1   B   1/1/2020    6   nan 6
1   B   2/1/2020    10  nan 6
1   B   3/1/2020    8   24  6
1   B   4/1/2020    8   26  6
1   B   5/1/2020    10  26  6
1   B   6/1/2020    8   26  6
2   A   1/1/2020    4   nan 1
2   A   2/1/2020    3   15  1
2   A   3/1/2020    10  17  1
2   A   4/1/2020    6   19  1
2   A   5/1/2020    4   20  1
2   A   6/1/2020    1   11  1

I've tried the following:

grp = df.groupby(['id','s'])
df = df.assign(ttl_3 = grp['ttl'].apply(lambda x: x.rolling(window=3)).sum(), min_id_s = grp['ttl'].min())

I get the following error:

Cannot access callable attribute 'assign' of 'DataFrameGroupBy' objects, try using the 'apply' method

I know that it can be solved without assign, but then I'll have to have a line per each new column, and as I need plenty of these, I'm looking for a workaround.

I've also looked into add_columns with pyjanitor, but it doesn't seem to work with groupby.

In R, the following code solves the issue and in the mutate I can proceed adding columns:

df = df %>% group_by(id, s) %>% mutate(ttl_3 = runner(ttl, k=3, f=sum), min_id_s = min(ttl))
1

There are 1 best solutions below

8
Michael Szczesny On

With python syntax and pandas the logic is nearly identical

t = '''
id  s         m   ttl 
1   A   1/1/2020    7 
1   A   2/1/2020    3 
1   A   3/1/2020    7 
1   A   4/1/2020    6 
1   A   5/1/2020    7 
1   A   6/1/2020    7 
1   B   1/1/2020    6 
1   B   2/1/2020    10
1   B   3/1/2020    8 
1   B   4/1/2020    8 
1   B   5/1/2020    10
1   B   6/1/2020    8 
2   A   1/1/2020    4 
2   A   2/1/2020    3 
2   A   3/1/2020    10
2   A   4/1/2020    6 
2   A   5/1/2020    4 
2   A   6/1/2020    1 '''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(t), sep='\s+')

Trying to come as close to dplyr as I could.
Group by id and s and compute the new columns. Multiple rolling columns could be computed with the agg method.
You can use assign, but you also have to write one line per aggregation.

group = df.groupby(['id','s'])['ttl']
df['ttl_3'] = group.rolling(3).sum().reset_index(level=(0,1), drop=True)
df['min_id_s'] = group.transform('min')
#df = df.assign(
#         ttl_3 = group.rolling(3).sum().reset_index(level=(0,1), drop=True),
#         min_id_s = group.transform('min'))
df

Out:

    id  s         m  ttl  ttl_3  min_id_s
0    1  A  1/1/2020    7    NaN         3
1    1  A  2/1/2020    3    NaN         3
2    1  A  3/1/2020    7   17.0         3
3    1  A  4/1/2020    6   16.0         3
4    1  A  5/1/2020    7   20.0         3
5    1  A  6/1/2020    7   20.0         3
6    1  B  1/1/2020    6    NaN         6
7    1  B  2/1/2020   10    NaN         6
8    1  B  3/1/2020    8   24.0         6
9    1  B  4/1/2020    8   26.0         6
10   1  B  5/1/2020   10   26.0         6
11   1  B  6/1/2020    8   26.0         6
12   2  A  1/1/2020    4    NaN         1
13   2  A  2/1/2020    3    NaN         1
14   2  A  3/1/2020   10   17.0         1
15   2  A  4/1/2020    6   19.0         1
16   2  A  5/1/2020    4   20.0         1
17   2  A  6/1/2020    1   11.0         1

Aggregate multiple columns for rolling with agg

group = df.groupby(['id','s'])['ttl']
df[['ttl_3_sum','ttl_3_mean']] = group.rolling(3).agg(['sum','mean']).reset_index(level=(0,1), drop=True)

Out:

    id  s         m  ttl  ttl_3_sum  ttl_3_mean
0    1  A  1/1/2020    7        NaN         NaN
1    1  A  2/1/2020    3        NaN         NaN
2    1  A  3/1/2020    7       17.0    5.666667
3    1  A  4/1/2020    6       16.0    5.333333
4    1  A  5/1/2020    7       20.0    6.666667
5    1  A  6/1/2020    7       20.0    6.666667
6    1  B  1/1/2020    6        NaN         NaN
7    1  B  2/1/2020   10        NaN         NaN
8    1  B  3/1/2020    8       24.0    8.000000
9    1  B  4/1/2020    8       26.0    8.666667
10   1  B  5/1/2020   10       26.0    8.666667
11   1  B  6/1/2020    8       26.0    8.666667
12   2  A  1/1/2020    4        NaN         NaN
13   2  A  2/1/2020    3        NaN         NaN
14   2  A  3/1/2020   10       17.0    5.666667
15   2  A  4/1/2020    6       19.0    6.333333
16   2  A  5/1/2020    4       20.0    6.666667
17   2  A  6/1/2020    1       11.0    3.666667

Aggregating multiple columns from multiple columns for rolling with agg. The columns have to be numerical. transform doesn't support multiple aggregations as far as I know.

Generate a random numerical column for a rolling aggregation

import numpy as np
df['ttl2'] = np.random.rand(len(df))

groupby without selected columns to aggregate from multiple columns. For example with a custom function

group = df.groupby(['id','s'])
df[['ttl_3_sum', 'ttl2_lambda']] = (group.rolling(3)
          .agg({'ttl':'sum', 'ttl2': lambda x: x.sum()/x.min()})
          .reset_index(level=(0,1), drop=True))
df

Out:

    id  s         m  ttl      ttl2  ttl_3_sum  ttl2_lambda
0    1  A  1/1/2020    7  0.032482        NaN          NaN
1    1  A  2/1/2020    3  0.998115        NaN          NaN
2    1  A  3/1/2020    7  0.689431       17.0    52.953016
3    1  A  4/1/2020    6  0.897444       16.0     3.749456
4    1  A  5/1/2020    7  0.484360       20.0     4.276231
5    1  A  6/1/2020    7  0.971768       20.0     4.859138
6    1  B  1/1/2020    6  0.238363        NaN          NaN
7    1  B  2/1/2020   10  0.740311        NaN          NaN
8    1  B  3/1/2020    8  0.641598       24.0     6.797507
9    1  B  4/1/2020    8  0.984911       26.0     3.688945
10   1  B  5/1/2020   10  0.043379       26.0    38.495141
11   1  B  6/1/2020    8  0.700253       26.0    39.847293
12   2  A  1/1/2020    4  0.437082        NaN          NaN
13   2  A  2/1/2020    3  0.465313        NaN          NaN
14   2  A  3/1/2020   10  0.698976       17.0     3.663777
15   2  A  4/1/2020    6  0.430087       19.0     3.707103
16   2  A  5/1/2020    4  0.949536       20.0     4.832976
17   2  A  6/1/2020    1  0.904986       11.0     5.311974