How to add groups to groupby

1.3k Views Asked by At

My hypothetical dataframe is

df = pd.DataFrame({'col1':[91,91,91,91,92,92,92,92],
                  'col2':[91,92]*4,'value':[10]*8})
df

   col1 col2 value
0   91  91  10
1   91  92  10
2   91  91  10
3   91  92  10
4   92  91  10
5   92  92  10
6   92  91  10
7   92  92  10

Grouping over the two columns produces those groups:

grouped = df.groupby(['col1','col2'])
grouped.groups
{(91, 91): Int64Index([0, 2], dtype='int64'),
 (91, 92): Int64Index([1, 3], dtype='int64'),
 (92, 91): Int64Index([4, 6], dtype='int64'),
 (92, 92): Int64Index([5, 7], dtype='int64')}

I want to expand this set of groups, so that I can aggregate over the extended selection of groups.
Let’s say I want to add groups produced by

groupedall = df.groupby(['col1'])
groupedall.groups
{91: Int64Index([0, 1, 2, 3], dtype='int64'),
 92: Int64Index([4, 5, 6, 7], dtype='int64')}

This is what I try: I substitute 99 for col2 value (where 99 will mean ‘any’),

groupedall.groups[(91, 99)] = groupedall.groups.pop(91)
groupedall.groups[(92, 99)] = groupedall.groups.pop(92)

And then add those new groups to my original group dict.

grouped.groups.update(groupedall.groups)
grouped.groups
{(91, 91): Int64Index([0, 2], dtype='int64'),
 (91, 92): Int64Index([1, 3], dtype='int64'),
 (91, 99): Int64Index([0, 1, 2, 3], dtype='int64'),
 (92, 91): Int64Index([4, 6], dtype='int64'),
 (92, 92): Int64Index([5, 7], dtype='int64'),
 (92, 99): Int64Index([4, 5, 6, 7], dtype='int64')}

But when I try to group over the grouped object, those newly added groups are omitted.

grouped.sum()
               value
col1    col2    
91      91      20
        92      20
92      91      20
        92      20

And I want the output to include groups that I’ve just added:

               value
col1    col2    
91      91      20
        92      20
        99      40
92      91      20
        92      20
        99      40

What am I missing here?

2

There are 2 best solutions below

2
On BEST ANSWER

Option 1

df.append(df.assign(col2=99)).groupby(['col1', 'col2']).sum()

           value
col1 col2       
91   91       20
     92       20
     99       40
92   91       20
     92       20
     99       40

Option 2

dummy_series = pd.Series(99, df.index, name='col2')

pd.concat([
    df.groupby(['col1', 'col2']).sum(),
    df.groupby(['col1', dummy_series])[['value']].sum()
]).sort_index()

           value
col1 col2       
91   91       20
     92       20
     99       40
92   91       20
     92       20
     99       40
2
On

the key thing here seems to be that you want to manually add a group to a GroupByDataFrame.

This appears to work when you look at grouped.groups, but when you look at any other attributes of grouped it's clear the new group isn't being considered as a group.

It doesn't appear to be possible to change a groupbydataframe in this way, but using the link provided by @QuickBeam2k1, you can get the data you need by doing:

df.pivot_table(
    index='col1',
    columns='col2',
    values='value',
    aggfunc='sum',
    margins=True
)

which returns:

col2    91      92      All
col1            
91      20.0    20.0    40.0
92      20.0    20.0    40.0
All     40.0    40.0    80.0