pandas pivot table: custom aggfunc with margins

709 Views Asked by At

I am using pandas to create pivot tables. My data looks usually contains a lot of numeric values which can easily be aggregated with np.mean (e.g. question1), but there is one exception - Net Promoter Score (notice Total 0.00 both for EU and NA)

    responseId  country region  nps question1
0   1           Germany EU      11  3.2
1   2           Germany EU      10  5.0
2   3           US      NA      7   4.3
3   4           US      NA      5   4.8
4   5           France  EU      5   3.2
5   6           France  EU      5   5.0
6   7           France  EU      11  5.0
region                           EU               NA
country    France   Germany   Total   US       Total
nps        -33.33   100.0     0.00    -100.00   0.00
question1  4.40     4.1       4.25    4.55      4.55

For NPS I use a custom aggfunc

def calculate_nps(column):    
    detractors = [1,2,3,4,5,6,7]
    passives = [8,9]
    promoters = [10,11]
    
    counts = column.value_counts(normalize=True)
    percent_promoters = counts.reindex(promoters).sum()
    percent_detractors = counts.reindex(detractors).sum()
    
    return (percent_promoters - percent_detractors) * 100

aggfunc = {
    "nps": calculate_nps,
    "question1": np.mean
}

pd.pivot_table(data=df,columns=["region","country"],values=["nps","question1"],aggfunc=aggfunc,margins=True,margins_name="Total",sort=True)

This aggfunc works fine for regular columns, but fails for margins ("Total" columns), because pandas passes data already aggregated. For regular fields calculate_nps receives columns like this

4     5
5     5
6    11
Name: nps, dtype: int64

but for margins the data looks like this

region  country
EU      France     -33.333333
        Germany    100.000000
Name: nps, dtype: float64

calculate_nps cannot deal with such data and returns 0. In this case column.mean() should be applied which I solved like this (notice if column.index.names != [None])

def calculate_nps(column):
    if column.index.names != [None]:
        return column.mean()
    
    detractors = [1,2,3,4,5,6,7]
    passives = [8,9]
    promoters = [10,11]
    
    counts = column.value_counts(normalize=True)
    percent_promoters = counts.reindex(promoters).sum()
    percent_detractors = counts.reindex(detractors).sum()
    
    return (percent_promoters - percent_detractors) * 100

Now the pivot table is correct

region                           EU                  NA
country    France   Germany   Total   US          Total
nps        -33.33   100.0     33.33   -100.00   -100.00
question1  4.40     4.1        4.25   4.55         4.55

Question

Is there a proper / better way to determine what kind of data is passed to aggfunc? I'm not sure that my solution will work for all scenarios

0

There are 0 best solutions below