I am using pandas to create pivot tables. My data looks usually contains a lot of numeric values which can easily be aggregated with np.mean (e.g. question1), but there is one exception - Net Promoter Score (notice Total 0.00 both for EU and NA)
responseId country region nps question1
0 1 Germany EU 11 3.2
1 2 Germany EU 10 5.0
2 3 US NA 7 4.3
3 4 US NA 5 4.8
4 5 France EU 5 3.2
5 6 France EU 5 5.0
6 7 France EU 11 5.0
region EU NA
country France Germany Total US Total
nps -33.33 100.0 0.00 -100.00 0.00
question1 4.40 4.1 4.25 4.55 4.55
For NPS I use a custom aggfunc
def calculate_nps(column):
detractors = [1,2,3,4,5,6,7]
passives = [8,9]
promoters = [10,11]
counts = column.value_counts(normalize=True)
percent_promoters = counts.reindex(promoters).sum()
percent_detractors = counts.reindex(detractors).sum()
return (percent_promoters - percent_detractors) * 100
aggfunc = {
"nps": calculate_nps,
"question1": np.mean
}
pd.pivot_table(data=df,columns=["region","country"],values=["nps","question1"],aggfunc=aggfunc,margins=True,margins_name="Total",sort=True)
This aggfunc works fine for regular columns, but fails for margins ("Total" columns), because pandas passes data already aggregated. For regular fields calculate_nps receives columns like this
4 5
5 5
6 11
Name: nps, dtype: int64
but for margins the data looks like this
region country
EU France -33.333333
Germany 100.000000
Name: nps, dtype: float64
calculate_nps cannot deal with such data and returns 0. In this case column.mean() should be applied which I solved like this (notice if column.index.names != [None])
def calculate_nps(column):
if column.index.names != [None]:
return column.mean()
detractors = [1,2,3,4,5,6,7]
passives = [8,9]
promoters = [10,11]
counts = column.value_counts(normalize=True)
percent_promoters = counts.reindex(promoters).sum()
percent_detractors = counts.reindex(detractors).sum()
return (percent_promoters - percent_detractors) * 100
Now the pivot table is correct
region EU NA
country France Germany Total US Total
nps -33.33 100.0 33.33 -100.00 -100.00
question1 4.40 4.1 4.25 4.55 4.55
Question
Is there a proper / better way to determine what kind of data is passed to aggfunc? I'm not sure that my solution will work for all scenarios