I have a dataframe in pandas of the following form:
# df name: cust_sim_data_product_agg:
yearmo region products revenue
0 201711 CN ['Auto', 'Flood', 'Home', 'Liability', 'Life',... 690
1 201711 CN ['Auto', 'Flood', 'Home', 'Liability', 'Life'] 610
2 201711 CN ['Auto', 'Flood', 'Home', 'Liability'] 560
3 201711 CN ['Auto', 'Flood', 'Home', 'Life', 'Liability',... 690
4 201711 CN ['Auto', 'Flood', 'Home', 'Life', 'Mortgage', ... 690
I would like to roll it up into a nested json of the form:
{
yearmo: '201711'
data: [
{
name: 'SE',
value: 18090, # sum of all the values in the level below
children: [
{
name: '['Auto', 'Flood', 'Home',...], # this is product from the dataframe
value: 690 . # this is the revenue value
},
{
name: '['Flood', 'Home', 'Life'...],
value: 690
},
...
},
{
name: 'NE',
value: 16500, # sum of all the values in the level below
children: [
{
name: '['Auto', 'Home',...],
value: 210
},
{
name: '['Life'...],
value: 450
},
...
}
},
yearmo: '201712'
data: [
{
name: 'SE',
value: 24050,
children: [ ... ] # same format as above
},
{
name: 'NE',
value: 22400,
children: [ ... ] # same format as above
}
]
}
So each yearmo would have an element in the top level of the json. Within data, there would be an entry for each region where value is the sum of the values from the level directly underneath it. Children is an array of dicts where each dict a map of product -> name and revenue -> value from the row level data in the pandas DF.
My best attempt so far looks like this:
def roll_yearmo_rev(d):
x1 = [{'name': n, 'value': v} for n,v in zip(d.products, d.revenue)]
x2 = {'children': x1, 'value': sum(d.revenue)}
return x2
def roll_yearmo(d):
x1 = [{'name': n, 'children': c} for n,c in zip(d.region, d.children)]
x2 = {'children': x1, 'value': sum(d.value)}
return x2
cust_sim_data_product_agg_dict = cust_sim_data_product_agg.groupby(['yearmo', 'region'])\
.apply(roll_yearmo_rev)
cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.reset_index()
cust_sim_data_product_agg_dict.columns = ['yearmo' , 'region', 'children']
cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.groupby(['yearmo'])\
.apply(roll_yearmo)
cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.reset_index()
Which fails because the last rollup throws the following error:
AttributeError: 'DataFrame' object has no attribute 'value'
The whole thing looks messy to me. I read up on split-apply-combine which inspired the use of groupby() and apply() but I could really use a second opinion on the approach because I am pretty sure there is a better way. Any advice would be appreciated.