Convert Pandas DF to Nested JSON

191 Views Asked by At

I have a dataframe in pandas of the following form:

# df name: cust_sim_data_product_agg:
yearmo  region  products    revenue
0   201711  CN  ['Auto', 'Flood', 'Home', 'Liability', 'Life',...   690
1   201711  CN  ['Auto', 'Flood', 'Home', 'Liability', 'Life']  610
2   201711  CN  ['Auto', 'Flood', 'Home', 'Liability']  560
3   201711  CN  ['Auto', 'Flood', 'Home', 'Life', 'Liability',...   690
4   201711  CN  ['Auto', 'Flood', 'Home', 'Life', 'Mortgage', ...   690

I would like to roll it up into a nested json of the form:

{
  yearmo: '201711'
  data: [
    {
      name: 'SE',
      value: 18090, # sum of all the values in the level below
      children: [
        {
          name: '['Auto', 'Flood', 'Home',...], # this is product from the dataframe
          value: 690 . # this is the revenue value
        },
        {
          name: '['Flood', 'Home', 'Life'...],
          value: 690
        },
        ...
      },
      {
      name: 'NE',
      value: 16500, # sum of all the values in the level below
      children: [
        {
          name: '['Auto', 'Home',...],
          value: 210
        },
        {
          name: '['Life'...],
          value: 450
        },
        ...
      }
    },
  yearmo: '201712'
  data: [
    {
      name: 'SE',
      value: 24050,
      children: [ ... ] # same format as above
    },
    {
      name: 'NE',
      value: 22400,
      children: [ ... ] # same format as above
    }
  ]
}

So each yearmo would have an element in the top level of the json. Within data, there would be an entry for each region where value is the sum of the values from the level directly underneath it. Children is an array of dicts where each dict a map of product -> name and revenue -> value from the row level data in the pandas DF.

My best attempt so far looks like this:

def roll_yearmo_rev(d):
    x1 = [{'name': n, 'value': v}  for n,v in zip(d.products, d.revenue)]
    x2 = {'children': x1, 'value': sum(d.revenue)}
    return x2

def roll_yearmo(d):
    x1 = [{'name': n, 'children': c} for n,c in zip(d.region, d.children)]
    x2 = {'children': x1, 'value': sum(d.value)}
    return x2

cust_sim_data_product_agg_dict = cust_sim_data_product_agg.groupby(['yearmo', 'region'])\
    .apply(roll_yearmo_rev)
cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.reset_index()
cust_sim_data_product_agg_dict.columns = ['yearmo' , 'region', 'children']


cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.groupby(['yearmo'])\
    .apply(roll_yearmo)
cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.reset_index()

Which fails because the last rollup throws the following error:

AttributeError: 'DataFrame' object has no attribute 'value'

The whole thing looks messy to me. I read up on split-apply-combine which inspired the use of groupby() and apply() but I could really use a second opinion on the approach because I am pretty sure there is a better way. Any advice would be appreciated.

0

There are 0 best solutions below