Pandas: output csv data to nested json with sorting

50 Views Asked by At

Given the sample CSV data below, in a pandas DataFrame, how can I output to_json as follows

as_of
  category
     type
        subtype
           log: [
            #sorted by timestamp
            {timestamp: 1618879229, action: add, stale_timestamp: true},
            {timestamp: 1608879229, action: remove, stale_timestamp: None},
           ]

20210415
  apples
     A
        big
           log: [
            {timestamp: 1618879229, action: add, stale_timestamp: None},
           ]
        small
           log: [
            {timestamp: 1618879229, action: add, stale_timestamp: None},
            {timestamp: 1608879229, action: remove, stale_timestamp: None},
            {timestamp: 1518879229, action: add, stale_timestamp: None},
           ]
     B
        big
           log: [
            {timestamp: 1618879229, action: add, stale_timestamp: None},
           ]

Bonus if you can also help me go back to a DataFrame from the nested json!

as_of category type sub_type action timestamp stale_timestamp
20210415 apples A big add 1618879229.6703315
20210415 apples A small add 1618879229.6703315
20210415 apples B big add 1618879229.6703315
20210415 apples B small add 1618879229.6703315
20210415 apples C big add 1618879229.6703315
20210415 apples C small add 1618879229.6703315
202103 oranges sweet add 1616892142.6703315
202103 oranges sweet remove 1616632942.6703315
202103 oranges sweet add 1616200942.6703315
202103 grapes sweet add 1616200942.6703315
202102 oranges sweet add 1616200942.6703315
202102 grapes sweet add 1616200942.6703315
20210115 apples A big add 1611103342.6703315
20210115 apples A small add 1611103342.6703315
20210115 apples B big add 1611103342.6703315
20210115 apples B small add 1611103342.6703315
20210115 apples C big add 1611103342.6703315
20210115 apples C small add 1611103342.6703315
202101 oranges sweet add 1608424942.6703315
202101 grapes sweet add 1608424942.6703315
202012 oranges sweet add 1608424942.6703315
202012 grapes sweet add 1608424942.6703315
202011 oranges sweet add 1608424942.6703315
202011 grapes sweet add 1608424942.6703315
20201015 apples A big add 1608424942.6703315 True
20201015 apples A small add 1608424942.6703315 True
20201015 apples B big add 1608424942.6703315 True
20201015 apples B small add 1608424942.6703315 True
20201015 apples C big add 1608424942.6703315 True
20201015 apples C small add 1608424942.6703315 True
202010 oranges sweet add 1608424942.6703315 True
202010 grapes sweet add 1608424942.6703315 True
1

There are 1 best solutions below

0
On

First I converted the table into a CSV:

as_of,category,type,sub_type,action,timestamp,stale_timestamp
20210415,apples,A,big,add,1618879230,
20210415,apples,A,small,add,1618879230,
20210415,apples,B,big,add,1618879230,
20210415,apples,B,small,add,1618879230,
20210415,apples,C,big,add,1618879230,
20210415,apples,C,small,add,1618879230,
202103,oranges,sweet,,add,1616892143,
202103,oranges,sweet,,remove,1616632943,
202103,oranges,sweet,,add,1616200943,
202103,grapes,sweet,,add,1616200943,
202102,oranges,sweet,,add,1616200943,
202102,grapes,sweet,,add,1616200943,
20210115,apples,A,big,add,1611103343,
20210115,apples,A,small,add,1611103343,
20210115,apples,B,big,add,1611103343,
20210115,apples,B,small,add,1611103343,
20210115,apples,C,big,add,1611103343,
20210115,apples,C,small,add,1611103343,
202101,oranges,sweet,,add,1608424943,
202101,grapes,sweet,,add,1608424943,
202012,oranges,sweet,,add,1608424943,
202012,grapes,sweet,,add,1608424943,
202011,oranges,sweet,,add,1608424943,
202011,grapes,sweet,,add,1608424943,
20201015,apples,A,big,add,1608424943,TRUE
20201015,apples,A,small,add,1608424943,TRUE
20201015,apples,B,big,add,1608424943,TRUE
20201015,apples,B,small,add,1608424943,TRUE
20201015,apples,C,big,add,1608424943,TRUE
20201015,apples,C,small,add,1608424943,TRUE
202010,oranges,sweet,,add,1608424943,TRUE
202010,grapes,sweet,,add,1608424943,TRUE

The missing entries cause problems later in the JSON. That either needs to be fixed in the input file or in the Python converter. Also, some of the dates seem to be missing characters.

Because there isn't a pre-defined orient option in Pandas that suits this requirement, I wrote a custom dictionary and then converted the dictionary to JSON.

import pandas
import json
df = pandas.read_csv('sheet1.csv',header=None, dtype=str)

mydic = {}
for unique_col0 in df[0].unique():
    mydic[unique_col0] = {}
    
    sub_df = df[df[0]==unique_col0]
    for unique_col1 in sub_df[1].unique():
        mydic[unique_col0][unique_col1] = {}
        
        sub_sub_df = sub_df[sub_df[1]==unique_col1]
        for unique_col2 in sub_sub_df[2].unique():
            mydic[unique_col0][unique_col1][unique_col2] = {}
            
            sub_sub_sub_df = sub_sub_df[sub_sub_df[2]==unique_col2]
            for unique_col3 in sub_sub_sub_df[3].unique():
                mydic[unique_col0][unique_col1][unique_col2][unique_col3] = {'log':[]}
                
                for index in range(sub_sub_sub_df.shape[0]):
                    this_dict = {'timestamp': list(sub_sub_sub_df[5])[index],
                                 'action': list(sub_sub_sub_df[4])[index],
                                 'stale_timestamps': list(sub_sub_sub_df[6])[index]}
                    mydic[unique_col0][unique_col1][unique_col2][unique_col3]['log'].append(this_dict)
                
with open('output.json','w') as file_handle:
    json.dump(mydic,file_handle,indent=2)

The sample output provided by the asker isn't consistent with what actually gets produced by the Python implementation.