Is there a pandas function to unpack a column that's a dict into a long format?

215 Views Asked by At

So the data has the column [outbreakMap] as a pandas series. I recreated a snippet of the data as a df below:

Initial Data: enter image description here

data = {0: {'id': '53545',
  'latitude': -1.5168500000000003,
  'longitude': 105.2985,
  'outbreakMap': {'ob_102795': {'nationalObReference': '403/KPTS/05/22',
    'oieReference': 'ob_102795',
    'outbreakStartDate': '2022-04-12'},
   'ob_102796': {'nationalObReference': '404/KPTS/05/22',
    'oieReference': 'ob_102796',
    'outbreakStartDate': '2022-04-22'}}},
 1: {'id': '53709',
  'latitude': 36.42056,
  'longitude': 139.9085,
  'outbreakMap': {'ob_101531': {'oieReference': 'ob_101531',
    'outbreakStartDate': '2022-04-12'},
   'ob_101644': {'oieReference': 'ob_101644',
    'outbreakStartDate': '2022-04-14'},
   'ob_102290': {'oieReference': 'ob_102290',
    'outbreakStartDate': '2022-04-21'},
   'ob_100121': {'oieReference': 'ob_100121',
    'outbreakStartDate': '2022-03-24'},
   'ob_102949': {'oieReference': 'ob_102949',
    'outbreakStartDate': '2022-05-09'}}}}

import pandas as pd
df = pd.DataFrame.from_dict(data, orient='index')

Goal: Unpack the dictionary into a long format like-so:

enter image description here

Thank you.

4

There are 4 best solutions below

1
On

[EDIT]

I think this is what you want.

def into_long(outbreakMap):
    '''
    Extracting key, oieReference, and outbreakStartDate
    from outbreakMap
    '''
    keys = outbreakMap.keys()
    # return list of tuples
    return [(
        key,
        outbreakMap[key]['oieReference'],
        outbreakMap[key]['outbreakStartDate']
    ) for key in keys]

df['outbreakMap_tuple'] = df['outbreakMap'].apply(lambda x: into_long(x))

# list to multiple rows
# further reading: https://stackoverflow.com/questions/39954668/how-to-convert-column-with-list-of-values-into-rows-in-pandas-dataframe
df = df.explode('outbreakMap_tuple')

# tuples to multiple columns
# further reading: https://stackoverflow.com/questions/29550414/how-can-i-split-a-column-of-tuples-in-a-pandas-dataframe
df[['outbreakMap', 'oieReference', 'outbreakStartDate']] = pd.DataFrame(df['outbreakMap_tuple'].tolist(), index=df.index)

df.drop('outbreakMap_tuple', axis=1, inplace=True)

# this is optional
df.reset_index(drop=True, inplace=True)

The outcome looks like

enter image description here

I am sure that there are better ways. This is just one of them.

2
On

You can use list comprehension:

lst_3d = [
[
    [data[n]['id'], 
     data[n]['latitude'], 
     data[n]['longitude'], 
     data[n]['outbreakMap'][i]['oieReference'],
     data[n]['outbreakMap'][i]['outbreakStartDate']
] 
    for i in data[n]['outbreakMap']]
for n in data]

#convert to 2d list
lst = [elem for subl in lst_3d for elem in subl]
df = pd.DataFrame(lst, columns=['id', 'latitude', 'longitude', 'oieReference', 'outbreakStartDate'])
df
Out[1]: 
      id  latitude  longitude oieReference outbreakStartDate
0  53545  -1.51685   105.2985    ob_102795        2022-04-12
1  53545  -1.51685   105.2985    ob_102796        2022-04-22
2  53709  36.42056   139.9085    ob_101531        2022-04-12
3  53709  36.42056   139.9085    ob_101644        2022-04-14
4  53709  36.42056   139.9085    ob_102290        2022-04-21
5  53709  36.42056   139.9085    ob_100121        2022-03-24
6  53709  36.42056   139.9085    ob_102949        2022-05-09
2
On

While you can build the DataFrame and work on outbreakMap column afterwards, it might be easier and more efficient if you modify data to include the desired keys to the inner dicts, then construct the DataFrame and explode the added columns:

for v in data.values():
    outbreakMap = v.pop('outbreakMap')
    v['outbreak'] = list(outbreakMap)
    v['outbreak_information'] = list(outbreakMap.values())
    
out = pd.DataFrame.from_dict(data, orient='index').explode(['outbreak','outbreak_information'])

Output:

      id  latitude  longitude   outbreak                               outbreak_information
0  53545  -1.51685   105.2985  ob_102795  {'nationalObReference': '403/KPTS/05/22', 'oie...
0  53545  -1.51685   105.2985  ob_102796  {'nationalObReference': '404/KPTS/05/22', 'oie...
1  53709  36.42056   139.9085  ob_101531  {'oieReference': 'ob_101531', 'outbreakStartDa...
1  53709  36.42056   139.9085  ob_101644  {'oieReference': 'ob_101644', 'outbreakStartDa...
1  53709  36.42056   139.9085  ob_102290  {'oieReference': 'ob_102290', 'outbreakStartDa...
1  53709  36.42056   139.9085  ob_100121  {'oieReference': 'ob_100121', 'outbreakStartDa...
1  53709  36.42056   139.9085  ob_102949  {'oieReference': 'ob_102949', 'outbreakStartDa...
2
On

Since some of the data is duplicated across multiple rows, I think its easier to write a python table and then convert that to a dataframe. Dataframes need to know the full number of rows when they are first created - so pandas would need to make an intermediate list anyway.

Its just a nested for loop that puts rows into a master list, duplicating the data from the outer loop in the inner loop. The data is pulled from the dictionaries via column name lists so we get the right order for each row.

I noticed that outbreakMap appears to be duplicated so skipped it and nationalObReference isn't in all of the data. I also put the code into a function which is an easy way to discard intermediate data that isn't needed after the calculation is done.

data = {0: {'id': '53545',
  'latitude': -1.5168500000000003,
  'longitude': 105.2985,
  'outbreakMap': {'ob_102795': {'nationalObReference': '403/KPTS/05/22',
    'oieReference': 'ob_102795',
    'outbreakStartDate': '2022-04-12'},
   'ob_102796': {'nationalObReference': '404/KPTS/05/22',
    'oieReference': 'ob_102796',
    'outbreakStartDate': '2022-04-22'}}},
 1: {'id': '53709',
  'latitude': 36.42056,
  'longitude': 139.9085,
  'outbreakMap': {'ob_101531': {'oieReference': 'ob_101531',
    'outbreakStartDate': '2022-04-12'},
   'ob_101644': {'oieReference': 'ob_101644',
    'outbreakStartDate': '2022-04-14'},
   'ob_102290': {'oieReference': 'ob_102290',
    'outbreakStartDate': '2022-04-21'},
   'ob_100121': {'oieReference': 'ob_100121',
    'outbreakStartDate': '2022-03-24'},
   'ob_102949': {'oieReference': 'ob_102949',
    'outbreakStartDate': '2022-05-09'}}}}

import pandas as pd

def data_to_df(data):
    common_cols = ["id", "latitude", "longitude"]
    outbreak_cols = ["nationalObReference", "oieReference", "outbreakStartDate"]
    table = []
    for index, common in data.items():
        common_row = [common[name] for name in common_cols]
        for ob, ref in common["outbreakMap"].items():
            row = common_row + [ref.get(name, None) for name in outbreak_cols]
            table.append(row)
    df = pd.DataFrame(table, columns=common_cols+outbreak_cols)
    return df

df = data_to_df(data)