More pythonic (and faster) way to extract list of dicts from field, parse, and add as rows to DF

48 Views Asked by At

I have a DF with a bunch of fields (calls) that contains lists of JSON information.

NOTE: That for some of these pd.json_normalize() works, but for others I have to parse out the data manually (don't know why json_normalize can't work with these. Perhaps because they are lists of JSON, versus just JSON.)

CODE

# This works on the first field, which is just JSON, not a list of JSONs
expanded_df = pd.concat([method_df, 
                pd.json_normalize(method_df.reportableinformation_json)], 
                axis=1).drop('reportableinformation_json', axis = 1)

# But the other fields, I do this...
def parseSections(row, prefix):
    columns = row.index
    
    _df = pd.DataFrame(row).transpose()

    for section in row[prefix]:
        _df = pd.concat([_df, pd.json_normalize(section)])

    for col in columns:
        _df[col].fillna(method='pad', inplace=True)

    _df = _df.rename(columns={  'ID': f'{prefix}.ID', 
                                'Name': f'{prefix},.Name',
                                'Tables': f'{prefix}.Tables',
                                'Sections': f'{prefix}.Sections',
                                'Parameters': f'{prefix}.Parameters',
                       })    

    return _df

# This calls parseSections with just one row. 
# Ultimately, I'd like to call this with either _df.apply() or using `for idx, row in _df.iterrow()`
parseSections(_df.iloc[0], prefix = "MethodDescription.Sections")

Original

id resultset_id resultsetrevision_id revision name methodtype identifier created createdby lastmodified lastmodifiedby audittrail_json approvalstate flavor instrumenttechnique filename path lastsaved Id Ver MethodDescription.ID MethodDescription.Name MethodDescription.Tables MethodDescription.Sections MethodDescription.Parameters
0 44547fcf-a26e-49a1-a 62abc4ca-bf13-4f27-b 30cc3932-ef93-4aec-b 2022-0722-1502-14947 NAME hydrogen quant pmx Method 00000000-0000-0000-0 2019-03-27 22:01:27. SYSTEM 2022-10-12 04:17:32. Username [] Generic GC/LC Quantitative None NAME hydrogen quant /project/path/ 2022-10-14 05:25:42. a8581436-4e34-4a17-b 0 None Method [] [{'ID': 'Method_Info []

AFTER

id resultset_id resultsetrevision_id revision name methodtype identifier created createdby lastmodified lastmodifiedby audittrail_json approvalstate flavor instrumenttechnique filename path lastsaved Id Ver MethodDescription.ID MethodDescription.Name MethodDescription.Tables MethodDescription.Sections MethodDescription.Parameters MethodDescription.Sections.ID MethodDescription.Sections,.Name MethodDescription.Sections.Tables MethodDescription.Sections.Sections MethodDescription.Sections.Parameters
0 44547fcf-a26e-49a1-a 62abc4ca-bf13-4f27-b 30cc3932-ef93-4aec-b 2022-0722-1502-14947 NAME hydrogen quant pmx Method 00000000-0000-0000-0 2019-03-27 22:01:27. SYSTEM 2022-10-12 04:17:32. Username [] Generic GC/LC Quantitative None NAME hydrogen quant /project/path/ 2022-10-14 05:25:42. a8581436-4e34-4a17-b 0 None Method [] [{'ID': 'Method_Info [] nan nan nan nan nan
0 44547fcf-a26e-49a1-a 62abc4ca-bf13-4f27-b 30cc3932-ef93-4aec-b 2022-0722-1502-14947 NAME hydrogen quant pmx Method 00000000-0000-0000-0 2019-03-27 22:01:27. SYSTEM 2022-10-12 04:17:32. Username [] Generic GC/LC Quantitative None NAME hydrogen quant /project/path/ 2022-10-14 05:25:42. a8581436-4e34-4a17-b 0 None Method [] [{'ID': 'Method_Info [] Method_Information Method Information [] [] [{'ID': 'ID_Method_O
0 44547fcf-a26e-49a1-a 62abc4ca-bf13-4f27-b 30cc3932-ef93-4aec-b 2022-0722-1502-14947 NAME hydrogen quant pmx Method 00000000-0000-0000-0 2019-03-27 22:01:27. SYSTEM 2022-10-12 04:17:32. Username [] Generic GC/LC Quantitative None NAME hydrogen quant /project/path/ 2022-10-14 05:25:42. a8581436-4e34-4a17-b 0 None Method [] [{'ID': 'Method_Info [] DefaultChain Method Parameters [] [{'ID': 'SignalAlign []
0 44547fcf-a26e-49a1-a 62abc4ca-bf13-4f27-b 30cc3932-ef93-4aec-b 2022-0722-1502-14947 NAME hydrogen quant pmx Method 00000000-0000-0000-0 2019-03-27 22:01:27. SYSTEM 2022-10-12 04:17:32. Username [] Generic GC/LC Quantitative None NAME hydrogen quant /project/path/ 2022-10-14 05:25:42. a8581436-4e34-4a17-b 0 None Method [] [{'ID': 'Method_Info [] Method_CalibrationCu Calibration curves [{'ID': 'Calibration [] []
0 44547fcf-a26e-49a1-a 62abc4ca-bf13-4f27-b 30cc3932-ef93-4aec-b 2022-0722-1502-14947 NAME hydrogen quant pmx Method 00000000-0000-0000-0 2019-03-27 22:01:27. SYSTEM 2022-10-12 04:17:32. Username [] Generic GC/LC Quantitative None NAME hydrogen quant /project/path/ 2022-10-14 05:25:42. a8581436-4e34-4a17-b 0 None Method [] [{'ID': 'Method_Info [] Method_ReferenceChro Reference chromatogr [{'ID': 'ReferenceCh [] []
1

There are 1 best solutions below

0
Laurent On

With the following toy dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "A": [
            1,
            2,
            3,
        ],
        "B": ["a", "b", "c"],
        "C": [{"U": 60, "V": 94}, {"U": 72, "V": 18}, {"U": 78, "V": 86}],
        "D": [
            [{"W": 88, "X": 21}, {"Y": 12, "Z": 63}],
            [{"W": 11, "X": 46}, {"Y": 49, "Z": 22}],
            [{"W": 15, "X": 9}, {"Y": 89, "Z": 17}],
        ],
    }
)

print(df)
# Output

   A  B                   C                                         D
0  1  a  {'U': 60, 'V': 94}  [{'W': 88, 'X': 21}, {'Y': 12, 'Z': 63}]
1  2  b  {'U': 72, 'V': 18}  [{'W': 11, 'X': 46}, {'Y': 49, 'Z': 22}]
2  3  c  {'U': 78, 'V': 86}   [{'W': 15, 'X': 9}, {'Y': 89, 'Z': 17}]

Here is another way to do it:

for col in df.columns:
    # Deal with columns containing lists of json
    if df[col].apply(lambda x: isinstance(x, list)).all():
        df = df.explode(col, ignore_index=True)
    # Deal with json
    if not (new_cols := pd.json_normalize(df[col])).empty:
        df = pd.concat([df.drop(columns=col), new_cols], axis=1)

Then:

   A  B   U   V     W     X     Y     Z
0  1  a  60  94  88.0  21.0   NaN   NaN
1  1  a  60  94   NaN   NaN  12.0  63.0
2  2  b  72  18  11.0  46.0   NaN   NaN
3  2  b  72  18   NaN   NaN  49.0  22.0
4  3  c  78  86  15.0   9.0   NaN   NaN
5  3  c  78  86   NaN   NaN  89.0  17.0