More pythonic (and faster) way to extract list of dicts from field, parse, and add as rows to DF

Question

More pythonic (and faster) way to extract list of dicts from field, parse, and add as rows to DF

48 Views Asked by RightmireM At 25 May 2023 at 12:33

I have a DF with a bunch of fields (calls) that contains lists of JSON information.

NOTE: That for some of these pd.json_normalize() works, but for others I have to parse out the data manually (don't know why json_normalize can't work with these. Perhaps because they are lists of JSON, versus just JSON.)

CODE

# This works on the first field, which is just JSON, not a list of JSONs
expanded_df = pd.concat([method_df, 
                pd.json_normalize(method_df.reportableinformation_json)], 
                axis=1).drop('reportableinformation_json', axis = 1)

# But the other fields, I do this...
def parseSections(row, prefix):
    columns = row.index
    
    _df = pd.DataFrame(row).transpose()

    for section in row[prefix]:
        _df = pd.concat([_df, pd.json_normalize(section)])

    for col in columns:
        _df[col].fillna(method='pad', inplace=True)

    _df = _df.rename(columns={  'ID': f'{prefix}.ID', 
                                'Name': f'{prefix},.Name',
                                'Tables': f'{prefix}.Tables',
                                'Sections': f'{prefix}.Sections',
                                'Parameters': f'{prefix}.Parameters',
                       })    

    return _df

# This calls parseSections with just one row. 
# Ultimately, I'd like to call this with either _df.apply() or using `for idx, row in _df.iterrow()`
parseSections(_df.iloc[0], prefix = "MethodDescription.Sections")

Original

	id	resultset_id	resultsetrevision_id	revision	name	methodtype	identifier	created	createdby	lastmodified	lastmodifiedby	audittrail_json	approvalstate	flavor	instrumenttechnique	filename	path	lastsaved	Id	Ver	MethodDescription.ID	MethodDescription.Name	MethodDescription.Tables	MethodDescription.Sections	MethodDescription.Parameters
0	44547fcf-a26e-49a1-a	62abc4ca-bf13-4f27-b	30cc3932-ef93-4aec-b	2022-0722-1502-14947	NAME hydrogen quant	pmx Method	00000000-0000-0000-0	2019-03-27 22:01:27.	SYSTEM	2022-10-12 04:17:32.	Username	[]	Generic	GC/LC Quantitative	None	NAME hydrogen quant	/project/path/	2022-10-14 05:25:42.	a8581436-4e34-4a17-b	0	None	Method	[]	[{'ID': 'Method_Info	[]

AFTER

id	resultset_id	resultsetrevision_id	revision	name	methodtype	identifier	created	createdby	lastmodified	lastmodifiedby	audittrail_json	approvalstate	flavor	instrumenttechnique	filename	path	lastsaved	Id	MethodDescription.ID	MethodDescription.Name	MethodDescription.Tables	MethodDescription.Sections	MethodDescription.Parameters	MethodDescription.Sections.ID	MethodDescription.Sections,.Name	MethodDescription.Sections.Tables	MethodDescription.Sections.Sections	MethodDescription.Sections.Parameters
44547fcf-a26e-49a1-a	62abc4ca-bf13-4f27-b	30cc3932-ef93-4aec-b	2022-0722-1502-14947	NAME hydrogen quant	pmx Method	00000000-0000-0000-0	2019-03-27 22:01:27.	SYSTEM	2022-10-12 04:17:32.	Username	[]	Generic	GC/LC Quantitative	None	NAME hydrogen quant	/project/path/	2022-10-14 05:25:42.	a8581436-4e34-4a17-b	None	Method	[]	[{'ID': 'Method_Info	[]	nan	nan	nan	nan	nan
44547fcf-a26e-49a1-a	62abc4ca-bf13-4f27-b	30cc3932-ef93-4aec-b	2022-0722-1502-14947	NAME hydrogen quant	pmx Method	00000000-0000-0000-0	2019-03-27 22:01:27.	SYSTEM	2022-10-12 04:17:32.	Username	[]	Generic	GC/LC Quantitative	None	NAME hydrogen quant	/project/path/	2022-10-14 05:25:42.	a8581436-4e34-4a17-b	None	Method	[]	[{'ID': 'Method_Info	[]	Method_Information	Method Information	[]	[]	[{'ID': 'ID_Method_O
44547fcf-a26e-49a1-a	62abc4ca-bf13-4f27-b	30cc3932-ef93-4aec-b	2022-0722-1502-14947	NAME hydrogen quant	pmx Method	00000000-0000-0000-0	2019-03-27 22:01:27.	SYSTEM	2022-10-12 04:17:32.	Username	[]	Generic	GC/LC Quantitative	None	NAME hydrogen quant	/project/path/	2022-10-14 05:25:42.	a8581436-4e34-4a17-b	None	Method	[]	[{'ID': 'Method_Info	[]	DefaultChain	Method Parameters	[]	[{'ID': 'SignalAlign	[]
44547fcf-a26e-49a1-a	62abc4ca-bf13-4f27-b	30cc3932-ef93-4aec-b	2022-0722-1502-14947	NAME hydrogen quant	pmx Method	00000000-0000-0000-0	2019-03-27 22:01:27.	SYSTEM	2022-10-12 04:17:32.	Username	[]	Generic	GC/LC Quantitative	None	NAME hydrogen quant	/project/path/	2022-10-14 05:25:42.	a8581436-4e34-4a17-b	None	Method	[]	[{'ID': 'Method_Info	[]	Method_CalibrationCu	Calibration curves	[{'ID': 'Calibration	[]	[]
44547fcf-a26e-49a1-a	62abc4ca-bf13-4f27-b	30cc3932-ef93-4aec-b	2022-0722-1502-14947	NAME hydrogen quant	pmx Method	00000000-0000-0000-0	2019-03-27 22:01:27.	SYSTEM	2022-10-12 04:17:32.	Username	[]	Generic	GC/LC Quantitative	None	NAME hydrogen quant	/project/path/	2022-10-14 05:25:42.	a8581436-4e34-4a17-b	None	Method	[]	[{'ID': 'Method_Info	[]	Method_ReferenceChro	Reference chromatogr	[{'ID': 'ReferenceCh	[]	[]

Original Q&A

There are 1 best solutions below

**Laurent** · Answer 1 · 2023-05-27T15:58:56.340000

With the following toy dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "A": [
            1,
            2,
            3,
        ],
        "B": ["a", "b", "c"],
        "C": [{"U": 60, "V": 94}, {"U": 72, "V": 18}, {"U": 78, "V": 86}],
        "D": [
            [{"W": 88, "X": 21}, {"Y": 12, "Z": 63}],
            [{"W": 11, "X": 46}, {"Y": 49, "Z": 22}],
            [{"W": 15, "X": 9}, {"Y": 89, "Z": 17}],
        ],
    }
)

print(df)
# Output

   A  B                   C                                         D
0  1  a  {'U': 60, 'V': 94}  [{'W': 88, 'X': 21}, {'Y': 12, 'Z': 63}]
1  2  b  {'U': 72, 'V': 18}  [{'W': 11, 'X': 46}, {'Y': 49, 'Z': 22}]
2  3  c  {'U': 78, 'V': 86}   [{'W': 15, 'X': 9}, {'Y': 89, 'Z': 17}]

Here is another way to do it:

for col in df.columns:
    # Deal with columns containing lists of json
    if df[col].apply(lambda x: isinstance(x, list)).all():
        df = df.explode(col, ignore_index=True)
    # Deal with json
    if not (new_cols := pd.json_normalize(df[col])).empty:
        df = pd.concat([df.drop(columns=col), new_cols], axis=1)

Then:

   A  B   U   V     W     X     Y     Z
0  1  a  60  94  88.0  21.0   NaN   NaN
1  1  a  60  94   NaN   NaN  12.0  63.0
2  2  b  72  18  11.0  46.0   NaN   NaN
3  2  b  72  18   NaN   NaN  49.0  22.0
4  3  c  78  86  15.0   9.0   NaN   NaN
5  3  c  78  86   NaN   NaN  89.0  17.0

More pythonic (and faster) way to extract list of dicts from field, parse, and add as rows to DF

There are 1 best solutions below

Related Questions in JSON

Related Questions in PYTHON-3.X

Related Questions in PANDAS

Related Questions in JSON-NORMALIZE

Trending Questions

Popular # Hahtags

Popular Questions