Given:
Sample Pandas Dataframe with dictionaries in each cell:
user_token_df = pd.DataFrame(
{
"usr": ["u1", "u2", "u3", "u4", "u7", "u9", "u5", "u8"],
"colA": [
{'a': 7, 'b': 0,'c': 1, 'd':0.1, 'e': 0.8},
{'a': 0, 'b': 1,'c': 0.1, 'd':0, 'e': 7},
{'a': 5, 'b': 2,'c': 0, 'd':0.1, 'e': 0.3},
np.nan,
{'a': 1.2, 'b': 2.6,'c': 2.2, 'd':0.3, 'e': 5.3},
{'a': 1, 'b': 0,'c': 0, 'd':0.5, 'e': 2.1},
{'a': .1, 'b': 0.7,'c': 3, 'd':1.9, 'e': 0.4},
np.nan,
],
"colB": [
{'a': 1, 'b': 2,'c': .8, 'd':0, 'e': 0.6},
np.nan,
{'a': 0.6, 'b': 5.2,'c': 0.1, 'd':0, 'e': 2.7},
{'a': 0, 'b': 2,'c': 3.0, 'd':0.1, 'e': 6.3},
{'a': 1.2, 'b': 2.6,'c': 2.2, 'd':0.3, 'e': 5.3},
np.nan,
{'a': 1, 'b': 0.3,'c': 0, 'd':0.5, 'e': 2.1},
{'a': 0, 'b': 0.5,'c': 0.6, 'd':0.9, 'e': 0},
],
"colC": [
np.nan,
{'a': 1.2, 'b': 12,'c': 1.2, 'd':0.6, 'e': 0},
{'a': 0.6, 'b': 5.2,'c': 0.1, 'd':0, 'e': 2.7},
{'a': 0.3, 'b': 2,'c': 3.0, 'd':0.1, 'e': 0.3},
{'a': 0, 'b': .6,'c': .2, 'd':.3, 'e': 5},
np.nan,
{'a': 1, 'b': 0.3,'c': 0, 'd':0.5, 'e': 2.1},
{'a': 4.3, 'b': 0.5,'c': 0.6, 'd':0.9, 'e': 0},
],
}
)
Goal:
I'd like to create a new column col_result which contains dictionaries with values as accumulated values from other columns.
Working but Inefficient Solution:
Right now, I have a working solution for small sample Dataframe using apply method and Counter from collections modules:
from typing import Dict
from collections import Counter
import pandas as pd
import functools
def sum_dict_values(df: pd.DataFrame, vb: Dict[str, int]):
df = df.dropna()
dict_colA = dict(Counter(df.colA)) if "colA" in df else dict.fromkeys(vb.keys(), 0.0)
dict_colB = dict(Counter(df.colB)) if "colB" in df else dict.fromkeys(vb.keys(), 0.0)
dict_colC = dict(Counter(df.colC)) if "colC" in df else dict.fromkeys(vb.keys(), 0.0)
res = dict(functools.reduce(lambda a, b: a.update(b) or a, [dict_colA, dict_colB, dict_colC], Counter()))
return res
init_bow = {'a': 0, 'b': 1,'c': 2, 'd':3, 'e': 4} # to deal with columns with NaN cells
%timeit -r 10 -n 10000 user_token_df["col_result"] = user_token_df.apply(lambda row: sum_dict_values(df=row, vb=init_bow), axis=1)
# 2.12 ms ± 145 µs per loop (mean ± std. dev. of 10 runs, 10000 loops each)
Is there any better and more efficient approach dealing with bigger size of dict (len()>100K) and DataFrame (.shape > 20K) ?
Cheers,

Try with
json_normalize: