Pandas Dataframe with dictionary cells | accumulate by values | inefficient apply method

37 Views Asked by At

Given:

Sample Pandas Dataframe with dictionaries in each cell:

user_token_df = pd.DataFrame(
    {
      "usr": ["u1", "u2", "u3", "u4", "u7", "u9", "u5", "u8"],
      "colA":   [
            {'a': 7, 'b': 0,'c': 1, 'd':0.1, 'e': 0.8},
            {'a': 0, 'b': 1,'c': 0.1, 'd':0, 'e': 7},
            {'a': 5, 'b': 2,'c': 0, 'd':0.1, 'e': 0.3},
            np.nan,
            {'a': 1.2, 'b': 2.6,'c': 2.2, 'd':0.3, 'e': 5.3},
            {'a': 1, 'b': 0,'c': 0, 'd':0.5, 'e': 2.1},
            {'a': .1, 'b': 0.7,'c': 3, 'd':1.9, 'e': 0.4},
            np.nan,
          ],
      "colB": [
            {'a': 1, 'b': 2,'c': .8, 'd':0, 'e': 0.6},
            np.nan,
            {'a': 0.6, 'b': 5.2,'c': 0.1, 'd':0, 'e': 2.7},
            {'a': 0, 'b': 2,'c': 3.0, 'd':0.1, 'e': 6.3},
            {'a': 1.2, 'b': 2.6,'c': 2.2, 'd':0.3, 'e': 5.3},
            np.nan,
            {'a': 1, 'b': 0.3,'c': 0, 'd':0.5, 'e': 2.1},
            {'a': 0, 'b': 0.5,'c': 0.6, 'd':0.9, 'e': 0},
          ],
      "colC": [
            np.nan,
            {'a': 1.2, 'b': 12,'c': 1.2, 'd':0.6, 'e': 0},
            {'a': 0.6, 'b': 5.2,'c': 0.1, 'd':0, 'e': 2.7},
            {'a': 0.3, 'b': 2,'c': 3.0, 'd':0.1, 'e': 0.3},
            {'a': 0, 'b': .6,'c': .2, 'd':.3, 'e': 5},
            np.nan,
            {'a': 1, 'b': 0.3,'c': 0, 'd':0.5, 'e': 2.1},
            {'a': 4.3, 'b': 0.5,'c': 0.6, 'd':0.9, 'e': 0},
          ],
    }
)

enter image description here

Goal:

I'd like to create a new column col_result which contains dictionaries with values as accumulated values from other columns.

Working but Inefficient Solution:

Right now, I have a working solution for small sample Dataframe using apply method and Counter from collections modules:

from typing import Dict
from collections import Counter
import pandas as pd
import functools
    
def sum_dict_values(df: pd.DataFrame, vb: Dict[str, int]):
  df = df.dropna()
  dict_colA = dict(Counter(df.colA)) if "colA" in df else dict.fromkeys(vb.keys(), 0.0)
  dict_colB = dict(Counter(df.colB)) if "colB" in df else dict.fromkeys(vb.keys(), 0.0)
  dict_colC = dict(Counter(df.colC)) if "colC" in df else dict.fromkeys(vb.keys(), 0.0)
  res = dict(functools.reduce(lambda a, b: a.update(b) or a, [dict_colA, dict_colB, dict_colC], Counter()))
  return res


init_bow = {'a': 0, 'b': 1,'c': 2, 'd':3, 'e': 4} # to deal with columns with NaN cells

%timeit -r 10 -n 10000 user_token_df["col_result"] = user_token_df.apply(lambda row: sum_dict_values(df=row, vb=init_bow), axis=1)
# 2.12 ms ± 145 µs per loop (mean ± std. dev. of 10 runs, 10000 loops each)

enter image description here Is there any better and more efficient approach dealing with bigger size of dict (len()>100K) and DataFrame (.shape > 20K) ?

Cheers,

1

There are 1 best solutions below

0
not_speshal On

Try with json_normalize:

user_token_df["col_result"] = (pd.json_normalize(user_token_df["colA"]).fillna(0)
                               .add(pd.json_normalize(user_token_df["colB"]).fillna(0))
                               .add(pd.json_normalize(user_token_df["colC"]).fillna(0))
                               .to_dict("records")
                               )