Convert dataframe of dictionary entries to dataframe of all entries based on existing dictionary

61 Views Asked by At

I have a pandas dataframe that consists of an id and an associated count of different encoded words. For instance:

Original = pd.DataFrame(data=[[1,'1:2,2:3,3:1'],[2,'2:2,4:3']], columns=['id','words'])

I have a dictionary that has the mapping to the actual words, for instance:

WordDict = {1:'A',2:'B',3:'C',4:'D'}

What I would like to do is create a new dataframe that maps the counts to columns for all possible words, so it would look something like:

Final =pd.DataFrame(data=[[1,2,3,1,0],[2,0,2,0,3]], columns=['id','A','B','C','D']).set_index('id')

I know I can split the 'words' column of the original into separate columns, and can create a dataframe from WordDict so that it has all possible columns, but could not figure out how to create the mapping.

2

There are 2 best solutions below

0
mozway On BEST ANSWER

You can use a regex, a list comprehension, and the DataFrame constructor:

import re

Final = pd.DataFrame([{WordDict.get(int(k), None): v
                      for k,v in re.findall('([^:,]+):([^:,]+)', s)}
                      for s in Original['words']], index=Original['id']
                     ).fillna(0).astype(int)

Or with split:

Final = pd.DataFrame([{WordDict.get(int(k), None): v
                       for x in s.split(',')
                       for k,v in [x.split(':')]}
                      for s in Original['words']], index=Original['id']
                     ).fillna(0).astype(int)

Or ast.literal_eval:

from ast import literal_eval

Final = pd.DataFrame([{WordDict.get(k, None): v for k,v in literal_eval(f'{{{s}}}').items()}
                      for s in Original['words']], index=Original['id']
                     ).fillna(0, downcast='infer')

Output:

    A  B  C  D
id            
1   2  3  1  0
2   0  2  0  3
0
Bimesh Perera On

You can do that way,

import pandas as pd

Original = pd.DataFrame(data=[[1, '1:2,2:3,3:1'], [2, '2:2,4:3']], columns=['id', 'words'])

WordDict = {1: 'A', 2: 'B', 3: 'C', 4: 'D'}

final_dict = {}

for k, v in WordDict.items():
    final_dict[v] = []
final_dict['id'] = []

for _, row in Original.iterrows():
    id_ = row['id']
    words = row['words']
    
    inner_dict = {v: 0 for k, v in WordDict.items()}
    inner_dict['id'] = id_
    
    for word_count in words.split(','):
        word, count = map(int, word_count.split(':'))
        inner_dict[WordDict[word]] = count

    for k, v in inner_dict.items():
        final_dict[k].append(v)

Final = pd.DataFrame(final_dict).set_index('id')

print(Final)