Auto-extracting columns from nested dictionaries in pandas

506 Views Asked by At

So I have this nested multiple dictionaries in a jsonl file column as below:

    `df['referenced_tweets'][0]` 

producing (shortened output)

  'id': '1392893055112400898',
  'public_metrics': {'retweet_count': 0,
   'reply_count': 1,
   'like_count': 2,
   'quote_count': 0},
  'conversation_id': '1392893055112400898',
  'created_at': '2021-05-13T17:22:37.000Z',
  'reply_settings': 'everyone',
  'entities': {'annotations': [{'start': 65,
     'end': 77,
     'probability': 0.9719000000000001,
     'type': 'Person',
     'normalized_text': 'Jill McMillan'}],
   'mentions': [{'start': 23,
     'end': 36,
     'username': 'usasklibrary',
     'protected': False,
     'description': 'The official account of the University Library at USask.',
     'created_at': '2019-06-04T17:19:12.000Z',
     'entities': {'url': {'urls': [{'start': 0,
         'end': 23,
         'url': '*removed*',
         'expanded_url': 'http://library.usask.ca',
         'display_url': 'library.usask.ca'}]}},
     'name': 'University Library',
     'url': '....',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1278828446026629120/G1w7t-HK_normal.jpg',
     'verified': False,
     'id': '1135959197902921728',
     'public_metrics': {'followers_count': 365,
      'following_count': 119,
      'tweet_count': 556,
      'listed_count': 9}}]},
  'text': 'Wonderful session with @usasklibrary Graduate Writing Specialist Jill McMillan who is walking SURE students through the process of organizing/analyzing a literature review! So grateful to the library -- our largest SURE: Student Undergraduate Research Experience partner!', 
...

My intention is to create a function that would auto extract specific columns (e.g. text,type) in the entire dataframe (not just a row). So I wrote the function:

### x = df['referenced_tweets']

def extract_TextType(x):
    dic = {}
    for i in x:
        if i != " ":
            new_df= pd.DataFrame.from_dict(i)
            dic['refd_text']=new_df['text']
            dic['refd_type'] = new_df['type']
        else:
            print('none')
    return dic     

However running the function:

df['referenced_tweets'].apply(extract_TextType)

produces an error:

ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

The whole point is to extract these two nested columns (texts & type) from the original "referenced tweets" column and match them to the original rows.

What am I doing wrong pls?

P.S. The original df is shotgrabbed below: original DF containing the target column

1

There are 1 best solutions below

0
On

A couple things to consider here. referenced_tweets holds a list so this line new_df= pd.DataFrame.from_dict(i) is most likely not parsing correctly the way you are entering it.

Also, because it's possible there are multiple tweets in that list you are correctly iterating over it but you don't need put it into a df to do so. This will also create a new dictionary in each cell as you are using a .apply(). If that's what you want that is ok. If you really just want a new dataframe, you can adapt the following. I don't have access to referenced_tweets so I'm using entities as an example. Here's my example:

ents = df[df.entities.notnull()]['entities']

dict_hold_list = []
for ent in ents:
    # print(ent['hashtags'])
    for htag in ent['hashtags']:
        # print(htag['text'])
        # print(htag['indices'])
        dict_hold_list.append({'text': htag['text'], 'indices': htag['indices']})
df_hashtags = pd.DataFrame(dict_hold_list)

Because you have not provided a good working json or dataframe, I can't test this, but your solution could look like this

refs = df[df.referenced_tweets.notnull()]['referenced_tweets']

dict_hold_list = []
for ref in refs:
    # print(ref)
    for r in ref:
        # print(r['text'])
        # print(r['type'])
        dict_hold_list.append({'text': r['text'], 'type': r['type']})
df_ref_tweets = pd.DataFrame(dict_hold_list)