So I have this nested multiple dictionaries in a jsonl file column as below:
`df['referenced_tweets'][0]`
producing (shortened output)
'id': '1392893055112400898',
'public_metrics': {'retweet_count': 0,
'reply_count': 1,
'like_count': 2,
'quote_count': 0},
'conversation_id': '1392893055112400898',
'created_at': '2021-05-13T17:22:37.000Z',
'reply_settings': 'everyone',
'entities': {'annotations': [{'start': 65,
'end': 77,
'probability': 0.9719000000000001,
'type': 'Person',
'normalized_text': 'Jill McMillan'}],
'mentions': [{'start': 23,
'end': 36,
'username': 'usasklibrary',
'protected': False,
'description': 'The official account of the University Library at USask.',
'created_at': '2019-06-04T17:19:12.000Z',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'url': '*removed*',
'expanded_url': 'http://library.usask.ca',
'display_url': 'library.usask.ca'}]}},
'name': 'University Library',
'url': '....',
'profile_image_url': 'https://pbs.twimg.com/profile_images/1278828446026629120/G1w7t-HK_normal.jpg',
'verified': False,
'id': '1135959197902921728',
'public_metrics': {'followers_count': 365,
'following_count': 119,
'tweet_count': 556,
'listed_count': 9}}]},
'text': 'Wonderful session with @usasklibrary Graduate Writing Specialist Jill McMillan who is walking SURE students through the process of organizing/analyzing a literature review! So grateful to the library -- our largest SURE: Student Undergraduate Research Experience partner!',
...
My intention is to create a function that would auto extract specific columns (e.g. text,type) in the entire dataframe (not just a row). So I wrote the function:
### x = df['referenced_tweets']
def extract_TextType(x):
dic = {}
for i in x:
if i != " ":
new_df= pd.DataFrame.from_dict(i)
dic['refd_text']=new_df['text']
dic['refd_type'] = new_df['type']
else:
print('none')
return dic
However running the function:
df['referenced_tweets'].apply(extract_TextType)
produces an error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
The whole point is to extract these two nested columns (texts & type) from the original "referenced tweets" column and match them to the original rows.
What am I doing wrong pls?
A couple things to consider here.
referenced_tweets
holds a list so this linenew_df= pd.DataFrame.from_dict(i)
is most likely not parsing correctly the way you are entering it.Also, because it's possible there are multiple tweets in that list you are correctly iterating over it but you don't need put it into a df to do so. This will also create a new dictionary in each cell as you are using a
.apply()
. If that's what you want that is ok. If you really just want a new dataframe, you can adapt the following. I don't have access toreferenced_tweets
so I'm usingentities
as an example. Here's my example:Because you have not provided a good working json or dataframe, I can't test this, but your solution could look like this