Pandas convert list of dictionaries (GA output) into dataframe(s) that makes sense

208 Views Asked by At

I have been struggling with this logic. I am getting data from google analytics in this format, per user per site. (So these are all the activities that one user performed on the site) I cannot change the format in which I receive the data.

PROBLEM: I am running a loop through all of the users and get this output for each user. I want to put this data into a data frame in order to use it later. The problem I have, is the 'activities':[{.....},{......}] part, I cannot figure out to store all of this data in a way that makes sense.

{'sampleRate': 1,
 'sessions': [{'activities': [{'activityTime': '2020-01-08T16:00:44.399101Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'pageview': {'pagePath': '/thepath',
                                            'pageTitle': 'thecurrentwebpage'},
                               'source': '(direct)'},
                              {'activityTime': '2020-01-08T15:58:43.077293Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'pageview': {'pagePath': '/theotherpath',
                                            'pageTitle': 'thecurrentwebpage'},
                               'source': '(direct)'}],
               'dataSource': 'web',
               'deviceCategory': 'desktop',
               'platform': 'Windows',
               'sessionDate': '2020-01-08',
               'sessionId': '1578491x03d'},
              {'activities': [{'activityTime': '2019-12-28T21:58:48.993944Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',.....

EXPECTED OUTPUT:

For each user's data to be stored in tables organised as such: Lucid chart brainstorming of ERD layouts

If there is some logic error in the picture, I am happy to change what I have. I just need the data to work.

PS: I need to use SQL and ERD's in LucidChart, I have never before had to manipulate data in this format. Any help, to get the data -that is structured like the example above- into a dataframe(s).

EDITED:

Example of two different types of activities (the activity is always classified as either 'pageview' or 'event'):

{'activityTime':
                               # Pageview activity
                               '2020-01-08T15:48:38.012671Z',
                               'activityType': 'PAGEVIEW',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'pageview': {'pagePath': '/login',
                                            'pageTitle': 'titleofthepage'},
                               'source': '(direct)'},

                              # Event activity
                              {'activityTime': '2020-01-08T15:48:37.915105Z',
                               'activityType': 'EVENT',
                               'campaign': '(not set)',
                               'channelGrouping': 'Direct',
                               'customDimension': [{'index': 1}],
                               'event': {'eventAction': 'Successfully Logged '
                                                        'In',
                                         'eventCategory': 'Auth',
                                         'eventCount': '1',
                                         'eventLabel': '(not set)'},
                               'hostname': 'company.domain.com',
                               'keyword': '(not set)',
                               'landingPagePath': '/login',
                               'medium': '(none)',
                               'source': '(direct)'}]
1

There are 1 best solutions below

1
On BEST ANSWER

For example you can do it like this:

import pandas as pd
import json
str = """{"sampleRate": 1,
 "sessions": [{"activities": [{"activityTime": "2020-01-08T16:00:44.399101Z",
                               "activityType": "PAGEVIEW",
                               "campaign": "(not set)",
                               "channelGrouping": "Direct",
                               "customDimension": [{"index": 1}],
                               "hostname": "company.domain.com",
                               "keyword": "(not set)",
                               "landingPagePath": "/login",
                               "medium": "(none)",
                               "pageview": {"pagePath": "/thepath",
                                            "pageTitle": "thecurrentwebpage"},
                               "source": "(direct)"},
                              {"activityTime": "2020-01-08T15:48:37.915105Z",
                               "activityType": "EVENT",
                               "campaign": "(not set)",
                               "channelGrouping": "Direct",
                               "customDimension": [{"index": 1}],
                               "event": {"eventAction": "Successfully Logged In",
                                         "eventCategory": "Auth",
                                         "eventCount": "1",
                                         "eventLabel": "(not set)"},
                               "hostname": "company.domain.com",
                               "keyword": "(not set)",
                               "landingPagePath": "/login",
                               "medium": "(none)",
                               "source": "(direct)"}],
               "dataSource": "web",
               "deviceCategory": "desktop",
               "platform": "Windows",
               "sessionDate": "2020-01-08",
               "sessionId": "1578491x03d"}]}"""


data = json.loads(str)

session_keys = "sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id".split(",")
event_keys = "activityTime,eventCategory,eventCount,eventLabel,eventAction".split(",")
pageview_keys = "activityTime,pageTitle,pagePath".split(",")

sessions = {k:[] for k in session_keys}
events = {k:[] for k in event_keys}
pageviews = {k:[] for k in pageview_keys}
activities = {"sessionId":[],"activityTime":[]}

for session in data["sessions"]:
    for easy_key in session_keys[:5]:
        sessions[easy_key] += [session[easy_key]]
    for activity in session["activities"]:
        activity_time = activity["activityTime"]
        activities["sessionId"] += [session["sessionId"]]
        activities["activityTime"] += [activity_time]
        if activity["activityType"] == "PAGEVIEW":
            pageviews["activityTime"] += [activity_time]
            pageviews["pageTitle"] += [activity["pageview"]["pageTitle"]]
            pageviews["pagePath"] += [activity["pageview"]["pagePath"]]
        elif activity["activityType"] == "EVENT":
            events["activityTime"] += [activity_time]
            events["eventAction"] += [activity["event"]["eventAction"]]
            events["eventCategory"] += [activity["event"]["eventCategory"]]
            events["eventCount"] += [activity["event"]["eventCount"]]
            events["eventLabel"] += [activity["event"]["eventLabel"]]
        else:
            print("Unknown Activity: {}".format(activity["activityType"]))

    sessions["DB_id"] += [0]

df_session = pd.DataFrame.from_dict(sessions)
df_session.set_index('sessionId', inplace=True)
df_event = pd.DataFrame.from_dict(events)
df_event.set_index('activityTime', inplace=True)
df_pageview = pd.DataFrame.from_dict(pageviews)
df_pageview.set_index('activityTime', inplace=True)
df_activities = pd.DataFrame.from_dict(activities)

Output each DF:

#df_session:

            dataSource deviceCategory platform sessionDate  DB_id
sessionId                                                        
1578491x03d        web        desktop  Windows  2020-01-08      0



#df_activities:
     sessionId                 activityTime
0  1578491x03d  2020-01-08T16:00:44.399101Z
1  1578491x03d  2020-01-08T15:48:37.915105Z



#df_event:
                            eventCategory eventCount eventLabel             eventAction
activityTime                                                                           
2020-01-08T15:48:37.915105Z          Auth          1  (not set)  Successfully Logged In



#df_pageview:
                                     pageTitle  pagePath
activityTime                                            
2020-01-08T16:00:44.399101Z  thecurrentwebpage  /thepath

Output example join

#As example for a join, I only want the event data
df_sa = df_activities.join(df_session, on="sessionId").join(df_event,on="activityTime",how="right") 

print(df_sa)
     sessionId                 activityTime dataSource deviceCategory platform sessionDate  DB_id eventCategory eventCount eventLabel             eventAction
1  1578491x03d  2020-01-08T15:48:37.915105Z        web        desktop  Windows  2020-01-08      0          Auth          1  (not set)  Successfully Logged In

Schema

It is the same as you specified above with 2 changes:

  1. The Table session doesn't have a column activities anymore.

  2. The Table Activity has an additional column sessionId.

SQL

How to execute SQL in pandas Dataframe you can look up online probably to much to cover here. See here for example: Executing an SQL query over a pandas dataset

How to get the data

Some examples: (but in the end you have to figure it out yourself if you want something specific, I don't make an SQL course here)

  • If you only want session data: Query df_session
  • If you want all activities: Query df_event and df_pageview
  • If you want all activities and combined with sessions: join df_session with df_activities then join with df_event and df_pageview

I don't want a Dataframe... I need MYSQL Database (or something else)

Nothing easier than that. The dataframe are in a "properly" database format.

Example for Session:

for index, row in df_sessions.iterrows():
    # for event and pageview the index would be activityTime
    # the df activities don't have a specific index
    sessionId = index 
    dataSource = row['dataSource']
    deviceCategory = row['deviceCategory']
    platform = row['platform']
    sessionDate = row['sessionDate']
    DB_id = row['DB_id']
    # function to save a row in a SQL DB basically:
    # INSERT INTO session (sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id) VALUES(x,x,x,x,x,x)
    save_to_sql(sessionId,dataSource,deviceCategory,platform,sessionDate,DB_id)

the save_to_sql is your own implementation depending on what database you are using. And it would not fit this question to explain that to you.

Comments

  1. DB_id don't know the origin of this value. I set it to 0.