I have been struggling with this logic. I am getting data from google analytics in this format, per user per site. (So these are all the activities that one user performed on the site) I cannot change the format in which I receive the data.
PROBLEM: I am running a loop through all of the users and get this output for each user. I want to put this data into a data frame in order to use it later. The problem I have, is the 'activities':[{.....},{......}] part, I cannot figure out to store all of this data in a way that makes sense.
{'sampleRate': 1,
'sessions': [{'activities': [{'activityTime': '2020-01-08T16:00:44.399101Z',
'activityType': 'PAGEVIEW',
'campaign': '(not set)',
'channelGrouping': 'Direct',
'customDimension': [{'index': 1}],
'hostname': 'company.domain.com',
'keyword': '(not set)',
'landingPagePath': '/login',
'medium': '(none)',
'pageview': {'pagePath': '/thepath',
'pageTitle': 'thecurrentwebpage'},
'source': '(direct)'},
{'activityTime': '2020-01-08T15:58:43.077293Z',
'activityType': 'PAGEVIEW',
'campaign': '(not set)',
'channelGrouping': 'Direct',
'customDimension': [{'index': 1}],
'hostname': 'company.domain.com',
'keyword': '(not set)',
'landingPagePath': '/login',
'medium': '(none)',
'pageview': {'pagePath': '/theotherpath',
'pageTitle': 'thecurrentwebpage'},
'source': '(direct)'}],
'dataSource': 'web',
'deviceCategory': 'desktop',
'platform': 'Windows',
'sessionDate': '2020-01-08',
'sessionId': '1578491x03d'},
{'activities': [{'activityTime': '2019-12-28T21:58:48.993944Z',
'activityType': 'PAGEVIEW',
'campaign': '(not set)',.....
EXPECTED OUTPUT:
For each user's data to be stored in tables organised as such:
If there is some logic error in the picture, I am happy to change what I have. I just need the data to work.
PS: I need to use SQL and ERD's in LucidChart, I have never before had to manipulate data in this format. Any help, to get the data -that is structured like the example above- into a dataframe(s).
EDITED:
Example of two different types of activities (the activity is always classified as either 'pageview' or 'event'):
{'activityTime':
# Pageview activity
'2020-01-08T15:48:38.012671Z',
'activityType': 'PAGEVIEW',
'campaign': '(not set)',
'channelGrouping': 'Direct',
'customDimension': [{'index': 1}],
'hostname': 'company.domain.com',
'keyword': '(not set)',
'landingPagePath': '/login',
'medium': '(none)',
'pageview': {'pagePath': '/login',
'pageTitle': 'titleofthepage'},
'source': '(direct)'},
# Event activity
{'activityTime': '2020-01-08T15:48:37.915105Z',
'activityType': 'EVENT',
'campaign': '(not set)',
'channelGrouping': 'Direct',
'customDimension': [{'index': 1}],
'event': {'eventAction': 'Successfully Logged '
'In',
'eventCategory': 'Auth',
'eventCount': '1',
'eventLabel': '(not set)'},
'hostname': 'company.domain.com',
'keyword': '(not set)',
'landingPagePath': '/login',
'medium': '(none)',
'source': '(direct)'}]
For example you can do it like this:
Output each DF:
Output example join
Schema
It is the same as you specified above with 2 changes:
The Table session doesn't have a column activities anymore.
The Table Activity has an additional column sessionId.
SQL
How to execute SQL in pandas Dataframe you can look up online probably to much to cover here. See here for example: Executing an SQL query over a pandas dataset
How to get the data
Some examples: (but in the end you have to figure it out yourself if you want something specific, I don't make an SQL course here)
df_session
df_event
anddf_pageview
df_session
withdf_activities
then join withdf_event
anddf_pageview
I don't want a Dataframe... I need MYSQL Database (or something else)
Nothing easier than that. The dataframe are in a "properly" database format.
Example for Session:
the
save_to_sql
is your own implementation depending on what database you are using. And it would not fit this question to explain that to you.Comments
DB_id
don't know the origin of this value. I set it to 0.