I am trying to parse a log file .which contains the structure like given below i want to do it with python and want to store extracted data in database how can i do this ?
i am able to parse simple key value pair but facing some problem.
1: How can i parse nested structure for example context field in the sample file is nested in main group?
2: How to tackle with condition if separator comes as a string . like for key:value pair separator is colon (:) and in the "site" key there is a key:value pair site_url:http://something.com here url also contains colon (:) which gives the wrong answer.
{
"username": "lavania",
"host": "10.105.22.32",
"event_source": "server",
"event_type": "/courses/XYZ/CS101/2014_T1/xblock
/i4x:;_;_XYZ;_CS101;_video;_d333fa637a074b41996dc2fd5e675818/handler/xmodule_handler/save_user_state",
"context": {
"course_id": "XYZ/CS101/2014_T1",
"course_user_tags": {},
"user_id": 42,
"org_id": "XYZ"
},
"time": "2014-06-20T05:49:10.468638+00:00",
"site":"http://something.com",
"ip": "127.0.0.1",
"event": "{\"POST\": {\"saved_video_position\": [\"00:02:10\"]}, \"GET\": {}}",
"agent": "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:18.0) Gecko/20100101 Firefox/18.0",
"page": null
}
{
"username": "rihana",
"host": "10.105.22.32",
"event_source": "server",
"event_type": "problem_check",
"context": {
"course_id": "XYZ/CS101/2014_T1",
"course_user_tags": {},
"user_id": 40,
"org_id": "XYZ",
"module": {
"display_name": ""
}
},
"time": "2014-06-20T06:43:52.716455+00:00",
"ip": "127.0.0.1",
"event": {
"submission": {
"i4x-XYZ-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {
"input_type": "choicegroup",
"question": "",
"response_type": "multiplechoiceresponse",
"answer": "MenuInflater.inflate()",
"variant": "",
"correct": true
}
},
"success": "correct",
"grade": 1,
"correct_map": {
"i4x-XYZ-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {
"hint": "",
"hintmode": null,
"correctness": "correct",
"npoints": null,
"msg": "",
"queuestate": null
}
},
"state": {
"student_answers": {},
"seed": 1,
"done": null,
"correct_map": {},
"input_state": {
"i4x-XYZ-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": {}
}
},
"answers": {
"i4x-XYZ-CS101-problem-33e4aac93dc84f368c93b1d08fa984fc_2_1": "choice_0"
},
"attempts": 1,
"max_grade": 1,
"problem_id": "i4x://XYZ/CS101/problem/33e4aac93dc84f368c93b1d08fa984fc"
},
"agent": "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:29.0) Gecko/20100101 Firefox/29.0",
"page": "x_module"
}
{
"username": "troysa",
"host": "localhost",
"event_source": "server",
"event_type": "/courses/XYZ/CS101/2014_T1/instructor_dashboard/api/list_instructor_tasks",
"context": {
"course_id": "XYZ/CS101/2014_T1",
"course_user_tags": {},
"user_id": 6,
"org_id": "XYZ"
},
"time": "2014-06-20T05:49:26.780244+00:00",
"ip": "127.0.0.1",
"event": "{\"POST\": {}, \"GET\": {}}",
"agent": "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:29.0) Gecko/20100101 Firefox/29.0",
"page": null
}
As has been pointed out this is a JSON data structure. I wrote some quick code that will read your log file line by line and attempt to find complete multi-line json objects. Once all the lines are read it is finished. I use pprint on the objects so that the output is human readable to ensure the dict that is returned looks correct.
This code is a bit of a kludge. It reads a line and sees if we have a complete parseable object. If not it reads the next line and concatenates it with the last. This continues until a parseable object can be loaded. It then does this over and over until all the lines are consumed and all objects have been found.
At this point in the code you have read a complete JSON object into
json_data
:I pprint the dict out but it is a standard python dictionary that can be processed for data as using normal dict traversal. For example you could retrieve the
course_id
with something like:or the
host
via: