Why consecutive event jsons fall on the same line in some packages in githubarchive?

73 Views Asked by At

In http://www.githubarchive.org/ that Ilya Grigorik has provided ,I found that in many gz files , some consecutive events are logged to same file .

for example in 2011-03-15-21.json.gz

To get the above do : wget http://data.githubarchive.org/2011-03-15-21.json.gz

In this gz for example if you search for id 1484832 , you can find that the 2 consecutive events(jsons) are in same line see http://codebeautify.org/jsonviewer/2cb891

the two jsons in same line is a combination of

http://codebeautify.org/jsonviewer/c7e18e

and

http://codebeautify.org/jsonviewer/945d56

.

What is the impact ? when I was loading each line and loading it with python's(why python ? because I felt python is comfortable in dealing with jsons) json.loads it said it was invalid as it was a combination of two jsons .

Question :

1) How did you solve these kind of bugs when you processed that github archive data ?

2) I already have the data in my local . so how can I overcome this problem . Shall I write code specific to this case to overcome ? the code i wrote was like

jsonlist = line.split('}{')
json.loads(jsonlist[0] + '}', "ISO-8859-1") # load and navigate through this json 
json.loads('{' + jsonlist[1], "ISO-8859-1") # load and navigate through this json
1

There are 1 best solutions below

0
On

I got the solution here

1) How did you solve these kind of bugs when you processed that github archive data ? https://github.com/vadasg/githubarchive-parser/blob/master/src/FixGitHubArchiveDelimiters.rb

. This script removes the problems of two or more events appearing on the same line . so now after running this script the jsons appear in different lines .

2) I already have the data in my local . so how can I overcome this problem . Shall I write code specific to this case to overcome ? the code i wrote was like This script removes the necessity to write the code I mentioned above .

Note :

Related issues found on the github archive project in github

  1. https://github.com/igrigorik/githubarchive.org/issues/53

  2. https://github.com/igrigorik/githubarchive.org/issues/17

WARNING :

When I was running this script I got an error related to the encoding used . Because by default the Yajl::Parser.parse(jsonInputFile) line checks if characters it parses adheres to UTF-8 encoding ,if not it will throw errors . As github data also contains non UTF-8 characters , this error will be thrown in our case too. So to bypass that problem(or may be a fix) I put it as

Yajl::Parser.parse(jsonInputFile, :check_utf8 => false)

for doubts refer docs: http://rdoc.info/github/brianmario/yajl-ruby/Yajl/Parser.parse