In http://www.githubarchive.org/ that Ilya Grigorik has provided ,I found that in many gz files , some consecutive events are logged to same file .
for example in 2011-03-15-21.json.gz
To get the above do : wget http://data.githubarchive.org/2011-03-15-21.json.gz
In this gz for example if you search for id 1484832 , you can find that the 2 consecutive events(jsons) are in same line see http://codebeautify.org/jsonviewer/2cb891
the two jsons in same line is a combination of
http://codebeautify.org/jsonviewer/c7e18e
and
http://codebeautify.org/jsonviewer/945d56
.
What is the impact ? when I was loading each line and loading it with python's(why python ? because I felt python is comfortable in dealing with jsons) json.loads it said it was invalid as it was a combination of two jsons .
Question :
1) How did you solve these kind of bugs when you processed that github archive data ?
2) I already have the data in my local . so how can I overcome this problem . Shall I write code specific to this case to overcome ? the code i wrote was like
jsonlist = line.split('}{')
json.loads(jsonlist[0] + '}', "ISO-8859-1") # load and navigate through this json
json.loads('{' + jsonlist[1], "ISO-8859-1") # load and navigate through this json
I got the solution here
1) How did you solve these kind of bugs when you processed that github archive data ? https://github.com/vadasg/githubarchive-parser/blob/master/src/FixGitHubArchiveDelimiters.rb
. This script removes the problems of two or more events appearing on the same line . so now after running this script the jsons appear in different lines .
2) I already have the data in my local . so how can I overcome this problem . Shall I write code specific to this case to overcome ? the code i wrote was like This script removes the necessity to write the code I mentioned above .
Note :
Related issues found on the github archive project in github
https://github.com/igrigorik/githubarchive.org/issues/53
https://github.com/igrigorik/githubarchive.org/issues/17
WARNING :
When I was running this script I got an error related to the encoding used . Because by default the
Yajl::Parser.parse(jsonInputFile)
line checks if characters it parses adheres to UTF-8 encoding ,if not it will throw errors . As github data also contains non UTF-8 characters , this error will be thrown in our case too. So to bypass that problem(or may be a fix) I put it asfor doubts refer docs: http://rdoc.info/github/brianmario/yajl-ruby/Yajl/Parser.parse