Why consecutive event jsons fall on the same line in some packages in githubarchive?

Question

Why consecutive event jsons fall on the same line in some packages in githubarchive?

67 Views Asked by Harish Kayarohanam At 29 July 2025 at 12:10

In http://www.githubarchive.org/ that Ilya Grigorik has provided ,I found that in many gz files , some consecutive events are logged to same file .

for example in 2011-03-15-21.json.gz

To get the above do : wget http://data.githubarchive.org/2011-03-15-21.json.gz

In this gz for example if you search for id 1484832 , you can find that the 2 consecutive events(jsons) are in same line see http://codebeautify.org/jsonviewer/2cb891

the two jsons in same line is a combination of

http://codebeautify.org/jsonviewer/c7e18e

and

http://codebeautify.org/jsonviewer/945d56

.

What is the impact ? when I was loading each line and loading it with python's(why python ? because I felt python is comfortable in dealing with jsons) json.loads it said it was invalid as it was a combination of two jsons .

Question :

1) How did you solve these kind of bugs when you processed that github archive data ?

2) I already have the data in my local . so how can I overcome this problem . Shall I write code specific to this case to overcome ? the code i wrote was like

jsonlist = line.split('}{')
json.loads(jsonlist[0] + '}', "ISO-8859-1") # load and navigate through this json 
json.loads('{' + jsonlist[1], "ISO-8859-1") # load and navigate through this json

Original Q&A

There are 1 best solutions below

**Harish Kayarohanam** · Answer 1

I got the solution here

1) How did you solve these kind of bugs when you processed that github archive data ? https://github.com/vadasg/githubarchive-parser/blob/master/src/FixGitHubArchiveDelimiters.rb

. This script removes the problems of two or more events appearing on the same line . so now after running this script the jsons appear in different lines .

2) I already have the data in my local . so how can I overcome this problem . Shall I write code specific to this case to overcome ? the code i wrote was like This script removes the necessity to write the code I mentioned above .

Note :

Related issues found on the github archive project in github

WARNING :

When I was running this script I got an error related to the encoding used . Because by default the Yajl::Parser.parse(jsonInputFile) line checks if characters it parses adheres to UTF-8 encoding ,if not it will throw errors . As github data also contains non UTF-8 characters , this error will be thrown in our case too. So to bypass that problem(or may be a fix) I put it as

Yajl::Parser.parse(jsonInputFile, :check_utf8 => false)

for doubts refer docs: http://rdoc.info/github/brianmario/yajl-ruby/Yajl/Parser.parse

Why consecutive event jsons fall on the same line in some packages in githubarchive?

There are 1 best solutions below

Related Questions in JSON

Related Questions in GITHUB

Related Questions in GITHUB-ARCHIVE

Trending Questions

Popular # Hahtags

Popular Questions