How to pull github timeline data from BigQuery

683 Views Asked by At

I am having trouble accessing the GitHub timeline from BigQuery.

I was using the following query:

SELECT repository_name, actor_attributes_company, payload_ref_type, payload_action, type, created_at FROM githubarchive:github.timeline WHERE repository_organization = 'foo' and created_at > '2014-07-01'

and everything was working great. Now, it looks like the githubarchive:github.timeline table is no longer available. I've been looking around and I found another table:

SELECT repository_name, actor_attributes_company, payload_ref_type, payload_action, type, created_at FROM publicdata:samples.github_timeline WHERE repository_organization = 'foo' and created_at > '2014-07-01'

This query works but returns zero rows. When I remove the created_at restriction it worked but only returned a few rows from 2012 so it looks like this is just sample data.

Does anyone know how to pull live timeline data from GitHub?

2

There are 2 best solutions below

2
On BEST ANSWER

Indeed, publicdata:samples.github_timeline has only sample data.

For the real GitHub Archive documentation, look at http://www.githubarchive.org/

I wrote an article yesterday about querying it:

Sample query:

SELECT repo.name,
       JSON_EXTRACT_SCALAR(payload, '$.action') action,
       COUNT(*) c,
FROM [githubarchive:month.201606]
WHERE type IN ('IssuesEvent')
AND repo.name IN ('kubernetes/kubernetes', 'docker/docker', 'tensorflow/tensorflow')
GROUP BY 1,2
ORDER BY 2 DESC

As Mikhail points out, there's also another dataset with all of GitHub's code:

3
On

Check out githubarchive BigQuery project
It has three datasets: day, month, year with respective daily, monthly and yearly data

Check out https://cloudplatform.googleblog.com/2016/06/GitHub-on-BigQuery-analyze-all-the-open-source-code.html for more details