I want to retrieve the latest infos about a repository using Google Big Query on the github archive timeline dataset.
I tried to join on max(created_at) but i get vastly incomplete informations. Here is the query for the rails repo :
SELECT *
FROM [githubarchive:github.timeline] a
JOIN EACH
(
SELECT MAX(created_at) as max_created, repository_url
FROM [githubarchive:github.timeline]
GROUP EACH BY repository_url
) b
ON
b.max_created = a.created_at and
b.repository_url = a.repository_url
WHERE payload_ref_type="repository" AND a.repository_name = 'rails'
I know this dataset doesn't include data before 2011, but it should contain informations about recently active repositories
I don't understand the results of this query :
- It only return events of type 'CreateEvent', which are always older than 'PushEvent'
- It doesn't return the main rails repository : https://github.com/rails/rails
- Github search reports 147149 repositories with the name "rails", the query only return 476 repositories
Is my query false in some way, why don't it return PushEvents ? Is there another trick about the github archive dataset ?
If you want to retrieve the latest version of a row, you need to use Window functions.
You can get the order of the modifications based on the
created_at
timestamp column using.seq_num=1
means it will grab the first entry from that specific partition, and since we ordered descending that will be the most recent entry.Keeping
payload_ref_type="repository
in the query would return only 1050 repos, but if you remove that part you will have more.In the same style you could also leverage data from day and monthly events. Those are in separate table and can be used with table wildcard functions.