I am trying to get various Github repo metrics in Github Archive through Big Query(doc here). However, when I try to count the number of forks, the number I am getting is very different from the number of forks specified in the Github UI. For instance when I run this sql script:
SELECT repo.url,repo.name , COUNT(*) fork_count,
FROM [githubarchive:year.2011],
[githubarchive:year.2012],
[githubarchive:year.2013],
[githubarchive:year.2014],
[githubarchive:year.2015],
[githubarchive:year.2016],
[githubarchive:year.2017],
[githubarchive:year.2018],
[githubarchive:month.201901]
WHERE type='ForkEvent'
and repo.url like 'https://github.com/python/cpython'
GROUP BY 1,2
I get a result of:
Row repo_url repo_name fork_count
1 https://github.com/python/cpython cpython 177
However when I go to the URL 'https://github.com/python/cpython' I see that there are 8,198 forks. What is the reason for this discrepancy?
EDIT:
Felipe pointed out below that there could be multiple URL's for the same repo.
However, even with multiple URLS, the number was not a exact match to the UI and this time was substantially larger then the UI's number. Is there any way to get an exact match?
What are you querying for? Notice you'll get different results depending if you go for the repo id, name, or url:
If you want to know "when?":
EDIT:
GitHub only lists one fork per user - so if you want to remove duplicates do COUNT(DISTINCT actor.id) which brings it down to ~9k.