I'd like to get a snapshot of "active" git commits has for a directory tree, meaning git commits that really are part of the build and not commits that have been fully superseded by newer commits.
I can do this by running git blame
on every file and extracting the commits that way, but it's too slow to be practical on a large repo.
What
git blame
does is pretty much the only way to find the information you're looking for. However, you can simplify the action somewhat, and that might be enough for your purposes and perhaps that would be fast enough as well.Remember, every commit has a full snapshot of every file. A branch name identifies the last commit in some chain of commits. So when you have:
the name
branch
holds the raw hash ID of commitH
. In commitH
, there are many files, each of which has many lines. Those files are in the form they have in commitH
, and that's all there is to it—except that commitH
contains the hash ID of earlier commitG
.You can use hash ID this to locate commit
G
and extract all of its files, and when the file inG
completely matches the file inH
, that means that—ingit blame
terms at least—all the lines in the file inG
are attributable toG
, if not to some earlier commit. So files that are different inG
andH
should be attributed toH
. Thegit blame
command works on a line-by-line basis, attributing individual lines to commitH
if they differ, but perhaps for your purposes, attributing the entire file toH
suffices.Should you decide that the file should perhaps be attributed to commit
G
, it is now time to extract commitF
's hash ID from commitG
, and use that to read all the files from commitF
. If any given file inF
matches the copy inG
, the attribution moves back toF
; otherwise it remains atG
.You must repeat this process until you run entirely out of commits:
Since commit
A
has no parent, any files inA
that are unchanged all the way through the last commit are to be attributed to commitA
. You can, however, stop traversing backwards as soon as you have completely attributed all files that exist inH
to some commit later in the chain. Compare this togit blame
, which must keep looking backwards as long as at least one line is attributed to some earlier commit: you'll probably stop long beforegit blame
must.Moreover, because of Git's internal data structures, it is very fast to tell whether a file in some earlier commit exactly matches a file of the same name in some later one: every file in every commit is represented by a hash ID. If the hash ID is the same, the file's contents are bit-for-bit identical in the two commits. If not, they're not.
There is no convenient in-Git command to do exactly what you want,1 and if you do intend to traverse the history like this, you must decide what to do with merges. Remember that a merge commit has a snapshot, but unlike a non-merge, has two or more parents:
Which commit(s) should you follow, if the file in
M
matches one or more of the files inK
and/orL
? Thegit log
command has its own method of doing this—git log <start-point> -- <path>
will simplify history by following one parent, chosen at random from the set of such parents, that has the same hash ID for the given file.Note that you can use
git rev-list
, perhaps with--parents
, to produce the set of hash IDs that you can choose to examine. The rev-list command is the workhorse for most other Git commands, includinggit blame
itself, for following history like this. (Note: thegit log
command is built from the same source asgit rev-list
, with some minor command-line-option differences and different default outputs.)1While
git log <start-point> -- <path>
is useful here, it will be too slow to run this once for each path, and it's not effective to run it without giving individual paths.