I'd like to get a snapshot of "active" git commits has for a directory tree, meaning git commits that really are part of the build and not commits that have been fully superseded by newer commits.
I can do this by running git blame on every file and extracting the commits that way, but it's too slow to be practical on a large repo.
What
git blamedoes is pretty much the only way to find the information you're looking for. However, you can simplify the action somewhat, and that might be enough for your purposes and perhaps that would be fast enough as well.Remember, every commit has a full snapshot of every file. A branch name identifies the last commit in some chain of commits. So when you have:
the name
branchholds the raw hash ID of commitH. In commitH, there are many files, each of which has many lines. Those files are in the form they have in commitH, and that's all there is to it—except that commitHcontains the hash ID of earlier commitG.You can use hash ID this to locate commit
Gand extract all of its files, and when the file inGcompletely matches the file inH, that means that—ingit blameterms at least—all the lines in the file inGare attributable toG, if not to some earlier commit. So files that are different inGandHshould be attributed toH. Thegit blamecommand works on a line-by-line basis, attributing individual lines to commitHif they differ, but perhaps for your purposes, attributing the entire file toHsuffices.Should you decide that the file should perhaps be attributed to commit
G, it is now time to extract commitF's hash ID from commitG, and use that to read all the files from commitF. If any given file inFmatches the copy inG, the attribution moves back toF; otherwise it remains atG.You must repeat this process until you run entirely out of commits:
Since commit
Ahas no parent, any files inAthat are unchanged all the way through the last commit are to be attributed to commitA. You can, however, stop traversing backwards as soon as you have completely attributed all files that exist inHto some commit later in the chain. Compare this togit blame, which must keep looking backwards as long as at least one line is attributed to some earlier commit: you'll probably stop long beforegit blamemust.Moreover, because of Git's internal data structures, it is very fast to tell whether a file in some earlier commit exactly matches a file of the same name in some later one: every file in every commit is represented by a hash ID. If the hash ID is the same, the file's contents are bit-for-bit identical in the two commits. If not, they're not.
There is no convenient in-Git command to do exactly what you want,1 and if you do intend to traverse the history like this, you must decide what to do with merges. Remember that a merge commit has a snapshot, but unlike a non-merge, has two or more parents:
Which commit(s) should you follow, if the file in
Mmatches one or more of the files inKand/orL? Thegit logcommand has its own method of doing this—git log <start-point> -- <path>will simplify history by following one parent, chosen at random from the set of such parents, that has the same hash ID for the given file.Note that you can use
git rev-list, perhaps with--parents, to produce the set of hash IDs that you can choose to examine. The rev-list command is the workhorse for most other Git commands, includinggit blameitself, for following history like this. (Note: thegit logcommand is built from the same source asgit rev-list, with some minor command-line-option differences and different default outputs.)1While
git log <start-point> -- <path>is useful here, it will be too slow to run this once for each path, and it's not effective to run it without giving individual paths.