Understanding git rev-list

21.3k Views Asked by At

While looking for git hook examples, I came across following post: https://github.com/Movidone/git-hooks/blob/master/pre-receive and I wanted to understand the following command:

git rev-list $new_list --not --all 

where new_list is obtained from:

NULL_SHA1="0000000000000000000000000000000000000000" # 40 0's
new_list=
any_deleted=false
while read oldsha newsha refname; do
    case $oldsha,$newsha in
        *,$NULL_SHA1) # it's a delete
            any_deleted=true;;
        $NULL_SHA1,*) # it's a create
            new_list="$new_list $newsha";;
        *,*) # it's an update
            new_list="$new_list $newsha";;
    esac
done

I figured that rev-list shows commits in reverse chronological order.

But, can someone share more insight on what -not and -all options are meant for?

As per the documentation:

--not
Reverses the meaning of the ^ prefix (or lack thereof) for all following revision specifiers, up to the next --not.
--all
Pretend as if all the refs in refs/ are listed on the command line as <commit>. 

I am not able to completely understand these options.

[Update] After doing some test commits, figured that if I don't use --not and --all options then, git rev-list lists all the commits on the branch and not the one's while I intend to push.

However, wanted to understand why doesn't it print the sha values on the terminal when --all option is passed?

2

There are 2 best solutions below

0
On BEST ANSWER

The git rev-list command is a very complicated, very central command in Git, as what it does is walk the graph. The word graph here refers to both the commit graph itself, and in some cases, the next level down (Git objects reachable from commits).

I figured that rev-list shows commits in reverse chronological order.

Not exactly, but close:

  • The order is changeable. The default is reverse-chronological.
  • The default is to walk some commits, but you can get rev-list to go deeper so as to include tree and blob objects and even tag objects. This is for programs like git fetch and git push (which invoke git pack-objects) and git pack-objects. I plan to ignore this possibility entirely here, but I feel I should at least mention it.

So the default is to list some commits in reverse chronological order. It is both important, and a little bit tricky, to specify exactly which parts of the graph we will have git rev-list walk: the some in some commits.

But, can someone share more insight on what --not and --all options are meant for?

As VonC notes, the effect here is to list commits that are new to the receiving repository. This depends on the fact that this git rev-list command is running in a pre-receive hook. It generally doesn't do anything useful outside this particular hook. Thus, as you can see, a hook's run-time environment, in Git, is often at least a little bit special. (This is true for more than just the pre-receive hook: one must think about each hook's activation context.)

More about --not --all

The --all option does just what you quoted from the documentation:

Pretend as if all the refs in refs/ are listed on the command line ...

So this does the equivalent of a git for-each-ref refs: it loops over each reference. That includes branch names (master or main, develop, feature/tall, and so on, all of which are really in refs/heads/), tag names (v1.2 which is really refs/tags/v1.2), remote-tracking names (origin/develop which is really refs/remotes/origin/develop), replacement refs (in refs/replace/), the stash (refs/stash), bisection refs, Gerrit refs if you're using Gerrit, and so on. Note that it does not loop over reflog entries.

The --not prefix is a simple boolean operation. In the gitrevisions syntax—see the gitrevisions documentation—we can write things like develop, meaning I tell you to start from develop and work backwards and include these commits, but also things like ^develop, meaning I tell you to start from develop and work backwards and exclude these commits. So if I write:

git rev-list feature1 feature2 ^main

I am asking Git to walk commits reachable from the commits identified by the names feature1 and feature2, but to exclude commits reachable from the commits identified by main. For (much) more about the general idea of reachability and graph-walking, see Think Like (a) Git.

The --not operator effectively flips the ^ on each ref:

git rev-list --not feature1 feature2 ^main

is shorthand, as it were, for:

git rev-list ^feature1 ^feature2 main

This walks the list of commits reachable from main, but excludes those reachable from either feature1 or feature2.

Usually all commits are findable with --all

If you are using Git in the normal everyday way, and don't have a "detached HEAD" at the moment—detached HEAD mode is not exactly abnormal but it's not the usual way to work—the --all option to git rev-list tells it to include all commits, because all commits are reachable from all references.1 So --not --all effectively excludes all commits. So adding --not --all to any git rev-list that would otherwise list some commits has the effect of inhibiting the list. The output is empty: why did we bother?

If you are in detached HEAD mode and have made several new commits—this can happen when you are in the middle of an interactive or conflicted rebase, for instance—then git rev-list HEAD --not --all would list those commits that are reachable from HEAD but not from any branch name. In that rebase, for instance, that would be just those commits that you have copied so far.

So "detached HEAD" mode would be once place where git rev-list --not --all could be useful from the command line. But for the situation you're examining—a pre-receive hook—we're not really on the command line.

Pre-receive hooks

When someone uses git push to send commits to your own Git, your Git:

  • sets up a quarantine area to hold any new objects (new commits and blobs and so on);1
  • negotiates with the sender to decide what the sender should send;
  • receives these objects; and
  • takes a list of ref update requests. These update requests essentially just say make this name hold this hash ID.2

Before actually doing any of the requested updates, your Git:

  1. Feeds the entire list to the pre-receive hook. That hook can say "no"; if so, the entire push, as a whole, is rejected.
  2. If that says "ok", feeds the list, one request at a time, to the update hook. When that hook says "ok", does the update. If the hook says "no", your Git rejects the one update, but goes on to examine others.
  3. After all updates are accepted or rejected in step 2, feeds the accepted list to the post-receive hook.

Objects that are needed, that were added to some ref in step 2, are moved from quarantine to Git's object database. Those that were rejected are not.

Now, think about a typical git push. We get some new commit(s) and a request: create a new branch name feature/short, or we get some new commit(s) and a request: update existing branch name develop to include these new commits, along with the old ones.

In step 1 above, we have a single new hash ID. We ran a loop to read all the ref names, and their current and proposed-new hash IDs, and the loop ran only once, because only one name was being git push-ed. That hash ID refers to the new commit or commits, that will either be added to this existing branch, or be the tip and other commits that are exclusive to the new branch.

We'd now like to inspect these commits, and not any of the existing commits that are reachable from any existing branch. For simplicity, rather than $new_list in my other answer, let's suppose we just the one new hash ID, $new, and the old hash ID for the branch name, $old: all-zeros if the branch is all-new, or some valid existing commit if it's an existing branch name.

If the new commits are on a completely new branch, then:

git rev-list $new ^master ^develop ^feature/short ^feature/tall

would cover them, for instance, if we knew that the only existing branches were these four (and that there are no tags etc to worry about). But what if they're being added to, say, develop? Then we'd like to exclude the commits that are currently on develop. We could use the $old hash ID to do that:

git rev-list $new ^master ^$old ^feature/short ^feature/tall

That would again list only the new commits that whoever is running git push origin develop wants to add to our develop.

But think about $old. This is a hash ID. Where did Git get it? Git got this hash ID from the name develop. This is a pre-receive hook; the name develop has not been updated yet. So the name develop is a name for the old hash ID $old. That means:

git rev-list $new ^master ^develop ^feature/short ^feature/tall

will also do the job.

If git rev-list $new followed by "and not all existing" will do the job, then:

git rev-list $new --not --branches

will do the job. That's almost what we have here.

The bug with just using --branches is that it doesn't get any tags, or other refs. We could use --not --branches --tags but --not --all is shorter and also gets all other refs.

So this is where --not --all comes from: it depends on the special case of a pre-receive hook. We list the new hash IDs, as proposed by whoever is running a git push, that our Git has passed to us as a list of lines. We have git rev-list walk the proposed-to-be-updated commit graph, looking at the new commits in the quarantine area, but excluding all the commits that are already in our repository. The rev-list command produces these hash IDs, one per line, which we then read in a shell loop, and do whatever we like to inspect each commit.


1The quarantine area was new in Git 2.11. Prior to that, new objects could remain in the repository for a while, even if the push is rejected. The quarantine area isn't really that big a deal for most people, but for big servers like GitHub, it can save them a lot of disk space.

2The request can be forced or not-forced, and if forced, could be a force-with-lease, or not. This information is not available in the pre-receive hook (nor in the update hook), which is, um, let's just say not so great, but there are compatibility issues with adding it. It's all livable, mostly, though. The hook can tell if it's a create new ref or delete existing ref request because if so, one of the two hash IDs—old or new—will be the all-zeros "null hash" (which is reserved; no hash ID is allowed to be all-zeros).

0
On

It means:

  • List commits that are reachable by following the parent links from the given commit(s), here $new_list, the new, modified or deleted commits
  • but exclude commits that are reachable from the one(s) given with a ^ in front of them, here "all", that is, all HEADS commits, or tagged commits.

That limits the rev-list to only the new commits received, and not all the commits (received and already present in the receiving repository)


Note that the same limitation can now be applied to stdin with pseudo-opts:

With Git 2.42 (Q3 2023), the set-up code for the get_revision() API now allows feeding options like --all and --not in the --stdin mode.

See commit c40f0b7, commit af37a20, commit cc80450 (15 Jun 2023) by Patrick Steinhardt (pks-t).
(Merged by Junio C Hamano -- gitster -- in commit 812907d, 04 Jul 2023)

revision: handle pseudo-opts in --stdin mode

Signed-off-by: Patrick Steinhardt

While both git-rev-list and git-log support --stdin, it only accepts commits and files.
Most notably, it is impossible to pass any of the pseudo-opts like --all, --glob= or others via stdin.

This makes it hard to use this function in certain scripted scenarios, like when one wants to support queries against specific revisions, but also against reference patterns.
While this is theoretically possible by using arguments, this may run into issues once we hit platform limits with sufficiently large queries.
And because --stdin cannot handle pseudo-opts, the only alternative would be to use a mixture of arguments and standard input, which is cumbersome.

Implement support for handling pseudo-opts in both commands to support this usecase better.
One notable restriction here is that --stdin only supports "stuck" arguments in the form of --glob=foo.
This is because "unstuck" arguments would also require us to read the next line, which would add quite some complexity to the code.
This restriction should be fine for scripted usage though.

rev-list-options now includes in its man page:

In addition to getting arguments from the command line, read them for standard input as well.
This accepts commits and pseudo-options like --all and --glob=. When a -- separator is seen, the following input is treated as paths and used to limit the result.

This is now possible:

git rev-list --stdin < --all --not --branches

With Git 2.43 (Q4 2023), "git rev-list --stdin"(man) learned to take non-revisions (like "--not") recently from the standard input, but the way such a "--not" was handled was quite confusing, which has been rethought.
This is potentially a change that breaks backward compatibility.

See commit f97c8b1 (25 Sep 2023) by Patrick Steinhardt (pks-t).
(Merged by Junio C Hamano -- gitster -- in commit 3029189, 04 Oct 2023)

revision: make pseudo-opt flags read via stdin behave consistently

Signed-off-by: Patrick Steinhardt
Reported-by: Christian Couder

When reading revisions from stdin via git-rev-list(1)'s --stdin option then these revisions never honor flags like --not which have been passed on the command line.
Thus, an invocation like e.g. git rev-list --all --not --stdin``(man) will not treat all revisions read from stdin as uninteresting.
While this behaviour may be surprising to a user, it's been this way ever since it has been introduced via 42cabc3 ("Teach rev-list an option to read revs from the standard input.", 2006-09-05, Git v1.4.3-rc1 -- merge).

With that said, in c40f0b7 ("revision: handle pseudo-opts in --stdin mode", 2023-06-15, Git v2.42.0-rc0 -- merge listed in batch #7) we have introduced a new mode to read pseudo opts from standard input where this behaviour is a lot more confusing.
If you pass --not via stdin, it will:

  • Influence subsequent revisions or pseudo-options passed on the command line.
  • Influence pseudo-options passed via standard input.
  • Not influence normal revisions passed via standard input.

This behaviour is extremely inconsistent and bound to cause confusion.

While it would be nice to retroactively change the behaviour for how --not and --stdin behave together, chances are quite high that this would break existing scripts that expect the current behaviour that has been around for many years by now.
This is thus not really a viable option to explore to fix the inconsistency.

Instead, we change the behaviour of how pseudo-opts read via standard input influence the flags such that the effect is fully localized.
With this change, when reading --not via standard input, it will:

  • Not influence subsequent revisions or pseudo-options passed on the command line, which is a change in behaviour.
  • Influence pseudo-options passed via standard input.
  • Influence normal revisions passed via standard input, which is a change in behaviour.

Thus, all flags read via standard input are fully self-contained to that standard input, only.

While this is a breaking change as well, the behaviour has only been recently introduced with Git v2.42.0.
Furthermore, the current behaviour can be regarded as a simple bug.
With that in mind it feels like the right thing to retroactively change it and make the behaviour sane.

rev-list-options now includes in its man page:

When used on the command line before --stdin, the revisions passed through stdin will not be affected by it. Conversely, when passed via standard input, the revisions passed on the command line will not be affected by it.

rev-list-options now includes in its man page:

Flags like --not which are read via standard input are only respected for arguments passed in the same way and will not influence any subsequent command line arguments.