how to detect a potentially evil git commit (containing the inverse of some other commit)?

201 Views Asked by At

I have several times come across the following situation: Based on A, I push some commit B. Someone else pulls the commit B, but for some reason, he does not revert the buffer of his editor, and thus continues editing A. In the end, he pushes a commit C which reverts B's changes, and introduces some new ones. In this way, the effect of my commit B is lost. Often, I have noticed the problem only by chance (and then I could easily reintroduce B by cherry-picking it). However, I am worried that sometimes I did not notice it at all. Is there an easy way of detecting (the potential of) such a situation? (I am happy with obtaining some false positives.) One somehow needs to detect whether the inverse of B is contained in some other commit.

I have created a test repo that shows such a commit: https://github.com/tillmo/test

This question is related to, but different from How do you detect an evil merge in git?. Note that there, the evil commit is a merge commit, while here it is not.

3

There are 3 best solutions below

2
On

It's clearly impossible in general, but it may be easy enough to hit enough cases to make it worthwhile. The following is basically pure theory: there's nothing built in to Git to do this.


Think of commits as snapshots that can be converted to deltas (because they are). Your task is, given a sequence of commits:

...--A--B--C--D--...

(we'll somewhat arbitrarily limit this to linear commits—the mods to the algebra for doing merges are obvious, albeit messy) we're going to compute ∆B, ∆C, and ∆D. ∆B is just B-A, ∆C is C-B, and so on. Of course the result of "subtraction" is a diff-set (or changeset) rather than a simple number. These deltas, however, are what you see when you run git show (or git diff). (You may also want to avoid rename detection here if you try to do this for real and make it robust.)

Next, we want to see whether ∆C "includes" -∆B, or whether ∆D "includes" -∆B. A normal git revert is the negative delta, so we're looking for a changeset that isn't a revert itself, yet still contains a revert.

When we look at the raw form of git diff output, we find that, for strictly modified files, the diff just a lot of boilerplate-y @@ and hunk-context lines around the change, which expressed exclusively as - and + lines: remove old, insert new.

Reversion consists of making the same change in reverse: add the removed lines, while removing the added lines, with the same context, in the same order.

A change that contains a reversion is, at least for the easily detected cases, simply a change that includes the reversion—i.e., adds back removed lines while removing added lines, with the same context, in the same order—while also making some other separate change.

In other words, if ∆C = -∆B+∂C, or ∆D = -∆B+∂D, then we will suspect one of these deltas to be "incorrect": it should have just been ∂C or just ∂D.

The git patch-id program can compute a patch ID for any diff hunk, so to find out whether some big-delta ∆X is equal to the negative of some previous ∆B plus some smaller ∂X, you could just split up the original ∆B into its component hunks, negate each one, and get its patch ID. [Edit: use André Sassi's suggestion—simplify this by computing the reverse diff initially, either by using B B^ as the two inputs to git diff, or using the -R flag, which is available in both git diff and git show.] Put these in a master list L. Then, for any subsequent commit X, starting with l=L, walk through the component hunks of ∆X. Compute that hunk's patch ID, and see if that corresponds to the next value that is in your list. If so, drop one item from l. If you drop the last item from l, you have just found that ∆X includes -∆B. Otherwise, move on to the next hunk in ∆X. If you reach the end of ∆X without dropping all values from l, you have concluded that ∆X does not include -∆B.

Repeat for all suspect commits and you'll see if you can automatically find these "accidental reversions". Note that a deliberate reversion will have nothing but -∆B, and will also have a commit message that says "revert ...".

1
On

The simplest you can do is use git log -S. Say, you introduced the line

FILE* = fopen(config_file, "r");

and somebody else removed it.

Then the following command will list the commits that changes the number of occurrences of a particular string in the code base:

git log -S'fopen(config_file,'

The first commit listed is the one that removed the line, the second commit the one that introduced it.

Note that commits that move the substring around (even to a different file) are not listed. Additionally, if the string was removed in a merge such that the version from one of the parents was taken literally, the merge will not be listed. (This can cause some confusion, option -m can help in the latter case.)

3
On

This is actually somewhat difficult to reproduce. Say we have contributors Alice (A) and Bob (B) working on the repository. The current base commit is X:

          master
            ↓
* --- * --- X

Now, both A and B have that commit checked out and work on it in the same file. A is first, commits and pushes the change:

                master
                  ↓
* --- * --- X --- Y

Now, B is still based on X. There are now two possible situations this can continue:

  1. B commits their changes. A push will fail since Y was pushed in the meantime, so B has to pull first, likely resulting in a merge conflict (since both A and B are working in the same file). B is notified of that situation and has to resolve the conflict before they can go on.

    So any reversal of the changes introduced with Y are deliberately made by B.

  2. B does not commit the changes but attempts to pull first (maybe because B knows that there are new changes on the remote). Since the pull would attempt to edit the file that B is working on (and has changes in), Git will refuse to actually update the working directory with the message that “local changes would be overwritten”.

    So B again has to resolve those conflicts first before the changes can even be pulled in. Any reversal of Y are again deliberate.

So basically, if this happens, you really have to blame B for undoing A’s work. Outside of these two situations, there is not really a way for the situation you explained to happen.

The only thing I can actually think of is a situation similar to (2) but where B never saved their changes. So when Y is pulled in, Git has a clean working directory (although there are unsaved changes in the file) and correctly updates the local branch. Then, afterwards, B can save the file resulting in only those changes, ignoring whatever happened in Y.

But, this again means that when B finally commits, that they include those undoing changes deliberately – or B simply does not check at all what changes are included in the commit which is a dangerous workflow for many other reasons.

So you should solve this by educating B to actually take care when creating commits, to check what changes will be included in the commit and to make sure that nothing unrelated is added. In doubt, disallow direct pushing to your repository and use a pull request or code review based workflow to ensure that the changes that end up in the repository are actually fine.