Remove useless merges (those without any 'non-mainline' commits) after filter-branch

610 Views Asked by At

I've performed a git filter-branch --index-filter 'git rm --cached --ignore-unmatched badfiles/ badfiles2/' --prune-empty (per here) to remove a bunch of files in preparation for moving the remaining files to another repository. --prune-empty gets rid of any resulting empty-commits, but it doesn't act on merges, which makes sense.

Now the history for this particular repo looks pretty ugly with a bunch of merges that don't actually add anything and some merges that are just merges of other merges that didn't actually add any changes (in the rewritten history; they may have been 'useful' before the filter-branch).

Consider this annotated snippet (generated with git log --graph --oneline --shortstat):

*   575e3b5 Merge pull request #68 from chris/feature # KEEP THIS MERGE!
|\  
| * 5dbc3f1 Actual feature changes
| |  2 files changed, 2 insertions(+), 2 deletions(-)
| * 35abc98 Cleanup/prep
|/  
|    2 files changed, 22 insertions(+), 16 deletions(-)
*   c3b3d86 Merge pull request #46 from org/topic_branch-mods # USELESS-C
|\  
* \   892de05 Merge pull request #47 from org/topic_branch # USELESS-B
|\ \  
| |/  
|/|   
| *   e738d4b Merge branch 'master' into topic_branch # USELESS-A
| |\  
| |/  
|/|   
* | 4182dac CommitMsg #40 #SQUASHED-PR
| |  2 files changed, 15 insertions(+), 6 deletions(-)
* | 3b42762 CommitMsg
|/  
|    2 files changed, 29 insertions(+), 14 deletions(-)
* c4e62ba CommitMsg
|  2 files changed, 39 insertions(+), 16 deletions(-)
* c2bb13f CommitMsg
   4 files changed, 241 insertions(+)

I'd like to shorten this to (obviously with different id's as appropriate):

*   575e3b5 Merge pull request #68 from chris/feature # KEEP THIS MERGE!
|\  
| * 5dbc3f1 Actual feature changes
| |  2 files changed, 2 insertions(+), 2 deletions(-)
| * 35abc98 Cleanup/prep
|/  
|    2 files changed, 22 insertions(+), 16 deletions(-) 
* 4182dac CommitMsg #40 #SQUASHED-PR
|  2 files changed, 15 insertions(+), 6 deletions(-)
* 3b42762 CommitMsg
|  2 files changed, 29 insertions(+), 14 deletions(-)
* c4e62ba CommitMsg
|  2 files changed, 39 insertions(+), 16 deletions(-)
* c2bb13f CommitMsg
   4 files changed, 241 insertions(+)

So I'd like to get rid of the 'USELESS' merges, which are all 'empty' merges (no merge changes), but I'd like to preserve the history/grouping associated with the also-'empty' KEEP merge at the top, which groups those commits together into one 'changeset'.

Or looking at another example in the traditional simplified-sideways-history:

A -- B -- C -- D   ==>  A -- B --- D'
 \----\--/   /                \-E-/
       \----E 

I have tried solutions to remove 'empty' merges (like this), but those remove all empty merges, and I want to keep the 'useful' empty merges as displayed in the examples...

As far as I can tell, the 'useless' empty merges don't contain any commits that aren't all the way to the left/top in the history. Is there a way to filter those out cleanly? I guess I don't really even know how to describe/define those...

Note that the given example was intentionally simple. For what it's worth, later in the history this repo looks like this, all of which I'd like to prune:

*   3d37e42 Merge pull request #239 from jim/topic-dev
|\  
| *   05eaf9e Merge pull request #7 from org/master
| |\  
| |/  
|/|  
* |   1576482 Merge pull request #193 from john/master
|\ \  
| * \   187100e Merge branch 'master' of github.com:org/repo into master
| |\ \  
| * \ \   067cc55 Merge branch 'master' of github.com:org/repo into master
| |\ \ \  
| * \ \ \   a69e3d2 Merge branch 'master' of github.com:org/repo into master
| |\ \ \ \  
| | |/ / /  
* | | | |   0ce6813 Merge pull request #212 from jim/feature
|\ \ \ \ \  
| | |_|_|/  
| |/| | |   
| * | | |   0f5352e Merge pull request #5 from org/master
| |\ \ \ \  
| |/ / / /  
2

There are 2 best solutions below

0
On BEST ANSWER

OK, I don't think this is perfect, but it does solve the problem in this particular case; there are cases where it doesn't quite clean up as much as it perhaps could, but it's a step if anyone is interested:

git filter-branch --commit-filter '
if ! git rev-parse --verify "$GIT_COMMIT^2" 1>/dev/null 2>&1 ||
  [ "$(git log --no-merges "$GIT_COMMIT^2" "^$GIT_COMMIT^1" --oneline | wc -l)" -gt 0 ];
then
  #echo take $GIT_COMMIT >&2
  # Pick one:
  git_commit_non_empty_tree "$@" # Drop empty commits
  #git commit-tree "$@" # Keep empty commits
else
  #echo "breakup $GIT_COMMIT ($*)" >&2
  skip_commit "$1" "$2" "$3" # (quietly) only keep the first parent
fi' -f HEAD

If 1) the commit doesn't have a second parent (git rev-parse returns an error if the referenced commit ($GIT_COMMIT^2) doesn't exist) OR 2) the second parent ($GIT_COMMIT^2) contains commits that the first parent ($GIT_COMMIT^1) does not (see here), the commit is kept (if it is not-empty; use git commit-tree if you want to keep empties); if the second parent exists and doesn't add anything useful, we skip the commit, and intentionally only pass the first parent-I'm not sure this is 'legit', but it drops the second parent from the history, and it worked in my case... (see caveats below)

From the bottom-up:

*   575e3b5 Merge pull request #68 from chris/feature # KEEP THIS MERGE!
|\  
| * 5dbc3f1 Actual feature changes
| |  2 files changed, 2 insertions(+), 2 deletions(-)
| * 35abc98 Cleanup/prep
|/  
|    2 files changed, 22 insertions(+), 16 deletions(-)
*   c3b3d86 Merge pull request #46 from org/topic_branch-mods # USELESS-C
|\  
* \   892de05 Merge pull request #47 from org/topic_branch # USELESS-B
|\ \  
| |/  
|/|   
| *   e738d4b Merge branch 'master' into topic_branch # USELESS-A
| |\  
| |/  
|/|   
* | 4182dac CommitMsg #40 #SQUASHED-PR
| |  2 files changed, 15 insertions(+), 6 deletions(-)
* | 3b42762 CommitMsg
|/  
|    2 files changed, 29 insertions(+), 14 deletions(-)
* c4e62ba CommitMsg
|  2 files changed, 39 insertions(+), 16 deletions(-)
* c2bb13f CommitMsg
   4 files changed, 241 insertions(+)

It kept everything through SQUASHED-PR (note that commit id 4182dac and parents are retained as their history didn't change). It decided USELESS-A should stick around b/c it's second parent (4182dac) contains commits its first parent (c4e62ba) did not contain, but then it looked at USELESS-B, whose second parent (including USELESS-A) doesn't add anything useful, so it dropped it (again, including USELESS-A). Then USELESS-C was just useless, so it got dropped, and KEEP had 'something useful' in the second parent, so it was retained. So you end with:

*   63b4d39 Merge pull request #68 from chris/feature # KEEP THIS MERGE!
|\  
| * 9a5570d Actual feature changes
| |  2 files changed, 2 insertions(+), 2 deletions(-)
| * a251317 Cleanup/prep
|/  
|    2 files changed, 22 insertions(+), 16 deletions(-) 
* 4182dac CommitMsg #40 #SQUASHED-PR
|  2 files changed, 15 insertions(+), 6 deletions(-)
* 3b42762 CommitMsg
|  2 files changed, 29 insertions(+), 14 deletions(-)
* c4e62ba CommitMsg
|  2 files changed, 39 insertions(+), 16 deletions(-)
* c2bb13f CommitMsg
   4 files changed, 241 insertions(+)

Important Caveats

  • This only works for simple histories where there are only ever two branches as we're explicitly passing "$1" "$2" "$3" in this case leaving off "$4" "$5", which would otherwise be included in "$@". If you have multiple parents (or rather if your commit has multiple parents), you'll have to adjust this to account for that; shouldn't be too hard, but I'm not fixing it right now for a hypothetical - you may want to choose specific parents to drop, idk.
  • If there were a 'useful' commit after USELESS-A before it got merged to USELESS-B (which arguably wouldn't be useless then), USELESS-A will not get pruned/dropped, so you'll still have some ugliness perhaps.
  • There are likely other scenarios where this doesn't work or could be improved. Please add suggestions in the comments (as usual) if you find any!
3
On

This is the heart of the problem:

I guess I don't really even know how to describe/define those...

Git is, at its heart, a graph-manipulation program, designed to build DAGs (Directed Acyclic Graphs) where each node in the graph is a commit. The fact that each commit carries a source snapshot as a sort of data payload is irrelevant to this process. (It's of course highly relevant to Git eventually being useful.)

You want to take the existing (post-filtering) DAG and build a different DAG. You'll need to define an algorithm for transforming the unwanted DAG to the wanted DAG. You don't necessarily have to use git filter-branch to achieve the transformation, but if you intend to do so, you'll have to further refine this transformation into an algorithm that works with "so-far" knowledge: it can see the current commit hash ID, of a commit that filter-branch is proposing to copy. That's in $GIT_COMMIT. It can read that commit (using Git plumbing commands), and it can find the mapping from other already-copied commits using the shell function map, as described in the git filter-branch documentation.

I, too, don't know quite how to define "useful merge". I think the most obvious algorithm, though, is one that is not (at least directly) suited to filter-branch: it's an iterative relaxation algorithm in which you start with the complete graph and repeatedly pluck out merge nodes, connecting their parents to their children, whenever those nodes are not useful. (It's still up to you to define not useful.) In the end, you have a list of nodes to keep and nodes to delete. That list is useful to a filter you write for filter-branch: you would now run git filter-branch with a --commit-filter that either runs git commit-tree as usual, or the provided skip_commit function as described in the documentation. The decision "keep" or "skip" is based on the list you generated with your relaxation algorithm.