Is it possible to get a list of all git object hashes of blobs which have been added to the repository by a given commit hash using the git command line tools?
I already tried archiving this with the git plumbing tool git-diff-tree
. Maybe it's the wrong approach. Below is the best result I could get so far. But the (very long man page) documentation didn't help finding out how exactly the output has to be interpreted.
$ git diff-tree --no-commit-id 2b53d04dbb7cd35d030ddc59b13c0836a87daeb7
:100644 100644 03f15b592c7d776da37e3d4372c215b14ff8820f 6e0ed0b1ed56e9a35a3be52a9de261c8ffcccae8 M file1.ts
:100644 100644 b5083bdb9c31005ebd16835a0f49dc848d3f387a 4b7f9e6624a66fec0510d76823303017e224c9d7 M file2.ts
:100644 100644 368d64862e6aa2a0110f201c8a5193d929e2956d 0e51626a9866a8a3896489f497fbd745a5f4a9f2 M file3.ts
:040000 040000 c332b1e576af0dbb93cc875106bc06c3de6b74c8 f7f3478a9b0eaac85719699d97e323563a1b102b M some_folder
Do the first and second git object blob hashes show the old and new objects for the modified file respectively? In the worst case I could fetch that information by parsing the output.
My primary goal was to find a command line which works as below:
$ git <command> <option1> <option2> 368d64862e6aa2a0110f201c8a5193d929e2956d
6e0ed0b1ed56e9a35a3be52a9de261c8ffcccae8
4b7f9e6624a66fec0510d76823303017e224c9d7
0e51626a9866a8a3896489f497fbd745a5f4a9f2
Edit below in response to @torek
In response to the answer of @torek I want to be more clear about what my intentions are because he is absolutely right pointing out that new isn't nececessary new.
I am planning to use git rev-list --reverse <branch>
to get a a list of all commits on that branch in commit order. Then I want to visit every commit in this order and collect firstly seen blob hashes on this branch per commit.
The end result should be something like the following:
C:368d64862e6aa2a0110f201c8a5193d929e2956d
B:03f15b592c7d776da37e3d4372c215b14ff8820f
B:4b7f9e6624a66fec0510d76823303017e224c9d7
B:c332b1e576af0dbb93cc875106bc06c3de6b74c8
C:5521a02ce1bc4f147d0fa39a178512476764dd66
B:e5fa44f2b31c1fb553b6021e7360d07d5d91ff5e
B:adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
C:a3db5c13ff90a36963278c6a39e4ee3c22e2a436
B:4888920a568af4ef2d2f4866e75b4061112a39ea
.
.
.
C:
commit
B:
blob
If this isn't easily done it would be ok to do two passes. In the first pass blobs can be mentioned multipe times in different commits because of reasons you have pointed out:
- adding a file with the same content in an other file
- a file has the same content after it has been modified
I could then do a second pass piping the file through awk '!x[$0]++'
which will remove any duplicates. This wouldn't be very efficient but would get the result I want.
I hope I made my intentions clear now. Any thoughts?
Yes and/or no: you have to define precisely what you mean by added to the repository.
Suppose, for instance, that I start with a totally empty repository:
Now I create
README.md
andgit add
it and commit:README.md
is a blob and its hash ID is:Later, I write a new file:
If we look at this commit, we'll see the new file. If we look at it with
git show --raw
we'll see it in thegit diff-tree
format:This seems like a blob that's been added to the repository, but wait, there's something awfully familiar about
43b18ad
:Yes, that's the same hash ID as
README.md
:It's one blob, but two files. Is that really newly added?
If your answer to the above is "yes, it's new, even though it's old", that might answer this second question. If your answer is "no, it's not new", what about a commit that reintroduces a blob that was removed in a previous commit? Or, if two commits
I
andJ
made in parallel on two branches:both introduce the same blob, which one actually adds it as all-new, and which one merely duplicates the other?
In general, if you want all new, you'll have to walk the entire commit graph, inspecting each commit's tree (see
git ls-tree -r
), and select which commits first introduce a blob object ID that is not already in some earlier (parent-wise and/or date-and-time-wise) commit object. If you want "newly added as a new file in this commit", inspect the commit and its parent(s), perhaps usinggit diff-tree
or similar. Note that an all-new file has an all-zero mode in its parent, and a status letter ofA
(added), while a file modified from the its parent has a status letter ofM
(modified) and two nonzero hashes. A file nominally deleted—a file that existed in the parent, but no longer does in the child—has a status letter ofD
(deleted). If you enable rename detection, you'll getR
status-es and similarity index values; you may want to disable this, or at least force the similarity testing to 100%.