What is the format of a Git tree object's content?
The content of a blob object is blob [size of string] NUL [string]
, but what is it for a tree object?
What is the format of a Git tree object's content?
The content of a blob object is blob [size of string] NUL [string]
, but what is it for a tree object?
The format of a tree object:
tree [content size]\0[Entries having references to other trees and blobs]
The format of each entry having references to other trees and blobs:
[mode] [file/folder name]\0[SHA-1 of referencing blob or tree]
I wrote a script deflating tree objects. It outputs as follows:
tree 192\0
40000 octopus-admin\0 a84943494657751ce187be401d6bf59ef7a2583c
40000 octopus-deployment\0 14f589a30cf4bd0ce2d7103aa7186abe0167427f
40000 octopus-product\0 ec559319a263bc7b476e5f01dd2578f255d734fd
100644 pom.xml\0 97e5b6b292d248869780d7b0c65834bfb645e32a
40000 src\0 6e63db37acba41266493ba8fb68c76f83f1bc9dd
The number 1 as the first character of a mode shows that is reference to a blob/file. The example above, pom.xml is a blob and the others are trees.
Note that I added new lines and spaces after \0
for the sake of pretty printing. Normally all the content has no new lines. Also I converted 20 bytes (i.e. the SHA-1 of referencing blobs and trees) into hex string to visualize better.
Expressed as a BNF-like pattern, a git tree contains data of the form
(?<tree> tree (?&SP) (?&decimal) \0 (?&entry)+ )
(?<entry> (?&octal) (?&SP) (?&strnull) (?&sha1bytes) )
(?<strnull> [^\0]+ \0)
(?<sha1bytes> (?s: .{20}))
(?<decimal> [0-9]+)
(?<octal> [0-7]+)
(?<SP> \x20)
That is, a git tree begins with a header of
tree
0x20
)After a NUL (i.e., the byte 0x00
) terminator, the tree contains one or more entries of the form
Git then feeds the tree data to zlib’s deflate for compact storage.
Remember that git blobs are anonymous. Git trees associate names with SHA1 hashes of other content that may be blobs, other trees, and so on.
To demonstrate, consider the tree associated with git’s v2.7.2 tag, which you may want to browse on GitHub.
$ git rev-parse v2.7.2^{tree}
802b6758c0c27ae910f40e1b4862cb72a71eee9f
The code below requires the tree object to be in “loose” format. I do not know of a way to extract a single raw object from a packfile, so I first ran git unpack-objects
on the pack files from my clone to a new repository. Be aware that this expanded a .git
directory that began around 90 MB to result of some 1.8 GB.
UPDATE: Thanks to max630 for showing how to unpack a single object.
#! /usr/bin/env perl
use strict;
use warnings;
use subs qw/ git_tree_contents_pattern read_raw_tree_object /;
use Compress::Zlib;
my $treeobj = read_raw_tree_object;
my $git_tree_contents = git_tree_contents_pattern;
die "$0: invalid tree" unless $treeobj =~ /^$git_tree_contents\z/;
die "$0: unexpected header" unless $treeobj =~ s/^(tree [0-9]+)\0//;
print $1, "\n";
# e.g., 100644 SP .gitattributes \0 sha1-bytes
while ($treeobj) {
# /s is important so . matches any byte!
if ($treeobj =~ s/^([0-7]+) (.+?)\0(.{20})//s) {
my($mode,$name,$bytes) = (oct($1),$2,$3);
printf "%06o %s %s\t%s\n",
$mode, ($mode == 040000 ? "tree" : "blob"),
unpack("H*", $bytes), $name;
}
else {
die "$0: unexpected tree entry";
}
}
sub git_tree_contents_pattern {
qr/
(?(DEFINE)
(?<tree> tree (?&SP) (?&decimal) \0 (?&entry)+ )
(?<entry> (?&octal) (?&SP) (?&strnull) (?&sha1bytes) )
(?<strnull> [^\0]+ \0)
(?<sha1bytes> (?s: .{20}))
(?<decimal> [0-9]+)
(?<octal> [0-7]+)
(?<SP> \x20)
)
(?&tree)
/x;
}
sub read_raw_tree_object {
# $ git rev-parse v2.7.2^{tree}
# 802b6758c0c27ae910f40e1b4862cb72a71eee9f
#
# NOTE: extracted using git unpack-objects
my $tree = ".git/objects/80/2b6758c0c27ae910f40e1b4862cb72a71eee9f";
open my $fh, "<", $tree or die "$0: open $tree: $!";
binmode $fh or die "$0: binmode: $!";
local $/;
my $treeobj = uncompress <$fh>;
die "$0: uncompress failed" unless defined $treeobj;
$treeobj
}
Watch our poor man’s git ls-tree
in action. The output is identical except that it outputs the tree
marker and length.
$ diff -u <(cd ~/src/git; git ls-tree 802b6758c0) <(../rawtree) --- /dev/fd/63 2016-03-09 14:41:37.011791393 -0600 +++ /dev/fd/62 2016-03-09 14:41:37.011791393 -0600 @@ -1,3 +1,4 @@ +tree 15530 100644 blob 5e98806c6cc246acef5f539ae191710a0c06ad3f .gitattributes 100644 blob 1c2f8321386f89ef8c03d11159c97a0f194c4423 .gitignore 100644 blob e5b4126bec557db55924b7b60ed70349626ea2c4 .mailmap
@lemiorhan answer is correct but misses small important detail. Tree format is:
[mode] [file/folder name]\0[SHA-1 of referencing blob or tree]
But what is important is that [SHA-1 of referencing blob or tree]
is in binary form, not in hex. This is Python snippet to parse tree object into entries:
entries = [
line[0:2]+(line[2].encode('hex'),)
for line in
re.findall('(\d+) (.*?)\0(.{20})', body, re.MULTILINE)
]
As suggested, Pro Git explains the structure well. To show a tree pretty-printed, use:
git cat-file -p 4c975c5f5945564eae86d1e933192c4a9096bfe5
to show the same tree in its raw, but uncompressed form, use:
git cat-file tree 4c975c5f5945564eae86d1e933192c4a9096bfe5
The structure is essentially the same, with hashes stored as binary and null-terminated filenames.
I try to elaborate a bit more on @lemiorhan answer, by means of a test repo.
Create a test repo
Create a test project in an empty folder:
That is:
Create the local Git repo:
The last command returns the hash of the top level tree.
Read a tree content
To print the content of a tree in human readable format use:
In this case
0b6e66
are the first six characters of the top tree. You can do the same forfolder1
.To get the same content but in raw format use:
The content is similar to the one physically stored as a file in compressed format, but it misses the initial string:
To get the actual content, we need to uncompress the file storing the
c1f4bf
tree object. The file we want is -- given of the 2/38 path format --:This file is compressed with zlib, therefore we obtain its content with:
We learn the tree content size is 67.
Note that, since the terminal is not made for printing binaries, it might eat some part of the string or show other weird behaviour. In this case pipe the commands above with
| od -c
or use the manual solution in the next section.Generate manually the tree object content
To understand the tree generation process we can generate it ourselves starting from its human readable content, e.g. for the top tree:
Each object ASCII SHA-1 hash is converted and stored in binary format. If what you need is just a binary version of the ASCII hashes, you can do it with:
So the blob
887ae9333d92a1d72400c210546e28baa1050e44
is converted toIf we want to create the whole tree object, here is an awk one-liner:
The function
bsha
converts the SHA-1 ASCII hashes to binaries. The tree content is first put into the variablet
and then its length is calculated and printed in theEND{...}
section.As observed above, the console is not very suitable for printing binaries, so we might want to replace them with their
\x##
format equivalent:The output should be a good compromise for understanding the tree content structure. Compare the output above with the general tree content structure
where each Object Entry is like:
Modes are a subset of UNIX filesystem modes. See Tree Objects on Git manual for more details.
We need to make sure that the results are consistent. To this end, we might compare the checksum of the awk generated tree with the checksum of the Git stored tree.
As for the latter:
As for the home made tree:
The checksum is the same.
Calculate the tree object checksum
The more or less official way to get it is:
To calculate it manually, we need to pipe the content of the script generated tree into the
shasum
command. Actually we have already done this above (to compare the generated and stored content). The results was:and is the same as with
git mktree
.Packed objects
You might find that, for your repo, you are unable to find the files
.git/objects/XX/XXX...
storing the Git objects. This happens because some or all "loose" objects have been packed into one or more.git\objects\pack\*.pack
files.To unpack the repo, first move the pack files away from their original position, then git-unpack the objects.
To repack when you are done with experiments: