CAUTION: readers of this article may experience rad skill gain

Git is simple (on the inside)

BenStraub
git

Sometimes you actually do need to know how it works. GitHubber Ben Straub shows how learning Git’s internals helps explain its odder qualities.

The way we use most computer systems is through a
metaphor.

When you’re using a word processor, you don’t
think about the byte-by-byte representation of each character, or
the control sequences that determine which words are italic and
which are bold-face. The software abstracts that away, presenting a
clean metaphor – inked characters on a sheet of paper. You write
your words, decide on their font and style, and when you’re ready
for your words to be on actual paper, you print.

When you’re adjusting the exposure of a photo,
you don’t want think about the mathematical acrobatics necessary to
change the R, G, and B bytes of every pixel just so, or the
resampling algorithm needed to adjust the size or rotation. The
metaphor is something like a darkroom, and you adjust exposure and
brightness, paint away red-eye, and erase the red wine stain on the
wedding dress.

And with most version control systems, you don’t
want to know how the data is stored and retrieved, or how the bytes
are ordered during a network transmission. You just want to write
your code, and every once in a while save a snapshot to someplace
safe, so the metaphor is defined on that level. The underlying data
model is complicated, and you rarely (if ever) need to know what it
is, because the UI is pretty effective at abstracting those details
away.

With Git, the story is a little different. The
metaphor it presents is a directed acyclic graph (DAG) of commit
nodes, which is to say there is no metaphor – the data model
actually is a DAG. If you try to re-use the metaphor you learned
from another system, you’re going to run into trouble.

The good news is that this data model is easy to
understand, and that figuring it out will make you better at using
Git.

Objects

Just about everything in a Git repository is
either an object or a
ref.

Objects are how Git stores content. They’re
stored in .git/objects, which is sometimes called the object
database or ODB. Objects in the ODB are immutable; once you create
one, you can’t change it. This is because Git uses an object’s
SHA-1 hash to identify and find it, and if you were to change the
content of an object, its hash would change.

Objects come in four flavors: blobs, trees,
commits, and tag annotations. Blobs are chunks of
data that Git doesn’t interpret with any kind of structure, and
it’s how Git stores your files. Objects are actually pretty easy to
inspect (listing 1).

Listing 1: A blob object. One line truncated for space.

# Print the object's type
$ git cat-file -t d7abd6
blob

# Print the first 5 lines of the object's content
$ git cat-file -p d7abd6 | head -n 5
<!DOCTYPE html>
<!--[if IEMobile 7 ]><html class="no-js iem7"><![endif]-->
<!--[if lt IE 9]><html class="no-js lte-ie8"><![endif]-->
<!--[if (gt IE 8)|(gt IEMobile 7)| [...] -->
<head>

End

Note that there’s no filename
here. Git expects file renames to be fairly common, and if the
filename were embedded with the content, you’d have to keep lots of
copies of that object where the only difference is the
filename.

You use git cat-file
because Git has optimized the storage of objects. They’re
gzip-compressed, and sometimes strings lots of them together into
big pack-files, so if you look in

.git/objects, you may not see anything you
recognize as an object.

The second type of object is called a
tree, and it’s how Git stores directory
structures (listing 2).

Listing 2: Inside a tree object. SHA-1 hashes truncated for space.

$ git cat-file -t 8f5b65
tree

$ git cat-file -p 8f5b65 | head -n 5
100644 blob 08b8e...493c0 after_footer.html
100644 blob 11517...fce19 archive_post.html
100644 blob 8ad5a...5b988 article.html
040000 tree 5c216...c8810 asides
040000 tree 52deb...e3dad custom

End 

Only blobs are unstructured; Git expects a
fairly specific format for all the others. Each line of a tree
object contains the file’s permission flags, what type it is
(
blobs are files,
trees are subdirectories), the SHA-1 hash of the
object, and a filename. So the tree type is responsible for the
names and locations of things, and the blob type is responsible for
their contents.

The third type of object is a
commit. This is how Git represents a
snapshot in history (listing 3).

Listing 3: Inside a commit object.

$ git cat-file -t e365b1
commit

$ git cat-file -p e365b1
tree 58c796e7717809c2ca2217fc5424fdebdbc121b1
parent d4291dfddfae86cfacec789133861098cebc67d4
author Ben Straub <bs@github.com> 1380719530 -0700
committer Ben Straub <bs@github.com> 1380719530 -0700

Fix typo, remove false statement

End

A commit has exactly one tree reference, which
is the root directory of the commit. It has zero or more parents,
which are references to other commits, and it has some metadata
about the commit – who made it, when it was made, and what it’s
about.

There’s just one more type of object, and it’s not used
very often. It’s called a tag annotation, and it’s
used to make a tag with comments.

Listing 4: Inside a tag annotation object.

$ git cat-file -t 849a5e34a
tag

$ git cat-file -p 849a5e34a
object a65fedf39aefe402d3bb6e24df4d4f5fe4547750
type commit
tag hard_tag
tagger Ben Straub <bs@github.com> Fri May 11 11:47:58 2012 -0700

Tag on tag

End

I’ll say more about how these work later; for
now, just note the object SHA that’s stored in there.

That’s it! You can count the number of object types on one
hand! See how simple this is?

Refs

References (or refs)
are nothing more than pointers to objects or other refs. They
consist of two pieces of information: the name of the ref, and
where it points. If a ref points to an object, it’s called a

direct ref; if it points to another ref, it’s
called a
symbolic ref.

Most refs are direct. To confirm this, check the contents
of anything under
.git/refs/heads;
they’re all plain-text files whose content is the SHA hash of the
commit they point to.

$ cat .git/refs/heads/master

2b67270f960563c55dd6c66495517bccc4f7fb17

Git also keeps around a few symbolic refs for
specific purposes. The most commonly useful one is

HEAD, which usually points to the branch you
have checked out:

$ cat HEAD
ref: refs/heads/master

Now that we know how refs work, let’s take
another stab at that tag annotation object we saw earlier. Remember
that refs are basically just names for locations; there’s no
commentary associated with them, and you can change them at any
time. Tag annotations solve both of these issues by putting
ref-like information into the ODB (making it immutable, and
allowing it to have more content), then by making it findable by
attaching a regular tag ref to it. The whole scheme looks like
this:

tag (ref) --> tag_ann (odb) --> commit

Note that this opens up a whole new universe of
possibility: refs don’t have to point to

commits. They can point to
any kind of object, which means you could
technically set up something like this (though it’s not clear why
you’d want to):

branch --> tag --> tag_ann_a --> tag_ann_b --> blob

Three Trees

Tree-type objects in the ODB aren’t the only
tree that Git likes to think about. During your day-to-day work,
you’ll deal with three of them: HEAD, the index, and the work
tree.

HEAD is the last commit that
was made, and it’s the parent of your next commit. Technically,
it’s a symbolic ref that points to the branch you’re on, which
points to the last commit, but for the purposes of this section
we’ll simplify a bit.

The index is a
proposal for the next commit. When you do a checkout, Git copies
the HEAD tree to the index, and when you type
git
commit -m ‘foo’
, it’s going to take what’s in the
index and store it in the ODB as the new commit’s tree.

The work tree is a
sandbox. You can safely create, update, and delete files at will in
this area, knowing that Git has backups of everything. This is how
Git makes your content available to the rest of the universe of
tools.

There are a few commands whose
job is mainly to deal with these three trees.

  • git checkout – Copies content from
    HEAD into the index and work tree. It can also move HEAD
    first.

  • git add – Copies content from the
    work tree to the index.

  • git commit – Copies content from the index
    to HEAD.

Reset is Easier Now

Now that we know a few things, some of Git’s
seemingly-strange commands start to make sense. One example
is
git reset, one of the most hated and
feared commands in all of Git. Reset generally performs three
steps:

  1. Move HEAD (and the branch it points to) to point
    to a different commit

  2. Update the index to have the contents from
    HEAD

  3. Update the work tree to have the contents from
    the index

And, through some oddly-named command-line
options, you can choose where it stops.

  • git reset –soft will stop after
    step 1. HEAD and the checked-out branch are moved, but that’s
    all.

  • git reset –mixed stops after step
    2. The work tree isn’t touched at all, but HEAD and the index are.
    This is actually reset’s default; the –mixed argument is
    optional.

  • git reset –hard performs all three
    steps. After the first two, the work tree is overwritten with
    what’s in the index.

If you use reset with a path, Git actually skips
the first step, since moving HEAD is a whole-repository operation.
The other two steps apply, though;
–mixed
will update the index with HEAD’s version of the file,
and
–hard also updates the working
directory, effectively trashing any modifications you’ve made to
the file since it was checked out.

The Everyday Pattern

Let’s take a look at a really common usage
pattern with our new Git X-ray goggles on.

$ git checkout -b bug-2345 master

Git creates a new branch called
bug-2345, and points it at the same
commit
master points to. Then it moves
HEAD to point to
bug-2345, and updates
the index and work tree to match HEAD.

You do some work, changing the files in the work tree, and
now you’re ready to make a commit.

$ git add foo.txt
$ git add -p bar.html

Git updates the index to match the contents of
the work tree. You can even update it with only

some of the changes from a file.

$ git commit -m 'Update foo and bar'

Git converts the index into a series of linked
objects in the ODB. Blobs and trees whose contents match are
re-used, and the files and directories that changed have new blobs
and trees generated for them. Git then creates a new commit which
points to the new root tree, and (since HEAD points to a
branch)
bug-2345 is moved to point to the
new commit.

Go Forth

With most version-control systems, you’re
encouraged to just get to know the UI layer, and the details will
be safely abstracted away. Git is different; its basic data model
is at a high enough level that it’s pretty easy to understand, and
its UI layer is so thin, that you’ll find yourself learning the
internals whether you want to or not – you’ll need to for
everything except the bare minimum of usage. I hope this article
has convinced you that there isn’t really that much to know, and
that you earn many new abilities by going through this
process.

Your new understanding has made you powerful. Please use
your new powers for good.

Ben
Straub has always been a programmer, and loves making things work.
He hacks on things at GitHub.

 

 


 

 

 


 

 

 


Author
Comments
comments powered by Disqus