Picturing Git: Conceptions and Misconceptions

Usually, I write on this blog about matters directly related to developing for iOS on a Mac. I typically talk about Xcode, the Swift language, and of course iOS itself (in particular, aspects of Cocoa Touch and related frameworks).

Today I’d like to switch gears and talk about another development-related topic that’s near and dear to my heart — Git.

If you’re a developer and within the sound of my voice, you’re likely to use Git. You are unlikely to work without some kind of version control, and the most commonly used form of version control is Git. There are three primary reasons for using a version control tool such as Git:

  • Safety: Git lets you save the state of your work at any moment (hence the name, version control). That means you can feel free to experiment as you develop; if something doesn’t work out, you can always return to an earlier saved state where things were okay.

  • Backup: Git lets you synchronize your work from your computer to a computer somewhere off in cyberspace. That way, there’s an offsite copy in case anything bad happens to your computer (or to you).

  • Collaboration: The same Git feature that lets you synchronize to a computer over the Internet also lets other people synchronize to that same computer; and the synchronization works both ways. So other people can synchronize with you, and multiple people can work together on a project.

Misunderstanding Git

There’s an odd thing I’ve noticed about the way many developers use Git. Often, they don’t really have an accurate mental model of what Git is and what it does. It’s surprisingly easy to get used to employing a few basic Git commands without having any real idea of what they mean.

I like to use the analogy of driving a car. Driving a car is a cybernetic activity; the car acts as an extension of your body, because your mental picture of how the car behaves in relation to your physical input is adequate. You have a general idea that when you step on the brake the car slows, and that when you turn the wheel to the right the car turns right; and your experience, plus the live feedback from the car (and the world around it) while you’re doing those things, reinforces that idea.

And yet the way people use Git is often not at all cybernetic in that way. Considering, for instance, how often one has to say git add or git commit, or both, you’d think one would have some reasonably accurate mental image of what those commands make Git do. But many people don’t. They know it’s a rote incantantion that one is supposed to invoke, and they just routinely invoke it and move on.

When I say “reasonably accurate”, I am not at all talking about what happens deep inside Git, “under the hood.” You don’t need to know that sort of detail, in general, in order to understand the business of driving Git, any more than you need to know what a brake caliper is or what a rack-and-pinion is in order to understand the business of driving a car. We reason (or intuit) about the car analogically, perhaps I should even say metaphorically; but the analogy or metaphor that our mind employs does in fact work.

The problem with how people use Git, I’m suggesting, is that their analogical or metaphorical conception of Git doesn’t work — it doesn’t fit the way Git actually behaves — if, indeed, the conception exists at all. This phenomenon can manifest itself as problematic as soon as an unusual or unaccustomed situation arises. The user then becomes confused and doesn’t know how to proceed, can’t guess what will happen if a certain Git action is taken, or can’t understand what did happen when a certain Git action was taken.

The purpose of this article is to present a simple way of looking at what Git really is and what it really does. I do not claim that this way of looking at Git represents absolute “facts” in any hard and fast or literal sense. But I contend that if you conceive of Git in the way that I’m going to suggest, if you substitute these conceptions of Git for any misconceptions you might have now, you’ll be a much happier and more fluid Git user. When posed with a puzzle as to what happened, what will happen, what you should do in order to make a certain thing happen, the answer might suddenly be obvious, where previously it wasn’t.

That, indeed, has been my own personal experience. I myself used Git for years while thoroughly misconceiving what it does and what it is. It is only relatively recently that my thinking about Git has shifted in the way I’m going to suggest here. This new way of thinking about Git has been a huge help to me, and it might be useful to you as well. Plus, it’s just a good idea in general to understand what you’re doing when you use Git.

The Git repository

We may tend to think of a folder full of our stuff as the repository on our computer. We might call this our “project” or our “Git folder” or even our “Git repository”. But what we see is not the repository. The repository is inside an invisible folder called .git. And the actual location of that invisible folder is crucial.

Let’s say you have a folder of three files under Git control. The hierarchy might look like this:

...
    /myFolder
        a.txt
        b.txt
        c.txt
        .git [invisible!]

The repository is not myFolder; it’s .git. And the position of the invisible .git folder here is key. By being located just inside the folder myFolder, the .git folder, by default, grants Git nominal control over all the other files inside myFolder, to any depth.

The phrase “to any depth” implies that this, too, would be a viable organization:

...
    /myFolder
        a.txt
        b.txt
        .git [invisible!]
        /myInnerFolder
            c.txt

Git takes control of the whole hierarchy downward from the folder containing the .git folder, including in this case the files in myFolder and any files in folders inside myFolder as well (meaning, also files in myInnerFolder).

But this is not quite such a viable organization:

...
    /myFolder
        a.txt
        b.txt
        .git [invisible!]
        /myInnerFolder
            c.txt
            .git [invisible!]

Weird things can happen when you do that. It’s an advanced configuration, and not something most Git users would want to get involved with. So don’t create that kind of organization by mistake!

Another mistake I see beginners make a lot is that they say git init without thinking where they are (in the command-line world) when they say it. For example, don’t do this (on a Mac):

  1. Start the Terminal
  2. Say git init

That’s a huge mistake. You now (on a Mac) have an organization like this:

/Users
   /yourname
       .git [invisible!]
       /Applications
       /Desktop
       /Documents
       /Downloads
       /Library
       /Movies
       /Music
       /Pictures

Do you see what’s happened? You’ve just put your whole hard disk hierarchy, from your Home folder on down, under Git control! That’s an absolutely terrible idea. Give the wrong command at this moment, and Git will happily erase that whole region of your hard disk hierarchy, permanently.

If you get yourself into that situation, the best thing to do is remove the unwanted .git folder, immediately. On a Mac, you can do that in the Finder; press Shift-Command-Period to make invisible files and folders visible, spot the problematic .git folder, and throw it in the trash so that Git loses control of your hard disk.

Want to know if a certain folder is nominally under Git control already? Just cd into that folder, and then ask Git:

$ git rev-parse --absolute-git-dir

Git will tell you the path, if any, to the .git repository that the current folder “belongs” to.

One last thing. How did the .git repository get there in the first place? There are two main ways: most likely, either you said git init or you said git clone. Either of those commands, one way or another, means in part: “Create a .git folder.” The difference is that git init typically creates a brand new .git folder as the repository, whereas git clone typically copies an existing .git repository, which might contain quite a bit of stuff already, from somewhere on the Internet.

The files you see

Not only are the files you see not the repository; they aren’t even your files! Here’s what I mean. Let’s say you’ve got this organization:

...
    /myFolder
        a.txt
        b.txt
        c.txt
        .git [invisible!]

And now suppose you cd into myFolder and say:

$ git add .
$ git commit -m 'here you go, Git'

You have just said that the three files a.txt, b.txt, and c.txt are no longer yours. In a very real sense, they are not even the “real” files any more. The real files are now inside the repository! They are in the invisible .git folder (in a form that makes them quite hard to see, but don’t worry about that).

What you are now seeing is basically some copies of the files in the repository. I like to say that these are copies that Git lends you. Why does Git lend you these copies? So that you can edit them and make another commit! In other words, Git lends you copies of files that the repository contains, in order to give you a way to work on those files. That is why this entire area, all the contents of myFolder on down to any depth (except inside the .git folder itself), is called the working tree (or the working directory) corresponding to this repository.

Commits

Despite the impression I may have given so far, Git doesn’t actually traffic in files. It also doesn’t traffic in folders. Furthermore, Git doesn’t traffic in changes. (A lot of people seem to think Git is about changes.) No!

Git traffics in commits.

What is a commit, then? A commit is effectively a snapshot of the current state of your working tree — the entire current state of the entire working tree. (I’m over-simplifying here, but I’ll correct that in the next section, so just bear with me.)

Here’s what I’m getting at. Let’s go back to the situation I posited a moment ago. You’ve got this organization:

...
    /myFolder
        a.txt
        b.txt
        c.txt
        [.git]

And you cd into myFolder and say:

$ git add .
$ git commit -m 'here you go, Git'

You have now made a commit, and that commit contains the current state of all three files — a.txt, b.txt, and c.txt.

Okay, fine. And now let’s say you edit just one of the files — let it be b.txt. You edit it and then you say:

$ git add .
$ git commit -m 'another commit'

Or else (this is pretty much the same thing):

$ git commit -a -m 'another commit'

Now: what did you just commit? What exactly is in this new commit that you just made? A lot of people would say: “It’s b.txt.” Or they might even say: “It’s the changes in b.txt.” No! That is wrong!

What is in this commit is exactly the same sort of thing that was in the previous commit. This new commit, once again, contains the current state of all three files — a.txt, b.txt, and c.txt.

So a commit is basically a photograph, as it were, of the entire working tree. I like to think of the commits, in fact, as constituting a family photograph album. Imagine that a family gets together for a reunion every year, and a photographer takes a group photo of everyone. And suppose that in one of these photos we see that one of the people has grown a beard since the previous photo. You would not say that this photo is only of that one person, just because only that person has changed! They are still all group photos of the whole family.

One reason this is hard for people to realize or believe is that it is hard to see what’s in a commit. You can’t just peek into the .git folder. But there’s a way! To demonstrate, I’ll start all over once again. We have a folder containing three files, a.txt, b.txt, and c.txt. I cd into that folder and I do this:

$ git init
$ git add .
$ git commit -m 'here you go, Git'

Then I edit b.txt and say:

$ git add .
$ git commit -m 'another commit'

Now watch this little move:

$ git ls-tree --name-only HEAD

HEAD is a reference to the most recent commit, the one I just made. I’m asking Git what that commit contains. And here is what Git says:

a.txt
b.txt
c.txt

You see? The commit contains all the files, not just the “changes” caused by editing b.txt.

(Another reason why people don’t realize what’s in a commit is that they rely on git status to tell them, without knowing what the output of git status actually means. I’ll talk more about that a bit later.)

The index

The index (also, alas, called various other things like the stage, the staging area, or the cache) is a terribly important place in the world of Git. It’s effectively invisible, but we can take a look at it anyway!

Let’s go back to the little dance I did right at the beginning. We’re starting all over once again. Let’s arrive fresh into our folder. The folder has three files, a.txt, b.txt, and c.txt. I cd into that folder and I do this:

$ git init
$ git add .

OK, stop. I didn’t commit yet. I just used the add command. In particular, as one so often does, I said to add everything (that’s what the dot means). So… add everything to what? To the index!

You might say: Prove it! Well, once again, I’ve got a trick up my sleeve. Let’s start all over yet again. I cd into the folder and I do this:

$ git init
$ git ls-files

That’s how we peek into the index! There is no output, of course, because we only just created the repository and then we stopped, so the index is empty. But now I’ll peek into the index again after saying git add, like this:

$ git add .
$ git ls-files

This time Git responds:

a.txt
b.txt
c.txt

That’s the contents of the index! We’ve copied the three files, a.txt, b.txt, and c.txt, into the index.

Cool, so why did we do that? So that we can now say:

$ git commit -m 'here you go, Git'

That makes a commit — a snapshot. But what snapshot? A moment ago, I seemed to say that it was a snapshot of the working tree. But that’s not right! The snapshot that constitutes a commit, in general, is the index. That’s what the index is — it’s the material that will make up the next commit if you now say git commit. When you commit, in general, you commit the index.

I like to say that there are actually three worlds of Git:

  • The repository. I’m imagining the repository here as consisting (primarily) of stored commits. These are snapshots reflecting past versions of your work.

  • The index. This is where you configure what you want to go into the next commit.

  • The working tree. These are files that Git lends you from the repository, so that you can edit them and then add them to the index, in order to tell Git what the index (and hence the next commit) should look like. The working tree is the only part of this triad that you can see directly.

The index is actually the reason why a commit is, by default, a snapshot of all the files. Once a file is in the index, it stays in the index until you take rather drastic measures to take it out again (I’m not going to describe what those drastic measures are). Therefore, that file will continue to be part of every commit you make from now on (unless you take those drastic measures).

I’ll demonstrate. Remember how, a moment ago, I made an initial commit by saying:

$ git commit -m 'here you go, Git'

Okay, so having made that commit, what’s in the index now? We committed everything from the index; is the index now empty? No! Let’s ask again what’s in the index:

$ git ls-files

And what do you think Git says?

a.txt
b.txt
c.txt

Those three files are still there! And that means that they will still be there (unless we take drastic measures) when we make the next commit — and so on. That is how a commit becomes a snapshot of all the files; it’s because the index, by default, just keeps hanging on to the files that have been put into it.

Understanding git status

We are now in a position to understand the mysterious output of git status. I’m afraid that this one command is, all by itself, responsible for a lot of the prevalent misunderstandings of what Git is. That’s partly because the output of git status omits so much. In fact, it’s fair to say that it omits almost everything! It’s also because the output of git status talks in terms of “changes”, even though Git itself is not about changes.

To demonstrate, let’s pick up where we just left off; we’ve just committed all three files. Now we edit b.txt and stop. Let’s pause to think what the situation is. The index contains a.txt, b.txt, and c.txt. Meanwhile, we’ve also edited the copy of b.txt in the working tree. What does git status have to say about this situation?

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   b.txt

no changes added to commit (use "git add" and/or "git commit -a")

When people see that, their eyes glaze over. It’s a lot of information. But they see the word “changes”, they see the mention of b.txt, and they think Git is all about changes. But it isn’t. A commit consists of all the files. The index contains all the files. But git status omits from its output any mention of files that didn’t change.

In other words, git status doesn’t answer the question “What will the next commit consist of?” It answers the question, “What’s new?” But what goes into the next commit is not what’s new; what goes into the next commit is everything.

Ok, so let’s accept that. Then where does git status get its idea of what’s new? I just said that git status omits from its output any mention of files that didn’t change. How does Git know what changed? Changed from what?

It’s simple. git status works by comparing the three worlds! In the above output, git status is reporting that the most recent commit and the index are exactly the same as one another, but the working tree is different from the index — because we edited b.txt but we didn’t add that edited version to the index. That’s how git status knows that the working tree b.txt has been modified: it no longer matches the b.txt in the index.

Okay, let’s add our modified b.txt to the index and see what git status says then:

$ git add .
$ git status

Here’s the output now:

On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   b.txt

That means that the index and the working tree are now exactly the same as one another, but the index is different from the most recent commit. The index’s version of b.txt is our modified version, and that’s different from the b.txt that went into the last commit. Thus, if we said git commit now, the b.txt in the new commit would be different from the b.txt in the previous commit.

To make the git status report even more interesting, let’s now edit a.txt and not add it to the index, so that we get both kinds of difference in one report. Remember, we edited b.txt and added it to the index, and now we’ve edited a.txt but we didn’t add it to the index. Now here’s the output of git status:

On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   b.txt

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   a.txt

You should be able to interpret that output, in terms of the three worlds of Git:

  • There’s a description of how the index differs from the most recent commit: “Changes to be committed”. b.txt in the index is different from b.txt in the most recent commit, and if you said git commit right now, b.txt in the new commit would be different from b.txt in the previous commit.

  • There’s a description of how the index differs from the working tree: “Changes not staged for commit”. a.txt in the working tree is different from a.txt in the index, and its current state in the working tree won’t go into the next commit (unless you git add it).

You could actually perform, yourself, the same comparison that git status is performing here:

  • To find out how the index differs from the working tree, say git diff.

  • To find out how the index differs from the most recent commit, say git diff --cached.

Another common thing to want to do is to get a printout of what a particular file looks like in the most recent commit (or indeed, any commit) — not as a diff, but the whole text of the file, similar to saying cat for a file in the working tree. To do so, use git show with <commit>:<path> notation. The most recent commit is HEAD, which you can abbreviate as @. For example, to see the contents of b.txt in the most recent commit:

$ git show @:b.txt

Another feature of the git status output is that it advises you about what you might want to do next. A lot of people seem quite blind to this information, which can be very useful and should not be ignored!

  • We are told that the modified state of b.txt has been added to the index, but if we wanted to take that state back out of the index, so that it doesn’t become part of the next commit, we would say git restore --staged b.txt. (You recall that the index is sometimes called the stage?)

  • We are told that a.txt has been modified in the working tree (you recall that the working tree is sometimes called the working directory?), and if we wanted to add it to the index, so that it is ready to become part of the next commit, we would say git add a.txt.

  • We are also told that if we regret having edited a.txt, we can undo our edits by saying git restore a.txt. Be careful! That kind of undo action is not, itself, undoable. If you have not committed a certain state of a certain file, then that state of that file, if undone or overwritten, can never be recovered. So if you cared about the edits you have made in a.txt and you said git restore a.txt, you would be filled with regret when you lost those edits, permanently!

The parent chain

An intuition that every Git user probably has is that Git preserves history. That, after all, is what makes it possible to “undo” back to any earlier commit. And you can see a display of the history by saying git log.

However, history is not, of itself, a “thing” in Git. There are just commits. So how does Git turn a collection of commits into a history?

Well, a commit has a feature I haven’t told you about yet: it has a parent, which is just a pointer to another commit. The only commit that has no parent is the very first commit in the repository, which is called the root commit.

When you make a commit (other than the root commit), Git does a really clever, elegantly simple thing. Recall that HEAD is a name for the most recent commit. So when you make a new commit, Git configures the new commit so that its parent is the HEAD commit — and then it immediately changes HEAD itself to point at the new commit, which makes sense because it is now the most recent commit.

Let’s enact that little game in a diagram. Start all over, where we cd into a folder and say git init. We then say:

$ git add .
$ git commit -m 'here you go, Git'

and we end up with a parentless root commit, which I will call A:

  A
  |
HEAD

Notice that HEAD here is just a name for the most recent commit. Now we edit a file (it could be b.txt) and say

$ git add .
$ git commit -m 'another commit'

Git creates another commit, which I will call B, and sets its parent as HEAD, which is A:

  A <- B
  |
HEAD

But you will never see the repository in that state, because, in the very same move, Git immediately rips the name HEAD away from A and sets it to B instead — because B is now the most recent commit:

  A <- B
       |
     HEAD

Do you see how the left-pointing arrow works in that diagram? Time marches to the right, as it were; B is newer than A. But the parentage points to the left; B is pointing to A as its parent, meaning, “This is the commit that came before me.”

Okay, but you see where this is going, don’t you? Just keep playing that same game: edit, add, commit; edit, add, commit; edit, add, commit. After making a whole bunch of commits, we’ve got something that looks like this:

  A <- B <- C <- D <- E <- F <- G
                                |
                              HEAD

Now we can turn to Git and say: “Show me my history!” Typically, we will say git log. What does Git do? Git doesn’t actually know anything about any history. All it knows is where HEAD points to. But that’s all it needs to know, because every commit has a pointer to its parent! Thus, knowing that HEAD is G, Git knows that G has a parent F. And knowing that, Git knows that F has a parent E. And so on, as far back as you care to go, potentially all the way to the root commit.

Whenever you look at a machine-made diagram of the history of your repository, that’s all that Git is doing: it’s just walking back through the parent chain, one parent at a time. But this is a really simple and fast thing to do, so Git effectively does it instantly, making it look as if it magically “knows” the whole history.

Commits can’t change

An important corollary of this architecture is that a commit, once created, can basically never be modified. That makes sense, because if it could, you’d be violating the history, and thus violating the whole purpose of Git.

There are a number of commands that may look as if they alter a commit. For example, you may be aware that you can amend the commit you just made, in order to change its commit message. But that’s an illusion! When you amend a commit, you are actually creating a new commit. The new commit may have the same contents as the old commit, but it is not, in fact, the old commit.

There is a direct way to prove that: look at the unique identifier number that is attached to every commit. Here, I’ll do an entire dance that demonstrates. Once again we start with a folder and we cd into that folder:

$ git init
$ git add .
$ git commit -m 'Here you go, Git'
# edit b.txt...
$ git add .
$ git commit -m 'another commit'
$ git log --oneline

Git says:

8fcb682 (HEAD -> master) another commit
101eb67 Here you go, Git

Okay, now I’ll amend the commit message of that last commit:

$ git commit --amend -m 'a lovely commit'
$ git log --oneline

Now Git says:

e676370 (HEAD -> master) a lovely commit
101eb67 Here you go, Git

You see? That second commit has a different identifier from the old second commit. That’s because e676370 is an altered copy of 8fcb682.

Branches

In Git, branches are extremely important. What is a branch in Git? A branch is just a name for a commit, rather like HEAD.

When you say git init and then add-and-commit to create the root commit, that root commit gets the branch name master by default; you can change that default, and the Git folks themselves will probably change it in the near future, but let’s just assume that’s the name. So the situation actually looks like this:

  A
  |
master
  |
HEAD

The word master here, I repeat, is just a name — a name with a pointer. And what it’s pointing at is A. But after you’ve made a lot of commits, the situation looks like this:

  A <- B <- C <- D <- E <- F <- G
                                |
                              master
                                |
                              HEAD

Do you see what’s happened? After every add-and-commit, it isn’t just the HEAD pointer that gets repointed at the most recent commit; the name master, too, gets repointed at the most recent commit. That is what a branch really is: it’s a name for a commit — a name that can move automatically to point to the newest commit.

Some of the most pervasive misconceptions about Git have to do with what a branch is. There is a tendency, for example, to think of a branch as a sort of topological thing. In the diagram above, for instance, you might think that there’s a thing called “the master branch” consisting of all the commits from A to G.

But that’s really not the case! There is the name master, pointing at G; that’s the branch, and that’s all the branch is. So it is really not right to say, for example, that F is “on the master branch”, tempting as that way of thinking and speaking may be. What does exist is the parent chain as I described it in the previous section. The correct thing to say is that F is reachable from master — meaning that by working our way backward along the parent chain from the one master commit (which is G), we can get to F. And E is reachable from master too, and so is D, and so on.

And of course you can make more branches. For example, you can say:

$ git branch otherbranch

When you give that command, you are saying: “Create a new branch name otherbranch and point that name at this commit.” The commit designated by “this commit” is HEAD by default, but you can specify any existing commit as the one that otherbranch should point at.

If you have multiple branches and you make commits on both of them, you might end up with a topology like this:

                    otherbranch
                        |
              X <- Y <- Z
             /
  A <- B <- C <- D <- E <- F <- G
                                |
                              master
                                |
                              HEAD

How did that happen? Back at C, we made a new branch called otherbranch; it originally pointed at C, but then we made three commits on that branch, moving the name otherbranch from C to X and then to Y and then to Z.

Again, there is a tendency to think that X, Y, and Z somehow constitute the branch otherbranch. But that is not true! In the diagram, C is the parent of X as well being the parent of D — and so, C is reachable from otherbranch, every bit as much as C is reachable from master. So what “branch” do you want to claim C is on? The question makes no sense, and you should not confuse yourself by asking it. Think in terms of names and commits and parent chains and reachability, and you’ll be fine.

The life (and death) of commits

Branches are useful in part just because a name is easy to remember and use. But they serve another important function: they keep commits alive. In the diagram above, every commit shown is reachable from either the branch name otherbranch or the branch name master (or both). That fact is what preserves all these commits!

To illustrate, if you were now to delete the branch otherbranch, what would happen? All you are doing is erasing the name otherbranch; Git will discourage you from doing that, but it will happily let you do it nonetheless.

But why does Git discourage you? It’s because once you’ve done that, Z is not reachable from any name, nor is Y, nor is X. (C, on the other hand, is still reachable, namely from master.) And therefore, unless you take measures to the contrary, X and Y and Z will eventually be destroyed, in a natural process called garbage collection.

Moreover, that is the only way in which commits in a repository are ever destroyed, in the normal course of things: a commit is slated for destruction when, and only when, it becomes unreachable from any name.

I’ll illustrate with an earlier example. Remember when I did a commit --amend to change commit’s message? First I had this:

8fcb682 (HEAD -> master) another commit
101eb67 Here you go, Git

And then, after the amend, I had this:

e676370 (HEAD -> master) a lovely commit
101eb67 Here you go, Git

The question now is: what happened to 8fcb682? The answer is: Nothing, yet. But eventually it will be permitted to go out of existence, which makes sense, as it serves no purpose; it has been replaced by e676370. The commit 8fcb682 is now unreachable from any branch name, and therefore it is slated for eventual destruction.

A branch is not the only kind of name that keeps reachable commits alive, but it is far and away the most important kind. (The other is something called a tag.) So branches are crucial in keeping your history alive.

Working on a branch

I’ve said that you should not think of a commit in your history as being “on” any particular branch. But it does make sense to use the notion of being “on” a branch — not with regard to past commits, but with regard to you! Under normal circumstances, when you are working within a Git repository’s working tree, you, yourself, are always working on some branch.

Just what does that mean? Technically, it’s simply a matter of where HEAD is pointing. When we say that you are on a certain branch, we just mean that HEAD is pointing to that branch name at this moment.

Looking at the diagram I showed earlier of a repository that has two branches, let’s examine the difference between you being on one branch and you being on another branch. Here, you are on master:

                    otherbranch
                        |
              X <- Y <- Z
             /
  A <- B <- C <- D <- E <- F <- G
                                |
                              master
                                |
                              HEAD

And here, you are on otherbranch:

                      HEAD
                        |
                    otherbranch
                        |
              X <- Y <- Z
             /
  A <- B <- C <- D <- E <- F <- G
                                |
                              master

From a topological perspective, they are exactly the same. The only difference is what HEAD is pointing to.

From a practical perspective, however, it makes all the difference in the world what branch you are on. That’s why it’s so important to know what branch you are on — and that’s why it’s the first thing git status tells you:

$ git status
On branch master

Technically, that means HEAD is pointing to master. Practically, it tells you (among other things) what will happen if you make a commit right now. If you do make a new commit right now, its parent will be the commit that is now master, and master will then point to the new commit.

So, earlier I said that a branch is “a name that can move automatically to point to the newest commit.” But I didn’t tell you what branch name is going to behave that way. The answer is: the branch you are on!

This, then, is a meaningful sense of the notion “on a branch.” You can say you are working “on branch master” to mean: master is the HEAD pointer that is automatically getting moved every time I make a commit.

Switching branches

Think now, please, about how we might get from the first situation in the above pair of diagrams to the second situation — from this:

                    otherbranch
                        |
              X <- Y <- Z
             /
  A <- B <- C <- D <- E <- F <- G
                                |
                              master
                                |
                              HEAD

To this:

                      HEAD
                        |
                    otherbranch
                        |
              X <- Y <- Z
             /
  A <- B <- C <- D <- E <- F <- G
                                |
                              master

As I said before, all that has happened technically to the topology is that HEAD has changed what it points to. But how?

There are two main ways: you can use git checkout or git switch. The latter is preferable nowadays, as the former, git checkout, is an overloaded command. However, I do give commands like git checkout master all the time, so I can understand if you do too. Either way, let’s say that you are switching branches.

What happens when you switch branches? How does Git respond when you tell it to switch branches? It’s very important to understand this, because switching branches causes your little universe to change quite dramatically.

So far, I have talked about commits as things you assemble and create. You say git add and then git commit and you have basically put copies of a bunch of files into a commit. Now, however, I am going to reverse that directionality and talk about getting copies of files out of a commit.

That is what happens when you switch branches. When you switch branches from master to otherbranch, Git whips the HEAD pointer away from pointing at master and points it at otherbranch instead. But that is not all that happens! Git also takes aim at your entire working tree. Git effectively throws away the contents of your working tree, and replaces those contents with a copy of the contents of the commit currently pointed to by otherbranch. (That’s commit Z in the diagram.)

I will demonstrate. Before switching branches, let’s see what’s in each branch — what files are in the commits pointed at by the branch names. I’ve configured all this beforehand to make things particularly obvious:

$ git ls-tree --name-only master
a.txt
b.txt
$ git ls-tree --name-only otherbranch
c.txt

OK, so master has a file a.txt and a file b.txt, while otherbranch has a file c.txt. I am currently on branch master:

$ git status
On branch master

And sure enough, my working tree contains a file a.txt and a file b.txt:

$ ls -1
a.txt
b.txt

Now I’ll switch branches and see what my working tree contains now:

$ git switch otherbranch
$ ls -1
c.txt

Oh, my golly! My files a.txt and b.txt are gone! They’ve been completely replaced by this other file c.txt. Run in circles, scream and shout!!

But no. There’s no need to panic. What Git is doing makes perfect sense, because we have just told it that we would like to be “working on” the branch otherbranch. We want the working tree to look like otherbranch, so that we can edit in the working tree and make new commits that make sense in the context of being on otherbranch.

And after all, the files a.txt and b.txt are not actually “gone” at all. They are safely inside the master commit, where they were all along. Remember, the working tree is just some files that Git lends you; the working tree is a representation so that you can work on a branch. The real files are tucked safely in the repository. Don’t worry, be happy.

And there’s more. When you switch branches from master to otherbranch, Git also removes the contents of the index and replaces them with the contents of the otherbranch commit too! And that makes sense as well. Right after you switch branches, we expect to be in a neutral state: generally speaking, git status should come up empty, ready for us to start working. So the working tree should look like the index and the index should look like HEAD, which is otherbranch.

Now, you may be asking (and I hope you are): Is switching branches dangerous? In particular: What if I have active work lying around uncommitted in my working tree and possibly my index, and I switch branches at that moment?

In most situations, it’s okay. In real life, for instance, it very often happens that you start to edit some files and then suddenly realize that you really should be making a new branch so you can commit this work on that branch. That is generally fine to do. You create a branch and switch to it, and nothing dramatic happens at all, because the new branch looks like the branch you were already on. The edited files are still there in your working tree, ready to be added and committed on the new branch.

However, in broader terms, if you have edited files and you then switch to another branch that already exists, there is in fact some danger that the version of a file in that branch will be different from the version of the file that you have just edited. Switching branches thus threatens to wipe out your edits, overwriting them with the version from the branch you are about to switch to.

Git is supposed to detect this situation and stop you from switching branches in that case; it generally does, but every once in a while one hears horror stories of how Git permitted the switch and work was lost. On the whole, it’s a probably a pretty good idea to switch branches only when Git is in a neutral state, where all your edits have been added and committed and git status reports that the working tree is clean.

Merging

Merging is probably the most far-reaching and elaborate thing that Git knows how to do. And for that very reason, there are lots of misconceptions about it!

So what is merging? There are many variants on what can happen when you merge, but here’s what we might think of as the canonical case:

  1. You are on a branch, usually a primary branch of some sort; let’s say it’s master.

  2. You say git merge otherbranch (or whatever the name of some other branch is).

  3. Git now creates, out of whole cloth, a completely new commit combining the contributions from both branches. Moreover:

    • You are working on master, so this commit is on master; the master branch name pointer is advanced to point to this new commit.

    • The otherbranch pointer is not advanced.

    • This new commit has a remarkable feature: it has two parents, the master commit we were on before you said merge, and the commit pointed to by otherbranch, in that order.

So, for instance, suppose we are in this situation:

                    otherbranch
                        |
              X <- Y <- Z
             /
  A <- B <- C <- D <- E <- F <- G
                                |
                              master
                                |
                              HEAD

If you now say git merge otherbranch, you get this:

                    otherbranch
                        |
              X <- Y <- Z <--------\
             /                      \
  A <- B <- C <- D <- E <- F <- G <- M
                                     |
                                   master
                                     |
                                   HEAD

The newly minted commit, created entirely by Git, is M. And M is now master. And it has two parents! The first parent is G, which was master previously. The second parent is Z, which was otherbranch previously (and still is).

(At this point, I could talk about variants of merges, such as fast-forwarding and squashing; about the logic of how Git creates a merge commit based on “the contributions from both branches”; and about what happens when that logic is insufficient and Git turns to you for assistance — unfortunately known as a “conflict”, a term that has instilled unnecessary fear and misunderstanding in far too many Git users. But those are subjects for another day.)

In many situations, the purpose of a secondary branch all along was (assuming things panned out successfully) to be merged eventually into the main branch. You created the extra branch in order to try to implement some feature or experiment with some line of development; you succeeded, and now you want your work to be contributed back onto the main branch. Therefore, after merging, you might as well delete the secondary branch. A branch is just a name, after all, so that’s all that gets deleted: the name! If you deleted otherbranch right now, you’d get this:

              X <- Y <- Z <--------\
             /                      \
  A <- B <- C <- D <- E <- F <- G <- M
                                     |
                                   master
                                     |
                                   HEAD

The topology is unchanged; the name otherbranch is gone, and that’s all. Deleting the name otherbranch has, indeed, no important consequences, assuming (as I do assume here) that we were never going to use it for anything further. After all, the merge commit, with its two parents, preserves the history of what happened — the two parent chains that led up to the merge. All the commits in the diagram are reachable from master. So the name otherbranch is not needed in order to preserve anything.

At the same time, even with the name otherbranch gone and forgotten, G retains a kind of primacy over Z as a parent of M, because it is the first parent of M, and Git knows that fact (because the parents are recorded together with their order in M). That can make a difference to the way Git describes and displays the topology.

Remotes and fetch

I have mentioned Git’s ability to synchronize between your local copy of a repository and an online copy (typically at some place like GitHub or Bitbucket). This mechanism, too, is fraught with opportunities for misconception. Let’s try to straighten out some of them.

First of all, how does your local repository know where the online repository is? It isn’t magic! In the first instance, Git knows because you tell it the online repository’s URL. It is perfectly legal to give commands like fetch by explicitly specifying the URL where the online repository is to be found, like this:

$ git fetch git@github.com:mattneub/myCoolRepo.git

However, you can imagine that having to enter the entire URL of a remote repo, every time you want to synchronize with it, could get really old, really fast. So Git lets you give a URL a name. That name is called a remote. A remote is basically just a name — a name for a URL.

If you obtained your copy of this repo by cloning from an online copy to begin with, your copy comes with a remote already configured with the correct URL, with the default name origin. (Of course that’s just a default name, and you can change it.)

If your local repository has no remote — perhaps because you created it by saying git init — you can give it a remote “by hand”, by using the git remote add command to provide a name and a URL. But then the remote repository at that URL had better exist already! Merely declaring a remote in your local repository doesn’t cause any remote repository to come into existence.

Once origin is a name for the remote repository’s URL, instead of giving the URL in the fetch command, you can say:

$ git fetch origin

Or even more briefly, because origin is the default:

$ git fetch

Now let’s talk about what happens when you actually do say that. Under normal circumstances, git fetch means: Contact the Git located at the origin URL, and ask for copies of all commits reachable from all branches in that repository. That includes the remote repository’s branch names themselves. Thus, what you are saying is: Bring me up to date with the remote repository!

The actual transfer of commits over the network is as efficient as possible. A commit’s unique identifier is unique, so it’s easy to ascertain whether we already have a copy of a particular commit, meaning that that commit doesn’t have to be transferred. What’s transferred is solely what we don’t already have. (Moreover, the commits that are transferred are compressed to minimize bandwidth; but you don’t need to know about that.)

Remote-tracking branches

The mechanics of “me” in the notion “bring me up to date” are particularly interesting. After a git fetch, nothing in the working tree changes, even though you may have just fetched new commits from the remote version of the same branch you are on. And if you switch branches after a git fetch, you’ll find that nothing about any of those branches has changed either. So where did all the fetched commits go?

The answer involves special branches called remote-tracking branches. Remote-tracking branches are special in that they can’t be switched to or worked on directly. The job of a remote-tracking branch is to do the very thing we just said: to capture the result of synchronizing with the remote repository. Basically, a remote-tracking branch is a local copy of the parent chain of commits reachable from the corresponding branch in the remote repository.

Thanks to the remote-tracking branches in your repository, you can say git fetch without fear, because your branches will be completely untouched. Plus, your local Git knows, at all times, quite a lot about the state of the remote copy of this repository, without having to talk to that remote copy over the network.

For example, you can ask how your master branch compares with the remote master branch. Your master branch might point to a newer commit than the remote master branch, because you’ve done some add-and-commit work on your master branch. Or the remote master branch might point to a newer commit than your master branch, because someone created commits on the remote machine, or synchronized up to the remote machine. The point is that this question is very easy and efficient to answer, because it doesn’t involve doing any networking!

Indeed, git status automatically answers that very question:

$ git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.

What’s that origin/master? It’s the remote-tracking branch that acts as a copy of the origin remote’s master branch. Its parent chain is a copy of the remote branch’s parent chain. So your local Git knows instantly what commits are reachable from the remote branch!

Of course, this remote-tracking branch might not itself have the most current information about what’s happened at the remote repository. And it won’t have the most current information until you say git fetch again! That’s the point. Git will never talk to the remote repository automatically; it talks to the remote repository only when you expicitly ask it to do so.

Updating local branches

So now we know that git fetch does not cause any of your real branches to be brought up to date. It just brings your remote-tracking branches up to date. But what if you do want one of your real branches to be brought up to date? How can you get the up-to-dateness from a remote-tracking branch into a real branch?

Before I answer that question, I want to fix my terminology. The “real” branches I’m talking about are actually called local branches, as opposed to the remote-tracking branches. So I’m going to call them that. Keep in mind, though, that remote-tracking branches are actually local too — they and their reachable commits are on your computer, not on the remote.

So now let’s talk about updating a local branch from a remote-tracking branch. There are two possible cases to consider.

Case 1: You have a remote-tracking branch but no corresponding local branch, and you want a local branch so you can work on it. In this case, you can create a local branch based on the remote-tracking branch.

To do so, you basically ask to switch to the remote-tracking branch. A nice notation is:

$ git switch -c origin/somebranch

You cannot really switch to a remote-tracking branch, so Git interprets this as creating a local branch. It’s trivially easy for Git to do that; it just makes a local branch name and points it initially at the same commit as the remote-tracking branch. (You are not forced to name your new branch with the same name as the remote-tracking branch, but it is conventional to do so.)

Case 2: You already have this branch as a local branch, as well as the remote-tracking branch, and you want to bring your existing local branch up to date to match the remote-tracking branch. In this case, a very simple approach is just to merge the remote-tracking branch into the local branch!

(It’s so common to fetch and then merge the remote-tracking branch into the local branch, that Git provides a shortcut command: git pull. This, by default, means fetch and merge in one move. However, git pull can be quite tricky to use, because it has hidden configuration options, can have unforeseen consequences, and gives no opportunity for reflection and planning. Adept Git users prefer to fetch and then decide whether to merge.)

You’re probably wondering now how to tell whether you’ve got remote-tracking branches and what their names are. When you say simply git branch, which asks for a list of branches, remote-tracking branches are not listed! To see the remote-tracking branches, a good way is to say:

$ git branch --all -vv

In the output, remote-tracking branches are listed like this:

remotes/origin/master      8d10113 some commit message
remotes/origin/otherbranch 3b097b7 some other commit message

That tells you the name of the remote-tracking branch, preceded by the prefix remotes/, along with some information about the commit that that remote-tracking branch name points to. Local branches are listed like this:

* master                   8d10113 [origin/master] some commit message
  otherbranch              3b097b7 [origin/otherbranch] some other commit message

Note that each local branch is listed in company with the remote-tracking branch with which it is associated — if it has a remote-tracking branch associated with it.

(Confusingly, the association between a local branch and a remote-tracking branch is also called tracking. Using that terminology, the local master is tracking the remote-tracking origin/master, and the local otherbranch is tracking the remote-tracking origin/otherbranch. Sigh.)

Push and the “upstream” of a branch

The opposite of fetch is push. When you push, you are asking Git to send to the remote repository some commits that you’ve got but the remote repository doesn’t, bringing the remote into sync with you.

Most commonly, you’ll push a single branch where you’ve done some add-and-commit cycles:

$ git push origin master

By default, Git assumes that the branch name up at the remote repository is the same as the branch name locally. So git push origin master means, by default, to synchronize the local master branch up to the origin remote’s master branch. (You can push a local branch to a remote branch that has a different name, but I’m not going to explain how.)

If a local branch is already associated with a remote-tracking branch, then pushing to the corresponding remote branch also updates the remote-tracking branch. That makes sense, because pushing involves networking, so clearly Git can take advantage of this opportunity to make sure the remote-tracking branch reflects the remote repository correctly after the push.

As a notational shortcut, if a local branch is already associated with a remote-tracking branch, and you are on that local branch, you can just say:

$ git push

The association means that Git can make some obvious assumptions about what branch of what remote you want to push to, and if those assumptions are correct, saying git push will do the right thing.

But what if a local branch is not associated with a remote-tracking branch? To illustrate, I’ll make an entirely new branch, thirdbranch; I’ll switch to it and do an add-and-commit of some edited material. Then I’ll say:

$ git push

Whoops! Git doesn’t know what to do, and replies:

fatal: The current branch thirdbranch has no upstream branch.

That “upstream” is another of the many unfortunately overloaded Git terms. Basically, Git just means here that thirdbranch is not associated with any remote-tracking branch, so more information is needed as to what remote repository, and what branch of that repository, to push to. We can silence Git’s worries by being more explicit:

$ git push origin thirdbranch

That succeeds, and it also creates a remote-tracking branch, as we can discover by asking:

$ git branch --all -vv
  master                     8d10113 [origin/master] some commit message
  otherbranch                3b097b7 [origin/otherbranch] some other commit message
* thirdbranch                befe828 still another commit message
  remotes/origin/master      8d10113 some commit message
  remotes/origin/otherbranch 3b097b7 some other commit message
  remotes/origin/thirdbranch befe828 still another commit message

But as you can see, even though we created a remote-tracking branch remotes/origin/thirdbranch (listed in the last line of the output), our local thirdbranch (listed in the third line of the output) is still not actually associated with it! So we still cannot subsequently say plain and simple git push when we are working on thirdbranch.

If we want to be able to do that, we need to set origin/thirdbranch as the “upstream” of thirdbranch. We can do so now, by saying (while still on thirdbranch):

$ git branch --set-upstream-to origin/thirdbranch

Now the association has been formed (as you can see in the third line of the output):

$ git branch --all -vv
  master                     8d10113 [origin/master] some commit message
  otherbranch                3b097b7 [origin/otherbranch] some other commit message
* thirdbranch                befe828 [origin/thirdbranch] still another commit message
  remotes/origin/master      8d10113 some commit message
  remotes/origin/otherbranch 3b097b7 some other commit message
  remotes/origin/thirdbranch befe828 still another commit message

More commonly, you’ll probably associate the remote-tracking branch with the local branch in the same command that creates the remote-tracking branch — that is, when you first push the local branch. That syntax is:

$ git push -u origin thirdbranch

The -u is short for --set-upstream.

Push is picky

It’s quite easy, especially (though not exclusively) when you are collaborating with others, to find yourself in a situation where you will ask to push, and Git will communicate with the remote repository and will come back to you and slap your hand:

hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

What has happened here? Basically, on the remote repository, the branch name that we are trying to push to points to a commit that you don’t have locally.

To demonstrate, I have set up an artificial situation in order to make Git complain. My local thirdbranch looks like this:

f89effe eee
db5c32e ee
befe828 e
683c844 init

But the remote repository’s thirdbranch looks like this:

e4ce90b edited by someone else
db5c32e ee
befe828 e
683c844 init

As my commit message for e4ce90b is meant to imply, someone else has done some editing behind our backs and has pushed or created a commit on the remote repository. The commit e4ce90b, which is the remote repository’s thirdbranch, is a commit that I don’t have locally.

How is such a situation to be resolved? Well, if you’ve been paying attention, you know one answer: git merge! But here’s the thing: Git is not going to perform an automatic git merge at the remote repository. The rule is that if any merging is to be done, you have to do it. In this situation, therefore, it is not possible to push. We need to do the merge locally. Then, and only then, will we be allowed to push.

So we need to fetch and merge before we can push! Here we go (assuming I’m already on thirdbranch):

$ git fetch origin thirdbranch
$ git merge origin/thirdbranch

The merged topology we just created looks like this:

$ git log --graph --oneline
*   2f2c5ad (HEAD -> thirdbranch) Merge remote-tracking branch 'origin/thirdbranch' into thirdbranch
|\  
| * e4ce90b (origin/thirdbranch) edited by someone else
* | f89effe eee
|/  
* db5c32e ee
* befe828 e
* 683c844 init

As you can see, we now have commit e4ce90b, immediately followed by a new commit that the remote repository doesn’t have (a merge commit). And now we can say:

$ git push origin thirdbranch

From what I’ve said, you may be thinking that collaborating with someone else on the same branch is something of a race condition. And that’s exactly true. Just to give an example: suppose, while we were doing our merge locally, someone pushed another commit onto thirdbranch at the remote repository. Then we wouldn’t have that commit, and we still wouldn’t be allowed to push! We’d have to fetch and merge again and try to push again.

But that’s just the price of collaboration. In real life, with decent communication between collaborators, it should not pose too much of a problem.

Where to go from here

There is obviously far more to Git than I’ve outlined here. My purpose in this article, however, has not been to teach Git or even to introduce it, but to provide a mental orientation that makes Git comprehensible and usable.

I’ll end by listing some further topics you might be curious about at this point; this article’s general conception of Git should make them easier to understand:

  • Reset. A branch, as I have been at pains to emphasize, is just a name pointing at a commit. As you work “on a branch”, the branch name moves automatically to point at the latest commit. But you also might have reason to want to grab that name and repoint it at some other specific commit yourself, manually. You can! That is what git reset is. By clever use of git reset, you can rewrite your chain of commits in various interesting ways; for example, you can reduce a chain of multiple commits to a single commit, to make the history read more cleanly.

  • Diff. I’ve emphasized that Git does not store changes; it stores commits, which are snapshots. But you may be wanting to retort, “What about git diff? It shows changes!” Yes, of course. Clearly there are many situations where it is valuable to know what would have to be done (or what evidently was done) to the contents of one commit in order to end up with the contents of a different commit. git diff and other related comands do show that. But they are not showing something Git knows, but something it deduces. Differences between commits are a derived, secondary concept.

  • Merge logic. Earlier, I described what a merge is, but I avoided any details of how Git creates a new merge commit based on the commits you have told it to merge. Git’s reasoning here is what I call merge logic. Merge logic is fundamental to many of the more advanced and interesting abilities of Git — not just merge, but also cherry pick, rebase, and revert. This is a big topic and I’m saving it for a subsequent article.

  • Cherry pick and rebase. These are ways of making a copy of a commit that has a different parent from the original. Commits are uniquely identified and effectively immutable, so it follows that you can’t change the parentage of a commit itself! Cherry pick and rebase both manufacture new commits. Of course, there is plenty to know about the details of how they do that, and how you specify what you want Git to do.

  • Pull requests. You might be using a remote repository hosting service, such as GitHub, that offers (through the browser) an ability to make and resolve pull requests (also called merge requests). This enables a topology where you do exactly what I said earlier you cannot do: you merge commits, possibly making a new merge commit, at the remote repository. This should be impossible, and in fact it is impossible! It looks possible only because of some clever trickery on the part of the hosting service. Pull requests are not a Git feature; they are a feature of the hosting service. How they work and how to use them is a completely separate can of worms.

You Might Also Like…

Understanding Git Merge

Carrying on from my earlier article about some ways in which Git is commonly misunderstood — and how I think one should understand Git — I’d like to dive a bit deeper into one of the most important things Git knows how to do: merging. If Git is often misunderstood, merging is one of the …

Understanding Git Merge Read More »