Development
Picturing Git: Conceptions and Misconceptions
Matt Neuburg
Written on May 13, 2021

Usually, I write on this blog about matters directly related to developing for iOS on a Mac. I typically talk about Xcode, the Swift language, and of course iOS itself (in particular, aspects of Cocoa Touch and related frameworks).
Today I’d like to switch gears and talk about another development-related topic that’s near and dear to my heart — Git.
If you’re a developer and within the sound of my voice, you’re likely to use Git. You are unlikely to work without some kind of version control, and the most commonly used form of version control is Git. There are three primary reasons for using a version control tool such as Git:
-
Safety: Git lets you save the state of your work at any moment (hence the name, version control). That means you can feel free to experiment as you develop; if something doesn’t work out, you can always return to an earlier saved state where things were okay.
-
Backup: Git lets you synchronize your work from your computer to a computer somewhere off in cyberspace. That way, there’s an offsite copy in case anything bad happens to your computer (or to you).
-
Collaboration: The same Git feature that lets you synchronize to a computer over the Internet also lets other people synchronize to that same computer; and the synchronization works both ways. So other people can synchronize with you, and multiple people can work together on a project.
Misunderstanding Git
There’s an odd thing I’ve noticed about the way many developers use Git. Often, they don’t really have an accurate mental model of what Git is and what it does. It’s surprisingly easy to get used to employing a few basic Git commands without having any real idea of what they mean.
I like to use the analogy of driving a car. Driving a car is a cybernetic activity; the car acts as an extension of your body, because your mental picture of how the car behaves in relation to your physical input is adequate. You have a general idea that when you step on the brake the car slows, and that when you turn the wheel to the right the car turns right; and your experience, plus the live feedback from the car (and the world around it) while you’re doing those things, reinforces that idea.
And yet the way people use Git is often not at all cybernetic in that way. Considering, for instance, how often one has to say git add
or git commit
, or both, you’d think one would have some reasonably accurate mental image of what those commands make Git do. But many people don’t. They know it’s a rote incantantion that one is supposed to invoke, and they just routinely invoke it and move on.
When I say “reasonably accurate”, I am not at all talking about what happens deep inside Git, “under the hood.” You don’t need to know that sort of detail, in general, in order to understand the business of driving Git, any more than you need to know what a brake caliper is or what a rack-and-pinion is in order to understand the business of driving a car. We reason (or intuit) about the car analogically, perhaps I should even say metaphorically; but the analogy or metaphor that our mind employs does in fact work.
The problem with how people use Git, I’m suggesting, is that their analogical or metaphorical conception of Git doesn’t work — it doesn’t fit the way Git actually behaves — if, indeed, the conception exists at all. This phenomenon can manifest itself as problematic as soon as an unusual or unaccustomed situation arises. The user then becomes confused and doesn’t know how to proceed, can’t guess what will happen if a certain Git action is taken, or can’t understand what did happen when a certain Git action was taken.
The purpose of this article is to present a simple way of looking at what Git really is and what it really does. I do not claim that this way of looking at Git represents absolute “facts” in any hard and fast or literal sense. But I contend that if you conceive of Git in the way that I’m going to suggest, if you substitute these conceptions of Git for any misconceptions you might have now, you’ll be a much happier and more fluid Git user. When posed with a puzzle as to what happened, what will happen, what you should do in order to make a certain thing happen, the answer might suddenly be obvious, where previously it wasn’t.
That, indeed, has been my own personal experience. I myself used Git for years while thoroughly misconceiving what it does and what it is. It is only relatively recently that my thinking about Git has shifted in the way I’m going to suggest here. This new way of thinking about Git has been a huge help to me, and it might be useful to you as well. Plus, it’s just a good idea in general to understand what you’re doing when you use Git.
The Git repository
We may tend to think of a folder full of our stuff as the repository on our computer. We might call this our “project” or our “Git folder” or even our “Git repository”. But what we see is not the repository. The repository is inside an invisible folder called .git. And the actual location of that invisible folder is crucial.
Let’s say you have a folder of three files under Git control. The hierarchy might look like this:
...
/myFolder
a.txt
b.txt
c.txt
.git [invisible!]
The repository is not myFolder; it’s .git. And the position of the invisible .git folder here is key. By being located just inside the folder myFolder, the .git folder, by default, grants Git nominal control over all the other files inside myFolder, to any depth.
The phrase “to any depth” implies that this, too, would be a viable organization:
...
/myFolder
a.txt
b.txt
.git [invisible!]
/myInnerFolder
c.txt
Git takes control of the whole hierarchy downward from the folder containing the .git folder, including in this case the files in myFolder and any files in folders inside myFolder as well (meaning, also files in myInnerFolder).
But this is not quite such a viable organization:
...
/myFolder
a.txt
b.txt
.git [invisible!]
/myInnerFolder
c.txt
.git [invisible!]
Weird things can happen when you do that. It’s an advanced configuration, and not something most Git users would want to get involved with. So don’t create that kind of organization by mistake!
Another mistake I see beginners make a lot is that they say git init
without thinking where they are (in the command-line world) when they say it. For example, don’t do this (on a Mac):
- Start the Terminal
- Say
git init
That’s a huge mistake. You now (on a Mac) have an organization like this:
/Users
/yourname
.git [invisible!]
/Applications
/Desktop
/Documents
/Downloads
/Library
/Movies
/Music
/Pictures
Do you see what’s happened? You’ve just put your whole hard disk hierarchy, from your Home folder on down, under Git control! That’s an absolutely terrible idea. Give the wrong command at this moment, and Git will happily erase that whole region of your hard disk hierarchy, permanently.
If you get yourself into that situation, the best thing to do is remove the unwanted .git folder, immediately. On a Mac, you can do that in the Finder; press Shift-Command-Period to make invisible files and folders visible, spot the problematic .git folder, and throw it in the trash so that Git loses control of your hard disk.
Want to know if a certain folder is nominally under Git control already? Just cd
into that folder, and then ask Git:
$ git rev-parse --absolute-git-dir
Git will tell you the path, if any, to the .git repository that the current folder “belongs” to.
One last thing. How did the .git repository get there in the first place? There are two main ways: most likely, either you said git init
or you said git clone
. Either of those commands, one way or another, means in part: “Create a .git folder.” The difference is that git init
typically creates a brand new .git folder as the repository, whereas git clone
typically copies an existing .git repository, which might contain quite a bit of stuff already, from somewhere on the Internet.
The files you see
Not only are the files you see not the repository; they aren’t even your files! Here’s what I mean. Let’s say you’ve got this organization:
...
/myFolder
a.txt
b.txt
c.txt
.git [invisible!]
And now suppose you cd
into myFolder and say:
$ git add .
$ git commit -m 'here you go, Git'
You have just said that the three files a.txt, b.txt, and c.txt are no longer yours. In a very real sense, they are not even the “real” files any more. The real files are now inside the repository! They are in the invisible .git folder (in a form that makes them quite hard to see, but don’t worry about that).
What you are now seeing is basically some copies of the files in the repository. I like to say that these are copies that Git lends you. Why does Git lend you these copies? So that you can edit them and make another commit! In other words, Git lends you copies of files that the repository contains, in order to give you a way to work on those files. That is why this entire area, all the contents of myFolder on down to any depth (except inside the .git folder itself), is called the working tree (or the working directory) corresponding to this repository.
Commits
Despite the impression I may have given so far, Git doesn’t actually traffic in files. It also doesn’t traffic in folders. Furthermore, Git doesn’t traffic in changes. (A lot of people seem to think Git is about changes.) No!
Git traffics in commits.
What is a commit, then? A commit is effectively a snapshot of the current state of your working tree — the entire current state of the entire working tree. (I’m over-simplifying here, but I’ll correct that in the next section, so just bear with me.)
Here’s what I’m getting at. Let’s go back to the situation I posited a moment ago. You’ve got this organization:
...
/myFolder
a.txt
b.txt
c.txt
[.git]
And you cd
into myFolder and say:
$ git add .
$ git commit -m 'here you go, Git'
You have now made a commit, and that commit contains the current state of all three files — a.txt, b.txt, and c.txt.
Okay, fine. And now let’s say you edit just one of the files — let it be b.txt. You edit it and then you say:
$ git add .
$ git commit -m 'another commit'
Or else (this is pretty much the same thing):
$ git commit -a -m 'another commit'
Now: what did you just commit? What exactly is in this new commit that you just made? A lot of people would say: “It’s b.txt.” Or they might even say: “It’s the changes in b.txt.” No! That is wrong!
What is in this commit is exactly the same sort of thing that was in the previous commit. This new commit, once again, contains the current state of all three files — a.txt, b.txt, and c.txt.
So a commit is basically a photograph, as it were, of the entire working tree. I like to think of the commits, in fact, as constituting a family photograph album. Imagine that a family gets together for a reunion every year, and a photographer takes a group photo of everyone. And suppose that in one of these photos we see that one of the people has grown a beard since the previous photo. You would not say that this photo is only of that one person, just because only that person has changed! They are still all group photos of the whole family.
One reason this is hard for people to realize or believe is that it is hard to see what’s in a commit. You can’t just peek into the .git folder. But there’s a way! To demonstrate, I’ll start all over once again. We have a folder containing three files, a.txt, b.txt, and c.txt. I cd
into that folder and I do this:
$ git init
$ git add .
$ git commit -m 'here you go, Git'
Then I edit b.txt and say:
$ git add .
$ git commit -m 'another commit'
Now watch this little move:
$ git ls-tree --name-only HEAD
HEAD
is a reference to the most recent commit, the one I just made. I’m asking Git what that commit contains. And here is what Git says:
a.txt
b.txt
c.txt
You see? The commit contains all the files, not just the “changes” caused by editing b.txt.
(Another reason why people don’t realize what’s in a commit is that they rely on git status
to tell them, without knowing what the output of git status
actually means. I’ll talk more about that a bit later.)
The index
The index (also, alas, called various other things like the stage, the staging area, or the cache) is a terribly important place in the world of Git. It’s effectively invisible, but we can take a look at it anyway!
Let’s go back to the little dance I did right at the beginning. We’re starting all over once again. Let’s arrive fresh into our folder. The folder has three files, a.txt, b.txt, and c.txt. I cd
into that folder and I do this:
$ git init
$ git add .
OK, stop. I didn’t commit yet. I just used the add
command. In particular, as one so often does, I said to add everything (that’s what the dot means). So… add everything to what? To the index!
You might say: Prove it! Well, once again, I’ve got a trick up my sleeve. Let’s start all over yet again. I cd
into the folder and I do this:
$ git init
$ git ls-files
That’s how we peek into the index! There is no output, of course, because we only just created the repository and then we stopped, so the index is empty. But now I’ll peek into the index again after saying git add
, like this:
$ git add .
$ git ls-files
This time Git responds:
a.txt
b.txt
c.txt
That’s the contents of the index! We’ve copied the three files, a.txt, b.txt, and c.txt, into the index.
Cool, so why did we do that? So that we can now say:
$ git commit -m 'here you go, Git'
That makes a commit — a snapshot. But what snapshot? A moment ago, I seemed to say that it was a snapshot of the working tree. But that’s not right! The snapshot that constitutes a commit, in general, is the index. That’s what the index is — it’s the material that will make up the next commit if you now say git commit
. When you commit, in general, you commit the index.
I like to say that there are actually three worlds of Git:
-
The repository. I’m imagining the repository here as consisting (primarily) of stored commits. These are snapshots reflecting past versions of your work.
-
The index. This is where you configure what you want to go into the next commit.
-
The working tree. These are files that Git lends you from the repository, so that you can edit them and then add them to the index, in order to tell Git what the index (and hence the next commit) should look like. The working tree is the only part of this triad that you can see directly.
The index is actually the reason why a commit is, by default, a snapshot of all the files. Once a file is in the index, it stays in the index until you take rather drastic measures to take it out again (I’m not going to describe what those drastic measures are). Therefore, that file will continue to be part of every commit you make from now on (unless you take those drastic measures).
I’ll demonstrate. Remember how, a moment ago, I made an initial commit by saying:
$ git commit -m 'here you go, Git'
Okay, so having made that commit, what’s in the index now? We committed everything from the index; is the index now empty? No! Let’s ask again what’s in the index:
$ git ls-files
And what do you think Git says?
a.txt
b.txt
c.txt
Those three files are still there! And that means that they will still be there (unless we take drastic measures) when we make the next commit — and so on. That is how a commit becomes a snapshot of all the files; it’s because the index, by default, just keeps hanging on to the files that have been put into it.
Understanding git status
We are now in a position to understand the mysterious output of git status
. I’m afraid that this one command is, all by itself, responsible for a lot of the prevalent misunderstandings of what Git is. That’s partly because the output of git status
omits so much. In fact, it’s fair to say that it omits almost everything! It’s also because the output of git status
talks in terms of “changes”, even though Git itself is not about changes.
To demonstrate, let’s pick up where we just left off; we’ve just committed all three files. Now we edit b.txt and stop. Let’s pause to think what the situation is. The index contains a.txt, b.txt, and c.txt. Meanwhile, we’ve also edited the copy of b.txt in the working tree. What does git status
have to say about this situation?
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: b.txt
no changes added to commit (use "git add" and/or "git commit -a")
When people see that, their eyes glaze over. It’s a lot of information. But they see the word “changes”, they see the mention of b.txt, and they think Git is all about changes. But it isn’t. A commit consists of all the files. The index contains all the files. But git status
omits from its output any mention of files that didn’t change.
In other words, git status
doesn’t answer the question “What will the next commit consist of?” It answers the question, “What’s new?” But what goes into the next commit is not what’s new; what goes into the next commit is everything.
Ok, so let’s accept that. Then where does git status
get its idea of what’s new? I just said that git status
omits from its output any mention of files that didn’t change. How does Git know what changed? Changed from what?
It’s simple. git status
works by comparing the three worlds! In the above output, git status
is reporting that the most recent commit and the index are exactly the same as one another, but the working tree is different from the index — because we edited b.txt but we didn’t add
that edited version to the index. That’s how git status
knows that the working tree b.txt has been modified: it no longer matches the b.txt in the index.
Okay, let’s add
our modified b.txt to the index and see what git status
says then:
$ git add .
$ git status
Here’s the output now:
On branch master
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: b.txt
That means that the index and the working tree are now exactly the same as one another, but the index is different from the most recent commit. The index’s version of b.txt is our modified version, and that’s different from the b.txt that went into the last commit. Thus, if we said git commit
now, the b.txt in the new commit would be different from the b.txt in the previous commit.
To make the git status
report even more interesting, let’s now edit a.txt and not add
it to the index, so that we get both kinds of difference in one report. Remember, we edited b.txt and added it to the index, and now we’ve edited a.txt but we didn’t add it to the index. Now here’s the output of git status
:
On branch master
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: b.txt
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: a.txt
You should be able to interpret that output, in terms of the three worlds of Git:
-
There’s a description of how the index differs from the most recent commit: “Changes to be committed”. b.txt in the index is different from b.txt in the most recent commit, and if you said
git commit
right now, b.txt in the new commit would be different from b.txt in the previous commit. -
There’s a description of how the index differs from the working tree: “Changes not staged for commit”. a.txt in the working tree is different from a.txt in the index, and its current state in the working tree won’t go into the next commit (unless you
git add
it).
You could actually perform, yourself, the same comparison that git status
is performing here:
-
To find out how the index differs from the working tree, say
git diff
. -
To find out how the index differs from the most recent commit, say
git diff --cached
.
Another common thing to want to do is to get a printout of what a particular file looks like in the most recent commit (or indeed, any commit) — not as a diff, but the whole text of the file, similar to saying cat
for a file in the working tree. To do so, use git show
with <commit>:<path>
notation. The most recent commit is HEAD
, which you can abbreviate as @
. For example, to see the contents of b.txt in the most recent commit:
$ git show @:b.txt
Another feature of the git status
output is that it advises you about what you might want to do next. A lot of people seem quite blind to this information, which can be very useful and should not be ignored!
-
We are told that the modified state of b.txt has been added to the index, but if we wanted to take that state back out of the index, so that it doesn’t become part of the next commit, we would say
git restore --staged b.txt
. (You recall that the index is sometimes called the stage?) -
We are told that a.txt has been modified in the working tree (you recall that the working tree is sometimes called the working directory?), and if we wanted to add it to the index, so that it is ready to become part of the next commit, we would say
git add a.txt
. -
We are also told that if we regret having edited a.txt, we can undo our edits by saying
git restore a.txt
. Be careful! That kind of undo action is not, itself, undoable. If you have not committed a certain state of a certain file, then that state of that file, if undone or overwritten, can never be recovered. So if you cared about the edits you have made in a.txt and you saidgit restore a.txt
, you would be filled with regret when you lost those edits, permanently!
The parent chain
An intuition that every Git user probably has is that Git preserves history. That, after all, is what makes it possible to “undo” back to any earlier commit. And you can see a display of the history by saying git log
.
However, history is not, of itself, a “thing” in Git. There are just commits. So how does Git turn a collection of commits into a history?
Well, a commit has a feature I haven’t told you about yet: it has a parent, which is just a pointer to another commit. The only commit that has no parent is the very first commit in the repository, which is called the root commit.
When you make a commit (other than the root commit), Git does a really clever, elegantly simple thing. Recall that HEAD
is a name for the most recent commit. So when you make a new commit, Git configures the new commit so that its parent is the HEAD
commit — and then it immediately changes HEAD
itself to point at the new commit, which makes sense because it is now the most recent commit.
Let’s enact that little game in a diagram. Start all over, where we cd
into a folder and say git init
. We then say:
$ git add .
$ git commit -m 'here you go, Git'
and we end up with a parentless root commit, which I will call A
:
A
|
HEAD
Notice that HEAD
here is just a name for the most recent commit. Now we edit a file (it could be b.txt) and say
$ git add .
$ git commit -m 'another commit'
Git creates another commit, which I will call B
, and sets its parent as HEAD
, which is A
:
A <- B
|
HEAD
But you will never see the repository in that state, because, in the very same move, Git immediately rips the name HEAD
away from A
and sets it to B
instead — because B
is now the most recent commit:
A <- B
|
HEAD
Do you see how the left-pointing arrow works in that diagram? Time marches to the right, as it were; B
is newer than A
. But the parentage points to the left; B
is pointing to A
as its parent, meaning, “This is the commit that came before me.”
Okay, but you see where this is going, don’t you? Just keep playing that same game: edit, add, commit; edit, add, commit; edit, add, commit. After making a whole bunch of commits, we’ve got something that looks like this:
A <- B <- C <- D <- E <- F <- G
|
HEAD
Now we can turn to Git and say: “Show me my history!” Typically, we will say git log
. What does Git do? Git doesn’t actually know anything about any history. All it knows is where HEAD
points to. But that’s all it needs to know, because every commit has a pointer to its parent! Thus, knowing that HEAD
is G
, Git knows that G
has a parent F
. And knowing that, Git knows that F
has a parent E
. And so on, as far back as you care to go, potentially all the way to the root commit.
Whenever you look at a machine-made diagram of the history of your repository, that’s all that Git is doing: it’s just walking back through the parent chain, one parent at a time. But this is a really simple and fast thing to do, so Git effectively does it instantly, making it look as if it magically “knows” the whole history.
Commits can’t change
An important corollary of this architecture is that a commit, once created, can basically never be modified. That makes sense, because if it could, you’d be violating the history, and thus violating the whole purpose of Git.
There are a number of commands that may look as if they alter a commit. For example, you may be aware that you can amend the commit you just made, in order to change its commit message. But that’s an illusion! When you amend a commit, you are actually creating a new commit. The new commit may have the same contents as the old commit, but it is not, in fact, the old commit.
There is a direct way to prove that: look at the unique identifier number that is attached to every commit. Here, I’ll do an entire dance that demonstrates. Once again we start with a folder and we cd
into that folder:
$ git init
$ git add .
$ git commit -m 'Here you go, Git'
# edit b.txt...
$ git add .
$ git commit -m 'another commit'
$ git log --oneline
Git says:
8fcb682 (HEAD -> master) another commit
101eb67 Here you go, Git
Okay, now I’ll amend the commit message of that last commit:
$ git commit --amend -m 'a lovely commit'
$ git log --oneline
Now Git says:
e676370 (HEAD -> master) a lovely commit
101eb67 Here you go, Git
You see? That second commit has a different identifier from the old second commit. That’s because e676370
is an altered copy of 8fcb682
.
Branches
In Git, branches are extremely important. What is a branch in Git? A branch is just a name for a commit, rather like HEAD
.
When you say git init
and then add-and-commit to create the root commit, that root commit gets the branch name master
by default; you can change that default, and the Git folks themselves will probably change it in the near future, but let’s just assume that’s the name. So the situation actually looks like this:
A
|
master
|
HEAD
The word master
here, I repeat, is just a name — a name with a pointer. And what it’s pointing at is A
. But after you’ve made a lot of commits, the situation looks like this:
A <- B <- C <- D <- E <- F <- G
|
master
|
HEAD
Do you see what’s happened? After every add-and-commit, it isn’t just the HEAD
pointer that gets repointed at the most recent commit; the name master
, too, gets repointed at the most recent commit. That is what a branch really is: it’s a name for a commit — a name that can move automatically to point to the newest commit.
Some of the most pervasive misconceptions about Git have to do with what a branch is. There is a tendency, for example, to think of a branch as a sort of topological thing. In the diagram above, for instance, you might think that there’s a thing called “the master
branch” consisting of all the commits from A
to G
.
But that’s really not the case! There is the name master
, pointing at G
; that’s the branch, and that’s all the branch is. So it is really not right to say, for example, that F
is “on the master
branch”, tempting as that way of thinking and speaking may be. What does exist is the parent chain as I described it in the previous section. The correct thing to say is that F
is reachable from master
— meaning that by working our way backward along the parent chain from the one master
commit (which is G
), we can get to F
. And E
is reachable from master
too, and so is D
, and so on.
And of course you can make more branches. For example, you can say:
$ git branch otherbranch
When you give that command, you are saying: “Create a new branch name otherbranch
and point that name at this commit.” The commit designated by “this commit” is HEAD
by default, but you can specify any existing commit as the one that otherbranch
should point at.
If you have multiple branches and you make commits on both of them, you might end up with a topology like this:
otherbranch
|
X <- Y <- Z
/
A <- B <- C <- D <- E <- F <- G
|
master
|
HEAD
How did that happen? Back at C
, we made a new branch called otherbranch
; it originally pointed at C
, but then we made three commits on that branch, moving the name otherbranch
from C
to X
and then to Y
and then to Z
.
Again, there is a tendency to think that X
, Y
, and Z
somehow constitute the branch otherbranch
. But that is not true! In the diagram, C
is the parent of X
as well being the parent of D
— and so, C
is reachable from otherbranch
, every bit as much as C
is reachable from master
. So what “branch” do you want to claim C
is on? The question makes no sense, and you should not confuse yourself by asking it. Think in terms of names and commits and parent chains and reachability, and you’ll be fine.
The life (and death) of commits
Branches are useful in part just because a name is easy to remember and use. But they serve another important function: they keep commits alive. In the diagram above, every commit shown is reachable from either the branch name otherbranch
or the branch name master
(or both). That fact is what preserves all these commits!
To illustrate, if you were now to delete the branch otherbranch
, what would happen? All you are doing is erasing the name otherbranch
; Git will discourage you from doing that, but it will happily let you do it nonetheless.
But why does Git discourage you? It’s because once you’ve done that, Z
is not reachable from any name, nor is Y
, nor is X
. (C
, on the other hand, is still reachable, namely from master
.) And therefore, unless you take measures to the contrary, X
and Y
and Z
will eventually be destroyed, in a natural process called garbage collection.
Moreover, that is the only way in which commits in a repository are ever destroyed, in the normal course of things: a commit is slated for destruction when, and only when, it becomes unreachable from any name.
I’ll illustrate with an earlier example. Remember when I did a commit --amend
to change commit’s message? First I had this:
8fcb682 (HEAD -> master) another commit
101eb67 Here you go, Git
And then, after the amend, I had this:
e676370 (HEAD -> master) a lovely commit
101eb67 Here you go, Git
The question now is: what happened to 8fcb682
? The answer is: Nothing, yet. But eventually it will be permitted to go out of existence, which makes sense, as it serves no purpose; it has been replaced by e676370
. The commit 8fcb682
is now unreachable from any branch name, and therefore it is slated for eventual destruction.
A branch is not the only kind of name that keeps reachable commits alive, but it is far and away the most important kind. (The other is something called a tag.) So branches are crucial in keeping your history alive.
Working on a branch
I’ve said that you should not think of a commit in your history as being “on” any particular branch. But it does make sense to use the notion of being “on” a branch — not with regard to past commits, but with regard to you! Under normal circumstances, when you are working within a Git repository’s working tree, you, yourself, are always working on some branch.
Just what does that mean? Technically, it’s simply a matter of where HEAD
is pointing. When we say that you are on a certain branch, we just mean that HEAD
is pointing to that branch name at this moment.
Looking at the diagram I showed earlier of a repository that has two branches, let’s examine the difference between you being on one branch and you being on another branch. Here, you are on master
:
otherbranch
|
X <- Y <- Z
/
A <- B <- C <- D <- E <- F <- G
|
master
|
HEAD
And here, you are on otherbranch
:
HEAD
|
otherbranch
|
X <- Y <- Z
/
A <- B <- C <- D <- E <- F <- G
|
master
From a topological perspective, they are exactly the same. The only difference is what HEAD
is pointing to.
From a practical perspective, however, it makes all the difference in the world what branch you are on. That’s why it’s so important to know what branch you are on — and that’s why it’s the first thing git status
tells you:
$ git status
On branch master
Technically, that means HEAD
is pointing to master
. Practically, it tells you (among other things) what will happen if you make a commit right now. If you do make a new commit right now, its parent will be the commit that is now master
, and master
will then point to the new commit.
So, earlier I said that a branch is “a name that can move automatically to point to the newest commit.” But I didn’t tell you what branch name is going to behave that way. The answer is: the branch you are on!
This, then, is a meaningful sense of the notion “on a branch.” You can say you are working “on branch master
” to mean: master
is the HEAD
pointer that is automatically getting moved every time I make a commit.
Switching branches
Think now, please, about how we might get from the first situation in the above pair of diagrams to the second situation — from this:
otherbranch
|
X <- Y <- Z
/
A <- B <- C <- D <- E <- F <- G
|
master
|
HEAD
To this:
HEAD
|
otherbranch
|
X <- Y <- Z
/
A <- B <- C <- D <- E <- F <- G
|
master
As I said before, all that has happened technically to the topology is that HEAD
has changed what it points to. But how?
There are two main ways: you can use git checkout
or git switch
. The latter is preferable nowadays, as the former, git checkout
, is an overloaded command. However, I do give commands like git checkout master
all the time, so I can understand if you do too. Either way, let’s say that you are switching branches.
What happens when you switch branches? How does Git respond when you tell it to switch branches? It’s very important to understand this, because switching branches causes your little universe to change quite dramatically.
So far, I have talked about commits as things you assemble and create. You say git add
and then git commit
and you have basically put copies of a bunch of files into a commit. Now, however, I am going to reverse that directionality and talk about getting copies of files out of a commit.
That is what happens when you switch branches. When you switch branches from master
to otherbranch
, Git whips the HEAD
pointer away from pointing at master
and points it at otherbranch
instead. But that is not all that happens! Git also takes aim at your entire working tree. Git effectively throws away the contents of your working tree, and replaces those contents with a copy of the contents of the commit currently pointed to by otherbranch
. (That’s commit Z
in the diagram.)
I will demonstrate. Before switching branches, let’s see what’s in each branch — what files are in the commits pointed at by the branch names. I’ve configured all this beforehand to make things particularly obvious:
$ git ls-tree --name-only master
a.txt
b.txt
$ git ls-tree --name-only otherbranch
c.txt
OK, so master
has a file a.txt and a file b.txt, while otherbranch
has a file c.txt. I am currently on branch master
:
$ git status
On branch master
And sure enough, my working tree contains a file a.txt and a file b.txt:
$ ls -1
a.txt
b.txt
Now I’ll switch branches and see what my working tree contains now:
$ git switch otherbranch
$ ls -1
c.txt
Oh, my golly! My files a.txt and b.txt are gone! They’ve been completely replaced by this other file c.txt. Run in circles, scream and shout!!
But no. There’s no need to panic. What Git is doing makes perfect sense, because we have just told it that we would like to be “working on” the branch otherbranch
. We want the working tree to look like otherbranch
, so that we can edit in the working tree and make new commits that make sense in the context of being on otherbranch
.
And after all, the files a.txt and b.txt are not actually “gone” at all. They are safely inside the master
commit, where they were all along. Remember, the working tree is just some files that Git lends you; the working tree is a representation so that you can work on a branch. The real files are tucked safely in the repository. Don’t worry, be happy.
And there’s more. When you switch branches from master
to otherbranch
, Git also removes the contents of the index and replaces them with the contents of the otherbranch
commit too! And that makes sense as well. Right after you switch branches, we expect to be in a neutral state: generally speaking, git status
should come up empty, ready for us to start working. So the working tree should look like the index and the index should look like HEAD
, which is otherbranch
.
Now, you may be asking (and I hope you are): Is switching branches dangerous? In particular: What if I have active work lying around uncommitted in my working tree and possibly my index, and I switch branches at that moment?
In most situations, it’s okay. In real life, for instance, it very often happens that you start to edit some files and then suddenly realize that you really should be making a new branch so you can commit this work on that branch. That is generally fine to do. You create a branch and switch to it, and nothing dramatic happens at all, because the new branch looks like the branch you were already on. The edited files are still there in your working tree, ready to be added and committed on the new branch.
However, in broader terms, if you have edited files and you then switch to another branch that already exists, there is in fact some danger that the version of a file in that branch will be different from the version of the file that you have just edited. Switching branches thus threatens to wipe out your edits, overwriting them with the version from the branch you are about to switch to.
Git is supposed to detect this situation and stop you from switching branches in that case; it generally does, but every once in a while one hears horror stories of how Git permitted the switch and work was lost. On the whole, it’s a probably a pretty good idea to switch branches only when Git is in a neutral state, where all your edits have been added and committed and git status
reports that the working tree is clean.
Merging
Merging is probably the most far-reaching and elaborate thing that Git knows how to do. And for that very reason, there are lots of misconceptions about it!
So what is merging? There are many variants on what can happen when you merge, but here’s what we might think of as the canonical case:
-
You are on a branch, usually a primary branch of some sort; let’s say it’s
master
. -
You say
git merge otherbranch
(or whatever the name of some other branch is). -
Git now creates, out of whole cloth, a completely new commit combining the contributions from both branches. Moreover:
-
You are working on
master
, so this commit is onmaster
; themaster
branch name pointer is advanced to point to this new commit. -
The
otherbranch
pointer is not advanced. -
This new commit has a remarkable feature: it has two parents, the
master
commit we were on before you saidmerge
, and the commit pointed to byotherbranch
, in that order.
-
So, for instance, suppose we are in this situation:
otherbranch
|
X <- Y <- Z
/
A <- B <- C <- D <- E <- F <- G
|
master
|
HEAD
If you now say git merge otherbranch
, you get this:
otherbranch
|
X <- Y <- Z <--------\
/ \
A <- B <- C <- D <- E <- F <- G <- M
|
master
|
HEAD
The newly minted commit, created entirely by Git, is M
. And M
is now master
. And it has two parents! The first parent is G
, which was master
previously. The second parent is Z
, which was otherbranch
previously (and still is).
(At this point, I could talk about variants of merges, such as fast-forwarding and squashing; about the logic of how Git creates a merge commit based on “the contributions from both branches”; and about what happens when that logic is insufficient and Git turns to you for assistance — unfortunately known as a “conflict”, a term that has instilled unnecessary fear and misunderstanding in far too many Git users. But those are subjects for another day.)
In many situations, the purpose of a secondary branch all along was (assuming things panned out successfully) to be merged eventually into the main branch. You created the extra branch in order to try to implement some feature or experiment with some line of development; you succeeded, and now you want your work to be contributed back onto the main branch. Therefore, after merging, you might as well delete the secondary branch. A branch is just a name, after all, so that’s all that gets deleted: the name! If you deleted otherbranch
right now, you’d get this:
X <- Y <- Z <--------\
/ \
A <- B <- C <- D <- E <- F <- G <- M
|
master
|
HEAD
The topology is unchanged; the name otherbranch
is gone, and that’s all. Deleting the name otherbranch
has, indeed, no important consequences, assuming (as I do assume here) that we were never going to use it for anything further. After all, the merge commit, with its two parents, preserves the history of what happened — the two parent chains that led up to the merge. All the commits in the diagram are reachable from master
. So the name otherbranch
is not needed in order to preserve anything.
At the same time, even with the name otherbranch
gone and forgotten, G
retains a kind of primacy over Z
as a parent of M
, because it is the first parent of M
, and Git knows that fact (because the parents are recorded together with their order in M
). That can make a difference to the way Git describes and displays the topology.
Remotes and fetch
I have mentioned Git’s ability to synchronize between your local copy of a repository and an online copy (typically at some place like GitHub or Bitbucket). This mechanism, too, is fraught with opportunities for misconception. Let’s try to straighten out some of them.
First of all, how does your local repository know where the online repository is? It isn’t magic! In the first instance, Git knows because you tell it the online repository’s URL. It is perfectly legal to give commands like fetch
by explicitly specifying the URL where the online repository is to be found, like this:
$ git fetch git@github.com:mattneub/myCoolRepo.git
However, you can imagine that having to enter the entire URL of a remote repo, every time you want to synchronize with it, could get really old, really fast. So Git lets you give a URL a name. That name is called a remote. A remote is basically just a name — a name for a URL.
If you obtained your copy of this repo by cloning from an online copy to begin with, your copy comes with a remote already configured with the correct URL, with the default name origin
. (Of course that’s just a default name, and you can change it.)
If your local repository has no remote — perhaps because you created it by saying git init
— you can give it a remote “by hand”, by using the git remote add
command to provide a name and a URL. But then the remote repository at that URL had better exist already! Merely declaring a remote in your local repository doesn’t cause any remote repository to come into existence.
Once origin
is a name for the remote repository’s URL, instead of giving the URL in the fetch
command, you can say:
$ git fetch origin
Or even more briefly, because origin
is the default:
$ git fetch
Now let’s talk about what happens when you actually do say that. Under normal circumstances, git fetch
means: Contact the Git located at the origin
URL, and ask for copies of all commits reachable from all branches in that repository. That includes the remote repository’s branch names themselves. Thus, what you are saying is: Bring me up to date with the remote repository!
The actual transfer of commits over the network is as efficient as possible. A commit’s unique identifier is unique, so it’s easy to ascertain whether we already have a copy of a particular commit, meaning that that commit doesn’t have to be transferred. What’s transferred is solely what we don’t already have. (Moreover, the commits that are transferred are compressed to minimize bandwidth; but you don’t need to know about that.)
Remote-tracking branches
The mechanics of “me” in the notion “bring me up to date” are particularly interesting. After a git fetch
, nothing in the working tree changes, even though you may have just fetched new commits from the remote version of the same branch you are on. And if you switch branches after a git fetch
, you’ll find that nothing about any of those branches has changed either. So where did all the fetched commits go?
The answer involves special branches called remote-tracking branches. Remote-tracking branches are special in that they can’t be switched to or worked on directly. The job of a remote-tracking branch is to do the very thing we just said: to capture the result of synchronizing with the remote repository. Basically, a remote-tracking branch is a local copy of the parent chain of commits reachable from the corresponding branch in the remote repository.
Thanks to the remote-tracking branches in your repository, you can say git fetch
without fear, because your branches will be completely untouched. Plus, your local Git knows, at all times, quite a lot about the state of the remote copy of this repository, without having to talk to that remote copy over the network.
For example, you can ask how your master
branch compares with the remote master
branch. Your master
branch might point to a newer commit than the remote master
branch, because you’ve done some add-and-commit work on your master
branch. Or the remote master
branch might point to a newer commit than your master
branch, because someone created commits on the remote machine, or synchronized up to the remote machine. The point is that this question is very easy and efficient to answer, because it doesn’t involve doing any networking!
Indeed, git status
automatically answers that very question:
$ git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
What’s that origin/master
? It’s the remote-tracking branch that acts as a copy of the origin
remote’s master
branch. Its parent chain is a copy of the remote branch’s parent chain. So your local Git knows instantly what commits are reachable from the remote branch!
Of course, this remote-tracking branch might not itself have the most current information about what’s happened at the remote repository. And it won’t have the most current information until you say git fetch
again! That’s the point. Git will never talk to the remote repository automatically; it talks to the remote repository only when you expicitly ask it to do so.
Updating local branches
So now we know that git fetch
does not cause any of your real branches to be brought up to date. It just brings your remote-tracking branches up to date. But what if you do want one of your real branches to be brought up to date? How can you get the up-to-dateness from a remote-tracking branch into a real branch?
Before I answer that question, I want to fix my terminology. The “real” branches I’m talking about are actually called local branches, as opposed to the remote-tracking branches. So I’m going to call them that. Keep in mind, though, that remote-tracking branches are actually local too — they and their reachable commits are on your computer, not on the remote.
So now let’s talk about updating a local branch from a remote-tracking branch. There are two possible cases to consider.
Case 1: You have a remote-tracking branch but no corresponding local branch, and you want a local branch so you can work on it. In this case, you can create a local branch based on the remote-tracking branch.
To do so, you basically ask to switch to the remote-tracking branch. A nice notation is:
$ git switch -c origin/somebranch
You cannot really switch to a remote-tracking branch, so Git interprets this as creating a local branch. It’s trivially easy for Git to do that; it just makes a local branch name and points it initially at the same commit as the remote-tracking branch. (You are not forced to name your new branch with the same name as the remote-tracking branch, but it is conventional to do so.)
Case 2: You already have this branch as a local branch, as well as the remote-tracking branch, and you want to bring your existing local branch up to date to match the remote-tracking branch. In this case, a very simple approach is just to merge
the remote-tracking branch into the local branch!
(It’s so common to fetch
and then merge
the remote-tracking branch into the local branch, that Git provides a shortcut command: git pull
. This, by default, means fetch
and merge
in one move. However, git pull
can be quite tricky to use, because it has hidden configuration options, can have unforeseen consequences, and gives no opportunity for reflection and planning. Adept Git users prefer to fetch
and then decide whether to merge
.)
You’re probably wondering now how to tell whether you’ve got remote-tracking branches and what their names are. When you say simply git branch
, which asks for a list of branches, remote-tracking branches are not listed! To see the remote-tracking branches, a good way is to say:
$ git branch --all -vv
In the output, remote-tracking branches are listed like this:
remotes/origin/master 8d10113 some commit message
remotes/origin/otherbranch 3b097b7 some other commit message
That tells you the name of the remote-tracking branch, preceded by the prefix remotes/
, along with some information about the commit that that remote-tracking branch name points to. Local branches are listed like this:
* master 8d10113 [origin/master] some commit message
otherbranch 3b097b7 [origin/otherbranch] some other commit message
Note that each local branch is listed in company with the remote-tracking branch with which it is associated — if it has a remote-tracking branch associated with it.
(Confusingly, the association between a local branch and a remote-tracking branch is also called tracking. Using that terminology, the local master
is tracking the remote-tracking origin/master
, and the local otherbranch
is tracking the remote-tracking origin/otherbranch
. Sigh.)
Push and the “upstream” of a branch
The opposite of fetch
is push
. When you push
, you are asking Git to send to the remote repository some commits that you’ve got but the remote repository doesn’t, bringing the remote into sync with you.
Most commonly, you’ll push a single branch where you’ve done some add-and-commit cycles:
$ git push origin master
By default, Git assumes that the branch name up at the remote repository is the same as the branch name locally. So git push origin master
means, by default, to synchronize the local master
branch up to the origin
remote’s master
branch. (You can push a local branch to a remote branch that has a different name, but I’m not going to explain how.)
If a local branch is already associated with a remote-tracking branch, then pushing to the corresponding remote branch also updates the remote-tracking branch. That makes sense, because pushing involves networking, so clearly Git can take advantage of this opportunity to make sure the remote-tracking branch reflects the remote repository correctly after the push.
As a notational shortcut, if a local branch is already associated with a remote-tracking branch, and you are on that local branch, you can just say:
$ git push
The association means that Git can make some obvious assumptions about what branch of what remote you want to push to, and if those assumptions are correct, saying git push
will do the right thing.
But what if a local branch is not associated with a remote-tracking branch? To illustrate, I’ll make an entirely new branch, thirdbranch
; I’ll switch to it and do an add-and-commit of some edited material. Then I’ll say:
$ git push
Whoops! Git doesn’t know what to do, and replies:
fatal: The current branch thirdbranch has no upstream branch.
That “upstream” is another of the many unfortunately overloaded Git terms. Basically, Git just means here that thirdbranch
is not associated with any remote-tracking branch, so more information is needed as to what remote repository, and what branch of that repository, to push to. We can silence Git’s worries by being more explicit:
$ git push origin thirdbranch
That succeeds, and it also creates a remote-tracking branch, as we can discover by asking:
$ git branch --all -vv
master 8d10113 [origin/master] some commit message
otherbranch 3b097b7 [origin/otherbranch] some other commit message
* thirdbranch befe828 still another commit message
remotes/origin/master 8d10113 some commit message
remotes/origin/otherbranch 3b097b7 some other commit message
remotes/origin/thirdbranch befe828 still another commit message
But as you can see, even though we created a remote-tracking branch remotes/origin/thirdbranch
(listed in the last line of the output), our local thirdbranch
(listed in the third line of the output) is still not actually associated with it! So we still cannot subsequently say plain and simple git push
when we are working on thirdbranch
.
If we want to be able to do that, we need to set origin/thirdbranch
as the “upstream” of thirdbranch
. We can do so now, by saying (while still on thirdbranch
):
$ git branch --set-upstream-to origin/thirdbranch
Now the association has been formed (as you can see in the third line of the output):
$ git branch --all -vv
master 8d10113 [origin/master] some commit message
otherbranch 3b097b7 [origin/otherbranch] some other commit message
* thirdbranch befe828 [origin/thirdbranch] still another commit message
remotes/origin/master 8d10113 some commit message
remotes/origin/otherbranch 3b097b7 some other commit message
remotes/origin/thirdbranch befe828 still another commit message
More commonly, you’ll probably associate the remote-tracking branch with the local branch in the same command that creates the remote-tracking branch — that is, when you first push the local branch. That syntax is:
$ git push -u origin thirdbranch
The -u
is short for --set-upstream
.
Push is picky
It’s quite easy, especially (though not exclusively) when you are collaborating with others, to find yourself in a situation where you will ask to push, and Git will communicate with the remote repository and will come back to you and slap your hand:
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
What has happened here? Basically, on the remote repository, the branch name that we are trying to push to points to a commit that you don’t have locally.
To demonstrate, I have set up an artificial situation in order to make Git complain. My local thirdbranch
looks like this:
f89effe eee
db5c32e ee
befe828 e
683c844 init
But the remote repository’s thirdbranch
looks like this:
e4ce90b edited by someone else
db5c32e ee
befe828 e
683c844 init
As my commit message for e4ce90b
is meant to imply, someone else has done some editing behind our backs and has pushed or created a commit on the remote repository. The commit e4ce90b
, which is the remote repository’s thirdbranch
, is a commit that I don’t have locally.
How is such a situation to be resolved? Well, if you’ve been paying attention, you know one answer: git merge
! But here’s the thing: Git is not going to perform an automatic git merge
at the remote repository. The rule is that if any merging is to be done, you have to do it. In this situation, therefore, it is not possible to push. We need to do the merge locally. Then, and only then, will we be allowed to push.
So we need to fetch
and merge
before we can push
! Here we go (assuming I’m already on thirdbranch
):
$ git fetch origin thirdbranch
$ git merge origin/thirdbranch
The merged topology we just created looks like this:
$ git log --graph --oneline
* 2f2c5ad (HEAD -> thirdbranch) Merge remote-tracking branch 'origin/thirdbranch' into thirdbranch
|\
| * e4ce90b (origin/thirdbranch) edited by someone else
* | f89effe eee
|/
* db5c32e ee
* befe828 e
* 683c844 init
As you can see, we now have commit e4ce90b
, immediately followed by a new commit that the remote repository doesn’t have (a merge commit). And now we can say:
$ git push origin thirdbranch
From what I’ve said, you may be thinking that collaborating with someone else on the same branch is something of a race condition. And that’s exactly true. Just to give an example: suppose, while we were doing our merge locally, someone pushed another commit onto thirdbranch
at the remote repository. Then we wouldn’t have that commit, and we still wouldn’t be allowed to push! We’d have to fetch and merge again and try to push again.
But that’s just the price of collaboration. In real life, with decent communication between collaborators, it should not pose too much of a problem.
Where to go from here
There is obviously far more to Git than I’ve outlined here. My purpose in this article, however, has not been to teach Git or even to introduce it, but to provide a mental orientation that makes Git comprehensible and usable.
I’ll end by listing some further topics you might be curious about at this point; this article’s general conception of Git should make them easier to understand:
-
Reset. A branch, as I have been at pains to emphasize, is just a name pointing at a commit. As you work “on a branch”, the branch name moves automatically to point at the latest commit. But you also might have reason to want to grab that name and repoint it at some other specific commit yourself, manually. You can! That is what
git reset
is. By clever use ofgit reset
, you can rewrite your chain of commits in various interesting ways; for example, you can reduce a chain of multiple commits to a single commit, to make the history read more cleanly. -
Diff. I’ve emphasized that Git does not store changes; it stores commits, which are snapshots. But you may be wanting to retort, “What about
git diff
? It shows changes!” Yes, of course. Clearly there are many situations where it is valuable to know what would have to be done (or what evidently was done) to the contents of one commit in order to end up with the contents of a different commit.git diff
and other related comands do show that. But they are not showing something Git knows, but something it deduces. Differences between commits are a derived, secondary concept. -
Merge logic. Earlier, I described what a merge is, but I avoided any details of how Git creates a new merge commit based on the commits you have told it to merge. Git’s reasoning here is what I call merge logic. Merge logic is fundamental to many of the more advanced and interesting abilities of Git — not just merge, but also cherry pick, rebase, and revert. This is a big topic and I’m saving it for a subsequent article.
-
Cherry pick and rebase. These are ways of making a copy of a commit that has a different parent from the original. Commits are uniquely identified and effectively immutable, so it follows that you can’t change the parentage of a commit itself! Cherry pick and rebase both manufacture new commits. Of course, there is plenty to know about the details of how they do that, and how you specify what you want Git to do.
-
Pull requests. You might be using a remote repository hosting service, such as GitHub, that offers (through the browser) an ability to make and resolve pull requests (also called merge requests). This enables a topology where you do exactly what I said earlier you cannot do: you merge commits, possibly making a new merge commit, at the remote repository. This should be impossible, and in fact it is impossible! It looks possible only because of some clever trickery on the part of the hosting service. Pull requests are not a Git feature; they are a feature of the hosting service. How they work and how to use them is a completely separate can of worms.