Some Practical Thoughts on Git Merges vs. Rebase

Git rebases are not as complicated as you think. At some point in their careers, most engineers have been been told that they are risking rewriting history but that just isn’t strictly true. The Engineering team at ShopKeep has had this conversation a few times, so we thought we’d share our views on the issue with a wider audience.

We are going to try and answer this question: should we merge or rebase our changes?

We recommend you rebase, but before we get into the why we should go over some of the basics of Git.

Git, at it’s core, is just an object database that has tools to manage said objects. Git has two categories of commands. The first, dubbed ‘porcelain’, are all the commands your git core team expects the end user to use. The second, named ‘plumbing’, are the ‘background’ commands that are used by the porcelain commands to manipulate the actual objects. We are going to explore some of the plumbing commands so you can get familiar with what is happening under the covers.

Repositories, Blobs, Trees and Commits

The repository (or repo as it’s commonly shortened) is the complete history of the project, and all it’s objects. Git stores three core entities: blobs, trees, and commits. Objects are stored in the database and addressed by their hash which includes a header.

At it’s very basic level git tracks blobs of content. You can think of blobs as files. But always remember that blobs are immutable. Other systems might store changes to the repository as diffs from one version to the next, git stores the entirety of the content for every changeset. [1]

Trees are git’s directories. They are objects that point to blobs, or other trees. Because trees can only reference other blobs or trees, you won’t find any empty directories in a git repo.

Git Tree Image - http://git-scm.com/figures/18333fig0901-tn.png

Before we talk about commits let’s discuss the index first. Stored on disk in your repositories .git directory, the index tracks the state of the repo. It knows what blobs are currently referenced by the snapshot of the repo you are working off and it knows what the difference will be when you actually make a new one.

In this way, the index essentially serves as a staging area for changes you are preparing to save. The commit takes those staged changes and makes a new object out of them. It saves a reference to it’s parent [2], and its tree. Its tree is written to the database and it’s hash is saved.

So what does all this mean? How can we use these objects?

Let’s try out some of the plumbing commands and see if we can get some info. `git ls-tree HEAD` will give us a listing of a tree and it’s content along with their permissions and their sha1 addresses. Go ahead and use it on a repo you have you might see something like this

“`

100644 blob 2f360c4798c2fbd15bf902f33a3cc049175c1ffa .gitignore

100644 blob 21e3842dc466fabebf6d92e1ac4a773faa8965f4 README.md

“`

We can now use `git cat-file blob 21e3842dc466fabebf6d92e1ac4a773faa8965f4` to see the contents of our readme. And that’s essentially all a commit is. A tree of objects that at the very core just tracks the content at the point that you committed it.

Branching and its Benefits

We need one more piece before we can discuss merges and rebases, and that’s branching. A branch is a reference to a particular commit. You might be thinking that this is not that helpful, but remember that each commit stores a reference to it’s parent. And with that we can walk all the way back to the first ever commit in the repo. In this way a branch represents the tip of a history chain, all the way back to the beginning. [3]

Git Figure 2 - http://git-scm.com/figures/18333fig0903-tn.png

Well now that we talked a little about what makes a commit, and what a branch is, we can discuss strategies for getting work done on a branch, back into the main history. When you branch and add a commit to that branch, Git moves your branch references to the new commit. Now you have a reference to a commit that is no longer on the main branch, but whose parent still is. Your commit history visually would look forked, but because work on the main branch might have continued on, you can’t easily reconcile these[4]

The first way to reconcile these branches is with a merge.

Git Figure 3 - http://git-scm.com/figures/18333fig0328-tn.png

What happens in the case of a merge is that a new commit is written. In this commit it is up to you to resolve any conflicts, and a new tree is generated based on the new state of the repo. Forever in history now is the fact that you split and rejoined.This commit stores a reference to both parents. If you ever had to go back in history you’d need to walk both sides of the split to get the full picture.

To Rebase or Not to Rebase

Enter the rebase. It serves to take a commit, and change it’s parent reference to a new commit. If you think back to how keys are generated, you’ll recall that I said it’s the combination of some header data and the content. Well the parent commit reference is part of the content, and if you change it, you change it’s key.

This is precisely what someone means when they say that you are changing history. The list of commits that you used to point to, all have different sha1 references. Because of that git actually creates new entries in it’s database and moves your branch pointer to reference them.

Git Figure 4 - http://git-scm.com/figures/18333fig0329-tn.png

As you can see in this case you haven’t really done anything, no changes were made to the contents of the tree (Ignoring merge conflicts for simplicity) and all you have done is

point a single commit to a different parent. However, if you have a branch with multiple commits, each one will need to be rewritten, and that’s what git lets you do. It will apply each commit one at a time, keeping the content and moving the parent reference. If at any point a particular commit has a conflict, git will ask you to resolve the conflict, add the files, and continue the rebase.

Each commit is applied, one at a time, on top of the history of whatever reference you told git to rebase onto. This is where a lot of power can be wielded. Because you are stepping through history and applying each commit one at a time, you can change anything about each commit – as though you were doing the work at that parent commit.

You can change an author, message, or even the contents of a commit. Git will let you replay them in any order, squash them into a smaller number of commits, or even drop commits entirely. So you can see rebasing does in fact rewrite history.

At the most basic level however, this isn’t an issue. It simply updates the parent reference for each commit one at a time. Once you understand the immutability of git it’s very easy to see the function and power of rebase over merge. With that in mind any bad rebase can be reset, by simply pointing your branch back at the old commit. The old commit hasn’t disappeared and can easily be restored.

There is one caveat worth mentioning. Because of the distributed nature of git, it is possible that you have shared your branch with someone else. If you have, and then you change the branch in this manner, their local branch will be out of sync with your remote one. This leaves their version of the branch in a bad state, and as a result the general wisdom is that you ought not rebase any published branch[5].

Because of the power and complexity of what a rebase is doing, I recommend everyone learn what it does, and how to use it.

—-

[1] This is not strictly true. Because that is super inefficient. Git actually does often pack the content up into aptly named compressed pack files. (http://git-scm.com/book/en/Git-Internals-Packfiles)

[2] Git calls this reference to the top of a history chain, or ref, the head of that branch.

[3] If no work was done you can, and in that scenario the main branch would just be moved to point to your latest commit.

[4] We can consider any branch published if you have shared it with someone else (pushed it to github).