Source Control HOWTO: Repositories


Cars and Clocks
In previous chapters I have mentioned the concept of a repository, but I haven't said much further about it. In this chapter, I want to provide a lot more detail. Please bear with me as I spend a little time talking about how an SCM tool works "under the hood." I am doing this because an SCM tool is more like a car than a clock.

  • An SCM tool is not like a clock. Clock users have no need to know how a clock works inside. We just want to know what time it is. Those who understand the inner workings of a clock cannot tell time any more skillfully than the rest of us.
  • An SCM tool is more like a car. Lots of people do use cars without knowing how they work. However, people who really understand cars tend to get better performance out of them.

Rest assured, that this book is still a "HOWTO." My goal here remains to create a practical explanation of how to do source control. However, I believe that you can use an SCM tool more effectively if you know a little bit about what's happening inside.

Repository = File System * Time

A repository is the official place where you store all your source code. It keeps track of all your files, as well as the layout of the directories in which they are stored. It resides on a server where it can be shared by all the members of your team.

But there has to be more. If the definition in the previous paragraph were the whole story, then an SCM repository would be no more than a network file system. A repository is much more than that. A repository contains history.

A file system is two-dimensional: its space is defined by directories and files. In contrast, a repository is three-dimensional: it exists in a continuum defined by directories, files and time. An SCM repository contains every version of your source code that has ever existed. The additional dimension creates some rather interesting challenges in the architecture of a repository and the decisions about how it manages data.

How do we Store all Those old Versions of Everything?
As a first guess, let's not be terribly clever. We need to store every version of the source tree. Why not just keep a complete copy of the entire tree for every change that has happened?

We obviously use Vault as the SCM tool for our own development of Vault. We began development of Vault in the fall of 2001. In the summer of 2002, we started "dogfooding." On October 25th, 2002, we abandoned our repository history and started a fresh repository for the core components of Vault. Since that day, this tree has been modified 4,686 times.

This repository contains approximately 40 MB of source code. If we chose to store the entire tree for every change, those 4,686 copies of the source tree would consume approximately 183 GB, without compression. At today's prices for disk space, this option is worth considering.

However, this particular repository is just not very large. We have several others as well, but the sum total of all the code we have ever written still doesn't qualify as "large." Many of our Vault customers have trees which are a lot bigger.

As an example, consider the source tree for This tree is approximately 634 MB. Based on their claim of 270 developers and the fact that their repository is almost four years old, I'm going to conservatively estimate that they have made perhaps 20,000 checkins. So, if we used the dumb approach of storing a full copy of their tree for every change, we'll need around 12 TB of disk space. That's 12 terabytes.

At this point, the argument that "disk space is cheap" starts to break down. The disk space for 12 TB of data is cheaper than it has ever been in the history of the planet. But this


CMCrossroads is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.