JOIN
Get Time
features   
An Introduction to Version Control

Author
By bmerry
TopCoder Member

Introduction

During my undergraduate courses there were a number of group projects. The students in a group would typically mail files back and forth, lose them, try to arrange who was going to work on which file, work on an old version and accidentally lose changes, and generally waste a lot of time and effort on managing their collaboration. This is because they didn't know about version control, also known as source code management (SCM).

In this tutorial, I'll look at the concepts involved in version control, and illustrate them with two open-source systems: Subversion and git.

There are two primary goals of a version control system:

  1. The entire history of development should be preserved. At any time you can go back and find old versions of code, see what changed between versions and even who changed it. If you've ever ripped out some “unneeded” code, only to find you want it again later, you should immediately realize the benefit of this. A version control system can do this far better than just making hundreds of backup copies.
  2. It makes collaborative development possible and efficient. Changes are never lost just because somebody started working on the wrong version of a file, and it's possible for multiple people to work on the same codebase and have their changes merged together.

Version control is vital to any large project, but even small, short-term projects can benefit. During the TopCoder Open, I used version control for marathon match submissions. This meant that every submission I made, plus other ideas that I tried and discarded, were always there if I needed them, and I could see exactly what changes I made between versions. Version control can also be used for more than just source code; it's just as good for documents, Web pages or anything else, although it works best when there are a lot of small, textual files rather than a big binary blob like a video.

Repositories and working copies

Traditionally, the repository is a central store that holds all the files in a project, including every version ever committed and all the history. To avoid confusion, developers don't work directly on the central repository, but on a private copy called the working directory. This might just hold the latest copy of all the files, but in some cases it can be a complete copy of the entire repository with all the history. Developers will make changes in the working copy, hopefully test them, and then commit them to the repository. Once this set of changes (called a revision or a changeset) has been committed, other developers can update their working copies to receive those changes. What makes collaborative development possible is that the downloaded changes are merged with any changes that the developer has made (even within the same source file), instead of just overwriting them.

Getting started with Subversion

Now that the introduction is done, let's get our hands dirty with Subversion. We need somewhere to put a repository. For distributed development this will need to be on a web-accessible server (for example, Sourceforge provides Subversion hosting for all its projects, but for now let's just put it somewhere local. I have a directory called ~/svn in which I put my repositories. It's generally a good idea to use separate repositories for unrelated projects, so that they can be moved around, backed up, or even deleted independently. Let's create a new repository for a hello world project:

hactar:~/svn$ svnadmin create hello
hactar:~/svn$ svn mkdir file://$HOME/svn/hello/{trunk,branches,tags} -m "Create conventional directories"

Committed revision 1.

The first line is straightforward. The next line introduces quite a few new things. Firstly, there is the file:// syntax. This specifies an URL to a repository, in this case the one we just created (Subversion also supports HTTP-based and SSH-tunnelled protocols). We have created three subdirectories inside the repository called trunk, branches and tags. This is a convention specific to Subversion, and the reason for it will become clear later. For now, we will just do everything inside the trunk subdirectory. Finally, there is the "-m" switch. This provides an informational message which is attached to the revision, to tell anyone browsing the repository what it was for.

At this stage we don't have a working directory; the mkdir was run directly on the repository, which is why it immediately triggered a commit. Let's check out a working directory from the repository. Since we're going to do everything in trunk, we only need to check out that sub-tree:

hactar:~$svn checkout file://$HOME/svn/hello/trunk hello
Checked out revision 1.

The second parameter is the directory name to use locally. We now have an empty directory, so let's write and compile hello.cpp. An important convention (across all SCM systems) is that source code is placed under control of the SCM, but object files and executables are not. This is because they can be regenerated at any time and would just waste space in the repository, as well as causing extra noise for anyone trying to follow what really changed over time. So we need to tell Subversion that we want it to manage hello.cpp. It will also keep reminding us about hello until we tell it not to:

hactar:~/hello$ svn add hello.cpp
A         hello.cpp
hactar:~/hello$ svn propedit svn:ignore .
Set new value for property 'svn:ignore' on '.'

Different version control systems have different means of storing metadata: information that should be stored in the repository (in this case, a list of wildcards to ignore), but which aren't files. In Subversion the metadata takes the form of properties. When I issued the svn propedit command, Subversion opened an editor and I put hello in the temporary file it opened for me. The special svn:ignore property is now set on the directory itself. Let's see where that puts us:

hactar:~/hello$ svn status
 M     .
A      hello.cpp

This is a handy way to see what you've changed, relative to what is saved in the repository. Subversion is telling us that we've made a property alteration on . and added a new file hello.cpp. If you had run this before setting the property, you would have seen another line

?      hello

To warn you that you hadn't put hello under version control. That all looks good, so let's commit it to the repository:

hactar:~/hello$ svn commit -m "First version of hello world"
Sending        .
Adding         hello.cpp
Transmitting file data .
Committed revision 2.

The file is now saved in the repository, and anyone else who checks out the repository will be able to see this version. Even if we overwrite it later, we can always retrieve this version. Now, let's edit the file and see what happens:

hactar:~/hello$ svn status
M      hello.cpp
hactar:~/hello$ svn diff
Index: hello.cpp
===================================================================
--- hello.cpp   (revision 2)
+++ hello.cpp   (working copy)
@@ -1,5 +1,7 @@
 #include <iostream>
+using namespace std;
 int main()
 {
-    std::cout << "Hello world\n";
+    cout << "Hello world\n";
+    return 0;
 }
 

Here we've introduced a new sub-command: svn diff. By default this shows what you've changed relative to the base version, which is the version that you checked out of the repository (which may not be the same as the latest, or head version, if somebody has since made changes that you haven't downloaded). Many of the Subversion commands, including diff, can also take extra arguments to refer to specific versions in the repository, so that you can compare any two revisions.

Now, what if somebody else was working on hello.cpp at the same time? Before we commit this change, I'm going to commit another change to this file from outside. In order to see this change here, we have to update our working copy from the repository. This is as simple as running

hactar:~/hello$ svn update
G    hello.cpp
Updated to revision 3.

The G here is Subversion's way of telling us that it merged an external change with one of our own. As long as changes do not overlap, version control systems will automatically merge them together, and they will also assist in merging conflicting changes (but note that only changes that overlap are considered to be conflicting; high-level changes like changing the name of a variable might not conflict, but still lead to compilation failures).

Working within files is relatively straightforward, but what about rearranging files, creating directories and so on? Subversion was created as a replacement for an older, messier system called CVS (Concurrent Version System) which did a very poor job of this; Subversion handles it quite well. You just have to tell Subversion to do things, rather than using the shell:

hactar:~/hello$ svn mv hello.cpp helloworld.cpp
A         helloworld.cpp
D         hello.cpp

Subversion handles a move as a copy and a deletion. It also handles copies in a special way: internally a copy takes zero space, because it uses a pointer back to the original, and it also remembers that the file was copied and allows the version history to be traced back through the original file. For example:

hactar:~/hello$ svn commit -m "Rename hello.cpp to helloworld.cpp"
Deleting       hello.cpp
Adding         helloworld.cpp

Committed revision 5.
hactar:~/hello$ svn log helloworld.cpp
------------------------------------------------------------------------
r5 | bruce | 2007-06-12 18:36:45 +0200 (Tue, 12 Jun 2007) | 1 line

Rename hello.cpp to helloworld.cpp
------------------------------------------------------------------------
r4 | bruce | 2007-06-12 18:34:02 +0200 (Tue, 12 Jun 2007) | 1 line

Use standard namespace
------------------------------------------------------------------------
r3 | bruce | 2007-06-12 18:26:21 +0200 (Tue, 12 Jun 2007) | 1 line

Add a comment
------------------------------------------------------------------------
r2 | bruce | 2007-06-12 18:17:27 +0200 (Tue, 12 Jun 2007) | 1 line

First version of hello world
------------------------------------------------------------------------

The log sub-command browses the log messages that I've been attaching to each commit. As you can see here, it remembers the comments I made even when the file was called hello.cpp.

Alternative interfaces

I've been giving all the examples using the command line svn tool, because it's easy to copy-and-paste into this tutorial, and because I'm happiest on the command line and so this is how I usually interact with Subversion. However, there are third-party interfaces for those that think the shell belongs in the stone age. There are several standalone clients (for example, RapidSVN), a Windows shell extension, TortoiseSVN, and extensions for some IDEs, such as Subclipse for integration into Eclipse.

Advanced features

Armed with only the information you've seen so far, plus a Subversion manual (there is a very good online book (FIXME), you should already be able to improve your productivity in group projects, and give yourself piece of mind in personal projects that you can always go back and recover previous versions of code.

However, for large-scale projects, there are even more things that a good version control system can do for you. For the basics, there isn't a lot of difference between the systems, but these advanced features are handled quite differently and these needs will often determine which system is right for your needs.

Hooks
I've already mentioned that you need to specifically ask for changes to be sent to your working copy. That is generally a good thing, because it means that you can control when gobs of in-progress code will get dumped on your tree and force you to handle conflicts. However, you might want to be kept informed of changes that other people make, so that you know as soon as a change that you need has been committed. A post-commit hook is a script that is run on the server immediately after each commit; the most common use is to generate an automatic email to the developer mailing list with the commit message and a summary of the files that were changed. Other uses include kicking off automated test suites, taking backups, and extracting documentation for a website.

Another type of hook is a pre-commit hook. This is run when someone attempts to commit, and is usually some form of verifier. It might provide fine-grained access control, validate checked-in web-pages, or ensure that the commit message contains a bug number.

Subversion has a number of other types of hooks, and examples for all of them. The hooks live inside the directory structure of the repository. Let's add a simple hook that prevents code from being accidentally committed when it still has FIXME notes in it. Normally we would need some more powerful scripting to find all the source files, but to keep it simple we'll just hardcode it:

hactar:~/hello$ cat ~/svn/hello/hooks/pre-commit
#!/bin/sh
if /usr/bin/svnlook cat -t "$2" "$1" "trunk/helloworld.cpp" | grep -q FIXME; then
  echo "FIXME's found in helloworld.cpp" 1>&2
  exit 1
fi
hactar:~/hello$ svn commit -m "Test the pre-commit hook" helloworld.cpp
Sending        helloworld.cpp
Transmitting file data .svn: Commit failed (details follow):
svn: 'pre-commit' hook failed with error output:
FIXME's found in helloworld.cpp

Tagging, branching and merging
Remember that we started by creating subdirectories called trunk, branches and tags? The time for the explanation has come. However, git handles these concepts in a more integrated fashion than Subversion, so at this point I'm going to switch to git for examples.

git handles the basics along similar lines to other version control systems, but there is one fundamental difference in philosophy: there is no distinction between a repository and a working directory. When you check out a working directory from a public repository, you are in fact copying the entire repository. As a result, you have all the advantages of version control, even if you are offline, or have only read access to the original repository. This makes it attractive for highly decentralised projects, where a subgroup can collaborate in a satellite repository and only push changes back to the main repository when they have stabilised. git is best-known as the version control system used to develop Linux.

This is all very interesting, but what are tagging, branching, and merging? Tagging is the simplest: it allows one to assign a human-readable name to a specific revision in a repository. You may have noticed that Subversion allocates sequential numbers to revisions, but it's hard to remember that version 0.9.3 of the software corresponds to revision number 1346, and it becomes even worse with git, which identifies revisions by long hex strings. For example, let's suppose that in the past we made a tag v1.0.0 on the initial version. We've now made a lot of changes to the repository, and we're going to release 1.0.1. Sometime in the future, we may wish to see what exactly changed between the releases, for example to isolate a bug that was introduced in v1.0.1:

hactar:~/hello-git$ git tag -a -m "Tag the v1.0.1 public release" v1.0.1
Now we make lots more changes
hactar:~/hello-fit$ git diff v1.0.0..v1.0.1
diff --git a/helloworld.cpp b/helloworld.cpp
index 2b20ba6..3b18807 100644
--- a/helloworld.cpp
+++ b/helloworld.cpp
@@ -1,6 +1,7 @@
 /* This is a hello world program */
-/* FIXME: Fix  the whitespace */
+
 #include <iostream>

+
 using namespace std;
 int main()
 {

Tagging is fairly straightforward, although git has some extra features (like signing) that I won't elaborate on. Branching is more complex. Let's suppose that you've released version 1.0 of your software, and you're busy rewriting large pieces of it to get them ready for version 2.0. However, you still need to fix bugs in 1.0, because 2.0 isn't going to be ready for several years. At this point, your code development diverges, or branches in two separate directions: a 2.0 branch which is under heavy development, and a 1.0 maintainence branch which only receives bug-fixes. Or perhaps one module in the 2.0 development is going to take a few weeks to write, during which time it will cause problems for other developers working on other pieces of 2.0. In this case, you might work on that module in a side branch until it is ready for inclusion in the main 2.0 branch. At that point you will need to merge those changes into the main branch. While developing the side branch, you may also need to incorporate changes from the main branch, or even bug-fixes being made in the 1.0 branch.

That's a fairly extreme case. Let's try something simpler with our hello world program. We're going to use a side branch to convert it to use cstdio.

hactar:~/hello-git$ git status
# On branch master
nothing to commit (working directory clean)
hactar:~/hello-git$ git branch cstdio
hactar:~/hello-git$ git checkout cstdio
Switched to branch "cstdio"
hactar:~/hello-git$ vi helloworld.cpp 
hactar:~/hello-git$ git commit -a -m "Converted to cstdio"
Created commit 3d6a30bc37c106d598e6e1fd8a049367b1dac231
 1 files changed, 2 insertions(+), 2 deletions(-)

Creating a branch is that simple. Let's go back to the main branch (which is called master and make a change there, then merge it over to the side branch:

hactar:~/hello-git$ git checkout master
Switched to branch "master"
hactar:~/hello-git$ vi helloworld.cpp 
hactar:~/hello-git$ git commit -a -m "Update the comment"
Created commit 2c7afa153383d275f27ce3ac04375cc78789e3b0
 1 files changed, 1 insertions(+), 1 deletions(-)
hactar:~/hello-git$ git checkout cstdio
Switched to branch "cstdio"
hactar:~/hello-git$ git merge --no-commit master
 100% (5/5) done
Auto-merged helloworld.cpp
Automatic merge went well; stopped before committing as requested
bruce@hactar:~/hello-git$ git status
# On branch cstdio
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#       modified:   helloworld.cpp
#
hactar:~/hello-git$ git commit -a -m "Merge the comment change from master"
Created commit 74bd3f2107fec7ab0d12d640e29bc901d75f1888
hactar:~/hello-git$ git merge --no-commit master
Already up-to-date.

The last command indicates that git has remembered which changes on master have already been merged, and it doesn't try to merge them again. This is in contrast to Subversion, which requires the user to keep track of which merges have been applied. On the other hand, git appears to have some trouble with renaming: I initially tried to do this example by renaming the file to helloworld.c in the side branch, but then the merge failed because it tried to apply the change to the .cpp file, which had been deleted.

Conclusions

If you've never used a version control system, go out and try one for your next project, or even your next marathon match. Once you're comfortable working with it, it will give you incredible piece of mind to know that you can rip out old code and not worry about losing it forever, and you will have learned a vital skill for the workplace.