by Simon Tatham
When I'm not at work, I'm a free software developer. I maintain a variety of published projects, ranging from fairly major things like PuTTY to tiny little Unix utilities; and I have almost as wide a variety of unpublished projects as well, ranging from half-finished major programs to my personal
.bashrc. Until November 2004, all these projects were stored in CVS, along with probably 90% of the other free software in the world.
Then I migrated to Subversion. This took a fair amount of thought and effort to do well, and shortly afterwards I was asked by a colleague if I could write something about my experiences. He was probably expecting something more like a couple of paragraphs, but I thought, hey, why not do the job right? :-)
This article is not a rant. In general, I have found Subversion to be linearly superior to CVS and I certainly don't regret migrating to it. The article is just an attempt to share my experiences: things to watch out for, how to get the most out of Subversion, that sort of thing.
I'm pretty finicky about security, because I maintain a security product and can't really afford to cut corners on the repository where its code is going to be stored.
CVS's security model is just awful. To allow a user to commit to a particular directory, you have to give them write permission on the whole directory, so there's nothing to stop them rewriting history: deleting or editing old revisions, altering log messages, etc. Now just occasionally you do actually want to do things like that (suppose you accidentally checked your password file into a repository, for example – you really would want to remove it from your CVS history, perhaps with a marker explaining what you'd done, rather than just checking in a subsequent change that deleted it); but it certainly isn't clear that allowing all users unrestricted access to do things like that is what you want to do. Added to that, there's the question of scripts in
CVSROOT (run under each client's UID, so that they can't impose dependable restrictions because hacked clients can bypass them, and conversely if anyone manages to check in a malicious script then all clients will start running it under their own UIDs and propagating malware); and then there's the issue of needing write permission in a directory to read it (although this can be fixed by correct repository administration, i.e. having a separate locks tree).
(To be fair, version control in some sense inherently requires participants to trust one another, since each one checks in code which the others usually just compile and run. But at least you know when you're doing that;
CVSROOT scripts seem more dangerous to me since not every CVS user even needs to know they're there...)
So I was rather hoping Subversion could do better. Initially, it didn't look promising. The HTTP-based access methods didn't suit me: I'm probably unfairly biased in favour of SSH, but I couldn't find any supported authentication mechanisms which were as good as
ssh-agent. Sending passwords over HTTPS isn't too bad, but to obtain one-touch authentication you still need to cache your passwords on disk at the client end;
ssh-agent is superior because it never saves all the required authentication information to disk. (Well, unless it gets swapped out; but I encrypt my swap partition!)
The alternative was
svnserve, and tunnelling it over SSH. But this doesn't provide the fine-grained access control that the HTTP access method does; you just have users who can commit, and users who cannot.
svnserve does actually support fine-grained access control in Subversion 1.3 and above, by using the
authz-db configuration file directive to specify an access control file, but in this particular scenario it isn't helpful: if a group of collaborating users all tunnel
svnserve over SSH, then it runs under each user's UID, so all users need physical write permission to the repository. And once they've all got that, there's simply no point in imposing fiddly access control, since it would only be voluntary – anyone willing to put the effort into modifying an SVN client could just bypass it. This wasn't really sufficient; I have a variety of projects which I co-develop with different groups of people, and I didn't want to be forced to allow someone effective commit access to (say) PuTTY as a necessary consequence of collaborating with them on (say) mail handling software.
So, this wasn't looking particularly wonderful. HTTP provides fine-grained access control but poor network security;
svnserve over SSH provides good network security but awful access control. Is there a better way?
There is, and it's based on
userv. For those who don't already know about it,
userv provides a convenient and decently secure means for one user on a Unix machine to supply a service to other users. For example, one user can set up a
userv configuration which offers to supply a service (say ‘
hello-world’) to selected other users; then those other users can invoke the service by typing ‘
userv fred hello-world’.
userv tells the service the username of the calling user, unforgeably; and the service program is run from a
userv daemon, so it isn't tainted by environment variables or any hidden state passed to it from the (potentially malicious) caller. It doesn't do anything which a user couldn't in principle do by writing a setuid program very carefully indeed; but it makes it much easier to make such things secure.
So what I did was to ask my sysadmin to create me a second account on my server machine. That second account provides a
userv service which invokes
svnserve. So other users can simply run ‘
userv simon-svn svnserve’, and expect it to behave just like
svnserve itself; but the actual
svnserve process is always running under a user ID controlled by me, so it can't be messed around with. The user can't run a modified server process, because then it would run under their own UID which has no direct access to the repository; so this mechanism ensures that the only operations anyone can perform on the repository are those permitted by
svnserve, which means nobody can destroy or change history (except me, of course, the repository admin). Finally, this also means that the fine-grained access control provided by
svnserve 1.3 and above is now secure against users attempting to bypass it.
Remote access, of course, is still via
ssh, only instead of running ‘
ssh remote-host svnserve’, you now have to run ‘
ssh remote-host userv simon-svn svnserve’. But Subversion makes it easy to configure strange remote access methods (by adding entries in the
[tunnels] section in the
.subversion/config file), so that wasn't a problem.
Another nice thing is that Subversion's access control file isn't version-controlled within the repository itself. This removes any possible risk that an incautiously written access list might allow a user to commit to the list itself and grant themself extra permissions!
(Prior to Subversion 1.3, I had to do commit control using Subversion's pre-commit hook mechanism. The pre-commit hook can examine a prospective commit in detail using
svnlook, seeing which files are modified, the name of the committer, and even the data in the altered files if it wants to. So in particular, it can look up the changed files and the committing user against an access list; and if it returns a failure status, then the commit is aborted. However, 1.3's native mechanism is superior because it supports restriction of read access as well as commits.)
There was one small niggle in the
userv mechanism: a bug in
userv which meant that if the
userv client process was killed by
SIGKILL, the server process would not realise it needed to terminate, and would hang around for ever. Since Subversion's policy is to routinely
SIGKILL its tunnel process whenever it finishes using it (the source comments suggest that some badly-behaved tunnel utility failed to respond sensibly to anything less!), this was a serious problem. Fortunately, I know the
userv author personally and he sent me a fix promptly; I don't know when the fix will make it into a release, or even if it's done so already. An alternative workaround might have been to write a sort of ‘bodyguard’ wrapper which executed
userv as a subprocess with a different process group ID, in the hope of taking the
SIGKILL itself and having it miss
userv, but I haven't actually tried this myself. If you're stuck with a buggy version of
userv, it might be worth a try, though.
Having sorted out a repository access strategy, the next question was repository storage. As of Subversion 1.1, there are now two repository data formats: FSFS and BDB. BDB is supported by Subversion 1.0 as well, which (at the time of writing) was easy to get hold of and had Debian packages etc. FSFS is only supported by 1.1, which basically had to be downloaded and compiled specially.
FSFS proved to be the correct choice, for one sole and sufficient reason: a read-only client of an FSFS repository does not require physical write access to the repository files. Any client of a BDB repository, by contrast, requires physical write access.
This is an absolutely killer feature. It means I can be very confident in giving out anonymous SVN access to the entire world: I simply run a public
svnserve under a dedicated user ID (not the same one as I'm using for the main repository). So even if a malicious client manages to find a security hole in
svnserve and compromise the anonymous-access user ID, that still doesn't give them write access to the actual repository. (Unless they find a kernel security hole, of course.)
There are other things I like about FSFS:
But all of these were added bonuses. The real killer feature of FSFS, for me, is the read-only access.
(FSFS is supposedly a bit slower at the common operation of checking out the latest version. But at the time of writing, I've got a 5000-commit FSFS repository and I'm having no speed problems with it at all, so it's easily worth it for the benefits. Also, when I actually tried importing my entire version history into a BDB repository, the import process ran so slowly that I gave up waiting for it to finish. I'm inclined to feel that any speed disadvantages FSFS might possess are overwhelmingly outweighed by its speed benefits alone!)
Most of my CVS repositories had been exported via a public-access CVS
pserver, and quite a few users of my projects had found this a useful facility.
For anonymous access using a Subversion client, the two options were HTTP and anonymous
svnserve. The decision was simple: the server machine I was doing all this on still runs Apache 1.3, and the Subversion HTTP modules require 2.0.
svnserve it was.
Web-based repository browsing was also a facility provided by my previous CVS setup, via ViewCVS. The latest trunk ViewCVS also supports Subversion as a back end, so that was easy enough to provide as well.
(Since then, someone has made me aware of WebSVN. I haven't yet looked closely into whether that's superior to ViewCVS or not.)
As mentioned in section 3, both of these anonymous access methods are hosted under a separate user ID to the one hosting the main repository, ensuring that a compromise of either one is still some distance away from acquiring commit access.
Migration from CVS to Subversion can be done reasonably easily using
cvs2svn. However, I wasn't satisfied with the quality of the result, so I had to do quite a lot of work to produce a high-quality migration.
cvs2svn insists on creating your repository with three top-level directories
tags, each of which contains subdirectories corresponding to each top-level CVS module.
I didn't like this; it might make sense in a repository dedicated to a single project with many parts, where a branch typically wants to be across the entire project, but it wasn't well suited to my repository of many largely unrelated projects each of which branched separately.
(Actually, I also didn't like the alternative suggestion in the Subversion book, in which you swapped round the top two levels so you had the module name at the top level and then
tags subdirectories, because in all normal situations one is used to thinking of a source pathname as
putty/ssh.c and it would have felt very odd to insert an administrative layer in the middle of that to get
putty/trunk/ssh.c. Instead, I went for the completely ad-hoc approach, in which trunk source directories are direct children of the root, and branch and tag directories are alongside them with names indicating their status:
putty-0.56. But that's not immediately relevant to the difficulty of migration.)
I wasn't able to hack
cvs2svn to produce the layout I wanted; it was all a bit too interlinked to make a change like that easy. Instead, I found it was actually easier to write an auxiliary Python script which postprocessed the Subversion dump file output from
cvs2svn, and altered the directory names there.
At one point in the history of my repository, I wanted to make one CVS module a common subdirectory in two others (a library shared between two applications). You can do this by adding entries to the file
Subversion's analogous mechanism is the
svn:externals property, set on the directory in which you want to check out a separate subdirectory. So my migration script had to include
CVSROOT in the input to
cvs2svn; note the relevant checkin to
CVSROOT/modules; and invent some
svn:externals property changes on the two affected directories on that revision. Then I had to filter out the rest of
CVSROOT from the dump file, because all the other changes had been made by other people maintaining projects alongside mine!
To make matters worse, I couldn't think of any easy way to parse
CVSROOT/modules and do this job automatically. Instead, I had to hack around it manually by recognising the log message on the particular commit, which meant my migration script became specific to my particular repository so I can't usefully publish it for other people to use.
See section 7.5 for more problems I had with
A large motivation for me to migrate to Subversion was its sensible support for renaming directories and files. In CVS, you have a variety of nasty hacky things you can do to represent a file rename:
cvs add the new file and
cvs remove the old one in the same commit. This ensures that any date-based checkout accurately reflects what the source tree looked like on that date, but means it's hard to track history between the two files.
,v file) into the new location, and then
cvs remove the old file. This has the advantage that
cvs log on the new file shows its unbroken history, but means any historical checkout from before the rename will show two copies of the file.
Despite holding off a number of major renames for several years until I could migrate to Subversion, I still had to do some renames while I was a CVS user, and I did them in all three of the above ways in various different places! Another grubby chore for my migration script was to note all these renames (again, I had to point them out manually) and convert them into proper clean Subversion rename representations, so that previously broken halves of files' history became rejoined.
cvs2svn offers the option to provide a Subversion property called
cvs2svn:cvs-rev on each file it converts, which tracks the CVS revision number. This seemed like a useful way to preserve information about my original CVS.
The only trouble is, if you do this, you end up with this
cvs2svn:cvs-rev property on every file, and it stays there even after you start changing the file. That didn't seem like what I wanted; I only wanted a file to be tagged with a CVS revision property if the data in that file actually matched what had originally been in that CVS revision!
The idea I hit on was to let
cvs2svn write the CVS revision properties, but then after doing the migration, go through and do a mammoth Subversion commit which removed the properties from all active branches. This is what I actually did in the end, but that had its own downside: if you use ViewCVS to get an overview of any particular directory, all files that haven't been changed since before the migration now show their last modification as this administrative commit, and you don't get to see the last time they really changed. Also, it shows up as an annoyance in
svn log -v.
I think if I had this one to do over again, I'd just lose the CVS revision properties.
After the migration there were still some cleanup tasks to do:
cvs2svn:cvs-rev properties on all active branches and trunks.
cvs2svn had helpfully copied their contents into
svn:ignore properties throughout, but I didn't want the files themselves cluttering the place up after the migration.
Perhaps you wouldn't care about most of this sort of thing if you weren't a perfectionist. I am, and I did :-)
Prior to migration, on the PuTTY web site, I provided a
.tar.gz archive of the PuTTY CVS repository, so that anyone who wanted to do in-depth research into the history of our development could download the whole lot in one go and then do it offline, rather than having to refer back to our anonymous
pserver all the time.
CVS's repository storage format precisely mirroring the actual source tree has plenty wrong with it (in particular, it's part of what makes renames difficult), but one thing it makes convenient is this sort of trick. For Subversion, a different approach was needed.
I could have just
tarred up the entire Subversion repository in FSFS format (among FSFS's many virtues is that its format is architecture-independent, unlike BDB), but that didn't seem like a good idea since it contained all my other projects as well as PuTTY - it seemed unfriendly to foist a load of unwanted data on anyone using that download.
The obvious answer is to provide a Subversion dump file. One of the Subversion tools is a handy utility called
svndumpfilter, which filters a dump file and keeps only a subset of the directories involved. I could use this to produce a dump file containing only PuTTY, which people could download and import into their own repository.
svndumpfilter doesn't quite work. It fails in the case where you copy a file from one of the areas being filtered out into one of the areas being kept;
svndumpfilter can't reconstruct the data that needed to go into that file. This is an unfortunate consequence of it being a simple filter tool that operates on nothing but a dump file.
In fact I don't currently have any incidents in my version control history when a file from outside PuTTY was copied into it, but I certainly didn't want to rule out the possibility that I might have such an incident in future. So I wrote a replacement for
svndumpfilter (easy enough when my migration script already contained general routines to read and write Subversion dump files) which also knew the location of the repository from which the dump file came – so it could use
svnlook cat to find the data for the copied revision of the file, and replace an add-as-copy operation with a straight add-as-data.
The other really important thing here was the
--deltas option to
svnadmin dump, which reduced a 250Mb dump file to more like 10Mb! For this sort of purpose it's absolutely invaluable.
Having now done the migration, here are some things I've noticed about Subversion once you're actually using it.
(This section might contain a couple of small complaints, but nothing seriously ranty.)
The Subversion subcommand
ls lets you list a directory in a repository without having to check everything out. Very nice.
As befits an
ls command, it displays filenames only unless you ask for more information.
You ask for more information using
svn ls -v. Not
ls -l, as I'm sure everybody instinctively typed.
(This one annoyed me enough – and seemed easy enough to fix – that I considered submitting a patch. However, it's not as simple as it looked: it turns out that all command-line options supported by the Subversion client program are parsed centrally across all subcommands; so adding an alias
-l which only applied to the
ls command would be a nasty wart. Oh well.)
svn log acts on current revision
Unlike CVS, if you sit in a working copy and run
svn log, it will display the log for the checked-out revision of the file or directory in question – so you don't see log entries for checkins that have happened since your last
svn update. (If you want it to, you can always use
It isn't entirely clear that this behaviour is actually wrong; it has the virtue of making the log output predictable, so that if you want to keep piping
svn log into a processing script until you get the script right, the output won't unexpectedly change half way through because someone else did a commit. However, it's confusingly different from CVS and therefore worth watching out for.
(In particular, note that even
svn commit doesn't update the revision number in a working copy! You have to do an update after committing, before
svn log will show the commit you just made.)
The question of Subversion's branching model versus CVS's seems to be something of a holy war.
(As a very brief summary: CVS branches by creating a fork in the time dimension, so that the file always has the same pathname but changes to it are tree-structured rather than linear. Subversion has a single linear time dimension always, and it ‘branches’ simply by copying a directory to a different pathname. Hence the practice of having
branches directories – you generate a branch by copying
branches/mybranch/myprogram. Since copying is a magic operation and history is preserved across it, this means that history commands such as
svn log can be applied to both the source and destination paths and both will show the pre-branch-point history.)
One unfortunate effect of Subversion's branching model is that it partially undoes the benefit of the single revision number. You want to be able to quote a single revision number in (say) a nightly snapshot build, and when a user quotes that number back to you you want it to give you all the information you need about what files went into the build in question. But once you have a branch, you suddenly also need to include the repository path in some way, which is less helpful.
On the other hand, things I like a lot about Subversion's branching model are:
Suppose you check out two source directories alongside one another:
hostname:~/src$ svn co svn+ssh+userv://host/repos/foo
hostname:~/src$ svn co svn+ssh+userv://host/repos/bar
Then you modify some files in each one, and you want to do a single atomic commit including both sets of changes. Since you can legitimately run either of ‘
svn commit foo’ or ‘
svn commit bar’, you would sort of hope to be able to run ‘
svn commit foo bar’ and have it Just Work.
No such luck; if you try that, it complains.
hostname:~/src$ svn commit foo bar
svn: '/home/simon/src' is not a working copy
hostname:~/src$ svn commit foo/foo.txt bar/bar/txt
svn: '/home/simon/src' is not a working copy
Subversion appears to want the common parent directory of your two working copies to itself be a working copy. It seems to be only set up to do commits within a single working copy tree.
I have found a way around this, however. It's horrid, but it seems to work. Subversion won't complain too much provided the parent directory is a working copy of something in the same repository. So I make sure my repository contains an empty directory (usually called
emptydir), and I check that out into the parent directory:
hostname:~/src$ svn co svn+ssh+userv://host/repos/emptydir .
Now I still can't commit using the command ‘
svn commit foo bar’. However, if I also specify the precise file names of the files I've changed, it does work:
hostname:~/src$ svn commit foo/foo.txt bar/bar/txt
I don't fully understand why this works and none of the other options did. But it's certainly worth knowing!
Explicitly specifying the set of file names to commit can admittedly be fiddly if there are lots of them. In fact I ended up writing a Perl script called
svn-commit for use in this sort of situation, which automatically processes the output of
svn status to work out the file list and then passes that list to
svn:externals isn't what I wanted
svn:externals property looked like just what I wanted in order to arrange for one top-level directory in my repository to always be checked out as a subdirectory of another.
Unfortunately, it has to specify a fully qualified Subversion repository URL, which cannot be relative. So if I'm using a locally configured
svn+userv URL scheme to get commit access to my repository, but anonymous users are using the normal
svn scheme, then there are two possibilities:
svn+userv in the
svn:externals property, and anonymous users can't check out at all.
svn in the
svn:externals property, and now everyone can check out but I can't helpfully check in.
Shortly after migration I decided that this was simply too much hassle; so I deleted the
svn:externals properties on the two affected directories, and modified their
Makefiles so that they can cope with finding their library directory at the same level as themselves, instead of in a subdirectory.
I've described my experiences so far of migrating from CVS to Subversion, and using Subversion. In particular:
userv can be used to give Subversion a security model vastly superior to that of CVS.
cvs2svn sets things up for you, but gets a lot more difficult the more perfectionist you get. However, if you only do it once it's worth putting some effort in!
Although I've described some issues above which might or might not be considered bugs in Subversion, in general they're easy enough to work around, and in most respects I think Subversion amply fulfills its promise of being ‘CVS, only done properly’. I like it, and I don't regret migrating. I'd do it again.