No more “Hash Sum Mismatch” errors

The Debian repository format was designed a long time ago. The oldest versions of it were produced with the help of tools such as dpkg-scanpackages and consumed by dselect access methods such as dpkg-ftp. The access methods just fetched a Packages file (perhaps compressed) and used it as an index of which packages were available; each package had an MD5 checksum to defend against transport errors, but being from a more innocent age there was no repository signing or other protection against man-in-the-middle attacks.

An important and intentional feature of the early format was that, apart from the top-level Packages file, all other files were static in the sense that, once published, their content would never change without also changing the file name. This means that repositories can be efficiently copied around using rsync without having to tell it to re-checksum all files, and it avoids network races when fetching updates: the repository you’re updating from might change in the middle of your update, but as long as the repository maintenance software keeps superseded packages around for a suitable grace period, you’ll still be able to fetch them.

The repository format evolved rather organically over time as different needs arose, by what one might call distributed consensus among the maintainers of the various client tools that consumed it. Of course all sorts of fields were added to the index files themselves, which have an extensible format so that this kind of thing is usually easy to do. At some point a Sources index for source packages was added, which worked pretty much the same way as Packages except for having a different set of fields. But by far the most significant change to the repository structure was the “package pools” project.

The original repository layout put the packages themselves under the dists/ tree along with the index files. The dists/ tree is organised by “suite” (modern examples of which would be “stable”, “stable-updates”, “testing”, “unstable”, “xenial”, “xenial-updates”, and so on). This meant that making a release of Debian tended to involve copying lots of data around, and implementing the “testing” suite would have been very costly. Package pools solved this problem by moving individual package files out of dists/ and into a new pool/ tree, allowing those files to be shared between multiple suites with only a negligible cost in disk space and mirror bandwidth. From a database design perspective this is obviously much more sensible. As part of this project, the original Debian “dinstall” repository maintenance scripts were replaced by “da-katie” or “dak”, which among other things used a new apt-ftparchive program to build the index files; this replaced dpkg-scanpackages and dpkg-scansources, and included its own database cache which made a big difference to performance at the scale of a distribution.

A few months after the initial implementation of package pools, Release files were added. These formed a sort of meta-index for each suite, telling APT which index files were available (main/binary-i386/Packages, non-free/source/Sources, and so on) and what their checksums were. Detached signatures were added alongside that (Release.gpg) so that it was now possible to fetch packages securely given a public key for the repository, and client-side verification support for this eventually made its way into Debian and Ubuntu. The repository structure stayed more or less like this for several years.

At some point along the way, those of us by now involved in repository maintenance realised that an important property had been lost. I mentioned earlier that the original format allowed race-free updates, but this was no longer true with the introduction of the Release file. A client now had to fetch Release and then fetch whichever other index files such as Packages they wanted, typically in separate HTTP transactions. If a client was unlucky, these transactions would fall on either side of a mirror update and they’d get a “Hash Sum Mismatch” error from APT. Worse, if a mirror was unlucky and also didn’t go to special lengths to verify index integrity (most don’t), its own updates could span an update of its upstream mirror and then all its clients would see mismatches until the next mirror update. This was compounded by using detached signatures, so Release and Release.gpg were fetched separately and could be out of sync.

Fixing this has been a long road (the first time I remember talking about this was in late 2007!), and we’ve had to take care to maintain client/server compatibility along the way. The first step was to add inline-signed versions of the Release file, called InRelease, so that there would no longer be a race between fetching Release and fetching its signature. APT has had this for a while, Debian’s repository supports it as of stretch, and we finally implemented it for Ubuntu six months ago. Dealing with the other index files is more complicated, though; it isn’t sensible to inline them, as clients usually only need to fetch a small fraction of all the indexes available for a given suite.

The solution we’ve ended up with, thanks to Michael Vogt’s work implementing it in APT, is called by-hash and should be familiar in concept to people who’ve used git: with the exception of the top-level InRelease file, index files for suites that support the by-hash mechanism may now be fetched using a URL based on one of their hashes listed in InRelease. This means that clients can now operate like this:

  • Fetch dists/xenial/InRelease
  • Fetch dists/xenial/main/binary-amd64/by-hash/SHA256/46316a202cdae76a73b555414741b11d08c66620b76c470a1623cedcc8a14740 (and so on)
  • Fetch individual package files

This is now enabled by default in Ubuntu. It’s only there as of xenial (16.04), since earlier versions of Ubuntu don’t have the necessary support in APT. With this, hash mismatches on updates should be a thing of the past.

There will still be some people who won’t yet benefit from this. debmirror doesn’t support by-hash yet; apt-cacher-ng only supports it as of xenial, although there’s an easy configuration workaround. Full archive mirrors must make sure that they put new by-hash files in place before new InRelease files (I just fixed our recommended two-stage sync script to do this; ubumirror still needs some work; Debian’s ftpsync is almost correct but needs a tweak for its handling of translation files, which I’ve sent to its maintainers). Other mirrors and proxies that have specific handling of the repository format may need similar changes.

Please let me know if you see strange things happening as a result of this change. It’s useful to check the output of apt -o Debug::Acquire::http=true update to see exactly what requests are being issued.