From: Colin Watson Date: Fri, 8 Apr 2016 14:07:45 +0000 (+0100) Subject: No more "Hash Sum Mismatch" errors X-Git-Url: https://www.chiark.greenend.org.uk/ucgi/~cjwatson/git?a=commitdiff_plain;h=67f97a95fed0089d0d7a411451234b9cae624d4e;p=blog.git No more "Hash Sum Mismatch" errors --- diff --git a/content/no-more-hash-sum-mismatch-errors.md b/content/no-more-hash-sum-mismatch-errors.md new file mode 100644 index 00000000..21606f6c --- /dev/null +++ b/content/no-more-hash-sum-mismatch-errors.md @@ -0,0 +1,127 @@ +Title: No more "Hash Sum Mismatch" errors +Slug: no-more-hash-sum-mismatch-errors +Date: 2016-04-08 15:06:03 +01:00 +Category: launchpad +Tags: launchpad, ubuntu, planet-debian, planet-ubuntu + +The Debian repository format was designed a long time ago. The oldest +versions of it were produced with the help of tools such as +`dpkg-scanpackages` and consumed by `dselect` access methods such as +`dpkg-ftp`. The access methods just fetched a `Packages` file (perhaps +compressed) and used it as an index of which packages were available; each +package had an MD5 checksum to defend against transport errors, but being +from a more innocent age there was no repository signing or other protection +against man-in-the-middle attacks. + +An important and intentional feature of the early format was that, apart +from the top-level `Packages` file, all other files were *static* in the +sense that, once published, their content would never change without also +changing the file name. This means that repositories can be efficiently +copied around using `rsync` without having to tell it to re-checksum all +files, and it avoids network races when fetching updates: the repository +you're updating from might change in the middle of your update, but as long +as the repository maintenance software keeps superseded packages around for +a suitable grace period, you'll still be able to fetch them. + +The repository format evolved rather organically over time as different +needs arose, by what one might call distributed consensus among the +maintainers of the various client tools that consumed it. Of course all +sorts of fields were added to the index files themselves, which have an +extensible format so that this kind of thing is usually easy to do. At some +point a `Sources` index for source packages was added, which worked pretty +much the same way as `Packages` except for having a different set of fields. +But by far the most significant change to the repository structure was the +"package pools" project. + +The original repository layout put the packages themselves under the +`dists/` tree along with the index files. The `dists/` tree is organised by +"suite" (modern examples of which would be "stable", "stable-updates", +"testing", "unstable", "xenial", "xenial-updates", and so on). This meant +that making a release of Debian tended to involve copying lots of data +around, and implementing the "testing" suite would have been very costly. +Package pools solved this problem by moving individual package files out of +`dists/` and into a new `pool/` tree, allowing those files to be shared +between multiple suites with only a negligible cost in disk space and mirror +bandwidth. From a database design perspective this is obviously much more +sensible. As part of this project, the original Debian "dinstall" +repository maintenance scripts were +[replaced](https://lists.debian.org/debian-devel-announce/2000/10/msg00007.html) +by "da-katie" or "dak", which among other things used a new `apt-ftparchive` +program to build the index files; this replaced `dpkg-scanpackages` and +`dpkg-scansources`, and included its own database cache which made a big +difference to performance at the scale of a distribution. + +A few months after the initial implementation of package pools, `Release` +files were added. These formed a sort of meta-index for each suite, telling +APT which index files were available (`main/binary-i386/Packages`, +`non-free/source/Sources`, and so on) and what their checksums were. +Detached signatures were added alongside that (`Release.gpg`) so that it was +now possible to fetch packages securely given a public key for the +repository, and [client-side verification +support](https://lists.debian.org/debian-devel/2003/12/msg01986.html) for +this eventually made its way into Debian and Ubuntu. The repository +structure stayed more or less like this for several years. + +At some point along the way, those of us by now involved in repository +maintenance realised that an important property had been lost. I mentioned +earlier that the original format allowed race-free updates, but this was no +longer true with the introduction of the `Release` file. A client now had +to fetch `Release` and then fetch whichever other index files such as +`Packages` they wanted, typically in separate HTTP transactions. If a +client was unlucky, these transactions would fall on either side of a mirror +update and they'd get a "Hash Sum Mismatch" error from APT. Worse, if a +*mirror* was unlucky and also didn't go to special lengths to verify index +integrity (most don't), its own updates could span an update of its upstream +mirror and then all its clients would see mismatches until the next mirror +update. This was compounded by using detached signatures, so `Release` and +`Release.gpg` were fetched separately and could be out of sync. + +Fixing this has been a long road (the first time I remember talking about +this was in late 2007!), and we've had to take care to maintain +client/server compatibility along the way. The first step was to add +inline-signed versions of the `Release` file, called `InRelease`, so that +there would no longer be a race between fetching `Release` and fetching its +signature. APT has had this for a while, Debian's repository supports it as +of `stretch`, and we finally [implemented it for +Ubuntu](https://bugs.launchpad.net/launchpad/+bug/804252) six months ago. +Dealing with the other index files is more complicated, though; it isn't +sensible to inline them, as clients usually only need to fetch a small +fraction of all the indexes available for a given suite. + +The solution we've ended up with is called +[by-hash](https://wiki.debian.org/RepositoryFormat#indices_acquisition_via_hashsums_.28by-hash.29) +and should be familiar in concept to people who've used `git`: with the +exception of the top-level `InRelease` file, index files for suites that +support the by-hash mechanism may now be fetched using a URL based on one of +their hashes listed in `InRelease`. This means that clients can now operate +like this: + + * Fetch `dists/xenial/InRelease` + * Fetch + `dists/xenial/main/binary-amd64/by-hash/SHA256/46316a202cdae76a73b555414741b11d08c66620b76c470a1623cedcc8a14740` + (and so on) + * Fetch individual package files + +This is now [enabled by default in +Ubuntu](https://bugs.launchpad.net/launchpad/+bug/1430011). It's only there +as of xenial (16.04), since earlier versions of Ubuntu don't have the +necessary support in APT. With this, hash mismatches on updates should be a +thing of the past. + +There will still be some people who won't yet benefit from this. +`debmirror` doesn't support by-hash yet; `apt-cacher-ng` only supports it as +of xenial, although there's an [easy configuration +workaround](https://bugs.debian.org/819852). Full archive mirrors must make +sure that they put new by-hash files in place before new `InRelease` files +(I just fixed our [recommended two-stage sync +script](https://wiki.ubuntu.com/Mirrors/Scripts) to do this; +[ubumirror](https://launchpad.net/ubumirror) still needs some work; Debian's +[ftpsync](https://www.debian.org/mirror/ftpmirror#how) is almost correct but +needs a tweak for its handling of translation files, which I've sent to its +maintainers). Other mirrors and proxies that have specific handling of the +repository format may need similar changes. + +Please let me know if you see strange things happening as a result of this +change. It's useful to check the output of `apt -o +Debug::Acquire::http=true update` to see exactly what requests are being +issued.