--- /dev/null
+Title: No more "Hash Sum Mismatch" errors
+Slug: no-more-hash-sum-mismatch-errors
+Date: 2016-04-08 15:06:03 +01:00
+Category: launchpad
+Tags: launchpad, ubuntu, planet-debian, planet-ubuntu
+
+The Debian repository format was designed a long time ago. The oldest
+versions of it were produced with the help of tools such as
+`dpkg-scanpackages` and consumed by `dselect` access methods such as
+`dpkg-ftp`. The access methods just fetched a `Packages` file (perhaps
+compressed) and used it as an index of which packages were available; each
+package had an MD5 checksum to defend against transport errors, but being
+from a more innocent age there was no repository signing or other protection
+against man-in-the-middle attacks.
+
+An important and intentional feature of the early format was that, apart
+from the top-level `Packages` file, all other files were *static* in the
+sense that, once published, their content would never change without also
+changing the file name. This means that repositories can be efficiently
+copied around using `rsync` without having to tell it to re-checksum all
+files, and it avoids network races when fetching updates: the repository
+you're updating from might change in the middle of your update, but as long
+as the repository maintenance software keeps superseded packages around for
+a suitable grace period, you'll still be able to fetch them.
+
+The repository format evolved rather organically over time as different
+needs arose, by what one might call distributed consensus among the
+maintainers of the various client tools that consumed it. Of course all
+sorts of fields were added to the index files themselves, which have an
+extensible format so that this kind of thing is usually easy to do. At some
+point a `Sources` index for source packages was added, which worked pretty
+much the same way as `Packages` except for having a different set of fields.
+But by far the most significant change to the repository structure was the
+"package pools" project.
+
+The original repository layout put the packages themselves under the
+`dists/` tree along with the index files. The `dists/` tree is organised by
+"suite" (modern examples of which would be "stable", "stable-updates",
+"testing", "unstable", "xenial", "xenial-updates", and so on). This meant
+that making a release of Debian tended to involve copying lots of data
+around, and implementing the "testing" suite would have been very costly.
+Package pools solved this problem by moving individual package files out of
+`dists/` and into a new `pool/` tree, allowing those files to be shared
+between multiple suites with only a negligible cost in disk space and mirror
+bandwidth. From a database design perspective this is obviously much more
+sensible. As part of this project, the original Debian "dinstall"
+repository maintenance scripts were
+[replaced](https://lists.debian.org/debian-devel-announce/2000/10/msg00007.html)
+by "da-katie" or "dak", which among other things used a new `apt-ftparchive`
+program to build the index files; this replaced `dpkg-scanpackages` and
+`dpkg-scansources`, and included its own database cache which made a big
+difference to performance at the scale of a distribution.
+
+A few months after the initial implementation of package pools, `Release`
+files were added. These formed a sort of meta-index for each suite, telling
+APT which index files were available (`main/binary-i386/Packages`,
+`non-free/source/Sources`, and so on) and what their checksums were.
+Detached signatures were added alongside that (`Release.gpg`) so that it was
+now possible to fetch packages securely given a public key for the
+repository, and [client-side verification
+support](https://lists.debian.org/debian-devel/2003/12/msg01986.html) for
+this eventually made its way into Debian and Ubuntu. The repository
+structure stayed more or less like this for several years.
+
+At some point along the way, those of us by now involved in repository
+maintenance realised that an important property had been lost. I mentioned
+earlier that the original format allowed race-free updates, but this was no
+longer true with the introduction of the `Release` file. A client now had
+to fetch `Release` and then fetch whichever other index files such as
+`Packages` they wanted, typically in separate HTTP transactions. If a
+client was unlucky, these transactions would fall on either side of a mirror
+update and they'd get a "Hash Sum Mismatch" error from APT. Worse, if a
+*mirror* was unlucky and also didn't go to special lengths to verify index
+integrity (most don't), its own updates could span an update of its upstream
+mirror and then all its clients would see mismatches until the next mirror
+update. This was compounded by using detached signatures, so `Release` and
+`Release.gpg` were fetched separately and could be out of sync.
+
+Fixing this has been a long road (the first time I remember talking about
+this was in late 2007!), and we've had to take care to maintain
+client/server compatibility along the way. The first step was to add
+inline-signed versions of the `Release` file, called `InRelease`, so that
+there would no longer be a race between fetching `Release` and fetching its
+signature. APT has had this for a while, Debian's repository supports it as
+of `stretch`, and we finally [implemented it for
+Ubuntu](https://bugs.launchpad.net/launchpad/+bug/804252) six months ago.
+Dealing with the other index files is more complicated, though; it isn't
+sensible to inline them, as clients usually only need to fetch a small
+fraction of all the indexes available for a given suite.
+
+The solution we've ended up with is called
+[by-hash](https://wiki.debian.org/RepositoryFormat#indices_acquisition_via_hashsums_.28by-hash.29)
+and should be familiar in concept to people who've used `git`: with the
+exception of the top-level `InRelease` file, index files for suites that
+support the by-hash mechanism may now be fetched using a URL based on one of
+their hashes listed in `InRelease`. This means that clients can now operate
+like this:
+
+ * Fetch `dists/xenial/InRelease`
+ * Fetch
+ `dists/xenial/main/binary-amd64/by-hash/SHA256/46316a202cdae76a73b555414741b11d08c66620b76c470a1623cedcc8a14740`
+ (and so on)
+ * Fetch individual package files
+
+This is now [enabled by default in
+Ubuntu](https://bugs.launchpad.net/launchpad/+bug/1430011). It's only there
+as of xenial (16.04), since earlier versions of Ubuntu don't have the
+necessary support in APT. With this, hash mismatches on updates should be a
+thing of the past.
+
+There will still be some people who won't yet benefit from this.
+`debmirror` doesn't support by-hash yet; `apt-cacher-ng` only supports it as
+of xenial, although there's an [easy configuration
+workaround](https://bugs.debian.org/819852). Full archive mirrors must make
+sure that they put new by-hash files in place before new `InRelease` files
+(I just fixed our [recommended two-stage sync
+script](https://wiki.ubuntu.com/Mirrors/Scripts) to do this;
+[ubumirror](https://launchpad.net/ubumirror) still needs some work; Debian's
+[ftpsync](https://www.debian.org/mirror/ftpmirror#how) is almost correct but
+needs a tweak for its handling of translation files, which I've sent to its
+maintainers). Other mirrors and proxies that have specific handling of the
+repository format may need similar changes.
+
+Please let me know if you see strange things happening as a result of this
+change. It's useful to check the output of `apt -o
+Debug::Acquire::http=true update` to see exactly what requests are being
+issued.