No more “Hash Sum Mismatch” errors
The Debian repository format was designed a long time ago. The oldest
versions of it were produced with the help of tools such as
dpkg-scanpackages
and consumed by dselect
access methods such as
dpkg-ftp
. The access methods just fetched a Packages
file (perhaps
compressed) and used it as an index of which packages were available; each
package had an MD5 checksum to defend against transport errors, but being
from a more innocent age there was no repository signing or other protection
against man-in-the-middle attacks.
An important and intentional feature of the early format was that, apart
from the top-level Packages
file, all other files were static in the
sense that, once published, their content would never change without also
changing the file name. This means that repositories can be efficiently
copied around using rsync
without having to tell it to re-checksum all
files, and it avoids network races when fetching updates: the repository
you’re updating from might change in the middle of your update, but as long
as the repository maintenance software keeps superseded packages around for
a suitable grace period, you’ll still be able to fetch them.
The repository format evolved rather organically over time as different
needs arose, by what one might call distributed consensus among the
maintainers of the various client tools that consumed it. Of course all
sorts of fields were added to the index files themselves, which have an
extensible format so that this kind of thing is usually easy to do. At some
point a Sources
index for source packages was added, which worked pretty
much the same way as Packages
except for having a different set of fields.
But by far the most significant change to the repository structure was the
“package pools” project.
The original repository layout put the packages themselves under the
dists/
tree along with the index files. The dists/
tree is organised by
“suite” (modern examples of which would be “stable”, “stable-updates”,
“testing”, “unstable”, “xenial”, “xenial-updates”, and so on). This meant
that making a release of Debian tended to involve copying lots of data
around, and implementing the “testing” suite would have been very costly.
Package pools solved this problem by moving individual package files out of
dists/
and into a new pool/
tree, allowing those files to be shared
between multiple suites with only a negligible cost in disk space and mirror
bandwidth. From a database design perspective this is obviously much more
sensible. As part of this project, the original Debian “dinstall”
repository maintenance scripts were
replaced
by “da-katie” or “dak”, which among other things used a new apt-ftparchive
program to build the index files; this replaced dpkg-scanpackages
and
dpkg-scansources
, and included its own database cache which made a big
difference to performance at the scale of a distribution.
A few months after the initial implementation of package pools, Release
files were added. These formed a sort of meta-index for each suite, telling
APT which index files were available (main/binary-i386/Packages
,
non-free/source/Sources
, and so on) and what their checksums were.
Detached signatures were added alongside that (Release.gpg
) so that it was
now possible to fetch packages securely given a public key for the
repository, and client-side verification
support for
this eventually made its way into Debian and Ubuntu. The repository
structure stayed more or less like this for several years.
At some point along the way, those of us by now involved in repository
maintenance realised that an important property had been lost. I mentioned
earlier that the original format allowed race-free updates, but this was no
longer true with the introduction of the Release
file. A client now had
to fetch Release
and then fetch whichever other index files such as
Packages
they wanted, typically in separate HTTP transactions. If a
client was unlucky, these transactions would fall on either side of a mirror
update and they’d get a “Hash Sum Mismatch” error from APT. Worse, if a
mirror was unlucky and also didn’t go to special lengths to verify index
integrity (most don’t), its own updates could span an update of its upstream
mirror and then all its clients would see mismatches until the next mirror
update. This was compounded by using detached signatures, so Release
and
Release.gpg
were fetched separately and could be out of sync.
Fixing this has been a long road (the first time I remember talking about
this was in late 2007!), and we’ve had to take care to maintain
client/server compatibility along the way. The first step was to add
inline-signed versions of the Release
file, called InRelease
, so that
there would no longer be a race between fetching Release
and fetching its
signature. APT has had this for a while, Debian’s repository supports it as
of stretch
, and we finally implemented it for
Ubuntu six months ago.
Dealing with the other index files is more complicated, though; it isn’t
sensible to inline them, as clients usually only need to fetch a small
fraction of all the indexes available for a given suite.
The solution we’ve ended up with, thanks to Michael Vogt’s work implementing
it in APT, is called
by-hash
and should be familiar in concept to people who’ve used git
: with the
exception of the top-level InRelease
file, index files for suites that
support the by-hash mechanism may now be fetched using a URL based on one of
their hashes listed in InRelease
. This means that clients can now operate
like this:
- Fetch
dists/xenial/InRelease
- Fetch
dists/xenial/main/binary-amd64/by-hash/SHA256/46316a202cdae76a73b555414741b11d08c66620b76c470a1623cedcc8a14740
(and so on) - Fetch individual package files
This is now enabled by default in Ubuntu. It’s only there as of xenial (16.04), since earlier versions of Ubuntu don’t have the necessary support in APT. With this, hash mismatches on updates should be a thing of the past.
There will still be some people who won’t yet benefit from this.
debmirror
doesn’t support by-hash yet; apt-cacher-ng
only supports it as
of xenial, although there’s an easy configuration
workaround. Full archive mirrors must make
sure that they put new by-hash files in place before new InRelease
files
(I just fixed our recommended two-stage sync
script to do this;
ubumirror still needs some work; Debian’s
ftpsync is almost correct but
needs a tweak for its handling of translation files, which I’ve sent to its
maintainers). Other mirrors and proxies that have specific handling of the
repository format may need similar changes.
Please let me know if you see strange things happening as a result of this
change. It’s useful to check the output of apt -o
Debug::Acquire::http=true update
to see exactly what requests are being issued.