The biggest chunk of work here was fixing some extremely long-standing
issues with how the database is built. Despite being in the package name,
man-db’s database is much less important than it used to be: most uses of
man(1) haven’t required it in a long time, and both hardware and
mean that even some searches can be done by brute force without needing
prior indexing. However, the database is still needed for the
The database has a simple format - no relational structure here, it’s just a
simple key-value database using old-fashioned DBM-like interfaces and
composing a few fields to form values - but there are a number of subtleties
involved. The issues tend to amount to this: what does a manual page name
mean? At first glance it might seem simple, because you have file names
that look something like
/usr/share/man/man1/ls.1.gz and that’s obviously
ls(1). Some pages are symlinks to other pages (which we track separately
because it makes it easier to figure out which entries to update when the
contents of the file system change), and sometimes multiple pages are even
hard links to the same file.
The real complications come with “whatis references”. Pages can list a
bunch of names in their
NAME section, and the historical expectation is
that it should be possible to use those names as arguments to
if they don’t also appear in the file system (although Debian policy has
deprecated relying on
for some time). Not only does that mean that
man(1) sometimes needs to
consult the database, but it also means that the database is inherently more
complicated, since a page might list something in its
NAME section that
conflicts with an actual file name in the file system, and now you need a
priority system to resolve ambiguities. There are some other possible
causes of ambiguity as well.
The people working on reproducible builds in Debian branched out to the related challenge of reproducible installations some time ago: can you take a collection of packages, bootstrap a file system image from them, and reproduce that exact same image somewhere else? This is useful for the same sorts of reasons that reproducible builds are useful: it lets you verify that an image is built from the components it’s supposed to be built from, and doesn’t contain any other skulduggery by accident or design. One of the people working on this noticed that man-db’s database files were an obstacle to that: in particular, the exact contents of the database seemed to depend on the order in which files were scanned when building it. The reporter proposed solving this by processing files in sorted order, but I wasn’t keen on that approach: firstly because it would mean we could no longer process files in an order that makes it more efficient to read them all from disk (still valuable on rotational disks), but mostly because the differences seemed to point to other bugs.
Having understood this, there then followed several late nights of very fiddly work on the details of how the database is maintained. None of this was conceptually difficult: it mainly amounted to ensuring that we maintain a consistent well-order for different entries that we might want to insert for a given database key, and that we consider the same names for insertion regardless of the order in which we encounter files. As usual, the tricky bit is making sure that we have the right data structures to support this. man-db is written in C which is not very well-supplied with built-in data structures, and originally much of the code was written in a style that tried to minimize memory allocations; this came at the cost of ownership and lifetime often being rather unclear, and it was often difficult to make changes without causing leaks or double-frees. Over the years I’ve been gradually introducing better encapsulation to make things easier to follow, and I had to do another round of that here. There were also some problems with caching being done at slightly the wrong layer: we need to make use of a “trace” of the chain of links followed to resolve a page to its ultimate source file, but we were incorrectly caching that trace and reusing it for any link to the same file, with incorrect results in many cases.
Oh, and after doing all that I found that the on-disk representation of a GDBM database is insertion-order-dependent, so I ended up having to manually reorganize the database at the end by reading it all in and writing it all back out in sorted order, which feels really weird to me coming from spending most of my time with PostgreSQL these days. Fortunately the database is small so this takes negligible time.
None of this is particularly glamorous work, but it paid off:
# export SOURCE_DATE_EPOCH="$(date +%s)" # mkdir emptydir disorder # disorderfs --multi-user=yes --shuffle-dirents=yes --reverse-dirents=no emptydir disorder # export TMPDIR="$(pwd)/disorder" # mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \ unstable out1.tar # mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \ unstable out2.tar # cmp out1.tar out2.tar # echo $? 0