From: Colin Watson Date: Sun, 16 Oct 2022 15:55:05 +0000 (+0100) Subject: Reproducible man-db databases X-Git-Url: https://www.chiark.greenend.org.uk/ucgi/~cjwatson/git?a=commitdiff_plain;h=5fafd77753f96d2e42c76f4bc8a1fd3e31c326c5;p=blog.git Reproducible man-db databases --- diff --git a/content/reproducible-man-db-databases.md b/content/reproducible-man-db-databases.md new file mode 100644 index 00000000..59ae1b59 --- /dev/null +++ b/content/reproducible-man-db-databases.md @@ -0,0 +1,102 @@ +Title: Reproducible man-db databases +Slug: reproducible-man-db-databases +Date: 2022-10-16 16:54:32 +01:00 +Category: man-db +Tags: man-db, planet-debian, planet-ubuntu + +I've released man-db 2.11.0 +([announcement](https://lists.nongnu.org/archive/html/man-db-announce/2022-10/msg00000.html), +[NEWS](https://gitlab.com/cjwatson/man-db/-/blob/2.11.0/NEWS.md)), and +uploaded it to Debian unstable. + +The biggest chunk of work here was fixing some extremely long-standing +issues with how the database is built. Despite being in the package name, +man-db's database is much less important than it used to be: most uses of +`man(1)` haven't required it in a long time, and both hardware and +[software]({filename}/man-db-K.md) +[improvements](https://lists.nongnu.org/archive/html/man-db-announce/2022-02/msg00000.html) +mean that even some searches can be done by brute force without needing +prior indexing. However, the database is still needed for the `whatis(1)` +and `apropos(1)` commands. + +The database has a simple format - no relational structure here, it's just a +simple key-value database using old-fashioned DBM-like interfaces and +composing a few fields to form values - but there are a number of subtleties +involved. The issues tend to amount to this: what does a manual page name +mean? At first glance it might seem simple, because you have file names +that look something like `/usr/share/man/man1/ls.1.gz` and that's obviously +`ls(1)`. Some pages are symlinks to other pages (which we track separately +because it makes it easier to figure out which entries to update when the +contents of the file system change), and sometimes multiple pages are even +hard links to the same file. + +The real complications come with "whatis references". Pages can list a +bunch of names in their `NAME` section, and the historical expectation is +that it should be possible to use those names as arguments to `man(1)` even +if they don't also appear in the file system (although Debian policy has +[deprecated relying on +this](https://www.debian.org/doc/debian-policy/ch-docs.html#manual-pages) +for some time). Not only does that mean that `man(1)` sometimes needs to +consult the database, but it also means that the database is inherently more +complicated, since a page might list something in its `NAME` section that +conflicts with an actual file name in the file system, and now you need a +priority system to resolve ambiguities. There are some other possible +causes of ambiguity as well. + +The people working on reproducible builds in Debian branched out to the +related challenge of reproducible installations some time ago: can you take +a collection of packages, bootstrap a file system image from them, and +reproduce that exact same image somewhere else? This is useful for the same +sorts of reasons that reproducible builds are useful: it lets you verify +that an image is built from the components it's supposed to be built from, +and doesn't contain any other skulduggery by accident or design. One of the +people working on this [noticed](https://bugs.debian.org/1010957) that +man-db's database files were an obstacle to that: in particular, the exact +contents of the database seemed to depend on the order in which files were +scanned when building it. The reporter proposed solving this by processing +files in sorted order, but I wasn't keen on that approach: firstly because +it would mean we could no longer process files in an order that makes it +more efficient to read them all from disk (still valuable on rotational +disks), but mostly because the differences seemed to point to other bugs. + +Having understood this, there then followed several late nights of very +fiddly work on the details of how the database is maintained. None of this +was conceptually difficult: it mainly amounted to ensuring that we maintain +a consistent [well-order](https://en.wikipedia.org/wiki/Well-order) for +different entries that we might want to insert for a given database key, and +that we consider the same names for insertion regardless of the order in +which we encounter files. As usual, the tricky bit is making sure that we +have the right data structures to support this. man-db is written in C +which is not very well-supplied with built-in data structures, and +originally much of the code was written in a style that tried to minimize +memory allocations; this came at the cost of ownership and lifetime often +being rather unclear, and it was often difficult to make changes without +causing leaks or double-frees. Over the years I've been gradually +introducing better encapsulation to make things easier to follow, and I had +to do another round of that here. There were also some problems with +caching being done at slightly the wrong layer: we need to make use of a +"trace" of the chain of links followed to resolve a page to its ultimate +source file, but we were incorrectly caching that trace and reusing it for +any link to the same file, with incorrect results in many cases. + +Oh, and after doing all that I found that the on-disk representation of a +GDBM database is insertion-order-dependent, so I ended up having to manually +reorganize the database at the end by reading it all in and writing it all +back out in sorted order, which feels really weird to me coming from +spending most of my time with PostgreSQL these days. Fortunately the +database is small so this takes negligible time. + +None of this is particularly glamorous work, but it paid off: + + :::shell-session + # export SOURCE_DATE_EPOCH="$(date +%s)" + # mkdir emptydir disorder + # disorderfs --multi-user=yes --shuffle-dirents=yes --reverse-dirents=no emptydir disorder + # export TMPDIR="$(pwd)/disorder" + # mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \ + unstable out1.tar + # mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \ + unstable out2.tar + # cmp out1.tar out2.tar + # echo $? + 0