Reproducible man-db databases

author Colin Watson <cjwatson@debian.org>

Sun, 16 Oct 2022 15:55:05 +0000 (16:55 +0100)

committer Colin Watson <cjwatson@debian.org>

Sun, 16 Oct 2022 15:55:05 +0000 (16:55 +0100)
author Colin Watson <cjwatson@debian.org>
Sun, 16 Oct 2022 15:55:05 +0000 (16:55 +0100)
committer Colin Watson <cjwatson@debian.org>
Sun, 16 Oct 2022 15:55:05 +0000 (16:55 +0100)
diff --git a/content/reproducible-man-db-databases.md b/content/reproducible-man-db-databases.md

new file mode 100644 (file)

index 0000000..59ae1b5
--- /dev/null
+++ b/content/reproducible-man-db-databases.md
@@ -0,0 +1,102 @@
+Title: Reproducible man-db databases
+Slug: reproducible-man-db-databases
+Date: 2022-10-16 16:54:32 +01:00
+Category: man-db
+Tags: man-db, planet-debian, planet-ubuntu
+
+I've released man-db 2.11.0
+([announcement](https://lists.nongnu.org/archive/html/man-db-announce/2022-10/msg00000.html),
+[NEWS](https://gitlab.com/cjwatson/man-db/-/blob/2.11.0/NEWS.md)), and
+uploaded it to Debian unstable.
+
+The biggest chunk of work here was fixing some extremely long-standing
+issues with how the database is built.  Despite being in the package name,
+man-db's database is much less important than it used to be: most uses of
+`man(1)` haven't required it in a long time, and both hardware and
+[software]({filename}/man-db-K.md)
+[improvements](https://lists.nongnu.org/archive/html/man-db-announce/2022-02/msg00000.html)
+mean that even some searches can be done by brute force without needing
+prior indexing.  However, the database is still needed for the `whatis(1)`
+and `apropos(1)` commands.
+
+The database has a simple format - no relational structure here, it's just a
+simple key-value database using old-fashioned DBM-like interfaces and
+composing a few fields to form values - but there are a number of subtleties
+involved.  The issues tend to amount to this: what does a manual page name
+mean?  At first glance it might seem simple, because you have file names
+that look something like `/usr/share/man/man1/ls.1.gz` and that's obviously
+`ls(1)`.  Some pages are symlinks to other pages (which we track separately
+because it makes it easier to figure out which entries to update when the
+contents of the file system change), and sometimes multiple pages are even
+hard links to the same file.
+
+The real complications come with "whatis references".  Pages can list a
+bunch of names in their `NAME` section, and the historical expectation is
+that it should be possible to use those names as arguments to `man(1)` even
+if they don't also appear in the file system (although Debian policy has
+[deprecated relying on
+this](https://www.debian.org/doc/debian-policy/ch-docs.html#manual-pages)
+for some time).  Not only does that mean that `man(1)` sometimes needs to
+consult the database, but it also means that the database is inherently more
+complicated, since a page might list something in its `NAME` section that
+conflicts with an actual file name in the file system, and now you need a
+priority system to resolve ambiguities.  There are some other possible
+causes of ambiguity as well.
+
+The people working on reproducible builds in Debian branched out to the
+related challenge of reproducible installations some time ago: can you take
+a collection of packages, bootstrap a file system image from them, and
+reproduce that exact same image somewhere else?  This is useful for the same
+sorts of reasons that reproducible builds are useful: it lets you verify
+that an image is built from the components it's supposed to be built from,
+and doesn't contain any other skulduggery by accident or design.  One of the
+people working on this [noticed](https://bugs.debian.org/1010957) that
+man-db's database files were an obstacle to that: in particular, the exact
+contents of the database seemed to depend on the order in which files were
+scanned when building it.  The reporter proposed solving this by processing
+files in sorted order, but I wasn't keen on that approach: firstly because
+it would mean we could no longer process files in an order that makes it
+more efficient to read them all from disk (still valuable on rotational
+disks), but mostly because the differences seemed to point to other bugs.
+
+Having understood this, there then followed several late nights of very
+fiddly work on the details of how the database is maintained.  None of this
+was conceptually difficult: it mainly amounted to ensuring that we maintain
+a consistent [well-order](https://en.wikipedia.org/wiki/Well-order) for
+different entries that we might want to insert for a given database key, and
+that we consider the same names for insertion regardless of the order in
+which we encounter files.  As usual, the tricky bit is making sure that we
+have the right data structures to support this.  man-db is written in C
+which is not very well-supplied with built-in data structures, and
+originally much of the code was written in a style that tried to minimize
+memory allocations; this came at the cost of ownership and lifetime often
+being rather unclear, and it was often difficult to make changes without
+causing leaks or double-frees.  Over the years I've been gradually
+introducing better encapsulation to make things easier to follow, and I had
+to do another round of that here.  There were also some problems with
+caching being done at slightly the wrong layer: we need to make use of a
+"trace" of the chain of links followed to resolve a page to its ultimate
+source file, but we were incorrectly caching that trace and reusing it for
+any link to the same file, with incorrect results in many cases.
+
+Oh, and after doing all that I found that the on-disk representation of a
+GDBM database is insertion-order-dependent, so I ended up having to manually
+reorganize the database at the end by reading it all in and writing it all
+back out in sorted order, which feels really weird to me coming from
+spending most of my time with PostgreSQL these days.  Fortunately the
+database is small so this takes negligible time.
+
+None of this is particularly glamorous work, but it paid off:
+
+    :::shell-session
+    # export SOURCE_DATE_EPOCH="$(date +%s)"
+    # mkdir emptydir disorder
+    # disorderfs --multi-user=yes --shuffle-dirents=yes --reverse-dirents=no emptydir disorder
+    # export TMPDIR="$(pwd)/disorder"
+    # mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \
+          unstable out1.tar
+    # mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \
+          unstable out2.tar
+    # cmp out1.tar out2.tar
+    # echo $?
+    0
author	Colin Watson <cjwatson@debian.org>
	Sun, 16 Oct 2022 15:55:05 +0000 (16:55 +0100)
committer	Colin Watson <cjwatson@debian.org>
	Sun, 16 Oct 2022 15:55:05 +0000 (16:55 +0100)