Colin Watson's blog - planet-debian

Free software activity in April 2025

2025-05-04T16:38:13+01:00

About 90% of my Debian contributions this month were sponsored by Freexian.

You can also support my work directly via Liberapay.

Request for OpenSSH debugging help

Following the OpenSSH work described below, I have an open report about the sshd server sometimes crashing when clients try to connect to it. I can’t reproduce this myself, and arm’s-length debugging is very difficult, but three different users have reported it. For the time being I can’t pass it upstream, as it’s entirely possible it’s due to a Debian patch.

Is there anyone reading this who can reproduce this bug and is capable of doing some independent debugging work, most likely involving bisecting changes to OpenSSH? I’d suggest first seeing whether a build of the unmodified upstream 10.0p2 release exhibits the same bug. If it does, then bisect between 9.9p2 and 10.0p2; if not, then bisect the list of Debian patches. This would be extremely helpful, since at the moment it’s a bit like trying to look for a needle in a haystack from the next field over by sending instructions to somebody with a magnifying glass.

OpenSSH

I upgraded the Debian packaging to OpenSSH 10.0p1 (now designated 10.0p2 by upstream due to a mistake in the release process, but they’re the same thing), fixing CVE-2025-32728. This also involved a diffoscope bug report due to the version number change.

I enabled the new --with-linux-memlock-onfault configure option to protect sshd against being swapped out, but this turned out to cause test failures on riscv64, so I disabled it again there. Debugging this took some time since I needed to do it under emulation, and in the process of setting up a testbed I added riscv64 support to vmdb2.

In coordination with the wtmpdb maintainer, I enabled the new Y2038-safe native wtmpdb support in OpenSSH, so wtmpdb last now reports the correct tty.

I fixed a couple of packaging bugs:

I reviewed and merged several packaging contributions from others:

ssh-agent: Improve systemd user service socket activation (Daniel Kahn Gillmor)
Switch from adduser to sysusers.d (Luca Boccassi)
Add sshd-keygen service (Luca Boccassi)

dput-ng

Since we added dput-ng integration to Debusine recently, I wanted to make sure that it was in good condition in trixie, so I fixed dput-ng: will FTBFS during trixie support period. Previously a similar bug had been fixed by just using different Ubuntu release names in tests; this time I made the tests independent of the current supported release data returned by distro_info, so this shouldn’t come up again.

We also ran into dput-ng: —override doesn’t override profile parameters, which needed somewhat more extensive changes since it turned out that that option had never worked. I fixed this after some discussion with Paul Tagliamonte to make sure I understood the background properly.

man-db

I released man-db 2.13.1. This just included various small fixes and a number of translation updates, but I wanted to get it into trixie in order to include a contribution to increase the MAX_NAME constant, since that was now causing problems for some pathological cases of manual pages in the wild that documented a very large number of terms.

debmirror

I fixed one security bug: debmirror prints credentials with —progress.

Python team

I upgraded these packages to new upstream versions:

celery
django-modeltranslation (maintained by Freexian)
django-phonenumber-field
djangorestframework
kombu
orderly-set
pox
pydantic-extra-types
python-cmarkgfm (fixing CVE-2022-39209, CVE-2023-22483, CVE-2023-22484, CVE-2023-22485, CVE-2023-22486, CVE-2023-24824, CVE-2023-26485, and CVE-2023-37463)
python-django-crispy-forms
python-django-extensions (fixing incompatibilities with Python 3.12: #1040091, #1040119)
python-django-pgtrigger
python-django-test-migrations
python-holidays
python-legacy-cgi
python-pydash
python-redis (4.3.4 to 5.2.1; needed some autopkgtest adjustments)
python-tblib
python-typing-extensions
trove-classifiers
xonsh

In bookworm-backports, I updated these packages:

python-django to 3:4.2.20-1 (issuing BSA-123)
python-django-pgtrigger to 4.13.3

I dropped a stale build-dependency from python-aiohttp-security that kept it out of testing (though unfortunately too late for the trixie freeze).

I fixed or helped to fix various other build/test failures:

I packaged python-typing-inspection, needed for a new upstream version of pydantic.

I documented the architecture field in debian/tests/autopkgtest-pkg-pybuild.conf files.

I fixed other odds and ends of bugs:

Science team

I fixed various build/test failures:

python-vispy (also fixing unhandled failures to build documentation can occur while I was there)
skimage

Free software activity in March 2025

2025-04-01T13:17:35+01:00

Most of my Debian contributions this month were sponsored by Freexian.

You can also support my work directly via Liberapay.

OpenSSH

Changes in dropbear 2025.87 broke OpenSSH’s regression tests. I cherry-picked the fix.

I reviewed and merged patches from Luca Boccassi to send and accept the COLORTERM and NO_COLOR environment variables.

Python team

Following up on last month, I fixed some more uscan errors:

python-ewokscore
python-ewoksdask
python-ewoksdata
python-ewoksorange
python-ewoksutils
python-processview
python-rsyncmanager

I upgraded these packages to new upstream versions:

bitstruct
django-modeltranslation (maintained by Freexian)
django-yarnpkg
flit
isort
jinja2 (fixing CVE-2025-27516)
mkdocstrings-python-legacy
mysql-connector-python (fixing CVE-2025-21548)
psycopg3
pydantic-extra-types
pydantic-settings
pytest-httpx (fixing a build failure with httpx 0.28)
python-argcomplete
python-cymem
python-djvulibre
python-ecdsa
python-expandvars
python-holidays
python-json-log-formatter
python-keycloak (fixing a build failure with httpx 0.28)
python-limits
python-mastodon (in the course of which I found #1101140 in blurhash-python and proposed a small cleanup to slidge)
python-model-bakery
python-multidict
python-pip
python-rsyncmanager
python-service-identity
python-setproctitle
python-telethon
python-trio
python-typing-extensions
responses
setuptools-scm
trove-classifiers
zope.testrunner

In bookworm-backports, I updated python-django to 3:4.2.19-1.

Although Debian’s upgrade to python-click 8.2.0 was reverted for the time being, I fixed a number of related problems anyway since we’re going to have to deal with it eventually:

celery (contributed upstream)
magic-wormhole (closed in Debian without action, but contributed upstream)
python-flasgger (contributed upstream)
sqlfluff (closed in Debian without action, but contributed upstream)

dh-python dropped its dependency on python3-setuptools in 6.20250306, which was long overdue, but it had quite a bit of fallout; in most cases this was simply a question of adding build-dependencies on python3-setuptools, but in a few cases there was a missing build-dependency on python3-typing-extensions which had previously been pulled in as a dependency of python3-setuptools. I fixed these bugs resulting from this:

beangrow
beangulp
beanprice
beanquery
beautifulsoup4
django-choices-field
django-modeltranslation (maintained by Freexian)
flake8-class-newline
flake8-quotes
nodeenv
pygments-ansi-color
pytest-mypy-plugins
python-agilent
python-aiohttp-security
python-aiohttp-session
python-djvulibre
python-ewoksutils
python-jsonlines
python-mastodon
python-pydash
python-pytest-venv
python-redfish
python-ring-doorbell
python-sluurp
python-sqlite-migrate
python-trubar

We agreed to remove python-pytest-flake8. In support of this, I removed unnecessary build-dependencies from pytest-pylint, python-proton-core, python-pyzipper, python-tatsu, python-tatsu-lts, and python-tinycss, and filed #1101178 on eccodes-python and #1101179 on rpmlint.

There was a dnspython autopkgtest regression on s390x. I independently tracked that down to a pylsqpack bug and came up with a reduced test case before realizing that Pranav P had already been working on it; we then worked together on it and I uploaded their patch to Debian.

I fixed various other build/test failures:

aiomysql (closed with no action needed)
lazr.uri
m2crypto (thanks to Sebastian Andrzej Siewior)
mkdocstrings
mystic
poetry-plugin-export
poetry
pydantic-extra-types
pymilter
python-a2wsgi (contributed upstream)
python-anyio
python-cymem
python-django-timescaledb (only partially successful)
python-ewokscore
python-ewoksorange
python-httpx-sse
python-ml-collections
python-opt-einsum (contributed upstream)
python-passlib
python-pdoc
python-ppmd (with a follow-up)
python-processview
python-respx
python-rsyncmanager (contributed upstream: ccc5f66dc7, 51c15ca8d1)
python-sphobjinv
python-urllib3
pytrainer
tlv8-python

I enabled more tests in python-moto and contributed a supporting fix upstream.

I sponsored Maximilian Engelhardt to reintroduce zope.sqlalchemy.

I fixed various odds and ends of bugs:

I contributed a small documentation improvement to pybuild-autopkgtest(1).

Rust team

I upgraded rust-asn1 to 0.20.0.

Science team

I finally gave in and joined the Debian Science Team this month, since it often has a lot of overlap with the Python team, and Freexian maintains several packages under it.

I fixed a uscan error in hdf5-blosc (maintained by Freexian), and upgraded it to a new upstream version.

I fixed python-vispy: missing dependency on numpy abi.

Other bits and pieces

I fixed debconf should automatically be noninteractive if input is /dev/null.

I fixed a build failure with GCC 15 in yubihsm-shell (maintained by Freexian).

Prompted by a CI failure in debusine, I submitted a large batch of spelling fixes and some improved static analysis to incus (#1777, #1778) and distrobuilder.

After regaining access to the repository, I fixed telegnome: missing app icon in ‘About’ dialogue and made a new 0.3.7 release.

Free software activity in February 2025

2025-03-02T13:49:47+00:00

Most of my Debian contributions this month were sponsored by Freexian.

You can also support my work directly via Liberapay.

OpenSSH

OpenSSH upstream released 9.9p2 with fixes for CVE-2025-26465 and CVE-2025-26466. I got a heads-up on this in advance from the Debian security team, and prepared updates for all of testing/unstable, bookworm (Debian 12), bullseye (Debian 11), buster (Debian 10, LTS), and stretch (Debian 9, ELTS). jessie (Debian 8) is also still in ELTS for a few more months, but wasn’t affected by either vulnerability.

Although I’m not particularly active in the Perl team, I fixed a libnet-ssleay-perl build failure because it was blocking openssl from migrating to testing, which in turn was blocking the above openssh fixes.

I also sent a minor sshd -T fix upstream, simplified a number of autopkgtests using the newish Restrictions: needs-sudo facility, and prepared for removing the obsolete slogin symlink.

PuTTY

I upgraded to the new upstream version 0.83.

GCC 15 build failures

I fixed build failures with GCC 15 in a few packages:

Python team

A lot of my Python team work is driven by its maintainer dashboard. Now that we’ve finished the transition to Python 3.13 as the default version, and inspired by a recent debian-devel thread started by Santiago, I thought it might be worth spending a bit of time on the “uscan error” section. uscan is typically scraping upstream web sites to figure out whether new versions are available, and so it’s easy for its configuration to become outdated or broken. Most of this work is pretty boring, but it can often reveal situations where we didn’t even realize that a Debian package was out of date. I fixed these packages:

cssutils (this in particular was very out of date due to a new and active upstream maintainer since 2021)
django-assets
django-celery-email
django-sass
django-yarnpkg
json-tricks
mercurial-extension-utils
pydbus
pydispatcher
pylint-celery
pyspread
pytest-pretty
python-apptools
python-django-libsass (contributed a packaging fix upstream in passing)
python-django-postgres-extra
python-django-waffle
python-ephemeral-port-reserve
python-ifaddr
python-log-symbols
python-msrest
python-msrestazure
python-netdisco
python-pathtools
python-user-agents
sinntp
wchartype

I upgraded these packages to new upstream versions:

cssutils (contributed a packaging tweak upstream)
django-iconify
django-sass
domdf-python-tools
extra-data (fixing a numpy 2.0 failure)
flufl.i18n
json-tricks
jsonpickle
mercurial-extension-utils
mod-wsgi
nbconvert
orderly-set
pydispatcher (contributed a Python 3.12 fix upstream)
pylint
pytest-rerunfailures
python-asyncssh
python-box (contributed a packaging fix upstream)
python-charset-normalizer
python-django-constance
python-django-guid
python-django-pgtrigger
python-django-waffle
python-djangorestframework-simplejwt
python-formencode
python-holidays (contributed a test fix upstream)
python-legacy-cgi
python-marshmallow-polyfield (fixing a test failure)
python-model-bakery
python-mrcz (fixing a numpy 2.0 failure)
python-netdisco
python-npe2
python-persistent
python-pkginfo (fixing a test failure)
python-proto-plus
python-requests-ntlm
python-roman
python-semantic-release
python-setproctitle
python-stdlib-list
python-trustme
python-typeguard (fixing a test failure)
python-tzlocal
pyzmq
setuptools-scm
sqlfluff
stravalib
tomopy
trove-classifiers
xhtml2pdf (fixing CVE-2024-25885)
xonsh
zodbpickle
zope.deprecation
zope.testrunner

In bookworm-backports, I updated python-django to 3:4.2.18-1 (issuing BSA-121) and added new backports of python-django-dynamic-fixture and python-django-pgtrigger, all of which are dependencies of debusine.

I went through all the build failures related to python-click 8.2.0 (which was confusingly tagged but not fully released upstream and posted an analysis.

I fixed or helped to fix various other build/test failures:

cython
dask
deepdish
hickle (contributed upstream)
mdp (contributed upstream)
mypy
pillow
pynput
python-fonticon-fontawesome6
python-persistent (contributed upstream)
python-srsly

I dropped support for the old setup.py ftest command from zope.testrunner upstream.

I fixed various odds and ends of bugs:

Installer team

Following up on last month, I merged and uploaded Helmut’s /usr-move fix.

Qalculate time hacks

2025-02-23T20:00:12+00:00

Anarcat recently wrote about Qalculate, and I think I’m a convert, even though I’ve only barely scratched the surface.

The thing I almost immediately started using it for is time calculations. When I started tracking my time, I quickly found that Timewarrior was good at keeping all the data I needed, but I often found myself extracting bits of it and reprocessing it in variously clumsy ways. For example, I often don’t finish a task in one sitting; maybe I take breaks, or I switch back and forth between a couple of different tasks. The raw output of timew summary is a bit clumsy for this, as it shows each chunk of time spent as a separate row:

$ timew summary 2025-02-18 Debian

Wk Date       Day Tags                            Start      End    Time   Total
W8 2025-02-18 Tue CVE-2025-26465, Debian,       9:41:44 10:24:17 0:42:33
                  next, openssh
                  Debian, FTBFS with GCC-15,   10:24:17 10:27:12 0:02:55
                  icoutils
                  Debian, FTBFS with GCC-15,   11:50:05 11:57:25 0:07:20
                  kali
                  Debian, Upgrade to 0.67,     11:58:21 12:12:41 0:14:20
                  python_holidays
                  Debian, FTBFS with GCC-15,   12:14:15 12:33:19 0:19:04
                  vigor
                  Debian, FTBFS with GCC-15,   12:39:02 12:39:38 0:00:36
                  python_setproctitle
                  Debian, Upgrade to 1.3.4,    12:39:39 12:46:05 0:06:26
                  python_setproctitle
                  Debian, FTBFS with GCC-15,   12:48:28 12:49:42 0:01:14
                  python_setproctitle
                  Debian, Upgrade to 3.4.1,    12:52:07 13:02:27 0:10:20 1:44:48
                  python_charset_normalizer

                                                                         1:44:48

So I wrote this Python program to help me:

#! /usr/bin/python3

"""
Summarize timewarrior data, grouped and sorted by time spent.
"""

import json
import subprocess
from argparse import ArgumentParser, RawDescriptionHelpFormatter
from collections import defaultdict
from datetime import datetime, timedelta, timezone
from operator import itemgetter

from rich import box, print
from rich.table import Table


parser = ArgumentParser(
    description=__doc__, formatter_class=RawDescriptionHelpFormatter
)
parser.add_argument("-t", "--only-total", default=False, action="store_true")
parser.add_argument(
    "range",
    nargs="?",
    default=":today",
    help="Time range (usually a hint, e.g. :lastweek)",
)
parser.add_argument("tag", nargs="*", help="Tags to filter by")
args = parser.parse_args()

entries: defaultdict[str, timedelta] = defaultdict(timedelta)
now = datetime.now(timezone.utc)
for entry in json.loads(
    subprocess.run(
        ["timew", "export", args.range, *args.tag],
        check=True,
        capture_output=True,
        text=True,
    ).stdout
):
    start = datetime.fromisoformat(entry["start"])
    if "end" in entry:
        end = datetime.fromisoformat(entry["end"])
    else:
        end = now
    entries[", ".join(entry["tags"])] += end - start

if not args.only_total:
    table = Table(box=box.SIMPLE, highlight=True)
    table.add_column("Tags")
    table.add_column("Time", justify="right")
    for tags, time in sorted(entries.items(), key=itemgetter(1), reverse=True):
        table.add_row(tags, str(time))
    print(table)

total = sum(entries.values(), start=timedelta())
hours, rest = divmod(total, timedelta(hours=1))
minutes, rest = divmod(rest, timedelta(minutes=1))
seconds = rest.seconds
print(f"Total time: {hours:02}:{minutes:02}:{seconds:02}")

$ summarize-time 2025-02-18 Debian

  Tags                                                     Time
 ───────────────────────────────────────────────────────────────
  CVE-2025-26465, Debian, next, openssh                 0:42:33
  Debian, FTBFS with GCC-15, vigor                      0:19:04
  Debian, Upgrade to 0.67, python_holidays              0:14:20
  Debian, Upgrade to 3.4.1, python_charset_normalizer   0:10:20
  Debian, FTBFS with GCC-15, kali                       0:07:20
  Debian, Upgrade to 1.3.4, python_setproctitle         0:06:26
  Debian, FTBFS with GCC-15, icoutils                   0:02:55
  Debian, FTBFS with GCC-15, python_setproctitle        0:01:50

Total time: 01:44:48

Much nicer. But that only helps with some of my reporting. At the end of a month, I have to work out how much time to bill Freexian for and fill out a timesheet, and for various reasons those queries don’t correspond to single timew tags: they sometimes correspond to the sum of all time spent on multiple tags, or to the time spent on one tag minus the time spent on another tag, or similar. As a result I quite often have to do basic arithmetic on time intervals; but that’s surprisingly annoying! I didn’t previously have good tools for that, and was reduced to doing things like str(timedelta(hours=..., minutes=..., seconds=...) + ...) in Python, which gets old fast.

Instead:

$ qalc '62:46:30 - 51:02:42 to time'
(225990 / 3600) − (183762 / 3600) = 11:43:48

I also often want to work out how much of my time I’ve spent on Debian work this month so far, since Freexian pays me for up to 20% of my work time on Debian; if I’m under that then I might want to prioritize more Debian projects, and if I’m over then I should be prioritizing more Freexian projects as otherwise I’m not going to get paid for that time.

$ summarize-time -t :month Freexian
Total time: 69:19:42
$ summarize-time -t :month Debian
Total time: 24:05:30
$ qalc '24:05:30 / (24:05:30 + 69:19:42) to %'
(86730 / 3600) / ((86730 / 3600) + (249582 / 3600)) ≈ 25.78855349%

I love it.

Free software activity in January 2025

2025-02-02T19:48:40+00:00

Most of my Debian contributions this month were sponsored by Freexian. If you appreciate this sort of work and are at a company that uses Debian, have a look to see whether you can pay for any of Freexian‘s services; as well as the direct benefits, that revenue stream helps to keep Debian development sustainable for me and several other lovely people.

You can also support my work directly via Liberapay.

Python team

We finally made Python 3.13 the default version in testing! I fixed various bugs that got in the way of this:

As with last month, I fixed a few more build regressions due to the removal of a deprecated intersphinx_mapping syntax in Sphinx 8.0:

I ported a few packages to Django 5.1:

django-sitetree (thanks to an Ubuntu patch)
python-django-parler (thanks to an Ubuntu patch

I ported python-pypump to IPython 8.0.

I fixed python-datamodel-code-generator to handle isort 6, and contributed that upstream.

I fixed some packages to tolerate future versions of dh-python that will drop their dependency on python3-setuptools:

I removed the old python-celery-common transitional package from celery, since nothing in Debian needs it any more.

I fixed or helped to fix various other build/test failures:

awesomeversion
django-graphiql-debug-toolbar
django-iconify
flask-sqlalchemy (contributed upstream)
freezegun
napari (somewhat inconclusive, but I contributed fixes upstream to napari and npe2)
pyglet
python-construct-classes
python-django-guid
python-pyfakefs
python-ring-doorbell
recommonmark
spectral-cube
sphinx (helped with an upstream fix)
sphinxcontrib-openapi (contributed upstream)
sqlfluff
terminator (contributed upstream)
trac-wikiprint
vcr.py (investigated and reported findings upstream, confirming that disabling these tests in the Debian packaging seems reasonable for now)

I upgraded these packages to new upstream versions:

buildbot
cloudpickle
dask (fixing a Python 3.13 failure and working around a build failure due to sphinx-book-theme
distributed (fixing a Python 3.13 failure; also contributed a fix upstream)
importlib-resources (fixing a test failure on s390x and a test failure on all architectures)
isort
nbconvert
psycopg3
pydantic
pydantic-settings
pydoctor
pypandoc
python-argcomplete
python-cai
python-colormap (fixing a build failure with poetry-core 2.0, for which I contributed a fix upstream)
python-django-guid
python-easydev (fixing a build failure with poetry-core 2.0)
python-holidays
python-launchpadlib
python-limits
python-model-bakery
python-openapi-schema-validator
python-pathvalidate
python-pyftpdlib
python-quart-trio
python-urllib3 (contributed a test fix upstream)
python-telethon
python-webob (fixing CVE-2024-42353)
responses
restrictedpython (fixing CVE-2024-47532 and CVE-2025-22153)
sqlfluff
vcr.py (fixing a build failure with python-urllib3 2.3.0
xonsh (fixing a Python 3.13 failure)

Rust team

I fixed rust-pyo3-ffi to avoid explicit Python version dependencies that were getting in the way of making Python 3.13 the default version.

Security tools packaging team

I uploaded libevt to fix a build failure on i386 and to tolerate future versions of dh-python that will drop their dependency on python3-setuptools.

Installer team

I helped with some testing of a debian-installer-utils patch as part of the /usr move. I need to get around to uploading this, since it looks OK now.

Other small things

Helmut Grohne reached out for help debugging a multi-arch coinstallability problem (you know it’s going to be complicated when even Helmut can’t figure it out on his own …) in binutils, and we had a call about that.

I reviewed and applied a new Romanian translation of debconf’s manual pages.

I did my twice-yearly refresh of debmirror’s mirror_size documentation, and applied a contribution to improve the example debmirror.conf.

I fixed an arguable preprocessor string handling bug in man-db, and applied a fix for out-of-tree builds.

Free software activity in December 2024

2025-01-02T00:16:14+00:00

Most of my Debian contributions this month were sponsored by Freexian, as well as one direct donation via Liberapay (thanks!).

OpenSSH

I issued a bookworm update with a number of fixes that had accumulated over the last year, especially fixing GSS-API key exchange which was quite broken in bookworm.

base-passwd

A few months ago, the adduser maintainer started a discussion with me (as the base-passwd maintainer) and the shadow maintainer about bringing all three source packages under one team, since they often need to cooperate on things like user and group names. I agreed, but hadn’t got round to doing anything about it until recently. I’ve now officially moved it under team maintenance.

debconf

Gioele Barabucci has been working on eliminating duplicated code between debconf and cdebconf, ultimately with the goal of migrating to cdebconf (which I’m not sure I’m convinced of as a goal, but if we can make improvements to both packages as part of working towards it then there’s no harm in that). I finally got round to reviewing and merging confmodule changes in each of debconf and cdebconf. This caused an installer regression due to a weirdness in cdebconf-udeb’s packaging, which I fixed - sorry about that!

I’ve also been dealing with a few patch submissions that had been in my queue for a long time, but more on that next month if all goes well.

CI issues

I noticed and fixed a problem with Restrictions: needs-sudo in autopkgtest.

I fixed broken aptly images in the Salsa CI pipeline.

Python team

Last month, I mentioned some progress on sorting out the multipart vs. python-multipart name conflict in Debian (#1085728), and said that I thought we’d be able to finish it soon. I was right! We got it all done this month:

The Python 3.13 transition continues, and last month we were able to add it to the supported Python versions in testing. (The next step will be to make it the default.) I fixed lots of problems in aid of this, including:

audioread
celery
cloud-sptheme
djangorestframework
dominate
fenics-basix (investigated and suggested a fix, although it hasn’t yet been uploaded)
ipykernel
ipython
jupyter-server (contributed upstream)
mdp (contributed upstream)
pastescript (contributed upstream)
pypandoc (contributed upstream)
python-aiosmtpd
python-cheroot
python-hjson
python-miio
python-pyramid
python-trustme (uploaded by Robie Basak)
rich (investigated and tested; uploaded by Sandro Tosi)
supervisor
tomopy

Sphinx 8.0 removed some old intersphinx_mapping syntax which turned out to still be in use by many packages in Debian. The fixes for this were individually trivial, but there were a lot of them:

I found that twisted 24.11.0 broke tests in buildbot and wokkel, and fixed those.

I packaged python-flatdict, needed for a new upstream version of python-semantic-release.

I tracked down a test failure in vdirsyncer (which I’ve been using for some years, but had never previously needed to modify) and contributed a fix upstream.

I fixed some packages to tolerate future versions of dh-python that will drop their dependency on python3-setuptools:

I fixed django-cte to remove a build-dependency on the obsolete python3-nose package.

I added Django 5.1 support to django-polymorphic. (There are a number of other packages that still need work here.)

I fixed various other build/test failures:

black
datalad-next: #1080969 and #1088038 (contributed upstream)
dipy (uploaded by Andreas Tille)
pyfribidi
pylibmc, fixing a build failure in cachelib
pympress
pytest-forked (contributed upstream)
python-mongomock: c03f91b51048 and c485fb8fef23
python-nox

I upgraded these packages to new upstream versions:

aioftp
alot
astroid
buildbot
cloudpickle (fixing a Python 3.13 failure)
django-countries
django-sass-processor
djoser (fixing CVE-2024-21543)
ipython
jsonpickle
lazr.delegates
loguru (fixing a Python 3.13 failure)
netmiko
pydantic
pydantic-core
pydantic-settings
pydoctor
pygresql
pylint (fixing Python 3.13 failures #1089758 and #1091029)
pypandoc (fixing a Python 3.12 warning)
python-aiohttp (fixing CVE-2024-52303 and CVE-2024-52304
python-aiohttp-security
python-argcomplete
python-asyncssh
python-click
python-cytoolz
python-jira (fixing a Python 3.13 failure)
python-limits
python-line-profiler
python-mkdocs
python-model-bakery
python-pgspecial
python-pyramid (fixing CVE-2023-40587)
python-pythonjsonlogger
python-semantic-release
python-utils
python-venusian
pyupgrade
pyzmq
quart
six
sqlparse
twisted
vcr.py
vulture
yoyo
zope.configuration
zope.testrunner

I updated the team’s library style guide to remove material related to Python 2 and early versions of Python 3, which is no longer relevant to any current Python packaging work.

Other Python upstream work

I happened to notice a Twisted upstream issue requesting the removal of the deprecated twisted.internet.defer.returnValue, realized it was still used in many places in Debian, and went on a PR-filing spree informed by codesearch to try to reduce the future impact of such a change on Debian:

Other small fixes

Santiago Vila has been building the archive with make --shuffle (also see its author’s explanation). I fixed associated bugs in cccc (contributed upstream), groff, and spectemu.

I backported an upstream patch to putty to fix undefined behaviour that affected use of the “small keypad”.

I removed groff’s Recommends: libpaper1 (#1091375, #1091376), since it isn’t currently all that useful and was getting in the way of a transition to libpaper2. I filed an upstream bug suggesting better integration in this area.

Free software activity in November 2024

2024-12-01T15:00:22+00:00

Most of my Debian contributions this month were sponsored by Freexian.

You can also support my work directly via Liberapay.

Conferences

I attended MiniDebConf Toulouse 2024, and the MiniDebCamp before it. Most of my time was spent with the Freexian folks working on debusine; Stefano gave a talk about its current status with a live demo (frantically fixed up over the previous couple of days, as is traditional) and with me and others helping to answer questions at the end. I also caught up with some people I haven’t seen in ages, ate a variety of delicious cheeses, and generally had a good time. Many thanks to the organizers and sponsors!

After the conference, Freexian collaborators spent a day and a half doing some planning for next year, and then went for an afternoon visiting the Cité de l’espace.

Rust team

I upgraded these packages to new upstream versions, as part of upgrading pydantic and rpds-py:

rust-archery
rust-jiter (noticing an upstream test bug in the process)
rust-pyo3 (fixing CVE-2024-9979)
rust-pyo3-build-config
rust-pyo3-ffi
rust-pyo3-macros
rust-pyo3-macros-backend
rust-regex
rust-regex-automata
rust-regex
rust-serde
rust-serde-derive
rust-serde-json
rust-speedate
rust-triomphe

Python team

Last month, I mentioned that we still need to work out what to do about the multipart vs. python-multipart name conflict in Debian (#1085728). We eventually managed to come up with an agreed plan; Sandro has uploaded a renamed binary package to experimental, and I’ve begun work on converting reverse-dependencies (asgi-csrf, fastapi, python-curies, and starlette done so far). There’s a bit more still to do, but I expect we can finish it soon.

I fixed problems related to adding Python 3.13 support in:

coreapi
git-repo-updater
offlineimap3
ptyprocess
pytest-testinfra
python-formencode
python-iniparse
python-line-profiler
python-parameterized
python-sdjson (contributed upstream)
python-testfixtures (contributed upstream)
python-venusian
sphinx-a4doc (contributed upstream, and I also made a small antlr4 improvement)
webpy

I fixed some packaging problems that resulted in failures any time we add a new Python version to Debian:

I fixed other build/autopkgtest failures in:

I packaged python-quart-trio, needed for a new upstream version of python-urllib3, and contributed a small packaging tweak upstream.

I backported a twisted fix that caused problems in other packages, including breaking debusine‘s tests.

I disentangled some upstream version confusion in python-catalogue, and upgraded to the current upstream version.

I upgraded these packages to new upstream versions:

aioftp (fixing a Python 3.13 failure)
ansible-core
ansible
debugpy
jsonpickle
manuel
psycopg2
pydantic-core
pydantic
pydantic-settings
pymssql (fixing a Python 3.13 failure)
pyodbc (fixing a Python 3.13 failure)
python-argh (fixing a Python 3.13 failure)
python-boltons (fixing a Python 3.13 failure)
python-channels-redis
python-colorlog (fixing a Python 3.13 failure)
python-django-pgtrigger
python-line-profiler
python-pathvalidate (fixing a Python 3.13 failure)
python-plac (fixing a Python 3.13 failure)
python-precis-i18n
python-pure-eval (fixing a Python 3.13 failure)
python-pythonjsonlogger (contributing a small packaging fix upstream, as well as a test fix to jupyter-events)
python-rdata (fixing a Python 3.13 failure)
python-semantic-release
python-telethon (fixing a Python 3.13 failure, and contributing some test fixes upstream)
python-tornado (fixing CVE-2024-52804)
python-trio (fixing a Python 3.13 failure)
python-trustme
python-typeguard
python-urllib3 (fixing CVE-2024-37891 and a Python 3.13 failure, and requiring some shenanigans with its hypercorn test-dependency)
python-zipp
quart (fixing CVE-2024-49767)
rpds-py (fixing a build failure)
sen
sqlparse
stravalib
transaction
waitress
zope.interface

Other small fixes

I contributed Incus support to needrestart upstream.

In response to Helmut’s Cross building talk at MiniDebConf Toulouse, I fixed libfilter-perl to support cross-building (5b4c2e10, f9788c27).

I applied a patch to move aliased files from / to /usr in iprutils (#1087733).

I adjusted debconf to use the new /usr/lib/apt/apt-extracttemplates path (#1087523).

I upgraded putty to 0.82.

Free software activity in October 2024

2024-11-01T12:19:04+00:00

Almost all of my Debian contributions this month were sponsored by Freexian.

You can also support my work directly via Liberapay.

Ansible

I noticed that Ansible had fallen out of Debian testing due to autopkgtest failures. This seemed like a problem worth fixing: in common with many other people, we use Ansible for configuration management at Freexian, and it probably wouldn’t make our sysadmins too happy if they upgraded to trixie after its release and found that Ansible was gone.

The problems here were really just slogging through test failures in both the ansible-core and ansible packages, but their test suites are large and take a while to run so this took some time. I was able to contribute a few small fixes to various upstreams in the process:

test: Make git archive prefix fit in 32-bit ssize_t (though upstream went for a different approach)
test: replace more deprecated assertEquals
Fix import error on Python 3.13
Make test_start_daemon_with_no_mock less flaky

This should now get back into testing tomorrow.

OpenSSH

Martin-Éric Racine reported that ssh-audit didn’t list the ext-info-s feature as being available in Debian’s OpenSSH 9.2 packaging in bookworm, contrary to what OpenSSH upstream said on their specifications page at the time. I spent some time looking into this and realized that upstream was mistakenly saying that implementations of ext-info-c and ext-info-s were added at the same time, while in fact ext-info-s was added rather later. ssh-audit now has clearer output, and the OpenSSH maintainers have corrected their specifications page.

I looked into a report of an ssh failure in certain cases when using GSS-API key exchange (which is a Debian patch). Once again, having integration tests was a huge win here: the affected scenario is quite a fiddly one, but I was able to set it up in the test, and thereby make sure it doesn’t regress in future. It still took me a couple of hours to get all the details right, but in the past this sort of thing took me much longer with a much lower degree of confidence that the fix was correct.

On upstream’s advice, I cherry-picked some key exchange fixes needed for big-endian architectures.

Python team

I packaged python-evalidate, needed for a new upstream version of buildbot.

The Python 3.13 transition rolls on. I fixed problems related to it in htmlmin, humanfriendly, postgresfixture (contributed upstream), pylint, python-asyncssh (contributed upstream), python-oauthlib, python3-simpletal, quodlibet, zope.exceptions, and zope.interface.

A trickier Python 3.13 issue involved the cgi module. Years ago I ported zope.publisher to the multipart module because cgi.FieldStorage was broken in some situations, and as a result I got a recommendation into Python’s “dead batteries” PEP 594. Unfortunately there turns out to be a name conflict between multipart and python-multipart on PyPI; python-multipart upstream has been working to disentangle this, though we still need to work out what to do in Debian. All the same, I needed to fix python-wadllib and multipart seemed like the best fit; I contributed a port upstream and temporarily copied multipart into Debian’s python-wadllib source package to allow its tests to pass. I’ll come back and fix this properly once we sort out the multipart vs. python-multipart packaging.

tzdata moved some timezone definitions to tzdata-legacy, which has broken a number of packages. I added tzdata-legacy build-dependencies to alembic and python-icalendar to deal with this in those packages, though there are still some other instances of this left.

I tracked down an nltk regression that caused build failures in many other packages.

I fixed Rust crate versioning issues in pydantic-core, python-bcrypt, and python-maturin (mostly fixed by Peter Michael Green and Jelmer Vernooĳ, but it needed a little extra work).

I fixed other build failures in entrypoints, mayavi2, python-pyvmomi (mostly fixed by Alexandre Detiste, but it needed a little extra work), and python-testing.postgresql (ditto).

I fixed python3-simpletal to tolerate future versions of dh-python that will drop their dependency on python3-setuptools.

I fixed broken symlinks in python-treq.

I removed (build-)depends on python3-pkg-resources from alembic, autopep8, buildbot, celery, flufl.enum, flufl.lock, python-public, python-wadllib (contributed upstream), pyvisa, routes, vulture, and zodbpickle (contributed upstream).

I upgraded astroid, asyncpg (fixing a Python 3.13 failure and a build failure), buildbot (noticing an upstream test bug in the process), dnsdiag, frozenlist, netmiko (fixing a Python 3.13 failure), psycopg3, pydantic-settings, pylint, python-asyncssh, python-bleach, python-btrees, python-cytoolz, python-django-pgtrigger, python-django-test-migrations, python-gssapi, python-icalendar, python-json-log-formatter, python-pgbouncer, python-pkginfo, python-plumbum, python-stdlib-list, python-tokenize-rt, python-treq (fixing a Python 3.13 failure), python-typeguard, python-webargs (fixing a build failure), pyupgrade, pyvisa, pyvisa-py (fixing a Python 3.13 failure), toolz, twisted, vulture, waitress (fixing CVE-2024-49768 and CVE-2024-49769), wtf-peewee, wtforms, zodbpickle, zope.exceptions, zope.interface, zope.proxy, zope.security, and zope.testrunner to new upstream versions.

I tried to fix a regression in python-scruffy, but I need testing feedback.

I requested removal of python-testing.mysqld.

Free software activity in September 2024

2024-10-01T14:19:21+01:00

Almost all of my Debian contributions this month were sponsored by Freexian.

You can also support my work directly via Liberapay.

Pydantic

My main Debian project for the month turned out to be getting Pydantic back into a good state in Debian testing. I’ve used Pydantic quite a bit in various projects, most recently in Debusine, so I have an interest in making sure it works well in Debian. However, it had been stalled on 1.10.17 for quite a while due to the complexities of getting 2.x packaged. This was partly making sure everything else could cope with the transition, but in practice mostly sorting out packaging of its new Rust dependencies. Several other people (notably Alexandre Detiste, Andreas Tille, Drew Parsons, and Timo Röhling) had made some good progress on this, but nobody had quite got it over the line and it seemed a bit stuck.

Learning Rust is on my to-do list, but merely not knowing a language hasn’t stopped me before. So I learned how the Debian Rust team’s packaging works, upgraded a few packages to new upstream versions (including rust-half and upstream rust-idna test fixes), and packaged rust-jiter. After a lot of waiting around for various things and chasing some failures in other packages I was eventually able to get current versions of both pydantic-core and pydantic into testing.

I’m looking forward to being able to drop our clunky v1 compatibility code once debusine can rely on running on trixie!

OpenSSH

I upgraded the Debian packaging to OpenSSH 9.9p1.

YubiHSM

I upgraded python-yubihsm, yubihsm-connector, and yubihsm-shell to new upstream versions.

I noticed that I could enable some tests in python-yubihsm and yubihsm-shell; I’d previously thought the whole test suite required a real YubiHSM device, but when I looked closer it turned out that this was only true for some tests.

I fixed yubihsm-shell build failures on some 32-bit architectures (upstream PRs #431, #432), and also made it build reproducibly.

Thanks to Helmut Grohne, I fixed yubihsm-connector to apply udev rules to existing devices when the package is installed.

As usual, bookworm-backports is up to date with all these changes.

Python team

setuptools 72.0.0 removed the venerable setup.py test command. This caused some fallout in Debian, some of which was quite non-obvious as packaging helpers sometimes fell back to different ways of running test suites that didn’t quite work. I fixed django-guardian, manuel, python-autopage, python-flask-seeder, python-pgpdump, python-potr, python-precis-i18n, python-stopit, serpent, straight.plugin, supervisor, and zope.i18nmessageid.

As usual for new language versions, the addition of Python 3.13 caused some problems. I fixed psycopg2, python-time-machine, and python-traits.

I fixed build/autopkgtest failures in keymapper, python-django-test-migrations, python-rosettasciio, routes, transmissionrpc, and twisted.

buildbot was in a bit of a mess due to being incompatible with SQLAlchemy 2.0. Fortunately by the time I got to it upstream had committed a workable set of patches, and the main difficulty was figuring out what to cherry-pick since they haven’t made a new upstream release with all of that yet. I figured this out and got us up to 4.0.3.

Adrian Bunk asked whether python-zipp should be removed from trixie. I spent some time investigating this and concluded that the answer was no, but looking into it was an interesting exercise anyway.

On the other hand, I looked into flask-appbuilder, concluded that it should be removed, and filed a removal request.

I upgraded some embedded CSS files in nbconvert.

I upgraded importlib-resources, ipywidgets, jsonpickle, pydantic-settings, pylint (fixing a test failure), python-aiohttp-session, python-apptools, python-asyncssh, python-django-celery-beat, python-django-rules, python-limits, python-multidict, python-persistent, python-pkginfo, python-rt, python-spur, python-zipp, stravalib, transmissionrpc, vulture, zodbpickle, zope.exceptions (adopting it), zope.i18nmessageid, zope.proxy, and zope.security to new upstream versions.

debmirror

The experimental and *-proposed-updates suites used to not have Contents-* files, and a long time ago debmirror was changed to just skip those files in those suites. They were added to the Debian archive some time ago, but debmirror carried on skipping them anyway. Once I realized what was going on, I removed these unnecessary special cases (#819925, #1080168).

Free software activity in August 2024

2024-09-01T14:29:32+01:00

All but about four hours of my Debian contributions this month were sponsored by Freexian. (I ended up going a bit over my 20% billing limit this month.)

You can also support my work directly via Liberapay.

man-db and friends

I released libpipeline 1.5.8 and man-db 2.13.0.

Since autopkgtests are great for making sure we spot regressions caused by changes in dependencies, I added one to man-db that runs the upstream tests against the installed package. This required some preparatory work upstream, but otherwise was surprisingly easy to do.

OpenSSH

I fixed the various 9.8 regressions I mentioned last month: socket activation, libssh2, and Twisted. There were a few other regressions reported too: TCP wrappers support, openssh-server-udeb, and xinetd were all broken by changes related to the listener/per-session binary split, and I fixed all of those.

Once all that had made it through to testing, I finally uploaded the first stage of my plan to split out GSS-API support: there are now openssh-client-gssapi and openssh-server-gssapi packages in unstable, and if you use either GSS-API authentication or key exchange then you should install the corresponding package in order for upgrades to trixie+1 to work correctly. I’ll write a release note once this has reached testing.

Multiple identical results from getaddrinfo

I expect this is really a bug in a chroot creation script somewhere, but I haven’t been able to track down what’s causing it yet. My sbuild chroots, and apparently Lucas Nussbaum’s as well, have an /etc/hosts that looks like this:

$ cat /var/lib/schroot/chroots/sid-amd64/etc/hosts
127.0.0.1       localhost
127.0.1.1       [...]
127.0.0.1       localhost ip6-localhost ip6-loopback

The last line clearly ought to be ::1 rather than 127.0.0.1; but things mostly work anyway, since most code doesn’t really care which protocol it uses to talk to localhost. However, a few things try to set up test listeners by calling getaddrinfo("localhost", ...) and binding a socket for each result. This goes wrong if there are duplicates in the resulting list, and the test output is typically very confusing: it looks just like what you’d see if a test isn’t tearing down its resources correctly, which is a much more common thing for a test suite to get wrong, so it took me a while to spot the problem.

I ran into this in both python-asyncssh (#1052788, upstream PR) and Ruby (ruby3.1/#1069399, ruby3.2/#1064685, ruby3.3/#1077462, upstream PR). The latter took a while since Ruby isn’t one of my languages, but hey, I’ve tackled much harder side quests. I NMUed ruby3.1 for this since it was showing up as a blocker for openssl testing migration, but haven’t done the other active versions (yet, anyway).

OpenSSL vs. cryptography

I tend to care about openssl migrating to testing promptly, since openssh uploads have a habit of getting stuck on it otherwise.

Debian’s OpenSSL packaging recently split out some legacy code (cryptography that’s no longer considered a good idea to use, but that’s sometimes needed for compatibility) to an openssl-legacy-provider package, and added a Recommends on it. Most users install Recommends, but package build processes don’t; and the Python cryptography package requires this code unless you set the CRYPTOGRAPHY_OPENSSL_NO_LEGACY=1 environment variable, which caused a bunch of packages that build-depend on it to fail to build.

After playing whack-a-mole setting that environment variable in a few packages’ build process, I decided I didn’t want to be caught in the middle here and filed an upstream issue to see if I could get Debian’s OpenSSL team and cryptography’s upstream talking to each other directly. There was some moderately spirited discussion and the issue remains open, but for the time being the OpenSSL team has effectively reverted the change so it’s no longer a pressing problem.

GCC 14 regressions

Continuing from last month, I fixed build failures in pccts (NMU) and trn4.

Python team

I upgraded alembic, automat, gunicorn, incremental, referencing, pympler (fixing compatibility with Python >= 3.10), python-aiohttp, python-asyncssh (fixing CVE-2023-46445, CVE-2023-46446, and CVE-2023-48795), python-avro, python-multidict (fixing a build failure with GCC 14), python-tokenize-rt, python-zipp, pyupgrade, twisted (fixing CVE-2024-41671 and CVE-2024-41810), zope.exceptions, zope.interface, zope.proxy, zope.security, and zope.testrunner to new upstream versions. In the process, I added myself to Uploaders for zope.interface; I’m reasonably comfortable with the Zope Toolkit and I seem to be gradually picking up much of its maintenance in Debian.

A few of these required their own bits of yak-shaving:

python-aiohttp 3.10.0 needed fixes in blinkpy (#1077981, upstream PR) and python-yalexs (#1077985, upstream PR).
twisted 24.7.0 needed fixes in pytest-twisted (cherry-picked existing upstream commit), python-daphne (cherry-picked existing upstream PR), and python-tornado (#1078411, upstream PR).

I improved some Multi-Arch: foreign tagging (python-importlib-metadata, python-typing-extensions, python-zipp).

I fixed build failures in pipenv, python-stdlib-list, psycopg3, and sen, and fixed autopkgtest failures in autoimport (upstream PR), python-semantic-release and rstcheck.

Upstream for zope.file (not in Debian) filed an issue about a test failure with Python 3.12, which I tracked down to a Python 3.12 compatibility PR in zope.security.

I made python-nacl build reproducibly (upstream PR).

I moved aliased files from / to /usr in timekpr-next (#1073722).

Installer team

I applied a patch from Ubuntu to make os-prober support building with the noudeb profile (#983325).

Free software activity in July 2024

2024-08-02T13:27:07+01:00

My Debian contributions this month were all sponsored by Freexian.

You can also support my work directly via Liberapay.

OpenSSH

At the start of the month, I uploaded a quick fix (via Salvatore Bonaccorso) for a regression from CVE-2006-5051, found by Qualys; this was because I expected it to take me a bit longer to merge OpenSSH 9.8, which had the full fix.

This turned out to be a good guess: it took me until the last day of the month to get the merge done. OpenSSH 9.8 included some substantial changes to split the server into a listener binary and a per-session binary, which required some corresponding changes in the GSS-API key exchange patch. At this point I was very grateful for the GSS-API integration test contributed by Andreas Hasenack a little while ago, because otherwise I might very easily not have noticed my mistake: this patch adds some entries to the key exchange algorithm proposal, and on the server side I’d accidentally moved that to after the point where the proposal is sent to the client, which of course meant it didn’t work at all. Even with a failing test, it took me quite a while to spot the problem, involving a lot of staring at strace output and comparing debug logs between versions.

There are still some regressions to sort out, including a problem with socket activation, and problems in libssh2 and Twisted due to DSA now being disabled at compile-time.

Speaking of DSA, I wrote a release note for this change, which is now merged.

GCC 14 regressions

I fixed a number of build failures with GCC 14, mostly in my older packages: grub (legacy), imaptool, kali, knews, and vigor.

autopkgtest

I contributed a change to allow maintaining Incus container and VM images in parallel. I use both of these regularly (containers are faster, but some tests need full machine isolation), and the build tools previously didn’t handle that very well.

I now have a script that just does this regularly to keep my images up to date (although for now I’m running this with PATH pointing to autopkgtest from git, since my change hasn’t been released yet):

RELEASE=sid autopkgtest-build-incus images:debian/trixie
RELEASE=sid autopkgtest-build-incus --vm images:debian/trixie

Python team

I fixed dnsdiag’s uninstallability in unstable, and contributed the fix upstream.

I reverted python-tenacity to an earlier version due to regressions in a number of OpenStack packages, including octavia and ironic. (This seems to be due to #486 upstream.)

I fixed a build failure in python3-simpletal due to Python 3.12 removing the old imp module.

I added non-superficial autopkgtests to a number of packages, including httmock, py-macaroon-bakery, python-libnacl, six, and storm.

I switched a number of packages to build using PEP 517 rather than calling setup.py directly, including alembic, constantly, hyperlink, isort, khard, python-cpuinfo, and python3-onelogin-saml2. (Much of this was by working through the missing-prerequisite-for-pyproject-backend Lintian tag, but there’s still lots to do.)

I upgraded frozenlist, ipykernel, isort, langtable, python-exceptiongroup, python-launchpadlib, python-typeguard, pyupgrade, sqlparse, storm, and uncertainties to new upstream versions. In the process, I added myself to Uploaders for isort, since the previous primary uploader has retired.

Other odds and ends

I applied a suggestion by Chris Hofstaedtler to create /etc/subuid and /etc/subgid in base-passwd, since the login package is no longer essential.

I fixed a wireless-tools regression due to iproute2 dropping its (/usr)/sbin/ip compatibility symlink.

I applied a suggestion by Petter Reinholdtsen to add AppStream metainfo to pcmciautils.

Free software activity in June 2024

2024-07-02T13:02:27+01:00

My Debian contributions this month were all sponsored by Freexian.

I switched man-db and putty to Rules-Requires-Root: no, thanks to a suggestion from Niels Thykier.
I moved some files in pcmciautils as part of the /usr move.
I upgraded libfido2 to 1.15.0.
I made an upstream release of multipart 0.2.5.
I reviewed some security-update patches to putty.
I packaged yubihsm-connector, yubihsm-shell, and python-yubihsm.
openssh:
- I did a bit more planning for the GSS-API package split, though decided not to land it quite yet to avoid blocking other changes on NEW queue review.
- I removed the user_readenv option from PAM configuration (#1018260), and prepared a release note.
Python team:
- I packaged zope.deferredimport, needed for a new upstream version of python-persistent.
- I fixed some incompatibilities with pytest 8: ipykernel and ipywidgets.
- I fixed a couple of RC or soon-to-be-RC bugs in khard (#1065887 and #1069838), since I use it for my address book and wanted to get it back into testing.
- I fixed an RC bug in python-repoze.sphinx.autointerface (#1057599).
- I sponsored uploads of python-channels-redis (Dale Richards) and twisted (Florent ‘Skia’ Jacquet).
- I upgraded babelfish, django-favicon-plus-reloaded, dnsdiag, flake8-builtins, flufl.lock, ipywidgets, jsonpickle, langtable, nbconvert, requests, responses, partd, pytest-mock, python-aiohttp (fixing CVE-2024-23829, CVE-2024-23334, CVE-2024-30251, and CVE-2024-27306), python-amply, python-argcomplete, python-btrees, python-cups, python-django-health-check, python-fluent-logger, python-persistent, python-plumbum, python-rpaths, python-rt, python-sniffio, python-tenacity, python-tokenize-rt, python-typing-extensions, pyupgrade, sphinx-copybutton, sphinxcontrib-autoprogram, uncertainties, zodbpickle, zope.configuration, zope.proxy, and zope.security to new upstream versions.

You can support my work directly via Liberapay.

Free software activity in May 2024

2024-06-02T11:53:01+01:00

My Debian contributions this month were all sponsored by Freexian.

The bulk of my Debian time this month went towards trying to haul more Python packages up to current versions, but I got a few other bits and pieces done as well.

I did a little work on improving debbugs’ autopkgtest status.
openssh:
- I fixed an OpenSSL version mismatch error in openssh-ssh1.
- I finally tracked down a baffling CI issue in openssh, unblocking several contributed merge requests that I’d been sitting on until I could get CI to pass for them. (Special thanks to Andreas Hasenack; GSS-API integration tests will make my life much easier.)
- I removed the user_readenv=1 option from openssh’s PAM configuration, and did some work on release notes to document this change for affected users.
- I started work on the first stage of my plan to split out GSS-API key exchange support to separate packages.
Python team:
- I upgraded bitstruct, flufl.enum, flufl.testing, gunicorn, langtable, psycopg3, pygresql, pylint-flask, python-click-didyoumean, python-gssapi, python-httplib2, python-json-log-formatter, python-persistent, python-pgspecial, python-pyld, python-repoze.tm2, python-serializable, python-tenacity, python-typing-extensions, python-unidiff, responses, shortuuid (including an upstream packaging tweak), sqlparse, vulture, zc.lockfile, and zope.interface to new upstream versions.
- I cherry-picked an upstream PR to fix a pytest 8 incompatibility in ipywidgets.
I decided that fixing my old troffcvt package to support groff 1.23.0 wasn’t worth the time investment, and filed a removal request instead.
I NMUed bidentd and linuxtv-dvb-apps to declare Architecture: linux-any (and in the latter case also to fix a build failure due to 64-bit time), and worked with the buildd team to remove several of the other remaining entries from Packages-arch-specific.

You can support my work directly via Liberapay.

Playing with rich

2024-05-03T16:09:53+01:00

One of the things I do as a side project for Freexian is to work on various bits of business automation: accounting tools, programs to help contributors report their hours, invoicing, that kind of thing. While it’s not quite my usual beat, this makes quite a good side project as the tools involved are mostly rather sensible and easy to deal with (Python, git, ledger, that sort of thing) and it’s the kind of thing where I can dip into it for a day or so a week and feel like I’m making useful contributions. The logic can be quite complex, but there’s very little friction in the tools themselves.

A recent case where I did run into some friction in the tools was with some commands that need to present small amounts of tabular data on the terminal, using OSC 8 hyperlinks if the terminal supports them: think customer-related information with some links to issues. One of my colleagues had previously done this using a hack on top of texttable, which was perfectly fine as far as it went. However, now I wanted to be able to add multiple links in a single table cell in some cases, and that was really going to stretch the limits of that approach: working out the width of the displayed text in the cell was going to take an annoying amount of bookkeeping.

I started looking around to see whether any other approaches might be easier, without too much effort (remember that “a day or so a week” bit above). ansiwrap looked somewhat promising, but it isn’t currently packaged in Debian, and it would have still left me with the problem of figuring out how to integrate it into texttable, which looked like it would be quite complicated. Then I remembered that I’d heard good things about rich, and thought I’d take a look.

rich turned out to be exactly what I wanted. Instead of something like this based on the texttable hack above:

import shutil
from pyxian.texttable import UrlTable

termsize = shutil.get_terminal_size((80, 25))
table = UrlTable(max_width=termsize.columns)
table.set_deco(UrlTable.HEADER)
table.set_cols_align(["l"])
table.set_cols_dtype(["u"])
table.add_row(["Issue"])
table.add_row([(issue_url, f"#{issue_id}")]
print(table.draw())

… now I can do this instead:

import rich
from rich import box
from rich.table import Table

table = Table(box=box.SIMPLE)
table.add_column("Issue")
table.add_row(f"[link={issue_url}]#{issue_id}[/link]")
rich.print(table)

While this is a little shorter, the real bonus is that I can now just put multiple [link] tags in a single string, and it all just works. No ceremony. In fact, once the relevant bits of code passed type-checking (since the real code is a bit more complex than the samples above), it worked first time. It’s a pleasure to work with a library like that.

It looks like I’ve only barely scratched the surface of rich, but I expect I’ll reach for it more often now.

Free software activity in April 2024

2024-05-01T12:34:15+01:00

My Debian contributions this month were all sponsored by Freexian.

I’m trying to get back into bugs.debian.org administration, so I spent some time catching up on my owner@bugs.debian.org mailbox and answering a number of support requests there.
I fixed a regression I’d introduced last year where groff’s PDF output had invalid date headers, both upstream and in Debian.
I released man-db 2.12.1.
openssh:
- I did a little more testing of Luca Boccassi’s modifications to upstream’s inline systemd notification patch.
- I did an extensive review of some of the choices in Debian’s OpenSSH packaging, in light of last month’s xz-utils backdoor.
- I fixed a build failure on ppc64el, forwarded upstream.
- I proposed reducing shared library linkage in tcp-wrappers; its maintainer accepted this by disabling NIS support.
- I applied a suggestion to improve ordering of systemd services in relation to nss-user-lookup.target.
I updated putty to 0.81.
Python team:
- I fixed build/autopkgtest failures in cytoolz (upstream PR), nbconvert (due to sphinx and lxml-html-clean changes), python-argcomplete, python-exceptiongroup, readability, toolz (upstream PR), and topplot.
- I made a merge request to fix a build failure in pyferret.
- I fixed a mistake in a Debian patch to python-ecdsa, noticed while updating jsonpickle.
- I updated cachelib, dnsdiag, feedparser, jsonpickle, pywavelets (fixing a distutils dependency), python-aiohttp-session, python-avro, python-rstr, vine (including an upstream packaging tweak, and wtforms to new upstream versions.
I did some inconclusive investigation of flaky tests in gcr4. More work is needed there.
I proposed a patch for a build failure in gyoto, both upstream and in Debian.

You can support my work directly via Liberapay.

Free software activity in March 2024

2024-04-01T14:10:41+01:00

My Debian contributions this month were all sponsored by Freexian.

Python team:
- I updated zope.testrunner to 6.4.
- I fixed a build failure in celery-haystack-ng, which included an upstream change to stop using d2to1.
- I backported an upstream change to fix a build failure in python-json-log-formatter.
- I updated python-typing-extensions to 4.10.0 to fix a build failure.
- I updated wcwidth to 0.2.13 to fix a build failure, which included rewriting the Debian patches to update-tables.
- I updated jsonpickle to 3.0.3 to fix a build failure. As part of this I also needed to fix its tests for new timezone name policies.
man-db:
- I updated man(1) with updated information about groff.
- I documented italic support in man(1).
- I started on a 2.12.1 release (currently waiting for translations).
openssh:
- I added an sshd@.service template to help newer versions of systemd make containers and VMs SSH-accessible over AF_VSOCK sockets.
- I upgraded openssh to 9.7p1. As part of this I noticed a small bug in the PuTTY version detection in its regression tests.
- I helped with the 64-bit time_t transition by applying a patch from Simon McVittie to temporarily disable ssh-askpass-gnome on the affected architectures.
- I re-enabled interoperability tests with Twisted Conch.
- Following the xz-utils backdoor, I spent some time testing OpenSSH upstream’s proposed inline systemd notification patch and suggested a small adjustment.
debconf:
- I fixed dpkg-preconfigure to not produce errors when run from a non-existing directory.
parted:
- I applied a patch from Helmut Grohne to move files to /usr.
I fixed an off-by-one error in neovim.
I fixed a number of -Werror=implicit-function-declaration failures (db1-compat, kali, knews, openssh-ssh1, trn4, vigor).
I fixed a build failure in libsdl-perl, including a reference-counting fix contributed upstream.
I suggested an rrsync change to make it easier to use it with rsbackup.

apt install everything?

2024-03-19T07:05:27+00:00

On Mastodon, the question came up of how Ubuntu would deal with something like the npm install everything situation. I replied:

Ubuntu is curated, so it probably wouldn’t get this far. If it did, then the worst case is that it would get in the way of CI allowing other packages to be removed (again from a curated system, so people are used to removal not being self-service); but the release team would have no hesitation in removing a package like this to fix that, and it certainly wouldn’t cause this amount of angst.

If you did this in a PPA, then I can’t think of any particular negative effects.

OK, if you added lots of build-dependencies (as well as run-time dependencies) then you might be able to take out a builder. But Launchpad builders already run arbitrary user-submitted code by design and are therefore very carefully sandboxed and treated as ephemeral, so this is hardly novel.

There’s a lot to be said for the arrangement of having a curated system for the stuff people actually care about plus an ecosystem of add-on repositories. PPAs cover a wide range of levels of developer activity, from throwaway experiments to quasi-official distribution methods; there are certainly problems that arise from it being difficult to tell the difference between those extremes and from there being no systematic confinement, but for this particular kind of problem they’re very nearly ideal. (Canonical has tried various other approaches to software distribution, and while they address some of the problems, they aren’t obviously better at helping people make reliable social judgements about code they don’t know.)

For a hypothetical package with a huge number of dependencies, to even try to upload it directly to Ubuntu you’d need to be an Ubuntu developer with upload rights (or to go via Debian, where you’d have to clear a similar hurdle). If you have those, then the first upload has to pass manual review by an archive administrator. If your package passes that, then it still has to build and get through proposed-migration CI before it reaches anything that humans typically care about.

On the other hand, if you were inclined to try this sort of experiment, you’d almost certainly try it in a PPA, and that would trouble nobody but yourself.

Free software activity in January/February 2024

2024-03-04T10:39:50+00:00

Two months into my new gig and it’s going great! Tracking my time has taken a bit of getting used to, but having something that amounts to a queryable database of everything I’ve done has also allowed some helpful introspection.

Freexian sponsors up to 20% of my time on Debian tasks of my choice. In fact I’ve been spending the bulk of my time on debusine which is itself intended to accelerate work on Debian, but more details on that later. While I contribute to Freexian’s summaries now, I’ve also decided to start writing monthly posts about my free software activity as many others do, to get into some more detail.

January 2024

I added Incus support to autopkgtest. Incus is a system container and virtual machine manager, forked from Canonical’s LXD. I switched my laptop over to it and then quickly found that it was inconvenient not to be able to run Debian package test suites using autopkgtest, so I tweaked autopkgtest’s existing LXD integration to support using either LXD or Incus.
I discovered Perl::Critic and used it to tidy up some poor practices in several of my packages, including debconf. Perl used to be my language of choice but I’ve been mostly using Python for over a decade now, so I’m not as fluent as I used to be and some mechanical assistance with spotting common errors is helpful; besides, I’m generally a big fan of applying static analysis to everything possible in the hope of reducing bug density. Of course, this did result in a couple of regressions (1, 2), but at least we caught them fairly quickly.
I did some overdue debconf maintenance, mainly around tidying up error message handling in several places (1, 2, 3).
I did some routine maintenance to move several of my upstream projects to a new Gnulib stable branch.
debmirror includes a useful summary of how big a Debian mirror is, but it hadn’t been updated since 2010 and the script to do so had bitrotted quite badly. I fixed that and added a recurring task for myself to refresh this every six months.

February 2024

Some time back I added AppArmor and seccomp confinement to man-db. This was mainly motivated by a desire to support manual pages in snaps (which is still open several years later …), but since reading manual pages involves a non-trivial text processing toolchain mostly written in C++, I thought it was reasonable to assume that some day it might have a vulnerability even though its track record has been good; so man now restricts the system calls that groff can execute and the parts of the file system that it can access. I stand by this, but it did cause some problems that have needed a succession of small fixes over the years. This month I issued DLA-3731-1, backporting some of those fixes to buster.
I spent some time chasing a console-setup build failure following the removal of kFreeBSD support, which was uploaded by mistake. I suggested a set of fixes for this, but the author of the change to remove kFreeBSD support decided to take a different approach (fair enough), so I’ve abandoned this.
I updated the Debian zope.testrunner package to 6.3.1.
openssh:
- A Freexian collaborator had a problem with automating installations involving changes to /etc/ssh/sshd_config. This turned out to be resolvable without any changes, but in the process of investigating I noticed that my dodgy arrangements to avoid ucf prompts in certain cases had bitrotted slightly, which meant that some people might be prompted unnecessarily. I fixed this and arranged for it not to happen again.
- Following a recent debian-devel discussion, I realized that some particularly awkward code in the OpenSSH packaging was now obsolete, and removed it.
I backported a python-channels-redis fix to bookworm. I wasn’t the first person to run into this, but I rediscovered it while working on debusine and it was confusing enough that it seemed worth fixing in stable.
I fixed a simple build failure in storm.
I dug into a very confusing cluster of celery build failures (1, 2, 3), and tracked the hardest bit down to a Python 3.12 regression, now fixed in unstable thanks to Stefano Rivera. Getting celery back into testing is blocked on the 64-bit time_t transition for now, but once that’s out of the way it should flow smoothly again.

Task management

2024-01-17T13:28:19+00:00

Now that I’m freelancing, I need to actually track my time, which is something I’ve had the luxury of not having to do before. That meant something of a rethink of the way I’ve been keeping track of my to-do list. Up to now that was a combination of things like the bug lists for the projects I’m working on at the moment, whatever task tracking system Canonical was using at the moment (Jira when I left), and a giant flat text file in which I recorded logbook-style notes of what I’d done each day plus a few extra notes at the bottom to remind myself of particularly urgent tasks. I could have started manually adding times to each logbook entry, but ugh, let’s not.

In general, I had the following goals (which were a bit reminiscent of my address book):

free software throughout
storage under my control
ability to annotate tasks with URLs (especially bugs and merge requests)
lightweight time tracking (I’m OK with having to explicitly tell it when I start and stop tasks)
ability to drive everything from the command line
decent filtering so I don’t have to look at my entire to-do list all the time
ability to easily generate billing information for multiple clients
optionally, integration with Android (mainly so I can tick off personal tasks like “change bedroom lightbulb” or whatever that don’t involve being near a computer)

I didn’t do an elaborate evaluation of multiple options, because I’m not trying to come up with the best possible solution for a client here. Also, there are a bazillion to-do list trackers out there and if I tried to evaluate them all I’d never do anything else. I just wanted something that works well enough for me.

Since it came up on Mastodon: a bunch of people swear by Org mode, which I know can do at least some of this sort of thing. However, I don’t use Emacs and don’t plan to use Emacs. nvim-orgmode does have some support for time tracking, but when I’ve tried vim-based versions of Org mode in the past I’ve found they haven’t really fitted my brain very well.

Taskwarrior and Timewarrior

One of the other Freexian collaborators mentioned Taskwarrior and Timewarrior, so I had a look at those.

The basic idea of Taskwarrior is that you have a task command that tracks each task as a blob of JSON and provides subcommands to let you add, modify, and remove tasks with a minimum of friction. task add adds a task, and you can add metadata like project:Personal (I always make sure every task has a project, for ease of filtering). Just running task shows you a task list sorted by Taskwarrior’s idea of urgency, with an ID for each task, and there are various other reports with different filtering and verbosity. task <id> annotate lets you attach more information to a task. task <id> done marks it as done. So far so good, so a redacted version of my to-do list looks like this:

$ task ls

ID A Project     Tags                 Description
17   Freexian                         Add Incus support to autopkgtest [2]
 7   Columbiform                      Figure out Lloyds online banking [1]
 2   Debian                           Fix troffcvt for groff 1.23.0 [1]
11   Personal                         Replace living room curtain rail

Once I got comfortable with it, this was already a big improvement. I haven’t bothered to learn all the filtering gadgets yet, but it was easy enough to see that I could do something like task all project:Personal and it’d show me both pending and completed tasks in that project, and that all the data was stored in ~/.task - though I have to say that there are enough reporting bells and whistles that I haven’t needed to poke around manually. In combination with the regular backups that I do anyway (you do too, right?), this gave me enough confidence to abandon my previous text-file logbook approach.

Next was time tracking. Timewarrior integrates with Taskwarrior, albeit in an only semi-packaged way, and it was easy enough to set that up. Now I can do:

$ task 25 start
Starting task 00a9516f 'Write blog post about task tracking'.
Started 1 task.
Note: '"Write blog post about task tracking"' is a new tag.
Tracking Columbiform "Write blog post about task tracking"
  Started 2024-01-10T11:28:38
  Current                  38
  Total               0:00:00
You have more urgent tasks.
Project 'Columbiform' is 25% complete (3 of 4 tasks remaining).

When I stop work on something, I do task active to find the ID, then task <id> stop. Timewarrior does the tedious stopwatch business for me, and I can manually enter times if I forget to start/stop a task. Then the really useful bit: I can do something like timew summary :month <name-of-client> and it tells me how much to bill that client for this month. Perfect.

I also started using VIT to simplify the day-to-day flow a little, which means I’m normally just using one or two keystrokes rather than typing longer commands. That isn’t really necessary from my point of view, but it does save some time.

Android integration

I left Android integration for a bit later since it wasn’t essential. When I got round to it, I have to say that it felt a bit clumsy, but it did eventually work.

The first step was to set up a taskserver. Most of the setup procedure was OK, but I wanted to use Let’s Encrypt to minimize the amount of messing around with CAs I had to do. Getting this to work involved hitting things with sticks a bit, and there’s still a local CA involved for client certificates. What I ended up with was a certbot setup with the webroot authenticator and a custom deploy hook as follows (with cert_name replaced by a DNS name in my house domain):

#! /bin/sh
set -eu

cert_name=taskd.example.org

found=false
for domain in $RENEWED_DOMAINS; do
    case "$domain" in
        $cert_name)
            found=:
            ;;
    esac
done
$found || exit 0

install -m 644 "/etc/letsencrypt/live/$cert_name/fullchain.pem" \
    /var/lib/taskd/pki/fullchain.pem
install -m 640 -g Debian-taskd "/etc/letsencrypt/live/$cert_name/privkey.pem" \
    /var/lib/taskd/pki/privkey.pem

systemctl restart taskd.service

I could then set this in /etc/taskd/config (server.crl.pem and ca.cert.pem were generated using the documented taskserver setup procedure):

server.key=/var/lib/taskd/pki/privkey.pem
server.cert=/var/lib/taskd/pki/fullchain.pem
server.crl=/var/lib/taskd/pki/server.crl.pem
ca.cert=/var/lib/taskd/pki/ca.cert.pem

Then I could set taskd.ca on my laptop to /usr/share/ca-certificates/mozilla/ISRG_Root_X1.crt and otherwise follow the client setup instructions, run task sync init to get things started, and then task sync every so often to sync changes between my laptop and the taskserver.

I used TaskWarrior Mobile as the client. I have to say I wouldn’t want to use that client as my primary task tracking interface: the setup procedure is clunky even beyond the necessity of copying a client certificate around, it expects you to give it a .taskrc rather than having a proper settings interface for that, and it only seems to let you add a task if you specify a due date for it. It also lacks Timewarrior integration, so I can only really use it when I don’t care about time tracking, e.g. personal tasks. But that’s really all I need, so it meets my minimum requirements.

Next?

Considering this is literally the first thing I tried, I have to say I’m pretty happy with it. There are a bunch of optional extras I haven’t tried yet, but in general it kind of has the vim nature for me: if I need something it’s very likely to exist or easy enough to build, but the features I don’t use don’t get in my way.

I wouldn’t recommend any of this to somebody who didn’t already spend most of their time in a terminal - but I do. I’m glad people have gone to all the effort to build this so I didn’t have to.

OpenUK New Year’s Honours

2024-01-15T16:15:49+00:00

Apparently I got an honour from OpenUK.

There are a bunch of people I know on that list. Chris Lamb and Mark Brown are familiar names from Debian. Colin King and Jonathan Riddell are people I know from past work in Ubuntu. I’ve admired David MacIver’s work on Hypothesis and Richard Hughes’ work on firmware updates from afar. And there are a bunch of other excellent projects represented there: OpenStreetMap, Textualize, and my alma mater of Cambridge to name but a few.

My friend Stuart Langridge wrote about being on a similar list a few years ago, and I can’t do much better than to echo it: in particular he wrote about the way the open source development community is often at best unwelcoming to people who don’t look like Stuart and I do. I can’t tell a whole lot about demographic distribution just by looking at a list of names, but while these honours still seem to be skewed somewhat male, I’m fairly sure they’re doing a lot better in terms of gender balance than my “home” project of Debian is, for one. I hope this is a sign of improvement for the future, and I’ll do what I can to pay it forward.

Going freelance

2024-01-10T09:50:23+00:00

I’ve mentioned this in a couple of other places, but I realized I never got round to posting about it on my own blog rather than on other people’s services. How remiss of me.

Anyway: after much soul-searching, I decided a few months ago that it was time for me to move on from Canonical and the Launchpad team there. Nearly 20 years is a long time to spend at any company, and although there are a bunch of people I’ll miss, Launchpad is in a reasonable state where I can let other people have a turn.

I’m now in business for myself as a freelance developer! My new company is Columbiform, and I’m focusing on Debian packaging and custom Python development. My services page has some self-promotion on the sorts of things I can do.

My first gig, and the one that made it viable to make this jump, is at Freexian where I’m helping with an exciting infrastructure project that we hope will start making Debian developers’ lives easier in the near future. This is likely to take up most of my time at least through to the end of 2024, but I may have some spare cycles. Drop me a line if you have something where you think I could be a good fit, and we can have a talk about it.

Reproducible man-db databases

2022-10-16T16:54:32+01:00

I’ve released man-db 2.11.0 (announcement, NEWS), and uploaded it to Debian unstable.

The biggest chunk of work here was fixing some extremely long-standing issues with how the database is built. Despite being in the package name, man-db’s database is much less important than it used to be: most uses of man(1) haven’t required it in a long time, and both hardware and software improvements mean that even some searches can be done by brute force without needing prior indexing. However, the database is still needed for the whatis(1) and apropos(1) commands.

The database has a simple format - no relational structure here, it’s just a simple key-value database using old-fashioned DBM-like interfaces and composing a few fields to form values - but there are a number of subtleties involved. The issues tend to amount to this: what does a manual page name mean? At first glance it might seem simple, because you have file names that look something like /usr/share/man/man1/ls.1.gz and that’s obviously ls(1). Some pages are symlinks to other pages (which we track separately because it makes it easier to figure out which entries to update when the contents of the file system change), and sometimes multiple pages are even hard links to the same file.

The real complications come with “whatis references”. Pages can list a bunch of names in their NAME section, and the historical expectation is that it should be possible to use those names as arguments to man(1) even if they don’t also appear in the file system (although Debian policy has deprecated relying on this for some time). Not only does that mean that man(1) sometimes needs to consult the database, but it also means that the database is inherently more complicated, since a page might list something in its NAME section that conflicts with an actual file name in the file system, and now you need a priority system to resolve ambiguities. There are some other possible causes of ambiguity as well.

The people working on reproducible builds in Debian branched out to the related challenge of reproducible installations some time ago: can you take a collection of packages, bootstrap a file system image from them, and reproduce that exact same image somewhere else? This is useful for the same sorts of reasons that reproducible builds are useful: it lets you verify that an image is built from the components it’s supposed to be built from, and doesn’t contain any other skulduggery by accident or design. One of the people working on this noticed that man-db’s database files were an obstacle to that: in particular, the exact contents of the database seemed to depend on the order in which files were scanned when building it. The reporter proposed solving this by processing files in sorted order, but I wasn’t keen on that approach: firstly because it would mean we could no longer process files in an order that makes it more efficient to read them all from disk (still valuable on rotational disks), but mostly because the differences seemed to point to other bugs.

Having understood this, there then followed several late nights of very fiddly work on the details of how the database is maintained. None of this was conceptually difficult: it mainly amounted to ensuring that we maintain a consistent well-order for different entries that we might want to insert for a given database key, and that we consider the same names for insertion regardless of the order in which we encounter files. As usual, the tricky bit is making sure that we have the right data structures to support this. man-db is written in C which is not very well-supplied with built-in data structures, and originally much of the code was written in a style that tried to minimize memory allocations; this came at the cost of ownership and lifetime often being rather unclear, and it was often difficult to make changes without causing leaks or double-frees. Over the years I’ve been gradually introducing better encapsulation to make things easier to follow, and I had to do another round of that here. There were also some problems with caching being done at slightly the wrong layer: we need to make use of a “trace” of the chain of links followed to resolve a page to its ultimate source file, but we were incorrectly caching that trace and reusing it for any link to the same file, with incorrect results in many cases.

Oh, and after doing all that I found that the on-disk representation of a GDBM database is insertion-order-dependent, so I ended up having to manually reorganize the database at the end by reading it all in and writing it all back out in sorted order, which feels really weird to me coming from spending most of my time with PostgreSQL these days. Fortunately the database is small so this takes negligible time.

None of this is particularly glamorous work, but it paid off:

# export SOURCE_DATE_EPOCH="$(date +%s)"
# mkdir emptydir disorder
# disorderfs --multi-user=yes --shuffle-dirents=yes --reverse-dirents=no emptydir disorder
# export TMPDIR="$(pwd)/disorder"
# mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \
      unstable out1.tar
# mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \
      unstable out2.tar
# cmp out1.tar out2.tar
# echo $?
0

Launchpad now supports SSH Ed25519 keys and RSA SHA-2 signatures

2022-02-18T13:49:59+00:00

As of 2022-02-16, Launchpad supports a couple of features on its SSH endpoints (git.launchpad.net, bazaar.launchpad.net, ppa.launchpad.net, and upload.ubuntu.com) that it previously didn’t: Ed25519 public keys (a well-regarded format, supported by OpenSSH since 6.5 in 2014) and signatures with existing RSA public keys using SHA-2 rather than SHA-1 (supported by OpenSSH since 7.2 in 2016).

I’m hesitant to call these features “new”, since they’ve been around for a long time elsewhere, and people might quite reasonably ask why it’s taken us so long. The problem has always been that Launchpad can’t really use a normal SSH server such as OpenSSH because it needs features that aren’t practical to implement that way, such as virtual filesystems and dynamic user key authorization against the Launchpad database. Instead, we use Twisted Conch, which is a very extensible Python SSH implementation that has generally served us well. The downside is that, because it’s an independent implementation and one that occupies a relatively small niche, it often lags behind in terms of newer protocol features.

Catching up to this point has been something we’ve been working on for around five years, although it’s taken a painfully long time for a variety of reasons which I thought some people might find interesting to go into, at least people who have the patience for details of the SSH protocol. Many of the delays were my own responsibility, although realistically we probably couldn’t have added Ed25519 support before OpenSSL/cryptography work that landed in 2019.

In 2015, we did some similar work on SHA-2 key exchange and MAC algorithms.
In 2016, various other contributors were working on ECDSA and Ed25519 support (e.g. #533 and #644). At the time, it seemed best to keep an eye on this but mainly leave them to it. I’m very glad that some people worked on this before me - studying their PRs helped a lot, even parts that didn’t end up being merged directly.
In 2017, it became clear that this was likely to need some more attention, but before we could do anything else we had to revamp Launchpad’s build system to use pip rather than buildout, since without that we couldn’t upgrade to any newer versions of Twisted. That proved to be a substantial piece of yak-shaving: first we had to upgrade Launchpad off Ubuntu 12.04, and then the actual build system rewrite was a complicated project of its own.
In 2018, I fixed an authentication hang that happened if a client even tried to offer ECDSA or Ed25519 public keys to Launchpad, and we got ECDSA support fully working in Launchpad. We also discovered as a result of automated interoperability tests run as part of the Debian OpenSSH packaging that Twisted needed to gain support for the new openssh-key-v1 private key format, which became a prerequisite for Ed25519 support since OpenSSH only ever writes those keys in the new format, and so I fixed that.
In 2019, Python’s cryptography package gained support for X25519 (the Diffie-Hellman key exchange function based on Curve25519) and Ed25519, and it became somewhat practical to add support to Twisted on top of that. However, it required OpenSSL 1.1.1b, and it seemed unlikely that we would be in a position to upgrade all the relevant bits of Launchpad’s infrastructure to use that in the near term. I at least managed to add curve25519-sha256 key exchange support to Twisted based on some previous work by another contributor, and I prepared support for Ed25519 keys in Twisted even though I knew we weren’t going to be able to use it yet.
2020 was … well, everyone knows what 2020 was like, plus we had a new baby. I did some experimentation in spare moments, but I didn’t really have the focus to be able to move this sort of complex problem forward.
In 2021, I bit the bullet and started seriously working on fallback mechanisms to allow us to use Ed25519 even on systems lacking a sufficient version of OpenSSL, though found myself blocked on figuring out type-checking issues following a code review. It then became clear on the release of OpenSSH 8.8 that we were going to have to deal with RSA SHA-2 signatures as well, since otherwise OpenSSH in Ubuntu soon wouldn’t be able to authenticate to Launchpad by default (which also caused me to delay uploading 8.8 to Debian unstable for a while). To deal with that, I first had to add SSH extension negotiation to Twisted.
Finally, in 2022, I added RSA SHA-2 signature support to Twisted, finally unblocked myself on the type-checking issue with the Ed25519 fallback mechanism, quickly put together a similar fallback mechanism for X25519, backported the whole mess to Twisted 20.3.0 since we currently can’t use anything newer due to the somewhat old version of Python 3 that we’re running, promptly ran into and fixed a regression that affected SFTP uploads to ppa.launchpad.net and upload.ubuntu.com, and finally added Ed25519 as a permissible key type in Launchpad’s authserver.

Phew! Thanks to everyone who works on Twisted, cryptography, and OpenSSL - it’s been really useful to be able to build on solid lower-level cryptographic primitives - and to those who helped with code review.

Launchpad now runs on Python 3!

2021-08-02T11:34:29+01:00

After a very long porting journey, Launchpad is finally running on Python 3 across all of our systems.

I wanted to take a bit of time to reflect on why my emotional responses to this port differ so much from those of some others who’ve done large ports, such as the Mercurial maintainers. It’s hard to deny that we’ve had to burn a lot of time on this, which I’m sure has had an opportunity cost, and from one point of view it’s essentially running to stand still: there is no single compelling feature that we get solely by porting to Python 3, although it’s clearly a prerequisite for tidying up old compatibility code and being able to use modern language facilities in the future. And yet, on the whole, I found this a rewarding project and enjoyed doing it.

Some of this may be because by inclination I’m a maintenance programmer and actually enjoy this sort of thing. My default view tends to be that software version upgrades may be a pain but it’s much better to get that pain over with as soon as you can rather than trying to hold back the tide; you can certainly get involved and try to shape where things end up, but rightly or wrongly I can’t think of many cases when a righteously indignant user base managed to arrange for the old version to be maintained in perpetuity so that they never had to deal with the new thing (OK, maybe Perl 5 counts here).

I think a more compelling difference between Launchpad and Mercurial, though, may be that very few other people really had a vested interest in what Python version Launchpad happened to be running, because it’s all server-side code (aside from some client libraries such as launchpadlib, which were ported years ago). As such, we weren’t trying to do this with the internet having Strong Opinions at us. We were doing this because it was obviously the only long-term-maintainable path forward, and in more recent times because some of our library dependencies were starting to drop support for Python 2 and so it was obviously going to become a practical problem for us sooner or later; but if we’d just stayed on Python 2 forever then fundamentally hardly anyone else would really have cared directly, only maybe about some indirect consequences of that. I don’t follow Mercurial development so I may be entirely off-base, but if other people were yelling at me about how late my project was to finish its port, that in itself would make me feel more negatively about the project even if I thought it was a good idea. Having most of the pressure come from ourselves rather than from outside meant that wasn’t an issue for us.

I’m somewhat inclined to think of the process as an extreme version of paying down technical debt. Moving from Python 2.7 to 3.5, as we just did, means skipping over multiple language versions in one go, and if similar changes had been made more gradually it would probably have felt a lot more like the typical dependency update treadmill. I appreciate why not everyone might want to think of it this way: maybe this is just my own rationalization.

Reflections on porting to Python 3

I’m not going to defend the Python 3 migration process; it was pretty rough in a lot of ways. Nor am I going to spend much effort relitigating it here, as it’s already been done to death elsewhere, and as I understand it the core Python developers have got the message loud and clear by now. At a bare minimum, a lot of valuable time was lost early in Python 3’s lifetime hanging on to flag-day-type porting strategies that were impractical for large projects, when it should have been providing for “bilingual” strategies (code that runs in both Python 2 and 3 for a transitional period) which is where most libraries and most large migrations ended up in practice. For instance, the early advice to library maintainers to maintain two parallel versions or perhaps translate dynamically with 2to3 was entirely impractical in most non-trivial cases and wasn’t what most people ended up doing, and yet the idea that 2to3 is all you need still floats around Stack Overflow and the like as a result. (These days, I would probably point people towards something more like Eevee’s porting FAQ as somewhere to start.)

There are various fairly straightforward things that people often suggest could have been done to smooth the path, and I largely agree: not removing the u'' string prefix only to put it back in 3.3, fewer gratuitous compatibility breaks in the name of tidiness, and so on. But if I had a time machine, the number one thing I would ask to have been done differently would be introducing type annotations in Python 2 before Python 3 branched off. It’s true that it’s technically possible to do type annotations in Python 2, but the fact that it’s a different syntax that would have to be fixed later is offputting, and in practice it wasn’t widely used in Python 2 code. To make a significant difference to the ease of porting, annotations would need to have been introduced early enough that lots of Python 2 library code used them so that porting code didn’t have to be quite so much of an exercise of manually figuring out the exact nature of string types from context.

Launchpad is a complex piece of software that interacts with multiple domains: for example, it deals with a database, HTTP, web page rendering, Debian-format archive publishing, and multiple revision control systems, and there’s often overlap between domains. Each of these tends to imply different kinds of string handling. Web page rendering is normally done mainly in Unicode, converting to bytes as late as possible; revision control systems normally want to spend most of their time working with bytes, although the exact details vary; HTTP is of course bytes on the wire, but Python’s WSGI interface has some string type subtleties. In practice I found myself thinking about at least four string-like “types” (that is, things that in a language with a stricter type system I might well want to define as distinct types and restrict conversion between them): bytes, text, “ordinary” native strings (str in either language, encoded to UTF-8 in Python 2), and native strings with WSGI’s encoding rules. Some of these are emergent properties of writing in the intersection of Python 2 and 3, which is effectively a specialized language of its own without coherent official documentation whose users must intuit its behaviour by comparing multiple sources of information, or by referring to unofficial porting guides: not a very satisfactory situation. Fortunately much of the complexity collapses once it becomes possible to write solely in Python 3.

Some of the difficulties we ran into are not ones that are typically thought of as Python 2-to-3 porting issues, because they were changed later in Python 3’s development process. For instance, the email module was substantially improved in around the 3.2/3.3 timeframe to handle Python 3’s bytes/text model more correctly, and since Launchpad sends quite a few different kinds of email messages and has some quite picky tests for exactly what it emits, this entailed a lot of work in our email sending code and in our test suite to account for that. (It took me a while to work out whether we should be treating raw email messages as bytes or as text; bytes turned out to work best.) 3.4 made some tweaks to the implementation of quoted-printable encoding that broke a number of our tests in ways that took some effort to fix, because the tests needed to work on both 2.7 and 3.5. The list goes on. I got quite proficient at digging through Python’s git history to figure out when and why some particular bit of behaviour had changed.

One of the thorniest problems was parsing HTTP form data. We mainly rely on zope.publisher for this, which in turn relied on cgi.FieldStorage; but cgi.FieldStorage is badly broken in some situations on Python 3. Even if that bug were fixed in a more recent version of Python, we can’t easily use anything newer than 3.5 for the first stage of our port due to the version of the base OS we’re currently running, so it wouldn’t help much. In the end I fixed some minor issues in the multipart module (and was kindly given co-maintenance of it) and converted zope.publisher to use it. Although this took a while to sort out, it seems to have gone very well.

A couple of other interesting late-arriving issues were around pickle. For most things we normally prefer safer formats such as JSON, but there are a few cases where we use pickle, particularly for our session databases. One of my colleagues pointed out that I needed to remember to tell pickle to stick to protocol 2, so that we’d be able to switch back and forward between Python 2 and 3 for a while; quite right, and we later ran into a similar problem with marshal too. A more surprising problem was that datetime.datetime objects pickled on Python 2 require special care when unpickling on Python 3; rather than the approach that ended up being implemented and documented for Python 3.6, though, I preferred a custom unpickler, both so that things would work on Python 3.5 and so that I wouldn’t have to risk affecting the decoding of other pickled strings in the session database.

General lessons

Writing this over a year after Python 2’s end-of-life date, and certainly nowhere near the leading edge of Python 3 porting work, it’s perhaps more useful to look at this in terms of the lessons it has for other large technical debt projects.

I mentioned in my previous article that I used the approach of an enormous and frequently-rebased git branch as a working area for the port, committing often and sometimes combining and extracting commits for review once they seemed to be ready. A port of this scale would have been entirely intractable without a tool of similar power to git rebase, so I’m very glad that we finished migrating to git in 2019. I relied on this right up to the end of the port, and it also allowed for quick assessments of how much more there was to land. git worktree was also helpful, in that I could easily maintain working trees built for each of Python 2 and 3 for comparison.

As is usual for most multi-developer projects, all changes to Launchpad need to go through code review, although we sometimes make exceptions for very simple and obvious changes that can be self-reviewed. Since I knew from the outset that this was going to generate a lot of changes for review, I therefore structured my work from the outset to try to make it as easy as possible for my colleagues to review it. This generally involved keeping most changes to a somewhat manageable size of 800 lines or less (although this wasn’t always possible), and arranging commits mainly according to the kind of change they made rather than their location. For example, when I needed to fix issues with / in Python 3 being true division rather than floor division, I did so in one commit across the various places where it mattered and took care not to mix it with other unrelated changes. This is good practice for nearly any kind of development, but it was especially important here since it allowed reviewers to consider a clear explanation of what I was doing in the commit message and then skim-read the rest of it much more quickly.

It was vital to keep the codebase in a working state at all times, and deploy to production reasonably often: this way if something went wrong the amount of code we had to debug to figure out what had happened was always tractable. (Although I can’t seem to find it now to link to it, I saw an account a while back of a company that had taken a flag-day approach instead with a large codebase. It seemed to work for them, but I’m certain we couldn’t have made it work for Launchpad.)

I can’t speak too highly of Launchpad’s test suite, much of which originated before my time. Without a great deal of extensive coverage of all sorts of interesting edge cases at both the unit and functional level, and a corresponding culture of maintaining that test suite well when making new changes, it would have been impossible to be anything like as confident of the port as we were.

As part of the porting work, we split out a couple of substantial chunks of the Launchpad codebase that could easily be decoupled from the core: its Mailman integration and its code import worker. Both of these had substantial dependencies with complex requirements for porting to Python 3, and arranging to be able to do these separately on their own schedule was absolutely worth it. Like disentangling balls of wool, any opportunity you can take to make things less tightly-coupled is probably going to make it easier to disentangle the rest. (I can see a tractable way forward to porting the code import worker, so we may well get that done soon. Our Mailman integration will need to be rewritten, though, since it currently depends on the Python-2-only Mailman 2, and Mailman 3 has a different architecture.)

Python lessons

Our database layer was already in pretty good shape for a port, since at least the modern bits of its table modelling interface were already strict about using Unicode for text columns. If you have any kind of pervasive low-level framework like this, then making it be pedantic at you in advance of a Python 3 port will probably incur much less swearing in the long run, as you won’t be trying to deal with quite so many bytes/text issues at the same time as everything else.

Early in our port, we established a standard set of __future__ imports and started incrementally converting files over to them, mainly because we weren’t yet sure what else to do and it seemed likely to be helpful. absolute_import was definitely reasonable (and not often a problem in our code), and print_function was annoying but necessary. In hindsight I’m not sure about unicode_literals, though. For files that only deal with bytes and text it was reasonable enough, but as I mentioned above there were also a number of cases where we needed literals of the language’s native str type, i.e. bytes in Python 2 and text in Python 3: this was particularly noticeable in WSGI contexts, but also cropped up in some other surprising places. We generally either omitted unicode_literals or used six.ensure_str in such cases, but it was definitely a bit awkward and maybe I should have listened more to people telling me it might be a bad idea.

A lot of Launchpad’s early tests used doctest, mainly in the style where you have text files that interleave narrative commentary with examples. The development team later reached consensus that this was best avoided in most cases, but by then there were far too many doctests to conveniently rewrite in some other form. Porting doctests to Python 3 is really annoying. You run into all the little changes in how objects are represented as text (particularly u'...' versus '...', but plenty of other cases as well); you have next to no tools to do anything useful like skipping individual bits of a doctest that don’t apply; using __future__ imports requires the rather obscure approach of adding the relevant names to the doctest’s globals in the relevant DocFileSuite or DocTestSuite; dealing with many exception tracebacks requires something like zope.testing.renormalizing; and whatever code refactoring tools you’re using probably don’t work properly. Basically, don’t have done that. It did all turn out to be tractable for us in the end, and I managed to avoid using much in the way of fragile doctest extensions aside from the aforementioned zope.testing.renormalizing, but it was not an enjoyable experience.

Regressions

I know of nine regressions that reached Launchpad’s production systems as a result of this porting work; of course there were various other regressions caught by CI or in manual testing. (Considering the size of this project, I count it as a resounding success that there were only nine production issues, and that for the most part we were able to fix them quickly.)

Equality testing of removed database objects

One of the things we had to do while porting to Python 3 was to implement the __eq__, __ne__, and __hash__ special methods for all our database objects. This was quite conceptually fiddly, because doing this requires knowing each object’s primary key, and that may not yet be available if we’ve created an object in Python but not yet flushed the actual INSERT statement to the database (most of our primary keys are auto-incrementing sequences). We thus had to take care to flush pending SQL statements in such cases in order to ensure that we know the primary keys.

However, it’s possible to have a problem at the other end of the object lifecycle: that is, a Python object might still be reachable in memory even though the underlying row has been DELETEd from the database. In most cases we don’t keep removed objects around for obvious reasons, but it can happen in caching code, and buildd-manager crashed as a result (in fact while it was still running on Python 2). We had to take extra care to avoid this problem.

Debian imports crashed on non-UTF-8 filenames

Python 2 has some unfortunate behaviour around passing bytes or Unicode strings (depending on the platform) to shutil.rmtree, and the combination of some porting work and a particular source package in Debian that contained a non-UTF-8 file name caused us to run into this. The fix was to ensure that the argument passed to shutil.rmtree is a str regardless of Python version.

We’d actually run into something similar before: it’s a subtle porting gotcha, since it’s quite easy to end up passing Unicode strings to shutil.rmtree if you’re in the process of porting your code to Python 3, and you might easily not notice if the file names in your tests are all encoded using UTF-8.

lazr.restful ETags

We eventually got far enough along that we could switch one of our four appserver machines (we have quite a number of other machines too, but the appservers handle web and API requests) to Python 3 and see what happened. By this point our extensive test suite had shaken out the vast majority of the things that could go wrong, but there was always going to be room for some interesting edge cases.

One of the Ubuntu kernel team reported that they were seeing an increase in 412 Precondition Failed errors in some of their scripts that use our webservice API. These can happen when you’re trying to modify an existing resource: the underlying protocol involves sending an If-Match header with the ETag that the client thinks the resource has, and if this doesn’t match the ETag that the server calculates for the resource then the client has to refresh its copy of the resource and try again. We initially thought that this might be legitimate since it can happen in normal operation if you collide with another client making changes to the same resource, but it soon became clear that something stranger was going on: we were getting inconsistent ETags for the same object even when it was unchanged. Since we’d recently switched a quarter of our appservers to Python 3, that was a natural suspect.

Our lazr.restful package provides the framework for our webservice API, and roughly speaking it generates ETags by serializing objects into some kind of canonical form and hashing the result. Unfortunately the serialization was dependent on the Python version in a few ways, and in particular it serialized lists of strings such as lists of bug tags differently: Python 2 used [u'foo', u'bar', u'baz'] where Python 3 used ['foo', 'bar', 'baz']. In lazr.restful 1.0.3 we switched to using JSON for this, removing the Python version dependency and ensuring consistent behaviour between appservers.

Memory leaks

This problem took the longest to solve. We noticed fairly quickly from our graphs that the appserver machine we’d switched to Python 3 had a serious memory leak. Our appservers had always been a bit leaky, but now it wasn’t so much “a small hole that we can bail occasionally” as “the boat is sinking rapidly”:

(Yes, this got in the way of working out what was going on with ETags for a while.)

I spent ages messing around with various attempts to fix this. Since only a quarter of our appservers were affected, and we could get by on 75% capacity for a while, it wasn’t urgent but it was definitely annoying. After spending some quality time with objgraph, for some time I thought traceback reference cycles might be at fault, and I sent a number of fixes to various upstream projects for those (e.g. zope.pagetemplate). Those didn’t help the leaks much though, and after a while it became clear to me that this couldn’t be the sole problem: Python has a cyclic garbage collector that will eventually collect reference cycles as long as there are no strong references to any objects in them, although it might not happen very quickly. Something else must be going on.

Debugging reference leaks in any non-trivial and long-running Python program is extremely arduous, especially with ORMs that naturally tend to end up with lots of cycles and caches. After a while I formed a hypothesis that zope.server might be keeping a strong reference to something, although I never managed to nail it down more firmly than that. This was an attractive theory as we were already in the process of migrating to Gunicorn for other reasons anyway, and Gunicorn also has a convenient max_requests setting that’s good at mitigating memory leaks. Getting this all in place took some time, but once we did we found that everything was much more stable:

This isn’t completely satisfying as we never quite got to the bottom of the leak itself, and it’s entirely possible that we’ve only papered over it using max_requests: I expect we’ll gradually back off on how frequently we restart workers over time to try to track this down. However, pragmatically, it’s no longer an operational concern.

Mirror prober HTTPS proxy handling

After we switched our script servers to Python 3, we had several reports of mirror probing failures. (Launchpad keeps lists of Ubuntu archive and image mirrors, and probes them every so often to check that they’re reasonably complete and up to date.) This only affected HTTPS mirrors when probed via a proxy server, support for which is a relatively recent feature in Launchpad and involved some code that we never managed to unit-test properly: of course this is exactly the code that went wrong. Sadly I wasn’t able to sort out that gap, but at least the fix was simple.

Non-MIME-encoded email headers

As I mentioned above, there were substantial changes in the email package between Python 2 and 3, and indeed between minor versions of Python 3. Our test coverage here is pretty good, but it’s an area where it’s very easy to have gaps. We noticed that a script that processes incoming email was crashing on messages with headers that were non-ASCII but not MIME-encoded (and indeed then crashing again when it tried to send a notification of the crash!). The only examples of these I looked at were spam, but we still didn’t want to crash on them.

The fix involved being somewhat more careful about both the handling of headers returned by Python’s email parser and the building of outgoing email notifications. This seems to be working well so far, although I wouldn’t be surprised to find the odd other incorrect detail in this sort of area.

Failure to handle non-ISO-8859-1 URL-encoded form input

Remember how I said that parsing HTTP form data was thorny? After we finished upgrading all our appservers to Python 3, people started reporting that they couldn’t post Unicode comments to bugs, which turned out to be only if the attempt was made using JavaScript, and was because I hadn’t quite managed to get URL-encoded form data working properly with zope.publisher and multipart. The current standard describes the URL-encoded format for form data as “in many ways an aberrant monstrosity”, so this was no great surprise.

Part of the problem was some very strange choices in zope.publisher dating back to 2004 or earlier, which I attempted to clean up and simplify. The rest was that Python 2’s urlparse.parse_qs unconditionally decodes percent-encoded sequences as ISO-8859-1 if they’re passed in as part of a Unicode string, so multipart needs to work around this on Python 2.

I’m still not completely confident that this is correct in all situations, but at least now that we’re on Python 3 everywhere the matrix of cases we need to care about is smaller.

Inconsistent marshalling of Loggerhead’s disk cache

We use Loggerhead for providing web browsing of Bazaar branches. When we upgraded one of its two servers to Python 3, we immediately noticed that the one still on Python 2 was failing to read back its revision information cache, which it stores in a database on disk. (We noticed this because it caused a deployment to fail: when we tried to roll out new code to the instance still on Python 2, Nagios checks had already caused an incompatible cache to be written for one branch from the Python 3 instance.)

This turned out to be a similar problem to the pickle issue mentioned above, except this one was with marshal, which I didn’t think to look for because it’s a relatively obscure module mostly used for internal purposes by Python itself; I’m not sure that Loggerhead should really be using it in the first place. The fix was relatively straightforward, complicated mainly by now needing to cope with throwing away unreadable cache data.

Ironically, if we’d just gone ahead and taken the nominally riskier path of upgrading both servers at the same time, we might never have had a problem here.

Intermittent bzr failures

Finally, after we upgraded one of our two Bazaar codehosting servers to Python 3, we had a report of intermittent bzr branch hangs. After some digging I found this in our logs:

Traceback (most recent call last):
  ...
  File "/srv/bazaar.launchpad.net/production/codehosting1-rev-20124175fa98fcb4b43973265a1561174418f4bd/env/lib/python3.5/site-packages/twisted/conch/ssh/channel.py", line 136, in addWindowBytes
    self.startWriting()
  File "/srv/bazaar.launchpad.net/production/codehosting1-rev-20124175fa98fcb4b43973265a1561174418f4bd/env/lib/python3.5/site-packages/lazr/sshserver/session.py", line 88, in startWriting
    resumeProducing()
  File "/srv/bazaar.launchpad.net/production/codehosting1-rev-20124175fa98fcb4b43973265a1561174418f4bd/env/lib/python3.5/site-packages/twisted/internet/process.py", line 894, in resumeProducing
    for p in self.pipes.itervalues():
builtins.AttributeError: 'dict' object has no attribute 'itervalues'

I’d seen this before in our git hosting service: it was a bug in Twisted’s Python 3 port, fixed after 20.3.0 but unfortunately after the last release that supported Python 2, so we had to backport that patch. Using the same backport dealt with this.

Onwards!

SSH quoting

2021-06-11T11:22:21+01:00

A while back there was a thread on one of our company mailing lists about SSH quoting, and I posted a long answer to it. Since then a few people have asked me questions that caused me to reach for it, so I thought it might be helpful if I were to anonymize the original question and post my answer here.

The question was why a sequence of commands involving ssh and fiddly quoting produced the output they did. The first example was this:

$ ssh user@machine.local bash -lc "cd /tmp;pwd"
/home/user

Oh hi, my dubious life choices have been such that this is my specialist subject!

This is because SSH command-line parsing is not quite what you expect.

First, recall that your local shell will apply its usual parsing, and the actual OS-level execution of ssh will be like this:

[0]: ssh
[1]: user@machine.local
[2]: bash
[3]: -lc
[4]: cd /tmp;pwd

Now, the SSH wire protocol only takes a single string as the command, with the expectation that it should be passed to a shell by the remote end. The OpenSSH client deals with this by taking all its arguments after things like options and the target, which in this case are:

[0]: bash
[1]: -lc
[2]: cd /tmp;pwd

It then joins them with a single space:

bash -lc cd /tmp;pwd

This is passed as a string to the server, which then passes that entire string to a shell for evaluation, so as if you’d typed this directly on the server:

sh -c 'bash -lc cd /tmp;pwd'

The shell then parses this as two commands:

bash -lc cd /tmp
pwd

The directory change thus happens in a subshell (actually it doesn’t quite even do that, because bash -lc cd /tmp in fact ends up just calling cd because of the way bash -c parses multiple arguments), and then that subshell exits, then pwd is called in the outer shell which still has the original working directory.

The second example was this:

$ ssh user@machine.local bash -lc "pwd;cd /tmp;pwd"
/home/user
/tmp

Following the logic above, this ends up as if you’d run this on the server:

sh -c 'bash -lc pwd; cd /tmp; pwd'

The third example was this:

$ ssh user@machine.local bash -lc "cd /tmp;cd /tmp;pwd"
/tmp

And this is as if you’d run:

sh -c 'bash -lc cd /tmp; cd /tmp; pwd'

Now, I wouldn’t have implemented the SSH client this way, because I agree that it’s confusing. But /usr/bin/ssh is used as a transport for other things so much that changing its behaviour now would be enormously disruptive, so it’s probably impossible to fix. (I have occasionally agitated on openssh-unix-dev@ for at least documenting this better, but haven’t made much headway yet; I need to get round to preparing a documentation patch.) Once you know about it you can use the proper quoting, though. In this case that would simply be:

ssh user@machine.local 'cd /tmp;pwd'

Or if you do need to specifically invoke bash -l there for some reason (I’m assuming that the original example was reduced from something more complicated), then you can minimise your confusion by passing the whole thing as a single string in the form you want the remote sh -c to see, in a way that ensures that the quotes are preserved and sent to the server rather than being removed by your local shell:

ssh user@machine.local 'bash -lc "cd /tmp;pwd"'

Shell parsing is hard.

Porting Launchpad to Python 3: progress report

2020-09-25T12:01:40+01:00

Launchpad still requires Python 2, which in 2020 is a bit of a problem. Unlike a lot of the rest of 2020, though, there’s good reason to be optimistic about progress.

I’ve been porting Python 2 code to Python 3 on and off for a long time, from back when I was on the Ubuntu Foundations team and maintaining things like the Ubiquity installer. When I moved to Launchpad in 2015 it was certainly on my mind that this was a large body of code still stuck on Python 2. One option would have been to just accept that and leave it as it is, maybe doing more backporting work over time as support for Python 2 fades away. I’ve long been of the opinion that this would doom Launchpad to being unmaintainable in the long run, and since I genuinely love working on Launchpad - I find it an incredibly rewarding project - this wasn’t something I was willing to accept. We’re already seeing some of our important dependencies dropping support for Python 2, which is perfectly reasonable on their terms but which is starting to become a genuine obstacle to delivering important features when we need new features from newer versions of those dependencies. It also looks as though it may be difficult for us to run on Ubuntu 20.04 LTS (we’re currently on 16.04, with an upgrade to 18.04 in progress) as long as we still require Python 2, since we have some system dependencies that 20.04 no longer provides. And then there are exciting new features like type hints and async/await that we’d like to be able to use.

However, until last year there were so many blockers that even considering a port was barely conceivable. What changed in 2019 was sorting out a trifecta of core dependencies. We ported our database layer, Storm. We upgraded to modern versions of our Zope Toolkit dependencies (after contributing various fixes upstream, including some substantial changes to Zope’s test runner that we’d carried as local patches for some years). And we ported our Bazaar code hosting infrastructure to Breezy. With all that in place, a port seemed more of a realistic possibility.

Still, even with this, it was never going to be a matter of just following some standard porting advice and calling it good. Launchpad has almost a million lines of Python code in its main git tree, and around 250 dependencies of which a number are quite Launchpad-specific. In a project that size, not only is following standard porting advice an extremely time-consuming task in its own right, but just about every strange corner case is going to show up somewhere. (Did you know that StringIO.StringIO(None) and io.StringIO(None) do different things even after you account for the native string vs. Unicode text difference? How about the behaviour of .union() on a subclass of frozenset?) Launchpad’s test suite is fortunately extremely thorough, but even just starting up the test suite involves importing most of the data model code, so before you can start taking advantage of it you have to make a large fraction of the codebase be at least syntactically-correct Python 3 code and use only modules that exist in Python 3 while still working in Python 2; in a project this size that turns out to be a large effort on its own, and can be quite risky in places.

Canonical’s product engineering teams work on a six-month cycle, but it just isn’t possible to cram this sort of thing into six months unless you do literally nothing else, and “please can we put all feature development on hold while we run to stand still” is a pretty tough sell to even the most understanding management. Fortunately, we’ve been able to grow the Launchpad team in the last year or so, and so it’s been possible to put “Python 3” on our roadmap on the understanding that we aren’t going to get all the way there in one cycle, while still being able to do other substantial feature development work as well.

So, with all that preamble, what have we done this cycle? We’ve taken a two-pronged approach. From one end, we identified 147 classes that needed to be ported away from some compatibility code in our database layer that was substantially less friendly to Python 3: we’ve ported 38 of those, so there’s clearly a fair bit more to do, but we were able to distribute this work out among the team quite effectively. From the other end, it was clear that it would be very inefficient to do general porting work when any attempt to even run the test suite would run straight into the same crashes in the same order, so I set myself a target of getting the test suite to start up, and started hacking on an enormous git branch that I never expected to try to land directly: instead, I felt free to commit just about anything that looked reasonable and moved things forward even if it was very rough, and every so often went back to tidy things up and cherry-pick individual commits into a form that included some kind of explanation and passed existing tests so that I could propose them for review.

This strategy has been dramatically more successful than anything I’ve tried before at this scale. So far this cycle, considering only Launchpad’s main git tree, we’ve landed 137 Python-3-relevant merge proposals for a total of 39552 lines of git diff output, keeping our existing tests passing along the way and deploying incrementally to production. We have about 27000 more lines of patch at varying degrees of quality to tidy up and merge. Our main development branch is only perhaps 10 or 20 more patches away from the test suite being able to start up, at which point we’ll be able to get a buildbot running so that multiple developers can work on this much more easily and see the effect of their work. With the full unlanded patch stack, about 75% of the test suite passes on Python 3! This still leaves a long tail of several thousand tests to figure out and fix, but it’s a much more incrementally-tractable kind of problem than where we started.

Finally: the funniest (to me) bug I’ve encountered in this effort was the one I encountered in the test runner and fixed in zopefoundation/zope.testrunner#106: IDs of failing tests were written to a pipe, so if you have a test suite that’s large enough and broken enough then eventually that pipe would reach its capacity and your test runner would just give up and hang. Pretty annoying when it meant an overnight test run didn’t give useful results, but also eloquent commentary of sorts.

Porting Storm to Python 3

2019-09-22T08:56:42+01:00

We released Storm 0.21 on Friday (the release announcement seems to be stuck in moderation, but you can look at the NEWS file directly). For me, the biggest part of this release was adding Python 3 support.

Storm is a really nice and lightweight ORM (object-relational mapper) for Python, developed by Canonical. We use it for some major products (Launchpad and Landscape are the ones I know of), and it’s also free software and used by some other folks as well. Other popular ORMs for Python include SQLObject, SQLAlchemy and the Django ORM; we use those in various places too depending on the context, but personally I’ve always preferred Storm for the readability of code that uses it and for how easy it is to debug and extend it.

It’s been a problem for a while that Storm only worked with Python 2. It’s one of a handful of major blockers to getting Launchpad running on Python 3, which we definitely want to do; stoq ended up with a local fork of Storm to cope with this; and it was recently removed from Debian for this and other reasons. None of that was great. So, with significant assistance from a large patch contributed by Thiago Bellini, and with patient code review from Simon Poirier and some of my other colleagues, we finally managed to get that sorted out in this release.

In many ways, Storm was in fairly good shape already for a project that hadn’t yet been ported to Python 3: while its internal idea of which strings were bytes and which text required quite a bit of untangling in the way that Python 2 code usually does, its normal class used for text database columns was already Unicode which only accepted text input (unicode in Python 2), so it could have been a lot worse; this also means that applications that use Storm tend to get at least this part right even in Python 2. Aside from the bytes/text thing, many of the required changes were just the usual largely-mechanical ones that anyone who’s done 2-to-3 porting will be familiar with. But there were some areas that required non-trivial thought, and I’d like to talk about some of those here.

Exception types

Concrete database implementations such as psycopg2 raise implementation-specific exception types. The inheritance hierarchy for these is defined by the Python Database API (DB-API), but the actual exception classes aren’t in a common place; rather, you might get an instance of psycopg2.errors.IntegrityError when using PostgreSQL but an instance of sqlite3.IntegrityError when using SQLite. To make things easier for applications that don’t have a strict requirement for a particular database backend, Storm arranged to inject its own virtual exception types as additional base classes of these concrete exceptions by patching their __bases__ attribute, so for example, you could import IntegrityError from storm.exceptions and catch that rather than having to catch each backend-specific possibility.

Although this was always a bit of a cheat, it worked well in practice for a while, but the first sign of trouble even before porting to Python 3 was with psycopg2 2.5. This release started implementing its DB-API exception types in a C extension, which meant that it was no longer possible to patch __bases__. To get around that, a few years ago I landed a patch to Storm to use abc.ABCMeta.register instead to register the DB-API exceptions as virtual subclasses of Storm’s exceptions, which solved the problem for Python 2. However, even at the time I landed that, I knew that it would be a porting obstacle due to Python issue 12029; Django ran into that as well.

In the end, I opted to refactor how Storm handles exceptions: it now wraps cursor and connection objects in such a way as to catch DB-API exceptions raised by their methods and properties and re-raise them using wrapper exception types that inherit from both the appropriate subclass of StormError and the original DB-API exception type, and with some care I even managed to avoid this being painfully repetitive. Out-of-tree database backends will need to make some minor adjustments (removing install_exceptions, adding an _exception_module property to their Database subclass, adjusting the raw_connect method of their Database subclass to do exception wrapping, and possibly implementing _make_combined_exception_type and/or _wrap_exception if they need to add extra attributes to the wrapper exceptions). Applications that follow the usual Storm idiom of catching StormError or any of its subclasses should continue to work without needing any changes.

SQLObject compatibility

Storm includes some API compatibility with SQLObject; this was from before my time, but I believe it was mainly because Launchpad and possibly Landscape previously used SQLObject and this made the port to Storm very much easier. It still works fine for the parts of Launchpad that haven’t been ported to Storm, but I wouldn’t be surprised if there were newer features of SQLObject that it doesn’t support.

The main question here was what to do with StringCol and its associated AutoUnicodeVariable. I opted to make these explicitly only accept text on Python 3, since the main reason for them to accept bytes was to allow using them with Python 2 native strings (i.e. str), and on Python 3 str is already text so there’s much less need for the porting affordance in that case.

Since releasing 0.21 I realised that the StringCol implementation in SQLObject itself in fact accepts both bytes and text even on Python 3, so it’s possible that we’ll need to change this in the future, although we haven’t yet found any real code using Storm’s SQLObject compatibility layer that might rely on this. Still, it’s much easier for Storm to start out on the stricter side and perhaps become more lenient than it is to go the other way round.

inspect.getargspec

Storm had some fairly complicated use of inspect.getargspec on Python 2 as part of its test mocking arrangements. This didn’t work in Python 3 due to some subtleties relating to bound methods. I switched to the modern inspect.signature API in Python 3 to fix this, which in any case is rather simpler with the exception of a wrinkle in how method descriptors work.

(It’s possible that these mocking arrangements could be simplified nowadays by using some more off-the-shelf mocking library; I haven’t looked into that in any detail.)

What’s next?

I’m working on getting Storm back into Debian now, which will be with Python 3 support only since Debian is in the process of gradually removing Python 2 module support. Other than that I don’t really have any particular plans for Storm at the moment (although of course I’m not the only person with an interest in it), aside from ideally avoiding leaving six years between releases again. I expect we can go back into bug-fixing mode there for a while.

From the Launchpad side, I’ve recently made progress on one of the other major Python 3 blockers (porting Bazaar code hosting to Breezy, coming soon). There are still some other significant blockers, the largest being migrating to Mailman 3, subvertpy fixes so that we can port code importing to Breezy as well, and porting the lazr.restful stack; but we may soon be able to reach the point where it’s possible to start running interesting subsets of the test suite using Python 3 and categorising the failures, at which point we’ll be able to get a much better idea of how far we still have to go. Porting a project with the best part of a million lines of code and around three hundred dependencies is always going to take a while, but I’m happy to be making progress there, both due to Python 2’s impending end of upstream support and so that eventually we can start using new language facilities.

man-db 2.8.7

2019-08-27T06:55:25+01:00

I’ve released man-db 2.8.7 (announcement, NEWS), and uploaded it to Debian unstable.

There are a few things of note that I wanted to talk about here. Firstly, I made some further improvements to the seccomp sandbox originally introduced in 2.8.0. I do still think it’s correct to try to confine subprocesses this way as a defence against malicious documents, but it’s also been a pretty rough ride for some users, especially those who use various kinds of VPNs or antivirus programs that install themselves using /etc/ld.so.preload and cause other programs to perform additional system calls. As well as a few specific tweaks, a recent discussion on LWN reminded me that it would be better to make seccomp return EPERM rather than raising SIGSYS, since that’s easier to handle gracefully: in particular, it fixes an odd corner case related to glibc’s nscd handling.

Secondly, there was a build failure on macOS that took a while to figure out, not least because I don’t have a macOS test system myself. In 2.8.6 I tried to make life easier for people on this platform with a CFLAGS tweak, but I made it a bit too general and accidentally took away configure’s ability to detect undefined symbols properly, which caused very confusing failures. More importantly, I hadn’t really thought through why this change was necessary and whether it was a good idea. man-db uses private shared libraries to keep its executable size down, and it passes -no-undefined to libtool to declare that those shared libraries have no undefined symbols after linking, which is necessary to build shared libraries on some platforms. But the CFLAGS tweak above directly contradicts this! So, instead of playing core wars with my own build system, I did some refactoring so that the assertion that man-db’s shared libraries have no undefined symbols after linking is actually true: this involved moving decompression code out of libman, and arranging for the code in libmandb to take the database path as a parameter rather than as a global variable (something I’ve meant to fix for ages anyway; 252d7cbc23, 036aa910ea, a97d977b0b). Lesson: don’t make build system changes you don’t quite understand.

Buster upgrade

2019-05-05T01:10:48+01:00

I upgraded my home server from Debian stretch to buster recently, which is something I normally do once we’re frozen: this is a system that was first installed in 1999 and has a lot of complicated stuff on it, and while I try to keep it as cleanly-maintained as I can it still often runs into some interesting problems. Things went largely OK this time round, although there were a few snags of various degrees of severity, some of which weren’t Debian’s fault.

As ever, etckeeper made it much more comfortable to make non-trivial configuration file changes without fearing that I was going to lose information.

The first apt full-upgrade failed part-way through with “dependency problems prevent processing triggers for desktop-file-utils” for what didn’t seem like a particularly good reason; dpkg --configure -a sorted it out and I was able to resume the upgrade from there. I think I’ve seen a report of this somewhere recently as it rang a bell, though I haven’t yet found it.
I had a number of truly annoying configuration file resolutions to perform. There’s not much to be done about that except try to gradually move things to .d directories where available, and other such strategies to minimise the local differences I’m maintaining.
I had an old backup disk that had failed some time ago but was still plugged in and occasionally generating ATA errors. These made some parts of the upgrade excruciatingly slow, so as soon as I got to a point where I had to reboot anyway I took the opportunity to open up the case and unplug it.
I hit #919621 “lvm2: Update unexpectedly activates system ID check, bypassing impossible”. Fortunately I noticed the problem before rebooting due to warning messages from various things, and I adjusted my LVM configuration to set a system ID matching the one in my volume group. Unfortunately I forgot to run update-initramfs -u after doing so, and so I ended up having to use break=premount on the kernel command line and fix things up in the same way in the initramfs until I could update it properly. I’m not sure what the right fix for this is, although it probably only affects some rather old VGs; I created mine in 2004.
I ran into #924881 “postgresql: buster upgrade breaks older postgresql (9.6) and newer postgresql (11) is also inoperative” (in fact a bug in ssl-cert). It was correct to reject the snakeoil certificate, but the upgrade failure mode was pretty graceless and it would have been helpful for something to notice the situation and prompt me to regenerate the certificate.
My networking wasn’t happy after the upgrade; I ended up with some missing addresses, which I’m prepared to believe was the fault of my very old and badly-organised /etc/network/interfaces file, so I rearranged it to follow what seems to be the modern best practice of handling multiple addresses on an interface by just having one iface stanza per address using the same interface name, rather than pre-up ip addr add lines or alias interfaces or anything like that. After that, the interface sometimes refused to come up at all with “ADDRCONF(NETDEV_UP): eth0: link is not ready” messages. Some web-searching and grepping of the kernel source led me to the idea that listing inet6 stanzas before inet stanzas for a given interface name was likely to be helpful, and so it proved: I now have an /etc/network/interfaces that both works and is much easier to read.
I had to do some manual steps to get Icinga Web 2 authentication working again: I followed the upstream directions to upgrade the database schema, and I had to run a2enmod php7.3 manually since the previous enablement of php7.0 wasn’t carried over. (I’m not completely sure if the first step was required, but the second certainly was.)

Other than that, everything seems to be working well now.

binfmt-support 2.2.0

2019-01-25T11:21:57+00:00

I’ve released binfmt-support 2.2.0. These are the major changes since 2.1.8:

Remove support for the old procfs interface, which has been unused since Linux 2.4.13 and which caused trouble in environments where we can’t use modprobe. Thanks to Bastian Blank.
Sort formats by name in the output of update-binfmts --display.
Building binfmt-support now requires Autoconf >= 2.63.
Add a new --unimport action, which is the inverse of --import.
Don’t enable formats on import or disable them on unimport unless /proc/sys/fs/binfmt_misc is already mounted. This avoids causing cleanup problems in chroots.
--fix-binary yes is incompatible with detectors. Warn the user if they try to use both at once. Thanks to Stefan Agner.

In the corresponding Debian upload (2.2.0-1), I’ve changed README.Debian to recommend using update-binfmts --unimport <name> in the prerm rather than a more complicated update-binfmts --package <package> --remove <name> <path> command. I don’t intend to push for existing packages to switch over to this before buster, though, since the stricter package relationships needed to arrange for a new enough version of binfmt-support to be present when the prerm runs would make the upgrade path more complicated, and it isn’t an urgent change.

Deploying Swift

2018-12-04T01:37:11+00:00

Sometimes I want to deploy Swift, the OpenStack object storage system.

Well, no, that’s not true. I basically never actually want to deploy Swift as such. What I generally want to do is to debug some bit of production service deployment machinery that relies on Swift for getting build artifacts into the right place, or maybe the parts of the Launchpad librarian (our blob storage service) that use Swift. I could find an existing private or public cloud that offers the right API and test with that, but sometimes I need to test with particular versions, and in any case I have a terribly slow internet connection and shuffling large build artifacts back and forward over the relevant bit of wet string makes it painfully slow to test things.

For a while I’ve had an Ubuntu 12.04 VM lying around with an Icehouse-based Swift deployment that I put together by hand. It works, but I didn’t keep good notes and have no real idea how to reproduce it, not that I really want to keep limping along with manually-constructed VMs for this kind of thing anyway; and I don’t want to be dependent on obsolete releases forever. For the sorts of things I’m doing I need to make sure that authentication works broadly the same way as it does in a real production deployment, so I want to have Keystone too. At the same time, I definitely don’t want to do anything close to a full OpenStack deployment of my own: it’s much too big a sledgehammer for this particular nut, and I don’t really have the hardware for it.

Here’s my solution to this, which is compact enough that I can run it on my laptop, and while it isn’t completely automatic it’s close enough that I can spin it up for a test and discard it when I’m finished (so I haven’t worried very much about producing something that runs efficiently). It relies on Juju and LXD. I’ve only tested it on Ubuntu 18.04, using Queens; for anything else you’re on your own. In general, I probably can’t help you if you run into trouble with the directions here: this is provided “as is”, without warranty of any kind, and all that kind of thing.

First, install Juju and LXD if necessary, following the instructions provided by those projects, and also install the python-openstackclient package as you’ll need it later. You’ll want to set Juju up to use LXD, and you should probably make sure that the shells you’re working in don’t have http_proxy set as it’s quite likely to confuse things unless you’ve arranged for your proxy to be able to cope with your local LXD containers. Then add a model:

juju add-model swift

At this point there’s a bit of complexity that you normally don’t have to worry about with Juju. The swift-storage charm wants to mount something to use for storage, which with the LXD provider in practice ends up being some kind of loopback mount. Unfortunately, being able to perform loopback mounts exposes too much kernel attack surface, so LXD doesn’t allow unprivileged containers to do it. (Ideally the swift-storage charm would just let you use directory storage instead.) To make the containers we’re about to create privileged enough for this to work, run:

lxc profile set juju-swift security.privileged true
lxc profile device add juju-swift loop-control unix-char \
    major=10 minor=237 path=/dev/loop-control
for i in $(seq 0 255); do
    lxc profile device add juju-swift loop$i unix-block \
        major=7 minor=$i path=/dev/loop$i
done

Now we can start deploying things! Save this to a file, e.g. swift.bundle:

series: bionic
description: "Swift in a box"
applications:
  mysql:
    charm: "cs:mysql-62"
    channel: candidate
    num_units: 1
    options:
      dataset-size: 512M
  keystone:
    charm: "cs:keystone"
    num_units: 1
  swift-storage:
    charm: "cs:swift-storage"
    num_units: 1
    to: [keystone]
    options:
      block-device: "/etc/swift/storage.img|5G"
  swift-proxy:
    charm: "cs:swift-proxy"
    num_units: 1
    to: [mysql]
    options:
      zone-assignment: auto
      replicas: 1
relations:
  - ["keystone:shared-db", "mysql:shared-db"]
  - ["swift-proxy:swift-storage", "swift-storage:swift-storage"]
  - ["swift-proxy:identity-service", "keystone:identity-service"]

And run:

juju deploy ./swift.bundle

This will take a while. You can run juju status to see how it’s going in general terms, or juju debug-log for detailed logs from the individual containers as they’re putting themselves together. When it’s all done, it should look something like this:

Model  Controller  Cloud/Region     Version  SLA
swift  lxd         localhost        2.3.1    unsupported

App            Version  Status  Scale  Charm          Store       Rev  OS      Notes
keystone       13.0.1   active      1  keystone       jujucharms  290  ubuntu
mysql          5.7.24   active      1  mysql          jujucharms   62  ubuntu
swift-proxy    2.17.0   active      1  swift-proxy    jujucharms   75  ubuntu
swift-storage  2.17.0   active      1  swift-storage  jujucharms  250  ubuntu

Unit              Workload  Agent  Machine  Public address  Ports     Message
keystone/0*       active    idle   0        10.36.63.133    5000/tcp  Unit is ready
mysql/0*          active    idle   1        10.36.63.44     3306/tcp  Ready
swift-proxy/0*    active    idle   1        10.36.63.44     8080/tcp  Unit is ready
swift-storage/0*  active    idle   0        10.36.63.133              Unit is ready

Machine  State    DNS           Inst id        Series  AZ  Message
0        started  10.36.63.133  juju-d3e703-0  bionic      Running
1        started  10.36.63.44   juju-d3e703-1  bionic      Running

At this point you have what should be a working installation, but with only administrative privileges set up. Normally you want to create at least one normal user. To do this, start by creating a configuration file granting administrator privileges (this one comes verbatim from the openstack-base bundle, though with one change that isn’t yet in the charm store version at the time of writing):

_OS_PARAMS=$(env | awk 'BEGIN {FS="="} /^OS_/ {print $1;}' | paste -sd ' ')
for param in $_OS_PARAMS; do
    if [ "$param" = "OS_AUTH_PROTOCOL" ]; then continue; fi
    if [ "$param" = "OS_CACERT" ]; then continue; fi
    unset $param
done
unset _OS_PARAMS

_keystone_unit=$(juju status keystone --format yaml | \
    awk '/units:$/ {getline; gsub(/:$/, ""); print $1; exit}')
_keystone_ip=$(juju run --unit ${_keystone_unit} 'unit-get private-address')
_password=$(juju run --unit ${_keystone_unit} 'leader-get admin_passwd')

export OS_AUTH_URL=${OS_AUTH_PROTOCOL:-http}://${_keystone_ip}:5000/v3
export OS_USERNAME=admin
export OS_PASSWORD=${_password}
export OS_USER_DOMAIN_NAME=admin_domain
export OS_PROJECT_DOMAIN_NAME=admin_domain
export OS_PROJECT_NAME=admin
export OS_REGION_NAME=RegionOne
export OS_IDENTITY_API_VERSION=3
# Swift needs this:
export OS_AUTH_VERSION=3
# Gnocchi needs this
export OS_AUTH_TYPE=password

Source this into a shell: for instance, if you saved this to ~/.swiftrc.juju-admin, then run:

. ~/.swiftrc.juju-admin

You should now be able to run openstack endpoint list and see a table for the various services exposed by your deployment. Then you can create a dummy project and a user with enough privileges to use Swift:

USERNAME=your-username
PASSWORD=your-password
openstack domain create SwiftDomain
openstack project create --domain SwiftDomain --description Swift \
    SwiftProject
openstack user create --domain SwiftDomain --project-domain SwiftDomain \
    --project SwiftProject --password "$PASSWORD" "$USERNAME"
openstack role add --project SwiftProject --user-domain SwiftDomain \
    --user "$USERNAME" Member

(This is intended for testing rather than for doing anything particularly sensitive. If you cared about keeping the password secret then you’d use the --password-prompt option to openstack user create instead of supplying the password on the command line.)

Now create a configuration file granting privileges for the user you just created. I felt like automating this to at least some degree:

touch ~/.swiftrc.juju
chmod 600 ~/.swiftrc.juju
sed '/^_password=/d;
     s/\( OS_PROJECT_DOMAIN_NAME=\).*/\1SwiftDomain/;
     s/\( OS_PROJECT_NAME=\).*/\1SwiftProject/;
     s/\( OS_USER_DOMAIN_NAME=\).*/\1SwiftDomain/;
     s/\( OS_USERNAME=\).*/\1'"$USERNAME"'/;
     s/\( OS_PASSWORD=\).*/\1'"$PASSWORD"'/' \
     <~/.swiftrc.juju-admin >~/.swiftrc.juju

Source this into a shell. For example:

. ~/.swiftrc.juju

You should now find that swift list works. Success! Now you can swift upload files, or just start testing whatever it was that you were actually trying to test in the first place.

This is not a setup I expect to leave running for a long time, so to tear it down again:

juju destroy-model swift

This will probably get stuck trying to remove the swift-storage unit, since nothing deals with detaching the loop device. If that happens, find the relevant device in losetup -a from another window and use losetup -d to detach it; juju destroy-model should then be able to proceed.

Credit to the Juju and LXD teams and to the maintainers of the various charms used here, as well as of course to the OpenStack folks: their work made it very much easier to put this together.

2019-01-18: Edited to deploy to two containers rather than four, and to incorporate a ~/.swiftrc.juju-admin change to cope with that.

An odd test failure

2017-12-19T13:52:52+00:00

Weird test failures are great at teaching you things that you didn’t realise you might need to know.

As previously mentioned, I’ve been working on converting Launchpad from Buildout to virtualenv and pip, and I finally landed that change on our development branch today. The final landing was mostly quite smooth, except for one test failure on our buildbot that I hadn’t seen before:

ERROR: lp.codehosting.codeimport.tests.test_worker.TestBzrSvnImport.test_stacked
worker ID: unknown worker (bug in our subunit output?)
----------------------------------------------------------------------
Traceback (most recent call last):
_StringException: log: {{{
36.384  creating repository in file:///tmp/testbzr-6CwSLV.tmp/lp.codehosting.codeimport.tests.test_worker.TestBzrSvnImport.test_stacked/work/stacked-on/.bzr/.
36.388  creating branch <bzrlib.branch.BzrBranchFormat7 object at 0xeb85b36c> in file:///tmp/testbzr-6CwSLV.tmp/lp.codehosting.codeimport.tests.test_worker.TestBzrSvnImport.test_stacked/work/stacked-on/
}}}

Traceback (most recent call last):
  File "/srv/buildbot/lpbuildbot/lp-devel-xenial/build/lib/lp/codehosting/codeimport/tests/test_worker.py", line 1108, in test_stacked
    stacked_on.fetch(Branch.open(source_details.url))
  File "/srv/buildbot/lpbuildbot/lp-devel-xenial/build/env/local/lib/python2.7/site-packages/bzrlib/branch.py", line 186, in open
    possible_transports=possible_transports, _unsupported=_unsupported)
  File "/srv/buildbot/lpbuildbot/lp-devel-xenial/build/env/local/lib/python2.7/site-packages/bzrlib/controldir.py", line 689, in open
    _unsupported=_unsupported)
  File "/srv/buildbot/lpbuildbot/lp-devel-xenial/build/env/local/lib/python2.7/site-packages/bzrlib/controldir.py", line 718, in open_from_transport
    find_format, transport, redirected)
  File "/srv/buildbot/lpbuildbot/lp-devel-xenial/build/env/local/lib/python2.7/site-packages/bzrlib/transport/__init__.py", line 1719, in do_catching_redirections
    return action(transport)
  File "/srv/buildbot/lpbuildbot/lp-devel-xenial/build/env/local/lib/python2.7/site-packages/bzrlib/controldir.py", line 706, in find_format
    probers=probers)
  File "/srv/buildbot/lpbuildbot/lp-devel-xenial/build/env/local/lib/python2.7/site-packages/bzrlib/controldir.py", line 1155, in find_format
    raise errors.NotBranchError(path=transport.base)
NotBranchError: Not a branch: "/tmp/tmpdwqrc6/trunk/".

When I investigated this locally, I found that I could reproduce it if I ran just that test on its own, but not if I ran it together with the other tests in the same class. That’s certainly my favourite way round for test isolation failures to present themselves (it’s more usual to find state from one test leaking out and causing another one to fail, which can make for a very time-consuming exercise of trying to find the critical combination), but it’s still pretty odd.

I stepped through the Branch.open call in each case in the hope of some enlightenment. The interesting difference was that the custom probers installed by the bzr-svn plugin weren’t installed when I ran that one test on its own, so it was trying to open a branch as a Bazaar branch rather than using the foreign-branch logic for Subversion, and this presumably depended on some configuration that only some tests put in place. I was on the verge of just explicitly setting up that plugin in the test suite’s setUp method, but I was still curious about exactly what was breaking this.

Launchpad installs several Bazaar plugins, and lib/lp/codehosting/__init__.py is responsible for putting most of these in place: anything in Launchpad itself that uses Bazaar is generally supposed to do something like import lp.codehosting to set everything up. I therefore put a breakpoint at the top of lp.codehosting and stepped through it to see whether anything was going wrong in the initial setup. Sure enough, I found that bzrlib.plugins.svn was failing to import due to an exception raised by bzrlib.i18n.load_plugin_translations, which was being swallowed silently but meant that its custom probers weren’t being installed. Here’s what that function looks like:

def load_plugin_translations(domain):
    """Load the translations for a specific plugin.

    :param domain: Gettext domain name (usually 'bzr-PLUGINNAME')
    """
    locale_base = os.path.dirname(
        unicode(__file__, sys.getfilesystemencoding()))
    translation = install_translations(domain=domain,
        locale_base=locale_base)
    add_fallback(translation)
    return translation

In this case, sys.getfilesystemencoding was returning None, which isn’t a valid encoding argument to unicode. But why would that be? It gave me a sensible result when I ran it from a Python shell in this environment. A bit of head-scratching later and it occurred to me to look at a backtrace:

(Pdb) bt
  /home/cjwatson/src/canonical/launchpad/lp-branches/testfix/env/lib/python2.7/site.py(703)<module>()
-> main()
  /home/cjwatson/src/canonical/launchpad/lp-branches/testfix/env/lib/python2.7/site.py(694)main()
-> execsitecustomize()
  /home/cjwatson/src/canonical/launchpad/lp-branches/testfix/env/lib/python2.7/site.py(548)execsitecustomize()
-> import sitecustomize
  /home/cjwatson/src/canonical/launchpad/lp-branches/testfix/env/lib/python2.7/sitecustomize.py(7)<module>()
-> lp_sitecustomize.main()
  /home/cjwatson/src/canonical/launchpad/lp-branches/testfix/lib/lp_sitecustomize.py(193)main()
-> dont_wrap_bzr_branch_classes()
  /home/cjwatson/src/canonical/launchpad/lp-branches/testfix/lib/lp_sitecustomize.py(139)dont_wrap_bzr_branch_classes()
-> import lp.codehosting
> /home/cjwatson/src/canonical/launchpad/lp-branches/testfix/lib/lp/codehosting/__init__.py(54)<module>()
-> load_plugins([_get_bzr_plugins_path()])

I wonder if there’s something interesting about being imported from a sitecustomize hook? Sure enough, when I went to look at Python for where sys.getfilesystemencoding is set up, I found this in Py_InitializeEx:

    if (!Py_NoSiteFlag)
        initsite(); /* Module site */
    ...
#if defined(Py_USING_UNICODE) && defined(HAVE_LANGINFO_H) && defined(CODESET)
    /* On Unix, set the file system encoding according to the
       user's preference, if the CODESET names a well-known
       Python codec, and Py_FileSystemDefaultEncoding isn't
       initialized by other means. Also set the encoding of
       stdin and stdout if these are terminals, unless overridden.  */

    if (!overridden || !Py_FileSystemDefaultEncoding) {
        ...
    }

I moved this out of sitecustomize, and it’s working better now. But did you know that a sitecustomize hook can’t safely use anything that depends on sys.getfilesystemencoding? I certainly didn’t, until it bit me.

Kitten Block equivalent for Firefox 57

2017-11-16T19:15:48+00:00

I’ve been using Kitten Block for years, since I don’t really need the blood pressure spike caused by accidentally following links to certain UK newspapers. Unfortunately it hasn’t been ported to Firefox 57. I tried emailing the author a couple of months ago, but my email bounced.

However, if your primary goal is just to block the websites in question rather than seeing kitten pictures as such (let’s face it, the internet is not short of alternative sources of kitten pictures), then it’s easy to do with uBlock Origin. After installing the extension if necessary, go to Tools → Add-ons → Extensions → uBlock Origin → Preferences → My filters, and add www.dailymail.co.uk and www.express.co.uk, each on its own line. (Of course you can easily add more if you like.) Voilà: instant tranquility.

Incidentally, this also works fine on Android. The fact that it was easy to install a good ad blocker without having to mess about with a rooted device or strange proxy settings was the main reason I switched to Firefox on my phone.

A mysterious bug with Twisted plugins

2017-09-26T11:20:14-04:00

I fixed a bug in Launchpad recently that led me deeper than I expected.

Launchpad uses Buildout as its build system for Python packages, and it’s served us well for many years. However, we’re using 1.7.1, which doesn’t support ensuring that packages required using setuptools’ setup_requires keyword only ever come from the local index URL when one is specified; that’s an essential constraint we need to be able to impose so that our build system isn’t immediately sensitive to downtime or changes in PyPI. There are various issues/PRs about this in Buildout (e.g. #238), but even if those are fixed it’ll almost certainly only be in Buildout v2, and upgrading to that is its own kettle of fish for other reasons. All this is a serious problem for us because newer versions of many of our vital dependencies (Twisted and testtools, to name but two) use setup_requires to pull in pbr, and so we’ve been stuck on old versions for some time; this is part of why Launchpad doesn’t yet support newer SSH key types, for instance. This situation obviously isn’t sustainable.

To deal with this, I’ve been working for some time on switching to virtualenv and pip. This is harder than you might think: Launchpad is a long-lived and complicated project, and it had quite a number of explicit and implicit dependencies on Buildout’s configuration and behaviour. Upgrading our infrastructure from Ubuntu 12.04 to 16.04 has helped a lot (12.04’s baseline virtualenv and pip have some deficiencies that would have required a more complicated bootstrapping procedure). I’ve dealt with most of these: for example, I had to reorganise a lot of our helper scripts (1, 2, 3), but there are still a few more things to go.

One remaining problem was that our Buildout configuration relied on building several different environments with different Python paths for various things. While this would technically be possible by way of building multiple virtualenvs, this would inflate our build time even further (we’re already going to have to cope with some slowdown as a result of using virtualenv, because the build system now has to do a lot more than constructing a glorified link farm to a bunch of cached eggs), and it seems like unnecessary complexity. The obvious thing to do seemed to be to collapse these into a single environment, since there was no obvious reason why it should actually matter if txpkgupload and txlongpoll were carefully kept off the path when running most of Launchpad: so I did that.

Then our build system got very sad.

Hmm, I thought. To keep our test times somewhat manageable, we run them in parallel across 20 containers, and we randomise the order in which they run to try to shake out test isolation bugs. It’s not completely unknown for there to be some oddities resulting from that. So I ran it again. Nope, but slightly differently sad this time. Furthermore, I couldn’t reproduce these failures locally no matter how hard I tried. Oh dear. This was obviously not going to be a good day.

In fact I spent a while on various different guesswork-based approaches. I found bug 571334 in Ampoule, an AMP-based process pool implementation that we use for some job runners, and proposed a fix for that, but cherry-picking that fix into Launchpad didn’t help matters. I tried backing out subsets of my changes and determined that if both txlongpoll and txpkgupload were absent from the Python module path in the context of the tests in question then everything was fine. I tried running strace locally and staring at the output for some time in the hope of enlightenment: that reminded me that the two packages in question install modules under twisted.plugins, which did at least establish a reason they might affect the environment that was more plausible than magic, but nothing much more specific than that.

On Friday I was fiddling about with this again and trying to insert some more debugging when I noticed some interesting behaviour around plugin caching. If I caused the txpkgupload plugin to raise an exception when loaded, the Twisted plugin system would remove its dropin.cache (because it was stale) and not create a new one (because there was now no content to put in it). After that, running the relevant tests would fail as I’d seen in our buildbot. Aha! This meant that I could also reproduce it by doing an even cleaner build than I’d previously tried to do, by removing the cached txpkgupload and txlongpoll eggs and allowing the build system to recreate them. When they were recreated, they didn’t contain dropin.cache, instead allowing that to be created on first use.

Based on this clue I was able to get to the answer relatively quickly. Ampoule has a specialised bootstrapping sequence for its worker processes that starts by doing this:

from twisted.application import reactors
reactors.installReactor(reactor)

Now, twisted.application.reactors.installReactor calls twisted.plugin.getPlugins, so the very start of this bootstrapping sequence is going to involve loading all plugins found on the module path (I assume it’s possible to write a plugin that adds an alternative reactor implementation). If dropin.cache is up to date, then it will just get the information it needs from that; but if it isn’t, it will go ahead and import the plugin. If the plugin happens (as Twisted code often does) to run from twisted.internet import reactor at some point while being imported, then that will install the platform’s default reactor, and then twisted.application.reactors.installReactor will raise ReactorAlreadyInstalledError. Since Ampoule turns this into an info-level log message for some reason, and the tests in question only passed through error-level messages or higher, this meant that all we could see was that a worker process had exited non-zero but not why.

The Twisted documentation recommends generating the plugin cache at build time for other reasons, but we weren’t doing that. Fixing that makes everything work again.

There are still a few more things needed to get us onto pip, but we’re now pretty close. After that we can finally start bringing our dependencies up to date.

env —chdir

2017-08-30T00:54:42+01:00

I was recently asked to sort things out so that snap builds on Launchpad could themselves install snaps as build-dependencies. To make this work we need to start doing builds in LXD containers rather than in chroots. As a result I’ve been doing some quite extensive refactoring of launchpad-buildd: it previously had the assumption that it was going to use a chroot for everything baked into lots of untested helper shell scripts, and I’ve been rewriting those in Python with unit tests and with a single Backend abstraction that isolates the high-level logic from the details of where each build is being performed.

This is all interesting work in its own right, but it’s not what I want to talk about here. While I was doing all this refactoring, I ran across a couple of methods I wrote a while back which looked something like this:

def chroot(self, args, echo=False):
    """Run a command in the chroot.

    :param args: the command and arguments to run.
    """
    args = set_personality(
        args, self.options.arch, series=self.options.series)
    if echo:
        print("Running in chroot: %s" %
              ' '.join("'%s'" % arg for arg in args))
        sys.stdout.flush()
    subprocess.check_call([
        "/usr/bin/sudo", "/usr/sbin/chroot", self.chroot_path] + args)

def run_build_command(self, args, env=None, echo=False):
    """Run a build command in the chroot.

    This is unpleasant because we need to run it in /build under sudo
    chroot, and there's no way to do this without either a helper
    program in the chroot or unpleasant quoting.  We go for the
    unpleasant quoting.

    :param args: the command and arguments to run.
    :param env: dictionary of additional environment variables to set.
    """
    args = [shell_escape(arg) for arg in args]
    if env:
        args = ["env"] + [
            "%s=%s" % (key, shell_escape(value))
            for key, value in env.items()] + args
    command = "cd /build && %s" % " ".join(args)
    self.chroot(["/bin/sh", "-c", command], echo=echo)

(I’ve already replaced the chroot method with a call to Backend.run, but it’s easier to see what I’m talking about in the original form.)

One thing to notice about this code is that it uses several adverbial commands: that is, commands that run another command in a different way. For example, sudo runs another command as another user, while chroot runs another command with a different root directory, and env runs another command with different environment variables set. These commands chain neatly, and they also have the useful property that they take the subsidiary command and its arguments as a list of arguments. coreutils has several other commands that behave this way, and adverbio is another useful example.

By contrast, su -c is something you might call a “quasi-adverbial” command: it does modify the behaviour of another command, but it takes it as a single argument which it then passes to sh -c. Every time you have something that’s passed to a shell like this, you need a corresponding layer of shell quoting to escape any shell metacharacters that should be interpreted literally. This is often cumbersome and is easy to get wrong. My Python implementation is as follows, and I wouldn’t be totally surprised to discover that it contained a bug:

import re

non_meta_re = re.compile(r'^[a-zA-Z0-9+,./:=@_-]+$')

def shell_escape(arg):
    if non_meta_re.match(arg):
        return arg
    else:
        return "'%s'" % arg.replace("'", "'\\''")

Python >= 3.3 has shlex.quote, which is an improvement and we should probably use that instead, but it’s still another thing to forget to call. This is why process-spawning libraries such as Python’s subprocess, Perl’s system and open, and my own libpipeline for C encourage programmers to use a list syntax and to avoid involving the shell entirely wherever possible.

One thing that the standard Unix tools don’t let you do in an adverbial way is to change your working directory, and I’ve run into this annoying limitation several times. This means that it’s difficult to chain that operation together with other adverbs, for example to run a command in a particular working directory inside a chroot. The workaround I used above was to invoke a shell that runs cd /build && ..., but that’s another command that’s only quasi-adverbial, since the extra shell means an extra layer of shell quoting.

(Ian Jackson rightly observes that you can in fact write the necessary adverb as something like sh -ec 'cd "$1"; shift; exec "$@"' chdir. I think that’s a bit uglier than I ideally want to use in production code, but you might reasonably think that it’s worth it to avoid the extra layer of shell quoting.)

I therefore decided that this was a feature that belonged in coreutils, and after a bit of mailing list discussion we felt it was best implemented as a new option to env(1). I sent a patch for this which has been accepted. This means that we have a new composable adverb, env --chdir=NEWDIR, which will allow the run_build_command method above to be rewritten as something like this:

def run_build_command(self, args, env=None, echo=False):
    """Run a build command in the chroot.

    :param args: the command and arguments to run.
    :param env: dictionary of additional environment variables to set.
    """
    env_args = ["env", "--chdir=/build"]
    if env:
        for key, value in env.items():
            env_args.append("%s=%s" % (key, value))
    self.chroot(env_args + args, echo=echo)

The env --chdir option will be in coreutils 8.28. We won’t be able to use it in launchpad-buildd until that’s available in all Ubuntu series we might want to build for, so in this particular application that’s going to take a few years; but other applications may well be able to make use of it sooner.

New address book

2017-06-27T12:57:27+01:00

I’ve had a kludgy mess of electronic address books for most of two decades, and have got rather fed up with it. My stack consisted of:

~/.mutt/aliases, a flat text file consisting of mutt alias commands
lbdb configuration to query ~/.mutt/aliases, Debian’s LDAP database, and Canonical’s LDAP database, so that I can search by name with Ctrl-t in mutt when composing a new message
Google Contacts, which I used from Android and was completely separate from all of the above

The biggest practical problem with this was that I had the address book that was most convenient for me to add things to (Google Contacts) and the one I used when sending email, and no sensible way to merge them or move things between them. I also wasn’t especially comfortable with having all my contact information in a proprietary web service.

My goals for a replacement address book system were:

free software throughout
storage under my control
single common database
minimal manual transcription when consolidating existing databases
integration with Android such that I can continue using the same contacts, messaging, etc. apps
integration with mutt such that I can continue using the same query interface
not having to write my own software, because honestly

I think I have all this now!

New stack

The obvious basic technology to use is CardDAV: it’s fairly complex, admittedly, but lots of software supports it and one of my goals was not having to write my own thing. This meant I needed a CardDAV server, some way to sync the database to and from both Android and the system where I run mutt, and whatever query glue was necessary to get mutt to understand vCards.

There are lots of different alternatives here, and if anything the problem was an embarrassment of choice. In the end I just decided to go for things that looked roughly the right shape for me and tried not to spend too much time in analysis paralysis.

CardDAV server

I went with Xandikos for the server, largely because I know Jelmer and have generally had pretty good experiences with their software, but also because using Git for history of the backend storage seems like something my future self will thank me for.

It isn’t packaged in stretch, but it’s in Debian unstable, so I installed it from there.

Rather than the standalone mode suggested on the web page, I decided to set it up in what felt like a more robust way using WSGI. I installed gunicorn and python3-gunicorn, created the following file in /etc/systemd/system/xandikos.socket:

[Unit]
Description=Xandikos socket

[Socket]
ListenStream=/run/xandikos.socket

[Install]
WantedBy=sockets.target

… and the following file in /etc/systemd/system/xandikos.service:

[Unit]
Description=Xandikos CalDAV/CardDAV server
Documentation=man:xandikos(1)
Requires=xandikos.socket

[Service]
User=xandikos
Group=xandikos
Restart=on-failure
ExecStart=/usr/bin/python3 /usr/bin/gunicorn --bind=unix:/run/xandikos.socket xandikos.wsgi:app
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
Environment=XANDIKOSPATH=/srv/xandikos/collections
ProtectSystem=strict
ProtectKernelTunables=yes
ProtectControlGroups=yes
PrivateDevices=yes
PrivateTmp=yes
ReadWritePaths=/run/xandikos.socket /srv/xandikos

The path (/srv/xandikos/collections) was arbitrary. You need to create the xandikos user and group first (adduser --system --group --no-create-home --disabled-login xandikos). I created /srv/xandikos owned by xandikos:xandikos and mode 0700. You should also run sudo -u xandikos xandikos -d /srv/xandikos/collections --autocreate and then Ctrl-c it after a short time (I think it would be nicer if there were a way to ask the WSGI wrapper to do this). If you aren’t using systemd then you can of course write equivalent init scripts instead.

For Apache setup, I kept it reasonably simple: I ran a2enmod proxy_http, used htpasswd to create /etc/apache2/xandikos.passwd with a username and password for myself, added a virtual host in /etc/apache2/sites-available/xandikos.conf, and enabled it with a2ensite xandikos:

<VirtualHost *:443>
        ServerName xandikos.example.org
        ServerAdmin me@example.org

        ErrorLog /var/log/apache2/xandikos-error.log
        TransferLog /var/log/apache2/xandikos-access.log

        <Location />
                ProxyPass "unix:/run/xandikos.socket|http://xandikos.riva.dynamic.greenend.org.uk/"
                AuthType Basic
                AuthName "Xandikos"
                AuthBasicProvider file
                AuthUserFile "/etc/apache2/xandikos.passwd"
                Require valid-user
        </Location>
</VirtualHost>

You should of course adjust the ProxyPass line to match your own deployment.

Then service apache2 reload, set the new virtual host up with Let’s Encrypt, reloaded again, and off we go.

Android integration

I installed DAVx⁵ from the Play Store: it cost a few pounds, but I was OK with that since it’s GPLv3 and I’m happy to help fund free software. I created two accounts, one for my existing Google Contacts database (and in fact calendaring as well, although I don’t intend to switch over to self-hosting that just yet), and one for the new Xandikos instance. The Google setup was a bit fiddly because I have two-step verification turned on so I had to create an app-specific password. The Xandikos setup was straightforward: base URL, username, password, and done.

Since I didn’t completely trust the new setup yet, I followed what seemed like the most robust option from the DAVx⁵ contacts syncing documentation, and used the stock contacts app to export my Google Contacts account to a .vcf file and then import that into the appropriate DAVx⁵ account (which showed up automatically). This seemed straightforward and everything got pushed to Xandikos. There are some weird delays in syncing contacts that I don’t entirely understand, but it all seems to get there in the end.

2019-06-13: Followed rename of DAVdroid to DAVx⁵. At the moment Google Contacts support seems to be flaky at best; see the DAVx⁵ forums for tips.

mutt integration

First off I needed to sync the contacts. (In fact I happen to run mutt on the same system where I run Xandikos at the moment, but I don’t want to rely on that, and going through the CardDAV server means that I don’t have to poke holes for myself using filesystem permissions.) I used vdirsyncer for this. In ~/.vdirsyncer/config:

[general]
status_path = "~/.vdirsyncer/status/"

[pair contacts]
a = "contacts_local"
b = "contacts_remote"
collections = ["from a", "from b"]

[storage contacts_local]
type = "filesystem"
path = "~/.contacts/"
fileext = ".vcf"

[storage contacts_remote]
type = "carddav"
url = "<Xandikos base URL>"
username = "<my username>"
password = "<my password>"

Running vdirsyncer discover and vdirsyncer sync then synced everything into ~/.contacts/. I added an hourly crontab entry to run vdirsyncer -v WARNING sync.

Next, I needed a command-line address book tool based on this. khard looked about right and is in stretch, so I installed that. In ~/.config/khard/khard.conf (this is mostly just the example configuration, but I preferred to sort by first name since not all my contacts have neat first/last names):

[addressbooks]
[[contacts]]
path = ~/.contacts/<UUID of my contacts collection>/

[general]
debug = no
default_action = list
editor = vim
merge_editor = vimdiff

[contact table]
# display names by first or last name: first_name / last_name
display = first_name
# group by address book: yes / no
group_by_addressbook = no
# reverse table ordering: yes / no
reverse = no
# append nicknames to name column: yes / no
show_nicknames = no
# show uid table column: yes / no
show_uids = yes
# sort by first or last name: first_name / last_name
sort = first_name

[vcard]
# extend contacts with your own private objects
# these objects are stored with a leading "X-" before the object name in the vcard files
# every object label may only contain letters, digits and the - character
# example:
#   private_objects = Jabber, Skype, Twitter
private_objects = Jabber, Skype, Twitter
# preferred vcard version: 3.0 / 4.0
preferred_version = 3.0
# Look into source vcf files to speed up search queries: yes / no
search_in_source_files = no
# skip unparsable vcard files: yes / no
skip_unparsable = no

Now khard list shows all my contacts. So far so good. Apparently there are some awkward vCard compatibility issues with creating or modifying contacts from the khard end. I’ve tried adding one address from ~/.mutt/aliases using khard and it seems to at least minimally work for me, but I haven’t explored this very much yet.

I had to install python3-vobject 0.9.4.1-1 from experimental to fix eventable/vobject#39 saving certain vCard files.

Finally, mutt integration. I already had set query_command="lbdbq '%s'" in ~/.muttrc, and I wanted to keep that in place since I still wanted to use LDAP querying as well. I had to write a very small amount of code for this (perhaps I should contribute this to lbdb upstream?), in ~/.lbdb/modules/m_khard:

#! /bin/sh

m_khard_query () {
    khard email --parsable --remove-first-line --search-in-source-files "$1"
}

My full ~/.lbdb/rc now reads as follows (you probably won’t want the LDAP stuff, but I’ve included it here for completeness):

MODULES_PATH="$MODULES_PATH $HOME/.lbdb/modules"
METHODS='m_muttalias m_khard m_ldap'
LDAP_NICKS='debian canonical'

Next steps

I’ve deleted one account from Google Contacts just to make sure that everything still works (e.g. I can still search for it when composing a new message), but I haven’t yet deleted everything. I won’t be adding anything new there though.

I need to push everything from ~/.mutt/aliases into the new system. This is only about 30 contacts so shouldn’t take too long.

Overall this feels like a big improvement! It wasn’t a trivial amount of setup for just me, but it means I have both better usability for myself and more independence from proprietary services, and I think I can add extra users with much less effort if I need to.

Postscript

A day later and I’ve consolidated all my accounts from Google Contacts and ~/.mutt/aliases into the new system, with the exception of one group that I had defined as a mutt alias and need to work out what to do with. This all went smoothly.

I’ve filed the new lbdb module as #866178, and the python3-vobject bug as #866181.

The sad tale of CVE-2015-1336

2016-12-11T23:42:55+00:00

Today I released man-db 2.7.6 (announcement, NEWS, git log), and uploaded it to Debian unstable. The major change in this release was a set of fixes for two security vulnerabilities, one of which affected all man-db installations since 2.3.12 (or 2.3.10-66 in Debian), and the other of which was specific to Debian and its derivatives.

It’s probably obvious from the dates here that this has not been my finest hour in terms of responding to security issues in a timely fashion, and I apologise for that. Some of this is just the usual life reasons, which I shan’t bore you by reciting, but some of it has been that fixing this properly in man-db was genuinely rather complicated and delicate. Since I’ve previously advocated man-db over some of its competitors on the basis of a better security posture, I think it behooves me to write up a longer description.

I took over maintaining man-db over fifteen years ago in slightly unexpected circumstances (I got annoyed with its bug list and made a couple of non-maintainer uploads, and then the previous maintainer died, so I ended up taking over both in Debian and upstream). I was a fairly new developer at the time, and there weren’t a lot of people I could ask questions of, but I did my best to recover as much of the history as I could and learn from it. One thing that became clear very quickly, both from my own inspection and from the bug list, was that most of the code had been written in a rather more innocent time. It was absolutely riddled with dangerous uses of the shell, poor temporary file handling, buffer overruns, and various common-or-garden deficiencies of that kind. I spent several years reworking large swathes of the codebase to be more robust against those kinds of bugs by design, and for example libpipeline came out of that effort.

The most subtle and risky set of problems came from the fact that the man and mandb programs were installed set-user-id to the man user. Part of this was so that man could maintain preformatted “cat pages”, and part of it was so that users could run mandb if the system databases were out of date (this is now much less useful since most package managers, including dpkg, support some kind of trigger mechanism that can run mandb whenever new system-level manual pages are installed). One of the first things I did was to make this optional, and this has been a disabled-by-default debconf option in Debian for a long time now. But it’s still a supported option and is enabled by default upstream, and when running setuid man and mandb need to take care to drop privileges when dealing with user-controlled data and to write files with the appropriate ownership and permissions.

My predecessor had problems related to this such as Debian #26002, and one of the ways they dealt with them was to make /var/cache/man/ set-group-id root, in order that files written to that directory would have consistent group ownership. This always struck me as rather strange and I meant to do something about it at some point, but until the first vulnerability report above I regarded it as mainly a curiosity, since nothing in there was group-writeable anyway. As a result, with the more immediate aim of making the system behave consistently and dealing with bug reports, various bits of code had accreted that assumed that /var/cache/man/ would be man:root 2755, and not all of it was immediately obvious.

This interacted with the second vulnerability report in two ways. Firstly, at some level it caused it because I was dealing with the day-to-day problems rather than thinking at a higher level: a series of bugs led me down the path of whacking problems over the head with a recursive chown of /var/cache/man/ from cron, rather than working out why things got that way in the first place. Secondly, once I’d done that, I couldn’t remove the chown without a much more extensive excursion into all the code that dealt with cache files, for fear of reintroducing those bugs. So although the fix for the second vulnerability is very simple in itself, I couldn’t get there without dealing with the first vulnerability.

In some ways, of course, cat pages are a bit of an anachronism. Most modern systems can format pages quickly enough that it’s not much of an issue. However, I’m loath to drop the feature entirely: I’m generally wary of assuming that just because I have a fast system that everyone does. So, instead, I did what I should have done years ago: make man and mandb set-group-id man as well as set-user-id man, at which point we can simply make all the cache files and directories be owned by man:man and drop the setgid bit on cache directories. This should be simpler and less prone to difficult-to-understand problems.

I expect that my next substantial upstream release will switch to --disable-setuid by default to reduce exposure, though, and distributions can start thinking about whether they want to follow that (Fedora already does, for example). If this becomes widely disabled without complaints then that would be good evidence that it’s reasonable to drop the feature entirely. I’m not in a rush, but if you do need cat pages then now is a good time to write to me and tell me why.

This is the fiddliest set of vulnerabilities I’ve dealt with in man-db for quite some time, so I hope that if there are more then I can get back to my previous quick response time.

No more “Hash Sum Mismatch” errors

2016-04-08T15:06:03+01:00

The Debian repository format was designed a long time ago. The oldest versions of it were produced with the help of tools such as dpkg-scanpackages and consumed by dselect access methods such as dpkg-ftp. The access methods just fetched a Packages file (perhaps compressed) and used it as an index of which packages were available; each package had an MD5 checksum to defend against transport errors, but being from a more innocent age there was no repository signing or other protection against man-in-the-middle attacks.

An important and intentional feature of the early format was that, apart from the top-level Packages file, all other files were static in the sense that, once published, their content would never change without also changing the file name. This means that repositories can be efficiently copied around using rsync without having to tell it to re-checksum all files, and it avoids network races when fetching updates: the repository you’re updating from might change in the middle of your update, but as long as the repository maintenance software keeps superseded packages around for a suitable grace period, you’ll still be able to fetch them.

The repository format evolved rather organically over time as different needs arose, by what one might call distributed consensus among the maintainers of the various client tools that consumed it. Of course all sorts of fields were added to the index files themselves, which have an extensible format so that this kind of thing is usually easy to do. At some point a Sources index for source packages was added, which worked pretty much the same way as Packages except for having a different set of fields. But by far the most significant change to the repository structure was the “package pools” project.

The original repository layout put the packages themselves under the dists/ tree along with the index files. The dists/ tree is organised by “suite” (modern examples of which would be “stable”, “stable-updates”, “testing”, “unstable”, “xenial”, “xenial-updates”, and so on). This meant that making a release of Debian tended to involve copying lots of data around, and implementing the “testing” suite would have been very costly. Package pools solved this problem by moving individual package files out of dists/ and into a new pool/ tree, allowing those files to be shared between multiple suites with only a negligible cost in disk space and mirror bandwidth. From a database design perspective this is obviously much more sensible. As part of this project, the original Debian “dinstall” repository maintenance scripts were replaced by “da-katie” or “dak”, which among other things used a new apt-ftparchive program to build the index files; this replaced dpkg-scanpackages and dpkg-scansources, and included its own database cache which made a big difference to performance at the scale of a distribution.

A few months after the initial implementation of package pools, Release files were added. These formed a sort of meta-index for each suite, telling APT which index files were available (main/binary-i386/Packages, non-free/source/Sources, and so on) and what their checksums were. Detached signatures were added alongside that (Release.gpg) so that it was now possible to fetch packages securely given a public key for the repository, and client-side verification support for this eventually made its way into Debian and Ubuntu. The repository structure stayed more or less like this for several years.

At some point along the way, those of us by now involved in repository maintenance realised that an important property had been lost. I mentioned earlier that the original format allowed race-free updates, but this was no longer true with the introduction of the Release file. A client now had to fetch Release and then fetch whichever other index files such as Packages they wanted, typically in separate HTTP transactions. If a client was unlucky, these transactions would fall on either side of a mirror update and they’d get a “Hash Sum Mismatch” error from APT. Worse, if a mirror was unlucky and also didn’t go to special lengths to verify index integrity (most don’t), its own updates could span an update of its upstream mirror and then all its clients would see mismatches until the next mirror update. This was compounded by using detached signatures, so Release and Release.gpg were fetched separately and could be out of sync.

Fixing this has been a long road (the first time I remember talking about this was in late 2007!), and we’ve had to take care to maintain client/server compatibility along the way. The first step was to add inline-signed versions of the Release file, called InRelease, so that there would no longer be a race between fetching Release and fetching its signature. APT has had this for a while, Debian’s repository supports it as of stretch, and we finally implemented it for Ubuntu six months ago. Dealing with the other index files is more complicated, though; it isn’t sensible to inline them, as clients usually only need to fetch a small fraction of all the indexes available for a given suite.

The solution we’ve ended up with, thanks to Michael Vogt’s work implementing it in APT, is called by-hash and should be familiar in concept to people who’ve used git: with the exception of the top-level InRelease file, index files for suites that support the by-hash mechanism may now be fetched using a URL based on one of their hashes listed in InRelease. This means that clients can now operate like this:

Fetch dists/xenial/InRelease
Fetch dists/xenial/main/binary-amd64/by-hash/SHA256/46316a202cdae76a73b555414741b11d08c66620b76c470a1623cedcc8a14740 (and so on)
Fetch individual package files

This is now enabled by default in Ubuntu. It’s only there as of xenial (16.04), since earlier versions of Ubuntu don’t have the necessary support in APT. With this, hash mismatches on updates should be a thing of the past.

There will still be some people who won’t yet benefit from this. debmirror doesn’t support by-hash yet; apt-cacher-ng only supports it as of xenial, although there’s an easy configuration workaround. Full archive mirrors must make sure that they put new by-hash files in place before new InRelease files (I just fixed our recommended two-stage sync script to do this; ubumirror still needs some work; Debian’s ftpsync is almost correct but needs a tweak for its handling of translation files, which I’ve sent to its maintainers). Other mirrors and proxies that have specific handling of the repository format may need similar changes.

Please let me know if you see strange things happening as a result of this change. It’s useful to check the output of apt -o Debug::Acquire::http=true update to see exactly what requests are being issued.

Re-signing PPAs

2016-03-30T10:20:32+01:00

Julian has written about their efforts to strengthen security in APT, and shortly before that notified us that Launchpad’s signatures on PPAs use weak SHA-1 digests. Unfortunately we hadn’t noticed that before; GnuPG’s defaults tend to result in weak digests unless carefully tweaked, which is a shame.

I started on the necessary fixes for this immediately we heard of the problem, but it’s taken a little while to get everything in place, and I thought I’d explain why since some of the problems uncovered are interesting in their own right.

Firstly, there was the relatively trivial matter of using SHA-512 digests on new signatures. This was mostly a matter of adjusting our configuration, although writing the test was a bit tricky since PyGPGME isn’t as helpful as it could be. (Simpler repository implementations that call gpg from the command line should probably just add the --digest-algo SHA512 option instead of imitating this.)

After getting that in place, any change to a suite in a PPA will result in it being re-signed with SHA-512, which is good as far as it goes, but we also want to re-sign PPAs that haven’t been modified. Launchpad hosts more than 50000 active PPAs, though, a significant percentage of which include packages for sufficiently recent Ubuntu releases that we’d want to re-sign them for this. We can’t expect everyone to push new uploads, and we need to run this through at least some part of our usual publication machinery rather than just writing a hacky shell script to do the job (which would have no idea which keys to sign with, to start with); but forcing full reprocessing of all those PPAs would take a prohibitively long time, and at the moment we need to interrupt normal PPA publication to do this kind of work. I therefore had to spend some quality time working out how to make things go fast enough.

The first couple of changes (1, 2) were to add options to our publisher script to let us run just the one step we need in “careful” mode: that is, forcibly re-run the Release file processing step even if it thinks nothing has changed, and entirely disable the other steps such as generating Packages and Sources files. Then last week I finally got around to timing things on one of our staging systems so that we could estimate how long a full run would take. It was taking a little over two seconds per archive, which meant that if we were to re-sign all published PPAs then that would take more than 33 hours! Obviously this wasn’t viable; even just re-signing xenial would be prohibitively slow.

The next question was where all that time was going. I thought perhaps that the actual signing might be slow for some reason, but it was taking about half a second per archive: not great, but not enough to account for most of the slowness. The main part of the delay was in fact when we committed the database transaction after processing each archive, but not in the actual PostgreSQL commit, rather in the ORM invalidate method called to prepare for a commit.

Launchpad uses the excellent Storm for all of its database interactions. One property of this ORM (and possibly of others; I’ll cheerfully admit to not having spent much time with other ORMs) is that it uses a WeakValueDictionary to keep track of the objects it’s populated with database results. Before it commits a transaction, it iterates over all those “alive” objects to note that if they’re used in future then information needs to be reloaded from the database first. Usually this is a very good thing: it saves us from having to think too hard about data consistency at the application layer. But in this case, one of the things we did at the start of the publisher script was:

def getPPAs(self, distribution):
    """Find private package archives for the selected distribution."""
    if (self.isCareful(self.options.careful_publishing) or
            self.options.include_non_pending):
        return distribution.getAllPPAs()
    else:
        return distribution.getPendingPublicationPPAs()

def getTargetArchives(self, distribution):
    """Find the archive(s) selected by the script's options."""
    if self.options.partner:
        return [distribution.getArchiveByComponent('partner')]
    elif self.options.ppa:
        return filter(is_ppa_public, self.getPPAs(distribution))
    elif self.options.private_ppa:
        return filter(is_ppa_private, self.getPPAs(distribution))
    elif self.options.copy_archive:
        return self.getCopyArchives(distribution)
    else:
        return [distribution.main_archive]

That innocuous-looking filter means that we do all the public/private filtering of PPAs up-front and return a list of all the PPAs we intend to operate on. This means that all those objects are alive as far as Storm is concerned and need to be considered for invalidation on every commit, and the time required for that stacks up when many thousands of objects are involved: this is essentially accidentally quadratic behaviour, because all archives are considered when committing changes to each archive in turn. Normally this isn’t too bad because only a few hundred PPAs need to be processed in any given run; but if we’re running in a mode where we’re processing all PPAs rather than just ones that are pending publication, then suddenly this balloons to the point where it takes a couple of seconds. The fix is very simple, using an iterator instead so that we don’t need to keep all the objects alive:

from itertools import ifilter

def getTargetArchives(self, distribution):
    """Find the archive(s) selected by the script's options."""
    if self.options.partner:
        return [distribution.getArchiveByComponent('partner')]
    elif self.options.ppa:
        return ifilter(is_ppa_public, self.getPPAs(distribution))
    elif self.options.private_ppa:
        return ifilter(is_ppa_private, self.getPPAs(distribution))
    elif self.options.copy_archive:
        return self.getCopyArchives(distribution)
    else:
        return [distribution.main_archive]

After that, I turned to that half a second for signing. A good chunk of that was accounted for by the signContent method taking a fingerprint rather than a key, despite the fact that we normally already had the key in hand; this caused us to have to ask GPGME to reload the key, which requires two subprocess calls. Converting this to take a key rather than a fingerprint gets the per-archive time down to about a quarter of a second on our staging system, about eight times faster than where we started.

Using this, we’ve now re-signed all xenial Release files in PPAs using SHA-512 digests. On production, this took about 80 minutes to iterate over around 70000 archives, of which 1761 were modified. Most of the time appears to have been spent skipping over unmodified archives; even a few hundredths of a second per archive adds up quickly there. The remaining time comes out to around 0.4 seconds per modified archive. There’s certainly still room for speeding this up a bit.

We wouldn’t want to do this procedure every day, but it’s acceptable for occasional tasks like this. I expect that we’ll similarly re-sign wily, vivid, and trusty Release files soon in the same way.

SSH SHA-2 support in Twisted

2015-12-02T20:42:25+00:00

Launchpad operates a few SSH endpoints: bazaar.launchpad.net and git.launchpad.net for code hosting, and upload.ubuntu.com and ppa.launchpad.net for uploading packages. None of these are straightforward OpenSSH servers, because they don’t give ordinary shell access and they authenticate against users’ SSH keys recorded in Launchpad; both of these are much easier to do with SSH server code that we can use in library form as part of another service. We use Twisted for several other tasks where we need event-based networking code, and its conch package is a good fit for this.

Of course, this means that it’s important that conch keeps up to date with the cryptographic state of the art in other SSH implementations, and this hasn’t always been the case. OpenSSH 7.0 dropped support for some old algorithms, including disabling the 1024-bit diffie-hellman-group1-sha1 key exchange method at run-time. Unfortunately, this also happened to be the only key exchange method that Launchpad’s SSH endpoints supported (conch supported the slightly better diffie-hellman-group-exchange-sha1 method as well, but that was disabled in Launchpad due to a missing piece of configuration). SHA-2 support was clearly called for, and the fact that we had to get this sorted out in conch first meant that everything took a bit longer than we’d hoped.

In Twisted 15.5, we contributed support for several conch improvements:

diffie-hellman-group14-sha1 key exchange (mostly by Ian Moore, finished off by me)
diffie-hellman-group-exchange-sha256 key exchange
hmac-sha2-256 and hmac-sha2-512 MACs

Between them and with some adjustments to the lazr.sshserver package we use to glue all this together to add support for DH group exchange, these are enough to allow us not to rely on SHA-1 at all, and these improvements have now been rolled out to all four endpoints listed above. I’ve thus also uploaded OpenSSH 7.1 packages to Debian unstable.

If you also run a Twisted-based SSH server, upgrade it now! Otherwise it will be harder for users of recent OpenSSH client versions to use your server, and for good reason.

Moving on, but not too far

2014-10-26T18:54:34-04:00

The Ubuntu Code of Conduct says:

Step down considerately: When somebody leaves or disengages from the project, we ask that they do so in a way that minimises disruption to the project. They should tell people they are leaving and take the proper steps to ensure that others can pick up where they left off.

I’ve been working on Ubuntu for over ten years now, almost right from the very start; I’m Canonical’s employee #17 due to working out a notice period in my previous job, but I was one of the founding group of developers. I occasionally tell the story that Mark originally hired me mainly to work on what later became Launchpad Bugs due to my experience maintaining the Debian bug tracking system, but then not long afterwards Jeff Waugh got in touch and said “hey Colin, would you mind just sorting out some installable CD images for us?”. This is where you imagine one of those movie time-lapse clocks … At some point it became fairly clear that I was working on Ubuntu, and the bug system work fell to other people. Then, when Matt Zimmerman could no longer manage the entire Ubuntu team in Canonical by himself, Scott James Remnant and I stepped up to help him out. I did that for a couple of years, starting the Foundations team in the process. As the team grew I found that my interests really lay in hands-on development rather than in management, so I switched over to being the technical lead for Foundations, and have made my home there ever since. Over the years this has given me the opportunity to do all sorts of things, particularly working on our installers and on the GRUB boot loader, leading the development work on many of our archive maintenance tools, instituting the +1 maintenance effort and proposed-migration, and developing the Click package manager, and I’ve had the great pleasure of working with many exceptionally talented people.

However. In recent months I’ve been feeling a general sense of malaise and what I’ve come to recognise with hindsight as the symptoms of approaching burnout. I’ve been working long hours for a long time, and while I can draw on a lot of experience by now, it’s been getting harder to summon the enthusiasm and creativity to go with that. I have a wonderful wife, amazing children, and lovely friends, and I want to be able to spend a bit more time with them. After ten years doing the same kinds of things, I’ve accreted history with and responsibility for a lot of projects. One of the things I always loved about Foundations was that it’s a broad church, covering a wide range of software and with a correspondingly wide range of opportunities; but, over time, this has made it difficult for me to focus on things that are important because there are so many areas where I might be called upon to help. I thought about simply stepping down from the technical lead position and remaining in the same team, but I decided that that wouldn’t make enough of a difference to what matters to me. I need a clean break and an opportunity to reset my habits before I burn out for real.

One of the things that has consistently held my interest through all of this has been making sure that the infrastructure for Ubuntu keeps running reliably and that other developers can work efficiently. As part of this, I’ve been able to do a lot of work over the years on Launchpad where it was a good fit with my remit: this has included significant performance improvements to archive publishing, moving most archive administration operations from excessively-privileged command-line operations to the webservice, making build cancellation reliable across the board, and moving live filesystem building from an unscalable ad-hoc collection of machines into the Launchpad build farm. The Launchpad development team has generally welcomed help with open arms, and in fact I joined the ~launchpad team last year.

So, the logical next step for me is to make this informal involvement permanent. As such, at the end of this year I will be moving from Ubuntu Foundations to the Launchpad engineering team.

This doesn’t mean me leaving Ubuntu. Within Canonical, Launchpad development is currently organised under the Continuous Integration team, which is part of Ubuntu Engineering. I’ll still be around in more or less the usual places and available for people to ask me questions. But I will in general be trying to reduce my involvement in Ubuntu proper to things that are closely related to the operation of Launchpad, and a small number of low-effort things that I’m interested enough in to find free time for them. I still need to sort out a lot of details, but it’ll very likely involve me handing over project leadership of Click, drastically reducing my involvement in the installer, and looking for at least some help with boot loader work, among others. I don’t expect my Debian involvement to change, and I may well find myself more motivated there now that it won’t be so closely linked with my day job, although it’s possible that I will pare some things back that I was mostly doing on Ubuntu’s behalf. If you ask me for help with something over the next few months, expect me to be more likely to direct you to other people or suggest ways you can help yourself out, so that I can start disentangling myself from my current web of projects.

Please contact me sooner or later if you’re interested in helping out with any of the things I’m visible in right now, and we can see what makes sense. I’m looking forward to this!

Porting GHC: A Tale of Two Architectures

2014-04-15T02:36:01+01:00

We had some requests to get GHC (the Glasgow Haskell Compiler) up and running on two new Ubuntu architectures: arm64, added in 13.10, and ppc64el, added in 14.04. This has been something of a saga, and has involved rather more late-night hacking than is probably good for me.

Book the First: Recalled to a life of strange build systems

You might not know it from the sheer bulk of uploads I do sometimes, but I actually don’t speak a word of Haskell and it’s not very high up my list of things to learn. But I am a pretty experienced build engineer, and I enjoy porting things to new architectures: I’m firmly of the belief that breadth of architecture support is a good way to shake out certain categories of issues in code, that it’s worth doing aggressively across an entire distribution, and that, even if you don’t think you need something now, new requirements have a habit of coming along when you least expect them and you might as well be prepared in advance. Furthermore, it annoys me when we have excessive noise in our build failure and proposed-migration output and I often put bits and pieces of spare time into gardening miscellaneous problems there, and at one point there was a lot of Haskell stuff on the list and it got a bit annoying to have to keep sending patches rather than just fixing things myself, and … well, I ended up as probably the only non-Haskell-programmer on the Debian Haskell team and found myself fixing problems there in my free time. Life is a bit weird sometimes.

Bootstrapping packages on a new architecture is a bit of a black art that only a fairly small number of relatively bitter and twisted people know very much about. Doing it in Ubuntu is specifically painful because we’ve always forbidden direct binary uploads: all binaries have to come from a build daemon. Compilers in particular often tend to be written in the language they compile, and it’s not uncommon for them to build-depend on themselves: that is, you need a previous version of the compiler to build the compiler, stretching back to the dawn of time where somebody put things together with a big magnet or something. So how do you get started on a new architecture? Well, what we do in this case is we construct a binary somehow (usually involving cross-compilation) and insert it as a build-dependency for a proper build in Launchpad. The ability to do this is restricted to a small group of Canonical employees, partly because it’s very easy to make mistakes and partly because things like the classic “Reflections on Trusting Trust” are in the backs of our minds somewhere. We have an iron rule for our own sanity that the injected build-dependencies must themselves have been built from the unmodified source package in Ubuntu, although there can be source modifications further back in the chain. Fortunately, we don’t need to do this very often, but it does mean that as somebody who can do it I feel an obligation to try and unblock other people where I can.

As far as constructing those build-dependencies goes, sometimes we look for binaries built by other distributions (particularly Debian), and that’s pretty straightforward. In this case, though, these two architectures are pretty new and the Debian ports are only just getting going, and as far as I can tell none of the other distributions with active arm64 or ppc64el ports (or trivial name variants) has got as far as porting GHC yet. Well, OK. This was somewhere around the Christmas holidays and I had some time. Muggins here cracks his knuckles and decides to have a go at bootstrapping it from scratch. It can’t be that hard, right? Not to mention that it was a blocker for over 600 entries on that build failure list I mentioned, which is definitely enough to make me sit up and take notice; we’d even had the odd customer request for it.

Several attempts later and I was starting to doubt my sanity, not least for trying in the first place. We ship GHC 7.6, and upgrading to 7.8 is not a project I’d like to tackle until the much more experienced Haskell folks in Debian have switched to it in unstable. The porting documentation for 7.6 has bitrotted more or less beyond usability, and the corresponding documentation for 7.8 really isn’t backportable to 7.6. I tried building 7.8 for ppc64el anyway, picking that on the basis that we had quicker hardware for it and didn’t seem likely to be particularly more arduous than arm64 (ho ho), and I even got to the point of having a cross-built stage2 compiler (stage1, in the cross-building case, is a GHC binary that runs on your starting architecture and generates code for your target architecture) that I could copy over to a ppc64el box and try to use as the base for a fully-native build, but it segfaulted incomprehensibly just after spawning any child process. Compilers tend to do rather a lot, especially when they’re built to use GCC to generate object code, so this was a pretty serious problem, and it resisted analysis. I poked at it for a while but didn’t get anywhere, and I had other things to do so declared it a write-off and gave up.

Book the Second: The golden thread of progress

In March, another mailing list conversation prodded me into finding a blog entry by Karel Gardas on building GHC for arm64. This was enough to be worth another look, and indeed it turned out that (with some help from Karel in private mail) I was able to cross-build a compiler that actually worked and could be used to run a fully-native build that also worked. Of course this was 7.8, since as I mentioned cross-building 7.6 is unrealistically difficult unless you’re considerably more of an expert on GHC’s labyrinthine build system than I am. OK, no problem, right? Getting a GHC at all is the hard bit, and 7.8 must be at least as capable as 7.6, so it should be able to build 7.6 easily enough …

Not so much. What I’d missed here was that compiler engineers generally only care very much about building the compiler with older versions of itself, and if the language in question has any kind of deprecation cycle then the compiler itself is likely to be behind on various things compared to more typical code since it has to be buildable with older versions. This means that the removal of some deprecated interfaces from 7.8 posed a problem, as did some changes in certain primops that had gained an associated compatibility layer in 7.8 but nobody had gone back to put the corresponding compatibility layer into 7.6. GHC supports running Haskell code through the C preprocessor, and there’s a __GLASGOW_HASKELL__ definition with the compiler’s version number, so this was just a slog tracking down changes in git and adding #ifdef-guarded code that coped with the newer compiler (remembering that stage1 will be built with 7.8 and stage2 with stage1, i.e. 7.6, from the same source tree). More inscrutably, GHC has its own packaging system called Cabal which is also used by the compiler build process to determine which subpackages to build and how to link them against each other, and some crucial subpackages weren’t being built: it looked like it was stuck on picking versions from “stage0” (i.e. the initial compiler used as an input to the whole process) when it should have been building its own. Eventually I figured out that this was because GHC’s use of its packaging system hadn’t anticipated this case, and was selecting the higher version of the ghc package itself from stage0 rather than the version it was about to build for itself, and thus never actually tried to build most of the compiler. Editing ghc_stage1_DEPS in ghc/stage1/package-data.mk after its initial generation sorted this out. One late night building round and round in circles for a while until I had something stable, and a Debian source upload to add basic support for the architecture name (and other changes which were a bit over the top in retrospect: I didn’t need to touch the embedded copy of libffi, as we build with the system one), and I was able to feed this all into Launchpad and watch the builders munch away very satisfyingly at the Haskell library stack for a while.

This was all interesting, and finally all that work was actually paying off in terms of getting to watch a slew of several hundred build failures vanish from arm64 (the final count was something like 640, I think). The fly in the ointment was that ppc64el was still blocked, as the problem there wasn’t building 7.6, it was getting a working 7.8. But now I really did have other much more urgent things to do, so I figured I just wouldn’t get to this by release time and stuck it on the figurative shelf.

Book the Third: The track of a bug

Then, last Friday, I cleared out my urgent pile and thought I’d have another quick look. (I get a bit obsessive about things like this that smell of “interesting intellectual puzzle”.) slyfox on the #ghc IRC channel gave me some general debugging advice and, particularly usefully, a reduced example program that I could use to debug just the process-spawning problem without having to wade through noise from running the rest of the compiler. I reproduced the same problem there, and then found that the program crashed earlier (in stg_ap_0_fast, part of the run-time system) if I compiled it with +RTS -Da -RTS. I nailed it down to a small enough region of assembly that I could see all of the assembly, the source code, and an intermediate representation or two from the compiler, and then started meditating on what makes ppc64el special.

You see, the vast majority of porting bugs come down to what I might call gross properties of the architecture. You have things like whether it’s 32-bit or 64-bit, big-endian or little-endian, whether char is signed or unsigned, that sort of thing. There’s a big table on the Debian wiki that handily summarises most of the important ones. Sometimes you have to deal with distribution-specific things like whether GL or GLES is used; often, especially for new variants of existing architectures, you have to cope with foolish configure scripts that think they can guess certain things from the architecture name and get it wrong (assuming that powerpc* means big-endian, for instance). We often have to update config.guess and config.sub, and on ppc64el we have the additional hassle of updating libtool macros too. But I’ve done a lot of this stuff and I’d accounted for everything I could think of. ppc64el is actually a lot like amd64 in terms of many of these porting-relevant properties, and not even that far off arm64 which I’d just successfully ported GHC to, so I couldn’t be dealing with anything particularly obvious. There was some hand-written assembly which certainly could have been problematic, but I’d carefully checked that this wasn’t being used by the “unregisterised” (no specialised machine dependencies, so relatively easy to port but not well-optimised) build I was using. A problem around spawning processes suggested a problem with SIGCHLD handling, but I ruled that out by slowing down the first child process that it spawned and using strace to confirm that SIGSEGV was the first signal received. What on earth was the problem?

From some painstaking gdb work, one thing I eventually noticed was that stg_ap_0_fast‘s local stack appeared to be being corrupted by a function call, specifically a call to the colourfully-named debugBelch. Now, when IBM’s toolchain engineers were putting together ppc64el based on ppc64, they took the opportunity to fix a number of problems with their ABI: there’s an OpenJDK bug with a handy list of references. One of the things I noticed there was that there were some stack allocation optimisations in the new ABI, which affected functions that don’t call any vararg functions and don’t call any functions that take enough parameters that some of them have to be passed on the stack rather than in registers. debugBelch takes varargs: hmm. Now, the calling code isn’t quite in C as such, but in a related dialect called “Cmm”, a variant of C— (yes, minus), that GHC uses to help bridge the gap between the functional world and its code generation, and which is compiled down to C by GHC. When importing C functions into Cmm, GHC generates prototypes for them, but it doesn’t do enough parsing to work out the true prototype; instead, they all just get something like extern StgFunPtr f(void);. In most architectures you can get away with this, because the arguments get passed in the usual calling convention anyway and it all works out, but on ppc64el this means that the caller doesn’t generate enough stack space and then the callee tries to save its varargs onto the stack in an area that in fact belongs to the caller, and suddenly everything goes south. Things were starting to make sense.

Now, debugBelch is only used in optional debugging code; but runInteractiveProcess (the function associated with the initial round of failures) takes no fewer than twelve arguments, plenty to force some of them onto the stack. I poked around the GCC patch for this ABI change a bit and determined that it only optimised away the stack allocation if it had a full prototype for all the callees, so I guessed that changing those prototypes to extern StgFunPtr f(); might work: it’s still technically wrong, not least because omitting the parameter list is an obsolescent feature in C11, but it’s at least just omitting information about the parameter list rather than actively lying about it. I tweaked that and ran the cross-build from scratch again. Lo and behold, suddenly I had a working compiler, and I could go through the same build-7.6-using-7.8 procedure as with arm64, much more quickly this time now that I knew what I was doing. One upstream bug, one Debian upload, and several bootstrapping builds later, and GHC was up and running on another architecture in Launchpad. Success!

Epilogue

There’s still more to do. I gather there may be a Google Summer of Code project in Linaro to write proper native code generation for GHC on arm64: this would make things a good deal faster, but also enable GHCi (the interpreter) and Template Haskell, and thus clear quite a few more build failures. Since there’s already native code generation for ppc64 in GHC, getting it going for ppc64el would probably only be a couple of days’ work at this point. But these are niceties by comparison, and I’m more than happy with what I got working for 14.04.

The upshot of all of this is that I may be the first non-Haskell-programmer to ever port GHC to two entirely new architectures. I’m not sure if I gain much from that personally aside from a lot of lost sleep and being considered extremely strange. It has, however, been by far the most challenging set of packages I’ve ported, and a fascinating trip through some odd corners of build systems and undefined behaviour that I don’t normally need to touch.

Testing wanted: GRUB 2.02~beta2 Debian/Ubuntu packages

2014-01-18T01:46:55+00:00

This is mostly a repost of my ubuntu-devel mail for a wider audience, but see below for some additions.

I’d like to upgrade to GRUB 2.02 for Ubuntu 14.04; it’s currently in beta. This represents a year and a half of upstream development, and contains many new features, which you can see in the NEWS file.

Obviously I want to be very careful with substantial upgrades to the default boot loader. So, I’ve put this in trusty-proposed, and filed a blocking bug to ensure that it doesn’t reach trusty proper until it’s had a reasonable amount of manual testing. If you are already using trusty and have some time to try this out, it would be very helpful to me. I suggest that you only attempt this if you’re comfortable driving apt-get directly and recovering from errors at that level, and if you’re willing to spend time working with me on narrowing down any problems that arise.

Please ensure that you have rescue media to hand before starting testing. The simplest way to upgrade is to enable trusty-proposed, upgrade ONLY packages whose names start with “grub” (e.g. use apt-get dist-upgrade to show the full list, say no to the upgrade, and then pass all the relevant package names to apt-get install), and then (very important!) disable trusty-proposed again. Provided that there were no errors in this process, you should be safe to reboot. If there were errors, you should be able to downgrade back to 2.00-22 (or 1.27+2.00-22 in the case of grub-efi-amd64-signed).

Please report your experiences (positive and negative) with this upgrade in the tracking bug. I’m particularly interested in systems that are complex in any way: UEFI Secure Boot, non-trivial disk setups, manual configuration, that kind of thing. If any of the problems you see are also ones you saw with earlier versions of GRUB, please identify those clearly, as I want to prioritise handling regressions over anything else. I’ve assigned myself to that bug to ensure that messages to it are filtered directly into my inbox.

I’ll add a couple of things that weren’t in my ubuntu-devel mail. Firstly, this is all in Debian experimental as well (I do all the work in Debian and sync it across, so the grub2 source package in Ubuntu is a verbatim copy of the one in Debian these days). There are some configuration differences applied at build time, but a large fraction of test cases will apply equally well to both. I don’t have a definite schedule for pushing this into jessie yet - I only just finished getting 2.00 in place there, and the release schedule gives me a bit more time - but I certainly want to ship jessie with 2.02 or newer, and any test feedback would be welcome. It’s probably best to just e-mail feedback to me directly for now, or to the pkg-grub-devel list.

Secondly, a couple of news sites have picked this up and run it as “Canonical intends to ship Ubuntu 14.04 LTS with a beta version of GRUB”. This isn’t in fact my intent at all. I’m doing this now because I think GRUB 2.02 will be ready in non-beta form in time for Ubuntu 14.04, and indeed that putting it in our development release will help to stabilise it; I’m an upstream GRUB developer too and I find the exposure of widely-used packages very helpful in that context. It will certainly be much easier to upgrade to a beta now and a final release later than it would be to try to jump from 2.00 to 2.02 in a month or two’s time.

Even if there’s some unforeseen delay and 2.02 isn’t released in time, though, I think nearly three months of stabilisation is still plenty to yield a boot loader that I’m comfortable with shipping in an LTS. I’ve been backporting a lot of changes to 2.00 and even 1.99, and, as ever for an actively-developed codebase, it gets harder and harder over time (in particular, I’ve spent longer than I’d like hunting down and backporting fixes for non-512-byte sector disks). While I can still manage it, I don’t want to be supporting 2.00 for five more years after upstream has moved on; I don’t think that would be in anyone’s best interests. And I definitely want some of the new features which aren’t sensibly backportable, such as several of the new platforms (ARM, ARM64, Xen) and various networking improvements; I can imagine a number of our users being interested in things like optional signature verification of files GRUB reads from disk, improved Mac support, and the TrueCrypt ISO loader, just to name a few. This should be a much stronger base for five-year support.

Automatic installability checking

2012-10-26T10:18:26+01:00

I’ve just finished deploying automatic installability checking for Ubuntu’s development release, which is more or less equivalent to the way that uploads are promoted from Debian unstable to testing. See my ubuntu-devel post and my ubuntu-devel-announce post for details. This now means that we’ll be opening the archive for general development once glibc 2.16 packages are ready.

I’m very excited about this because it’s something I’ve wanted to do for a long, long time. In fact, back in 2004 when I had my very first telephone conversation with a certain spaceman about this crazy Debian-based project he wanted me to work on, I remember talking about Debian’s testing migration system and some ways I thought it could be improved. I don’t remember the details of that conversation any more and what I just deployed may well bear very little resemblance to it, but it should transform the extent to which our development release is continuously usable.

The next step is to hook in autopkgtest results. This will allow us to do a degree of automatic testing of reverse-dependencies when we upgrade low-level libraries.

OpenSSH 6.0p1

2012-05-27T20:12:12+01:00

OpenSSH 6.0p1 was released a little while back; this weekend I belatedly got round to uploading packages of it to Debian unstable and Ubuntu quantal.

I was a bit delayed by needing to put together an improvement to privsep sandbox selection that particularly matters in the context of distributions. One of the experts on seccomp_filter has commented favourably on it, but I haven’t yet had a comment from upstream themselves, so I may need to refine this depending on what they say.

(This is a good example of how it matters that software is often not built on the system that it’s going to run on, and in particular that the kernel version is rather likely to be different. Where possible it’s always best to detect kernel capabilities at run-time rather than at build-time.)

I didn’t make it very clear in the changelog, but using the new seccomp_filter sandbox currently requires UsePrivilegeSeparation sandbox in sshd_config as well as a capable kernel. I won’t change the default here in advance of upstream, who still consider privsep sandboxing experimental.

libpipeline 1.2.1 released

2012-03-02T21:49:10+00:00

I’ve released libpipeline 1.2.1, and uploaded it to Debian unstable. This is a bug-fix release:

Retry reads and writes on EINTR.
Fix opening of output files requested by pipeline_want_outfile; these are now created if they do not already exist, and truncated if they do.
<pipeline.h> is now wrapped in extern "C" when used in a C++ compilation unit.

APT resolver bugs

2012-01-30T10:54:25+00:00

I’ve managed to go for eleven years working on Debian and nearly eight on Ubuntu without ever needing to teach myself how APT’s resolver works. I get the impression that there’s a certain mystique about it in general (alternatively, I’m just the last person to figure this out). Recently, though, I had a couple of Ubuntu upgrade bugs to fix that turned out to be bugs in the resolver, and I thought it might be interesting to walk through the process of fixing them based on the Debug::pkgProblemResolver=true log files.

Breakage with Breaks

The first was Ubuntu bug #922485 (apt.log). To understand the log, you first need to know that APT makes up to ten passes of the resolver to attempt to fix broken dependencies by upgrading, removing, or holding back packages; if there are still broken packages after this point, it’s generally because it’s got itself stuck in some kind of loop, and it bails out rather than carrying on forever. The current pass number is shown in each “Investigating” log entry, so they start with “Investigating (0)” and carry on up to at most “Investigating (9)”. Any packages that you see still being investigated on the tenth pass are probably something to do with whatever’s going wrong.

In this case, most packages have been resolved by the end of the fourth pass, but xserver-xorg-core is causing some trouble. (Not a particular surprise, as it’s an important package with lots of relationships.) We can see that each breakage is:

Broken xserver-xorg-core:i386 Breaks on xserver-xorg-video-6 [ i386 ] < none > ( none )

This is a Breaks (a relatively new package relationship type introduced a few years ago as a sort of weaker form of Conflicts) on a virtual package, which means that in order to unpack xserver-xorg-core each package that provides xserver-xorg-video-6 must be deconfigured. Much like Conflicts, APT responds to this by upgrading providing packages to versions that don’t provide the offending virtual package if it can, and otherwise removing them. We can see it doing just that in the log (some lines omitted):

Investigating (0) xserver-xorg-core [ i386 ] < 2:1.7.6-2ubuntu7.10 -> 2:1.11.3-0ubuntu8 > ( x11 )
  Fixing xserver-xorg-core:i386 via remove of xserver-xorg-video-tseng:i386
Investigating (1) xserver-xorg-core [ i386 ] < 2:1.7.6-2ubuntu7.10 -> 2:1.11.3-0ubuntu8 > ( x11 )
  Fixing xserver-xorg-core:i386 via remove of xserver-xorg-video-i740:i386
Investigating (2) xserver-xorg-core [ i386 ] < 2:1.7.6-2ubuntu7.10 -> 2:1.11.3-0ubuntu8 > ( x11 )
  Fixing xserver-xorg-core:i386 via remove of xserver-xorg-video-nv:i386

OK, so that makes sense - presumably upgrading those packages didn’t help at the time. But look at the pass numbers. Rather than just fixing all the packages that provide xserver-xorg-video-6 in a single pass, which it would be perfectly able to do, it only fixes one per pass. This means that if a package Breaks a virtual package which is provided by more than ten installed packages, the resolver will fail to handle that situation. On inspection of the code, this was being handled correctly for Conflicts by carrying on through the list of possible targets for the dependency relation in that case, but apparently when Breaks support was implemented in APT this case was overlooked. The fix is to carry on through the list of possible targets for any “negative” dependency relation, not just Conflicts, and I’ve filed a patch as Debian bug #657695.

My cup overfloweth

The second bug I looked at was Ubuntu bug #917173 (apt.log). Just as in the previous case, we can see the resolver “running out of time” by reaching the end of the tenth pass with some dependencies still broken. This one is a lot less obvious, though. The last few entries clearly indicate that the resolver is stuck in a loop:

Investigating (8) dpkg [ i386 ] < 1.15.5.6ubuntu4.5 -> 1.16.1.2ubuntu5 > ( admin )
Broken dpkg:i386 Breaks on dpkg-dev [ i386 ] < 1.15.5.6ubuntu4.5 -> 1.16.1.2ubuntu5 > ( utils ) (< 1.15.8)
  Considering dpkg-dev:i386 29 as a solution to dpkg:i386 7205
  Upgrading dpkg-dev:i386 due to Breaks field in dpkg:i386
Investigating (8) dpkg-dev [ i386 ] < 1.15.5.6ubuntu4.5 -> 1.16.1.2ubuntu5 > ( utils )
Broken dpkg-dev:i386 Depends on libdpkg-perl [ i386 ] < none -> 1.16.1.2ubuntu5 > ( perl ) (= 1.16.1.2ubuntu5)
  Considering libdpkg-perl:i386 12 as a solution to dpkg-dev:i386 29
  Holding Back dpkg-dev:i386 rather than change libdpkg-perl:i386
Investigating (9) dpkg [ i386 ] < 1.15.5.6ubuntu4.5 -> 1.16.1.2ubuntu5 > ( admin )
Broken dpkg:i386 Breaks on dpkg-dev [ i386 ] < 1.15.5.6ubuntu4.5 -> 1.16.1.2ubuntu5 > ( utils ) (< 1.15.8)
  Considering dpkg-dev:i386 29 as a solution to dpkg:i386 7205
  Upgrading dpkg-dev:i386 due to Breaks field in dpkg:i386
Investigating (9) dpkg-dev [ i386 ] < 1.15.5.6ubuntu4.5 -> 1.16.1.2ubuntu5 > ( utils )
Broken dpkg-dev:i386 Depends on libdpkg-perl [ i386 ] < none -> 1.16.1.2ubuntu5 > ( perl ) (= 1.16.1.2ubuntu5)
  Considering libdpkg-perl:i386 12 as a solution to dpkg-dev:i386 29
  Holding Back dpkg-dev:i386 rather than change libdpkg-perl:i386

The new version of dpkg requires upgrading dpkg-dev, but it can’t because of something wrong with libdpkg-perl. Following the breadcrumb trail back through the log, we find:

Investigating (1) libdpkg-perl [ i386 ] < none -> 1.16.1.2ubuntu5 > ( perl )
Broken libdpkg-perl:i386 Depends on perl [ i386 ] < 5.10.1-8ubuntu2.1 -> 5.14.2-6ubuntu1 > ( perl )
  Considering perl:i386 1472 as a solution to libdpkg-perl:i386 12
  Holding Back libdpkg-perl:i386 rather than change perl:i386

Investigating (1) perl [ i386 ] < 5.10.1-8ubuntu2.1 -> 5.14.2-6ubuntu1 > ( perl )
Broken perl:i386 Depends on perl-base [ i386 ] < 5.10.1-8ubuntu2.1 -> 5.14.2-6ubuntu1 > ( perl ) (= 5.14.2-6ubuntu1)
  Considering perl-base:i386 5806 as a solution to perl:i386 1472
  Removing perl:i386 rather than change perl-base:i386

Investigating (1) perl-base [ i386 ] < 5.10.1-8ubuntu2.1 -> 5.14.2-6ubuntu1 > ( perl )
Broken perl-base:i386 PreDepends on libc6 [ i386 ] < 2.11.1-0ubuntu7.8 -> 2.13-24ubuntu2 > ( libs ) (>= 2.11)
  Considering libc6:i386 -17473 as a solution to perl-base:i386 5806
  Added libc6:i386 to the remove list

Investigating (0) libc6 [ i386 ] < 2.11.1-0ubuntu7.8 -> 2.13-24ubuntu2 > ( libs )
Broken libc6:i386 Depends on libc-bin [ i386 ] < 2.11.1-0ubuntu7.8 -> 2.13-24ubuntu2 > ( libs ) (= 2.11.1-0ubuntu7.8)
  Considering libc-bin:i386 10358 as a solution to libc6:i386 -17473
  Removing libc6:i386 rather than change libc-bin:i386

So ultimately the problem is something to do with libc6; but what? As Steve Langasek said in the bug, libc6’s dependencies have been very carefully structured, and surely we would have seen some hint of it elsewhere if they were wrong. At this point ideally I wanted to break out GDB or at the very least experiment a bit with apt-get, but due to some tedious local problems I hadn’t been able to restore the apt-clone state file for this bug onto my system so that I could attack it directly. So I fell back on the last refuge of the frustrated debugger and sat and thought about it for a bit.

Eventually I noticed something. The numbers after the package names in the third line of each of these log entries are “scores”: roughly, the more important a package is, the higher its score should be. The function that calculates these is pkgProblemResolver::MakeScores() in apt-pkg/algorithms.cc. Reading this, I noticed that the various values added up to make each score are almost all provably positive, for example:

Scores[I->ID] += abs(OldScores[D.ParentPkg()->ID]);

The only exceptions are an initial -1 or -2 points for Priority: optional or Priority: extra packages respectively, or some values that could theoretically be configured to be negative but weren’t in this case. OK. So how come libc6 has such a huge negative score of -17473, when one would normally expect it to be an extremely powerful package with a large positive score?

Oh. This is computer programming, not mathematics … and each score is stored in a signed short, so in a sufficiently large upgrade all those bonus points add up to something larger than 32767 and everything goes haywire. Bingo. Make it an int instead - the number of installed packages is going to be on the order of tens of thousands at most, so it’s not as though it’ll make a substantial difference to the amount of memory used - and chances are everything will be fine. I’ve filed a patch as Debian bug #657732.

I’d expected this to be a pretty challenging pair of bugs. While I certainly haven’t lost any respect for the APT maintainers for dealing with this stuff regularly, it wasn’t as bad as I thought. I’d expected to have to figure out how to retune some slightly out-of-balance heuristics and not really know whether I’d broken anything else in the process; but in the end both patches were very straightforward.

Quality in Ubuntu 12.04 LTS

2011-10-24T14:57:41+01:00

As is natural for an LTS cycle, lots of people are thinking and talking about work focused on quality rather than features. With Canonical extending LTS support to five years on the desktop for 12.04, much of this is quite rightly focused on the desktop. I’m really not a desktop hacker in any way, shape, or form, though. I spent my first few years in Ubuntu working mainly on the installer - I still do, although I do some other things now too - and I used to say only half-jokingly that my job was done once X started. Of course there are plenty of bugs I can fix, but I wanted to see if I could do something with a bit more structure, so I got to thinking about projects we could work on at the foundations level that would make a big difference.

Image build pipeline

One difficulty we have is that quite a few of our bugs - especially installer bugs, although this goes for some other things too - are only really caught when people are doing coordinated image testing just before a milestone release. Now, it takes a while to do all the builds and then it takes a while to test them. The excellent work of the QA team has meant that testing is much quicker now than it used to be, and a certain amount of smoke-testing is automated (particularly for server images). On the other hand, the build phase has only got longer as we’ve added more flavours and architectures, particularly as some parts of the process are still serialised per architecture or subarchitecture so ARM builds in particular take a very long time indeed. Exact timings are a bit difficult to get for various reasons, but I think the minimum time between a developer uploading a fix and us having a full set of candidate images on all architectures including that fix is currently somewhere north of eight hours, and that’s with people cutting corners and pulling strings which is a suboptimal thing to have to do around release time. This obviously makes us reluctant to respin for anything short of showstopper bugs. If we could get things down to something closer to two hours, respins would be a much less horrible proposition and so we might be able to fix a few bugs that are serious but not showstoppers, not to mention that the release team would feel less burned out.

We discussed this problem at the release sprint, and came up with a laundry list of improvements; I’ve scheduled this for discussion at UDS in case we can think of any more. Please come along if you’re interested!

One thing in particular that I’m working on is refactoring Germinate, a tool which dates right back to our first meeting before Ubuntu was even called Ubuntu and whose job is to expand dependencies starting from our lists of “seed” packages; we use this, among other things, to generate Task fields in the archive and to decide which packages to copy into our images. This was acceptably quick in 2004, but now that we run it forty times (eight flavours multiplied by five architectures) at the end of every publisher run it’s actually become rather a serious performance problem: cron.germinate takes about ten minutes, which is over a third of the typical publisher runtime. It parses Packages files eight times as often as it needs to, Sources files forty times as often as it needs to, and recalculates the dependency tree of the base system five times as often as it needs to. I am confident that we can significantly reduce the runtime here, and I think there’s some hope that we might be able to move the publisher back to a 30-minute cycle, which would increase the velocity of Ubuntu development in general.

Maintaining the development release

Our release cycle always starts with syncing and merging packages from Debian unstable (or testing in the case of LTS cycles). The vast majority of packages in Ubuntu arrive this way, and generally speaking if we didn’t do this we would fall behind in ways that would be difficult to recover from later. However, this does mean that we get a “big bang” of changes at the start of the cycle, and it takes a while for the archive to be usable again. Furthermore, even once we’ve taken care of this, we have a long-established rhythm where the first part of the cycle is mainly about feature development and the second part of the cycle is mainly about stabilisation. As a result, we’ve got used to the archive being fairly broken for the first few months, and we even tell people that they shouldn’t expect things to work reliably until somewhere approaching beta.

This makes some kind of sense from the inside. But how are you supposed to do feature development that relies on other things in the development release?

In the first few years of Ubuntu, this question didn’t matter very much. Nearly all the people doing serious feature development were themselves serious Ubuntu developers; they were capable of fixing problems in the development release as they went along, and while it got in their way a little bit it wasn’t all that big a deal. Now, though, we have people focusing on things like Unity development, and we shouldn’t assume that just because somebody is (say) an OpenGL expert or a window management expert that they should be able to recover from arbitrary failures in development release upgrades. One of the best things we could do to help the 12.04 desktop be more stable is to have the entire system be less unstable as we go along, so that developers further up the stack don’t have to be distracted by things wobbling underneath them. Plus, it’s just good software engineering to keep the basics working as you go along: it should always build, it should always install, it should always upgrade. Ubuntu is too big to do something like having everyone stop any time the build breaks, the way you might do in a smaller project, but we shouldn’t let things slide for months either.

I’ve been talking to Rick Spencer and the other Ubuntu engineering leads at Canonical about this. Canonical has a system of “rotations”, where you can go off to another team for a while if you’re in need of a change or want to branch out a bit; so I proposed that we allow our engineers to spend a month or two at a time on what I’m calling the +1 Maintenance Team, whose job is simply to keep the development release buildable, installable, and upgradeable at all times. Rick has been very receptive to this, and we’re going to be running this as a trial throughout the 12.04 cycle, with probably about three people at a time. As well as being professional archive gardeners, these people will also work on developing infrastructure to help us keep better track of what we need to do. For instance, we could deploy better tools from Debian QA to help us track uninstallable packages, or we could enhance some of our many existing reports to have bug links and/or comment facilities, or we could spruce up the weather report; there are lots of things we could do to make our own lives easier.

By 12.04, I would like, in no particular order:

Precise to have been more or less continuously usable from Alpha 1 onward for people with reasonable general technical ability
Canonical engineering teams outside Ubuntu (DX, Ubuntu One, Launchpad, etc.) to be comfortable with running the development release on at least one system from Alpha 2 onward
Installability problems in daily image builds to be dealt with within one working day, or preferably before they even make it to daily builds
The archive to be close to consistent as we start milestone preparation, rather than the release team having to scramble to make it so
A very significant reduction in our long-term backlog of automatically-detected problems

Of course, this overlaps to a certain extent with the kinds of things that the MOTU team have been doing for years, not to mention with what all developers should be doing to keep their own houses in reasonable order, and I’d like us to work together on this; we’re trying to provide some extra hands here to make Ubuntu better for everyone, not take over! I would love this to be an opportunity to re-energise MOTU and bring some new people on board.

I’ve registered a couple of blueprints (priorities, infrastructure) for discussion at UDS. These are deliberately open-ended skeleton sessions, and I’ll try to make sure they’re scheduled fairly early in the week, so that we have time for break-out sessions later on. If you’re interested, please come along and give your feedback!

Top ideas on Ubuntu Brainstorm (August 2011)

2011-10-06T16:58:51+01:00

The Ubuntu Technical Board conducts a regular review of the most popular Ubuntu Brainstorm ideas (previous reviews conducted by Matt Zimmerman and Martin Pitt). This time it was my turn. Apologies for the late arrival of this review.

Contact lens in the Unity Dash (#27584)

Unity supports Lenses, which provide a consistent way for users to quickly search for information via the Dash. Current lenses include Applications, Files, and Music, but a number of people have asked for contacts to be accessible using the same interface.

While Canonical’s DX team isn’t currently working on this for Ubuntu 11.10 or 12.04, we’d love somebody who’s interested in this to get involved. Allison Randal explains how to get started, including some skeleton example code and several useful links.

Displaying Ubuntu version information (#27460)

Several people have asked for it to be more obvious what Ubuntu version they’re running, as well as other general information about their system.

John Lea, user experience architect on the Unity team, responds that in Ubuntu 11.10 the new LightDM greeter shows the Ubuntu version number, making that basic information very easily visible. For more detail, System Settings -> System Info provides a simple summary.

Volume adjustments for headphone use (#27275)

People often find that they need to adjust their sound volume when plugging in or removing headphones. It seems as though the computer ought to be able to remember this kind of thing and do it automatically; after all, a major goal of Ubuntu is to make the desktop Just Work.

David Henningson, a member of Canonical’s OEM Services group and an Ubuntu audio developer, responds on his blog with a summary of how PulseAudio jack detection has improved matters in Ubuntu 11.10, and what’s left to do:

The good news: in the upcoming Ubuntu Oneiric (11.10), this is actually working. The bad news: it isn’t working for everyone.

Making it easier to find software to handle a file (#28148)

Ubuntu is not always as helpful as it could be when you don’t have the right software installed to handle a particular file.

Michael Vogt, one of the developers of the Ubuntu Software Center, responded to this. It seems that most of the pieces to make this work nicely are in place, but there are a few more bits of glue required:

Thanks a lot for this suggestion. I like the idea and it’s something that software-center itself supports now. In the coming version 5.0 we will offer to “sort by top-rated” (based on the ratings&reviews data). It’s also possible to search for an application based on its mime data. To search for a mime-type, you can enter “mime:text/html” or “mime:audio/ogg” into the search field. What is needed however is better integration into the file manager nautilus. I will make sure this gets attention at the next developer meeting and filed bug #860536 about it.

In nautilus, there is now a button called “Find applications online” available as an option when opening an unknown file or when the user selects “open with…other application” in the context menu. But that will not use the data from software-center.

Show pop-up alert on low battery (#28037)

Some users have reported on Brainstorm that they are not alerted frequently enough when their laptop’s battery is low, as they clearly ought to be.

This is an odd one, because there are already several power alert levels and this has been working well for us for some time. Nevertheless, enough people have voted for this idea that there must be something behind it, perhaps a bug that only affects certain systems. Martin Pitt, technical lead of the Ubuntu desktop team, has responded directly to the Brainstorm idea with a description of the current system and how to file a bug when it does not work as intended.

man-db 2.6.0

2011-04-09T20:45:17+01:00

I’ve released man-db 2.6.0 (announcement, NEWS, ChangeLog), and uploaded it to Debian unstable. Ubuntu is rapidly approaching beta freeze so I’m not going to try to cram this into 11.04; it’ll be in 11.10.

Wubi bug 693671

2011-03-14T12:56:57+00:00

I spent most of last week working on Ubuntu bug 693671 (“wubi install will not boot - phase 2 stops with: Try (hd0,0): NTFS5”), which was quite a challenge to debug since it involved digging into parts of the Wubi boot process I’d never really touched before. Since I don’t think much of this is very well-documented, I’d like to spend a bit of time explaining what was involved, in the hope that it will help other developers in the future.

Wubi is a system for installing Ubuntu into a file in a Windows filesystem, so that it doesn’t require separate partitions and can be uninstalled like any other Windows application. The purpose of this is to make it easy for Windows users to try out Ubuntu without the need to worry about repartitioning, before they commit to a full installation. Wubi started out as an external project, and initially patched the installer on the fly to do all the rather unconventional things it needed to do; we integrated it into Ubuntu 8.04 LTS, which involved turning these patches into proper installer facilities that could be accessed using preseeding, so that Wubi only needs to handle the Windows user interface and other Windows-specific tasks.

Anyone familiar with a GNU/Linux system’s boot process will immediately see that this isn’t as simple as it sounds. Of course, ntfs-3g is a pretty solid piece of software so we can handle the Windows filesystem without too much trouble, and loopback mounts are well-understood so we can just have the initramfs loop-mount the root filesystem. Where are you going to get the kernel and initramfs from, though? Well, we used to copy them out to the NTFS filesystem so that GRUB could read them, but this was overly complicated and error-prone. When we switched to GRUB 2, we could instead use its built-in loopback facilities, and we were able to simplify this. So all was more or less well, except for the elephant in the room. How are you going to load GRUB?

In a Wubi installation, NTLDR (or BOOTMGR in Windows Vista and newer) still owns the boot process. Ubuntu is added as a boot menu option using BCDEdit. You might then think that you can just have the Windows boot loader chain-load GRUB. Unfortunately, NTLDR only loads 16 sectors - 8192 bytes - from disk. GRUB won’t fit in that: the smallest core.img you can generate at the moment is over 18 kilobytes. Thus, you need something that is small enough to be loaded by NTLDR, but that is intelligent enough to understand NTFS to the point where it can find a particular file in the root directory of a filesystem, load boot loader code from it, and jump to that. The answer for this was GRUB4DOS. Most of GRUB4DOS is based on GRUB Legacy, which is not of much interest to us any more, but it includes an assembly-language program called GRLDR that supports doing this very thing for FAT, NTFS, and ext2. In Wubi, we build GRLDR as wubildr.mbr, and build a specially-configured GRUB core image as wubildr.

Now, the messages shown in the bug report suggested a failure either within GRLDR or very early in GRUB. The first thing I did was to remember that GRLDR has been integrated into the grub-extras ntldr-img module suitable for use with GRUB 2, so I tried building wubildr.mbr from that; no change, but this gave me a modern baseline to work on. OK; now to try QEMU (you can use tricks like qemu -hda /dev/sda if you’re very careful not to do anything that might involve writing to the host filesystem from within the guest, such as recursively booting your host OS … [update: Tollef Fog Heen and Zygmunt Krynicki both point out that you can use the -snapshot option to make this safer]). No go; it hung somewhere in the middle of NTLDR. Still, I could at least insert debug statements, copy the built wubildr.mbr over to my test machine, and reboot for each test, although it would be slow and tedious. Couldn’t I?

Well, yes, I mostly could, but that 8192-byte limit came back to bite me, along with an internal 2048-byte limit that GRLDR allocates for its NTFS bootstrap code. There were only a few spare bytes. Something like this would more or less fit, to print a single mark character at various points so that I could see how far it was getting:

        pushal
        xorw    %bx, %bx        /* video page 0 */
        movw    $0x0e4d, %ax    /* print 'M' */
        int     $0x10
        popal

In a few places, if I removed some code I didn’t need on my test machine (say, CHS compatibility), I could even fit in cheap and nasty code to print a single register in hex (as long as you didn’t mind ‘A’ to ‘F’ actually being ‘:’ to ‘?’ in ASCII; and note that this is real-mode code, so the loop counter is %cx not %ecx):

        /* print %edx in dumbed-down hex */
        pushal
        xorw    %bx, %bx
        movb    $0xe, %ah
        movw    $8, %cx
1:
        roll    $4, %edx
        movb    %dl, %al
        andb    $0xf, %al
        int     $0x10
        loop    1b
        popal

After a considerable amount of work tracking down problems by bisection like this, I also observed that GRLDR’s NTFS code bears quite a bit of resemblance in its logical flow to GRUB 2’s NTFS module, and indeed the same person wrote much of both. Since I knew that the latter worked, I could use it to relieve my brain of trying to understand assembly code logic directly, and could compare the two to look for discrepancies. I did find a few of these, and corrected a simple one. Testing at this point suggested that the boot process was getting as far as GRUB but still wasn’t printing anything. I removed some Ubuntu patches which quieten down GRUB’s startup: still nothing - so I switched my attentions to grub-core/kern/i386/pc/startup.S, which contains the first code executed from GRUB’s core image. Code before the first call to real_to_prot (which switches the processor into protected mode) succeeded, while code after that point failed. Even more mysteriously, code added to real_to_prot before the actual switch to protected mode failed too. Now I was clearly getting somewhere interesting, but what was going on? What I really wanted was to be able to single-step, or at least see what was at the memory location it was supposed to be jumping to.

Around this point I was venting on IRC, and somebody asked if it was reproducible in QEMU. Although I’d tried that already, I went back and tried again. Ubuntu’s qemu is actually built from qemu-kvm, and if I used qemu -no-kvm then it worked much better. Excellent! Now I could use GDB:

(gdb) target remote | qemu -gdb stdio -no-kvm -hda /dev/sda

This let me run until the point when NTLDR was about to hand over control, then interrupt and set a breakpoint at 0x8200 (the entry point of startup.S). This revealed that the address that should have been real_to_prot was in fact garbage. I set a breakpoint at 0x7c00 (GRLDR’s entry point) and stepped all the way through to ensure it was doing the right thing. In the process it was helpful to know that GDB and QEMU don’t handle real mode very well between them. Useful tricks here were:

Use set architecture i8086 before disassembling real-mode code (and set architecture i386 to switch back).
GDB prints addresses relative to the current segment base, but if you want to enter an address then you need to calculate a linear address yourself. For example, breakpoints must be set at (CS << 4) + IP, rather than just at IP.

Single-stepping showed that GRLDR was loading the entirety of wubildr correctly and jumping to it. The first instruction it jumped to wasn’t in startup.S, though, and then I remembered that we prefix the core image with grub-core/boot/i386/pc/lnxboot.S. Stepping through this required a clear head since it copies itself around and changes segment registers a few times. The interesting part was at real_code_2, where it copies a sector of the kernel to the target load address, and then checks a known offset to find out whether the “kernel” is in fact GRUB rather than a Linux kernel. I checked that offset by hand, and there was the smoking gun. GRUB recently acquired Reed-Solomon error correction on its core image, to allow it to recover from other software writing over sectors in the boot track. This moved the magic number lnxboot.S was checking somewhat further into the core image, after the first sector. lnxboot.S couldn’t find it because it hadn’t copied it yet! A bit of adjustment and all was well again.

The lesson for me from all of this has been to try hard to get an interactive debugger working. Really hard. It’s worth quite a bit of up-front effort if it saves you from killing neurons stepping through pages of code by hand. I think the real-mode debugging tricks I picked up should be useful for working on GRUB in the future.

libpipeline 1.1.0 released

2010-12-11T15:47:39+00:00

I’ve released libpipeline 1.1.0, and uploaded it to Debian unstable. The changes are mostly just to add a few occasionally useful interfaces:

Add pipecmd_exec to execute a single command, replacing the current process; this is analogous to execvp.
Add pipecmd_clearenv to clear a command’s environment; this is analogous to clearenv.
Add pipecmd_get_nargs to get the number of arguments to a command.

The shared library actually ends up being a few kilobytes smaller on Debian than 1.0.0, probably because I tweaked the set of Gnulib modules I’m using.

NTP synchronisation problems

2010-12-06T12:58:29+00:00

The Ubuntu Technical Board is currently conducting a review of the top ten Brainstorm issues users have raised about Ubuntu, and Matt asked me to investigate and respond to Idea #25301: Keeping the time accurate over the Internet by default.

My first reaction was “hey, that’s odd - I thought we already did that?”. We install the ntpdate package by default (although it’s deprecated upstream in favour of other tools, but that shouldn’t be important here). ntpdate is run from /etc/network/if-up.d/ntpdate, in other words every time you connect to a network, which should be acceptably frequent for most people, so it really ought to Just Work by default. But this is one of the top ten problems where users have gone to the trouble of proposing solutions on Brainstorm, so it couldn’t be that simple. What was going on?

I brought up a clean virtual machine with a development version of Natty (the current Ubuntu development version, which will eventually become 11.04), and had a look in its logs: it was indeed synchronising its time from ntp.ubuntu.com, and I didn’t think anything in that area had changed recently. On the other hand, I had occasionally noticed that my own laptop wasn’t always synchronising its time quite right, but I’d put it down to local weirdness as my network isn’t always very stable. Maybe this wasn’t so local after all?

So, I started tracing through the scripts to figure out what was going on. It turned out that I had an empty /etc/ntp.conf file on my laptop. The /usr/sbin/ntpdate-debian script assumed that that meant I had a full NTP server installed (I don’t), and fetched the list of servers from it; since the file was empty, it ended up synchronising time from no servers, that is, not synchronising at all. I removed the file and all was well.

That left the question of where that file came from. It didn’t seem to be owned by any package; I was pretty sure I hadn’t created it by hand either. I had a look through some bug reports, and soon found ntpdate 1:4.2.2.p4+dfsg-1ubuntu2 has a flawed configuration file. It turns out that time-admin (System -> Administration -> Time and Date) creates an empty /etc/ntp.conf file if you press the reload button (tooltip: “Synchronise now”), as part of an attempt to update NTP configuration. Aha!

Once I knew where the problems were, it was easy to fix them. I’ve uploaded the following changes, which will be in the 11.04 release:

Disregard empty ntp.conf files in ntpdate-debian.
Remove an empty /etc/ntp.conf file on fresh installation of the ntp package, so that it doesn’t interfere with creating the normal configuration file.
Don’t create the NTP configuration file in the time-admin backend if it doesn’t exist already.

I’ve also sent these changes to Debian and GNOME as appropriate.

There are still a few problems. The “Synchronise now” button doesn’t work quite right in general (bug #90524), and if your network doesn’t allow time synchronisation from ntp.ubuntu.com then you’ll have to change the value of NTPSERVERS in /etc/default/ntpdate. Furthermore, the time-admin interface is confusing and makes it seem as though the default is not to synchronise the time automatically; this interface is being redesigned at the moment, which should be a good opportunity to make it less confusing, and I will contact the designers to mention this problem. On the whole, though, I think that many fewer people should have this kind of problem in Ubuntu 11.04.

It’s always possible that I missed some other problem that breaks automatic time synchronisation for people. Please do file a bug report if it still doesn’t work for you in 11.04, or contact me directly (cjwatson at ubuntu.com).

man-db on Fedora

2010-12-02T14:06:58+00:00

I just found out by chance that Fedora 14 switched from their old man package to man-db. This is great news: it should now be the beginning of the end of the divergence of man implementations that happened way back in the mid-1990s, when two different people took John W. Eaton’s man package and developed it in different directions without being aware of each other’s existence. For a while it looked as though man-db was stuck on just the Debian family and openSUSE, but a number of distributions have switched over in the last few years. As of now, the only remaining major distribution not using man-db is Gentoo, and they have a bug for switching which I think should be unblocked fairly soon.

In some ways man-db’s package name didn’t help it; people thought that the main difference was that man-db had a database backend stuck around apropos. These days, the database is one of the least important parts of man-db as far as I’m concerned. Other ways in which it’s very significantly superior to anything man could do without years of equivalent effort include correct encoding support, robust child process handling, and use of more modern development facilities (dear catgets: you belong to a previous millennium, so please go away). I’m glad that Fedora has recognised this.

libpipeline 1.0.0 released

2010-10-29T21:23:26+01:00

In my previous post, I described the pipeline library from man-db and asked whether people were interested in a standalone release of it. Several people expressed interest, and so I’ve now released libpipeline version 1.0.0. It’s in the Debian NEW queue, and my PPA contains packages of it for Ubuntu lucid and maverick.

I gave a lightning talk on this at UDS in Orlando, and my slides are available. I hope there’ll be a video at some point which I can link to.

Thanks to Scott James Remnant for code review (some time back), Ian Jackson for an extensive design review, and Kees Cook and Matthias Klose for helpful conversations.

Pipeline library

2010-10-03T22:59:11+01:00

When I took over man-db in 2001, one of the major problems that became evident after maintaining it for a while was the way it handled subprocesses. The nature of man and friends means that it spends a lot of time calling sequences of programs such as zsoelim < input-file | tbl | nroff -mandoc -Tutf8. Back then, it was using C library facilities such as system and popen for all this, and I had to deal with several bugs where those functions were being called with untrusted input as arguments without properly escaping metacharacters. Of course it was possible to chase around every such call inserting appropriate escaping functions, but this was always bound to be error-prone and one of the tasks that rapidly became important to me was arranging to start subprocesses in a way that was fundamentally immune to this kind of bug.

In higher-level languages, there are usually standard constructs which are safer than just passing a command line to the shell. For example, in Perl you can use system([$command, $arg1, $arg2, ...]) to invoke a program with arguments without the interference of the shell, and perlipc(1) describes various facilities for connecting them together. In Python, the subprocess module allows you to create pipelines easily and safely (as long as you remember the SIGPIPE gotcha). C has the fork and execve primitives, but assembling these to construct full-blown pipelines correctly is difficult and error-prone, so many programmers don’t bother and use the simple but unsafe library facilities instead.

I wrote a couple of thousand lines of library code in man-db to address this problem, loosely and now quite distantly based on code in groff. In the following examples, function names starting with command_, pipeline_, or decompress_ are real functions in the library, while any other function names are pseudocode.

Constructing the simplified example pipeline from my first paragraph using this library looks like this:

pipeline *p;
int status;

p = pipeline_new ();
p->want_infile = "input-file";
pipeline_command_args (p, "zsoelim", NULL);
pipeline_command_args (p, "tbl", NULL);
pipeline_command_args (p, "nroff", "-mandoc", "-Tutf8", NULL);
pipeline_start (p);
status = pipeline_wait (p);
pipeline_free (p);

You might want to construct a command more dynamically:

command *manconv = command_new_args ("manconv", "-f", from_code,
                                     "-t", "UTF-8", NULL);
if (quiet)
        command_arg (manconv, "-q");
pipeline_command (p, manconv);

Perhaps you want an environment variable set only while running a certain command:

command *less = command_new ("less");
command_setenv (less, "LESSCHARSET", lesscharset);

You might find yourself needing to pass the output of one pipeline to several other pipelines, in a “tee” arrangement:

pipeline *source, *sink1, *sink2;

source = make_source ();
sink1 = make_sink1 ();
sink2 = make_sink2 ();
pipeline_connect (source, sink1, sink2, NULL);
/* Pump data among these pipelines until there's nothing left. */
pipeline_pump (source, sink1, sink2, NULL);
pipeline_free (sink2);
pipeline_free (sink1);
pipeline_free (source);

Maybe one of your commands is actually an in-process function, rather than an external program:

command *inproc = command_new_function ("in-process", &func, NULL, NULL);
pipeline_command (p, inproc);

Sometimes your program needs to consume the output of a pipeline, rather than sending it all to some other subprocess:

pipeline *p = make_pipeline ();
const char *line;

line = pipeline_peekline (p);
if (!strstr (line, "coding: UTF-8"))
        printf ("Unicode text follows:\n");
while (line = pipeline_readline (p))
        printf ("  %s", line);
pipeline_free (p);

man-db deals with compressed files a lot, so I wrote an add-on library for opening compressed files (which is somewhat man-db-specific, but the implementation wasn’t difficult given the underlying library):

pipeline *decomp_file = decompress_open (compressed_filename);
pipeline *decomp_stdin = decompress_fdopen (fileno (stdin));

This library has been in production in man-db for over five years now. The very careful signal handling code has been reviewed independently and the whole thing has been run through multiple static analysis tools, although I would always welcome more review; in particular I have no idea what it would take to make it safe for use in threaded programs since I generally avoid threading wherever possible. There have been a handful of bugs, which I’ve fixed promptly, and I’ve added various new features to support particular requirements of man-db (though in as general a way as possible). Every so often I see somebody asking about subprocess handling in C, and I wonder if I should split this library out into a standalone package so that it can be used elsewhere. Web searches for things like “pipeline library” and “libpipeline” don’t reveal anything that’s a particularly close match for what I have. The licensing would be GPLv2 or later; this isn’t likely to be negotiable since some of the original code wasn’t mine and in any case I don’t feel particularly bad about giving an advantage to GPLed programs. For more details on the interface, the header file is well-commented.

Is there enough interest in this to make the effort of producing a separate library package worthwhile? As well as the general effort of creating a new package, I’d need to do some work to disentangle it from a few bits and pieces specific to man-db. If you maintain a specific package that could use this and you’re interested, please contact me with details, mentioning any extensions you think you’d need. I intentionally haven’t enabled comments on my blog for various reasons, but you can e-mail me at cjwatson at debian.org or man-db-devel at nongnu.org.

Windows applications making GRUB 2 unbootable

2010-08-28T00:47:21+01:00

If you find that running Windows makes a GRUB 2-based system unbootable (Debian bug, Ubuntu bug), then I’d like to hear from you. This is a bug in which some proprietary Windows-based software overwrites particular sectors in the gap between the master boot record and the first partition, sometimes called the “embedding area”. GRUB Legacy and GRUB 2 both normally use this part of the disk to store one of their key components: GRUB Legacy calls this component Stage 1.5, while GRUB 2 calls it the core image (comparison). However, Stage 1.5 is less useful than the core image (for example, the latter provides a rescue shell which can be used to recover from some problems), and is therefore rather smaller: somewhere around 10KB vs. 24KB for the common case of ext[234] on plain block devices. It seems that the Windows-based software writes to a sector which is after the end of Stage 1.5, but before the end of the core image. This is why the problem appears to be new with GRUB 2.

At least some occurrences of this are with software which writes a signature to the embedding area which hangs around even after uninstallation (even with one of those tools that tracks everything the installation process did and reverses it, I gather), so that you cannot uninstall and reinstall the application to defeat a trial period. This seems like a fine example of an antifeature, especially given its destructive consequences for free software, and is in general a poor piece of engineering; what happens if multiple such programs want to use the same sector, I wonder? They clearly aren’t doing much checking that the sector is unused, not that that’s really possible anyway. While I do not normally think that GRUB should go to any great lengths to accommodate proprietary software, this is a case where we need to defend ourselves against the predatory practices of some companies making us look bad: a relatively small number of people do enough detective work to realise that it’s the fault of a particular Windows application, but many more simply blame our operating system because it won’t start any more.

I believe that it may be possible to assemble a collection of signatures of such software, and arrange to avoid the disk sectors they have stolen. Indeed, I have a first draft of the necessary code. This is not a particularly pleasant solution, but it seems to be the most practical way around the problem; I’m hoping that several of the programs at fault are using common “licence manager” code or something like that, so that we can address most of the problems with a relatively small number of signatures. In order to do this, I need to hear from as many people as possible who are affected by this problem.

If you suffer from this problem, then please do the following:

Save the output of fdisk -lu to a file. In this output, take note of the start sector of the first partition (usually 63, but might also be 2048 on recent installations, or occasionally something else). If this is something other than 63, then replace 63 in the following items with your number.
Save the contents of the embedding area to a file (replace /dev/sda with your disk device if it’s something else): dd if=/dev/sda of=sda.1 count=63
Do whatever you do to make GRUB unbootable (presumably starting Windows), then boot into a recovery environment. Before you reinstall GRUB, save the new contents of the embedding area to a different file: dd if=/dev/sda of=sda.2 count=63
Follow up to either the Debian or the Ubuntu bug with these three files (the output of fdisk -lu, and the embedding area before and after making GRUB unbootable.

I hope that this will help me to assemble enough information to fix this bug at least for most people, and of course if you provide this information then I can make sure to fix your particular version of this problem. Thanks in advance!

debhelper statistics, redux

2010-07-10T23:40:20+01:00

Apropos of my previous post, I see that dh has now overtaken CDBS as the most popular rules helper system of its kind in Debian unstable, and shows no particular sign of slowing its rate of uptake any time soon. The resolution of the graph is such that you can’t see it yet, but dh drew dead level with CDBS on Thursday, and today 3836 packages are using dh as opposed to 3823 using CDBS.

GRUB 2: With luck …

2010-07-02T22:27:35+01:00

… this version, or something not too far away from it, might actually stand a chance of getting into testing.

I’ve just uploaded grub2 1.98+20100702-1. The most significant set of changes in this release is that it switches /boot/grub/device.map and the grub-pc/install_devices debconf question over to stable device names under /dev/disk/by-id (on Linux kernels). The code implementing this is reasonably careful, and it should make it quite difficult for people to accidentally fail to upgrade their installed GRUB core image; I explained the problems that tends to cause in the previous post in this series. There will probably be a few small glitches I need to clear up, but I’ve given this much more extensive testing than usual so I hope I won’t break too many people’s computers (again).

I did this work first in Ubuntu as one of my major goals for 10.04 LTS, which exposed a few problems that I wanted to fix before inflicting it on Debian as well (fixes for those are now under testing for 10.04.1). Most significantly, I felt it was necessary to start offering partitions in the select list for grub-pc/install_devices, but I went a bit overboard and offered all partitions in a giant list. This seemed like a good idea at the time, but it tended to confuse people into just selecting everything in the list, which in particular tended to make Windows unbootable! So I dialled that back a bit, and in the version I just merged it will only offer the partitions mounted on /, /boot, and /boot/grub (de-duplicating if necessary). This seems like a reasonable compromise between confusing people too much and forcing them to install only to MBRs.

My next priority will be making whatever fixes are necessary to get this version into testing, since the problems with /dev/mapper symlinks in testing aren’t getting any less urgent, and this is finally a version that shouldn’t break for most people due to the kernel’s switch to libata. I expect that I’ll try to get mdadm 1.x metadata sorted out immediately after that.

Other improvements since my last entry have included:

Further documentation work. Thanks to Vladimir Serbinenko (and to Jordan Uggla for hosting it temporarily), there’s now an HTML version of the GRUB manual from trunk online, which includes new sections on embedded configuration files, the various GRUB image files, device.map, and (shortly) a summary of changes from GRUB Legacy.
Video improvements: among other things, UEFI systems whose firmware uses the Graphics Output Protocol should now work rather better, and GRUB now includes specific support for some cards often used with minimal firmware support under emulation.
A fix to handle large memory maps exposed by some UEFI firmware.
Automatic configuration support for Fedora 13. You may need os-prober 1.39 from unstable as well.
Automatic configuration support for Linux on Xen.
Skip LVM snapshots rather than failing when they’re present.

GRUB 2 boot problems

2010-06-21T11:52:10+01:00

(This is partly a repost of material I’ve posted to bug reports and to debian-release, put together with some more detail for a wider audience.)

You could be forgiven for looking at the RC bug activity on grub2 over the last couple of days and thinking that it’s all gone to hell in a handbasket with recent uploads. In fact, aside from an interesting case which turned out to be due to botched handling of the GRUB Legacy to GRUB 2 chainloading setup (which prompted me to fix three other RC bugs along the way), all the recent problems people have been having have been duplicates of one of these bugs which have existed essentially forever:

When GRUB boots, its boot sector first loads its “core image”, which is usually embedded in the gap between the boot sector and the first partition on the same disk as the boot sector. This core image then figures out where to find /boot/grub, and loads grub.cfg from it as well as more GRUB modules.

The thing that tends to go wrong here is that the core image must be from the same version of GRUB as any modules it loads. /boot/grub/*.mod are updated only by grub-install, so this normally works OK. However, for various reasons (deliberate or accidental) some people install GRUB to multiple disks. In this case, grub-install might update /boot/grub/*.mod along with the core image on one disk, but your BIOS might actually be booting from a different disk. The effect of this will be that you’ll have an old core image and new modules, which will probably blow up in any number of possible ways. Quite often, this problem lies dormant for a while because GRUB happens not to change in a way that causes incompatibility between the core image and modules, but then we get massive spikes of bug reports any time the interface does change. Since these bugs sometimes bite people upgrading from testing to unstable, they get interpreted as regressions from the version in testing even though that isn’t strictly true (but it tends not to be very productive to argue this line; after all, people’s computers suddenly don’t boot!). Any problem that causes the core image to be installed to a disk other than the one actually being booted from, or not to be installed at all, will show up this way sooner or later.

On 2010-06-10, there was a substantial upstream change to the handling of list iterators (to reduce core image size and make code clearer and faster) which introduced an incompatibility between old core images and newer modules. This caused a bunch of dormant problems to flare up again, and so there was a flood of reports of booting problems with 1.98+20100614-1 and newer, often described as “the unaligned pointer bug” due to how it happened to manifest this time round. In previous cases, GRUB reported undefined symbols on boot, but it’s all essentially the same problem even though there are different symptoms.

The confusing bit when handling bug reports is that not only are there different symptoms with the same cause, but there are also multiple causes for the same symptom! This takes a certain amount of untangling, especially when lots of people have thought “ooh, that bug looks a bit like mine” and jumped in with their own comments. Working through this was a worthwhile exercise, as it came up with an entirely new cause for a problem I thought was fairly well-understood (thanks to debugging assistance from Sedat Dilek). If you had set up GRUB 2 to be automatically chainloaded from GRUB Legacy (which happens automatically on upgrade from the latter to the former), never got round to running upgrade-from-grub-legacy once you confirmed it worked, and then later ran grub-install by hand for one reason or another, then the core image you installed by hand would never be updated and would eventually fall over the next time the core/modules interface changed. Fixing future cases of this was easy enough, but fixing existing cases involved figuring out how to detect whether an installed GRUB boot sector came from GRUB Legacy or GRUB 2, which isn’t as easy as you might think. Fortunately, it turns out that there are a limited number of jump offsets that have ever been used in the second byte of the boot sector, and none of the GRUB 2 values clash with the only value ever used in GRUB Legacy; so, if you still have /boot/grub/stage2 et al on upgrade, we scan all disks for a GRUB 2 boot sector, and if we find one then we offer to complete the upgrade to GRUB 2.

Unless anything new shows up, that just leaves the problems that were already understood. Today, I posted a patch to generate stable device names in device.map by default. If this is accepted, then we can do something or other to fix up device.map on upgrade, switch over to /dev/disk/by-id names in grub-pc/install_devices at the same time, and that should take care of the vast majority of this kind of upgrade bug. I think at that point it should be feasible to get a new version into testing, and we should be down from 18 RC bugs towards the end of last month to around 6. We can then start attacking things like the lack of support for mdadm 1.x metadata.

Since my last blog entry on GRUB 2, improvements have included:

Substantial work on info grub, with, among other things, new sections on /etc/default/grub and on configuring authentication.
A workaround for GRUB’s inability to probe dm-crypt devices, thanks to Marc Haber.
Several build fixes for architectures I wasn’t testing, and a fix for broken nested partition handling on Debian GNU/kFreeBSD. I’m now testing GNU/kFreeBSD locally.
Rather less cruft in fs.lst, partmap.lst, and video.lst, which should speed up booting a bit by e.g. avoiding unnecessary filesystem probing.
upgrade-from-grub-legacy actually now installs GRUB 2 to the boot sector (!).
Ask for confirmation if grub-pc/install_devices is left empty.

The next upstream snapshot will bring several improvements to EFI video support, mainly thanks to Vladimir Serbinenko. I’ve been working on making grub-install actually work on UEFI systems as one of my goals for the next Ubuntu release, and I hope to get this landed in the not-too-distant future.

Hacking on grub2

2010-06-04T22:57:07+01:00

Various people observed in a long thread on debian-devel that the grub2 package was in a bit of a mess in terms of its release-critical bug count, and Jordi and Stefano both got in touch with me directly to gently point out that I probably ought to be doing something about it as one of the co-maintainers.

Actually, I don’t think grub2 was in quite as bad a state as its 18 RC bugs suggested. Of course every boot loader failure is critical to the person affected by it, not to mention that GRUB 2 offers more complex functionality than any other boot loader (e.g. LVM and RAID), and so it tends to accumulate RC bugs at rather a high rate. That said, we’d been neglecting its bug list for some time; Robert and Felix have both been taking some time off, Jordi mostly only cared about PowerPC and can’t do that any more due to hardware failure, and I hadn’t been able to pick up the slack.

Most of my projects at work for the next while involve GRUB in one way or another, so I decided it was a perfectly reasonable use of work time to do something about this; I was going to need fully up-to-date snapshots anyway, and practically all the Debian grub2 bugs affect Ubuntu too. Thus, with the exception of some other little things like releasing the first Maverick alpha, I’ve spent pretty much the last week and a half solidly trying to get the grub2 package back into shape, with four uploads so far.

The RC issues that remain are:

upgrade-from-grub-legacy problems (#547944, #550477):

I think this has just been traditionally undertested. I’m setting up a KVM image now with GRUB Legacy which I can snapshot just before and after running upgrade-from-grub-legacy, and I should be able to unpick the bugs this way.
LVM snapshots break GRUB’s LVM module (#574863):

Sean has been working on this and seems to be nearly there. Yay.
RAID metadata version 1.x not supported (#492897):

This became rather more of an issue recently since mdadm switched its default from the old 0.90 format which GRUB understood. Felix put together a branch implementing the hard parts of this a while back, and I’ve been trying to finish it off. The hard bit is dealing with device naming, especially as the new-format and rather more useful names under /dev/md/ don’t show up during d-i after creating RAID volumes; I think this is because we always create them as /dev/md0 etc. It’s looking tractable, though.
Another odd problem probing RAID (#548648):

Not sure about this one, and I’ll need to work with Josip on it as soon as I get a chance.
Stable device naming #554790) and consequential problems due to grub-install not being properly run (#557425 and many other sub-RC bugs):

Ubuntu’s been carrying a patch to rearrange device presentation in the postinst, which Robert OKed in principle ages ago and so I’ve been intending to merge it for a while, but there are a few known problems with it that I need to fix first. One known unfixable problem is that it will have to ask some people which devices they want GRUB to be installed on, even if they’d answered that question before: this will be one-time, and it’s because it recorded the answer using unstable device names and so has in some sense forgotten. Simple cases (e.g. single-disk) can be handled without needing to ask again, though.
Alignment errors on SPARC (#560823):

I have no idea what’s going on here, I’m afraid. I’ll try to trace it, but may have to downgrade it at some point since after all we don’t install GRUB by default on SPARC yet.
Fonts not shown in gfxmenu (#564844):

Apparently fixed upstream, but I couldn’t find the responsible commit so I want to make sure I can get gfxmenu working before closing this.
Sensitivity to out-of-date device.map files (#575076 and other sub-RC bugs):

We’re trying to get rid of device.map in general. It was fine in the 1990s but it’s hopeless now. Unfortunately there are still a small number of problems with running entirely without one, and one of my patches to help is controversial upstream, so we probably won’t get to that for squeeze. In the meantime we’ll probably just need some extra sanity-checking and robustness in the event that there’s an incorrect or out-of-date device.map lying around, which we may just be able to do in the maintainer scripts or something if necessary.
Seriously weird failures to load initramfs (#582342):

If anyone can produce a reproduction recipe for this, that would really help me out. There are too many reports to discount as user error, but I haven’t seen this myself yet.
Build failure on sparc (unfiled):

We’ve been discussing this upstream, but for the time being I’m just going to stop building grub-emu on sparc as a workaround.

If we can fix that lot, or even just the ones that are reasonably well-understood, I think we’ll be in reasonable shape. I’d also like to make grub-mkconfig a bit more robust in the event that the root filesystem isn’t one that GRUB understands (#561855, #562672), and I’d quite like to write some more documentation.

On the upside, progress has been good. We have multiple terminal support thanks to a new upstream snapshot (#506707), update-grub runs much faster (#508834, #574088), we have DM-RAID support with a following wind (#579919), the new scheme with symlinks under /dev/mapper/ works (#550704), we have basic support for btrfs / as long as you have something GRUB understands properly on /boot (#540786), we have full info documentation covering all the user-adjustable settings in /etc/default/grub, and a host of other smaller fixes. I’m hoping we can keep this up.

If you’d like to help, contact me, especially if there’s something particular that isn’t being handled that you think you could work on. GRUB 2 is actually quite a pleasant codebase to work on once you get used to its layout; it’s certainly much easier to fix bugs in than GRUB Legacy ever was, as far as I’m concerned. Thanks to tools like grub-probe and grub-fstest, it’s very often possible to fix problems without needing to reboot for anything other than a final sanity check (although KVM certainly helps), and you can often debug very substantial bits of the boot loader - the bits that actually go wrong - using standard tools such as strace and gdb. Upstream is helpful and I’ve been able to get many of the problems above fixed directly there. If you have a sound knowledge of C and a decent level of understanding of the environment a boot loader needs to operate in - or for that matter specialist knowledge of interesting device types - then you should be able to find something to do.

OpenSSH 5.5p1 for Lucid

2010-05-10T10:29:51+02:00

For various reasons, I chose to leave Ubuntu 10.04 LTS using OpenSSH 5.3p1. The new features in 5.4p1 such as certificate authentication, the new smartcard handling, netcat mode, and tab-completion in sftp are great, but unfortunately it was available just a little bit too late for me to be able to land it for 10.04 LTS. I realise that many Lucid users want to make use of these features for one reason or another, though, so as a compromise here’s a PPA containing OpenSSH 5.5p1 for Lucid.

I intend to keep this up to date for as long as I reasonably can, and I’m happy to accept bug reports on it in the usual place.

Thoughts on 3.0 (quilt) format

2010-03-25T23:45:28+00:00

Note: I wrote most of this before Neil Williams’ recent comments on the 3.0 family of formats, so despite the timing this isn’t really a reaction to that although I do have a couple of responses. On the whole I think I agree that the Lintian message is a bit heavy-handed and I’m not sure I’m thrilled about the idea of the default source format being changed (though I can see why the dpkg maintainers are interested in that). That said, as far as I personally am concerned, there is a vast cognitive benefit to me in having as much as possible be common to all my packages. Once I have more than a couple of packages that require patching and benefit from the 3.0 (quilt) format as a result, I find it in my interest to use it for all my non-native packages even if they’re patchless right now, so that for instance if they need patches in the future I can handle them the same way. It’s not unheard of for me to apply temporary patches even to packages I actively maintain upstream, so I don’t discount those either. I haven’t decided what to do with my native packages yet; unless they’re big enough for bzip2 compression to be worthwhile, there doesn’t seem to be much immediate advantage to 3.0 (native).

Anyway, on to the main body of this post:

I’ve been one of the holdouts resisting use of patch systems for a long time, on the basis that I felt strongly that dpkg-source -x ought to give you the source that’s actually built, rather than having to mess around with debian/rules targets in order to see it. Now that the 3.0 (quilt) format is available to fix this bug, I felt that I ought to revisit my resistance and start trying to use it. Migrating to it from monolithic diffs is of course a bit more work than migrating to it from other patch systems, so it’s taken me a little while to get round to it. I’d been thinking about holding off until there was better integration with revision control (e.g. bzr looms), as I feel that patch files really ought to be an export format, but I eventually decided that I shouldn’t let the perfect be the enemy of the good. I have enough experience with co-maintaining packages that use build-time patch systems to be able to compare my reactions.

After experimenting with a couple of small packages, I moved over to the deep end and converted openssh a few weekends ago, since quite a few people have requested over the years that the Debian changes to openssh be easier to audit. This was a substantial job - over 6000 lines of upstream patches - but not actually as much work as I expected. I took a fairly simplistic approach: first, I unapplied all the upstream patches from my tree; then I ran bzr di | interdiff -q /dev/stdin /dev/null >x, reduced it to a single logically-discrete patch, applied it to a new quilt patch using quilt fold, and repeated until x was empty. This was maybe an hour or two of work, and then I went through and tagged all the patches according to DEP-3, which took another few hours. After the first pass, I ended up with 38 patches and a much clearer idea of what has been forwarded upstream and what hasn’t; I currently have 5 patches to forward or eliminate, down from 18.

Good things:

I don’t lose any of my history. Since all the patches remain applied to the tree in revision control (this is what dpkg-source -x gives you, so it’s the natural representation in revision control too), bzr blame works just as you’d expect and displays both upstream and Debian changes at once. I rely on tools like blame a lot, and I really hate the way build-time patch systems make it hard to use revision control when the tree is in a built state, so this was a hard requirement for me.
I’ve used patch tagging before, so I was expecting some benefits, but viscerally I feel much more in control. It’s so much less laborious now to see what I need to do by way of forwarding. I don’t regret waiting for 3.0 (quilt) to become available, but I hadn’t realised quite how much I was being held back beforehand.
Adding new patches is pretty natural, much more so than with build-time patch systems. You can create and apply the patch, test-build, and commit when it works. I much prefer this over having to clean the tree before committing (or commit just part of the tree, which is error-prone). The more that committing to a Debian package feels like committing to an upstream project, the better.
There’s definitely something to be said for patch-tracker being more useful. It deals with DEP-3 to the extent of linkifying URLs, although it might be nice if patch descriptions were displayed on the overview page for each version.

Bad things:

It’s a bit awkward to set things up when checking out from revision control; I didn’t really want to check in the .pc directory, and the tree checks out in the patched state (as it should), so I needed some way for developers to get quilt working easily after a checkout. This is sort of the reverse of the previous problem, where users had to do something special after dpkg-source -x, and I consider it less serious so I’m willing to put up with it. I ended up with a rune in debian/rules that ought to live somewhere more common.
Everything ends up represented twice in revision control: the patch files, plus the changes to the patched files themselves. I’m OK with this although it is a little inelegant.
Although I haven’t had to do it yet, I expect that merging new upstream releases will be a bit harder. bzr will deal with resolving conflicts in the patched files themselves, and that’s why I use a revision control system after all, but then I’ll have to go and refresh all the patches and will probably end up doing some of the same conflict resolution a second time. I think the best answer right now is to quilt pop -a, force a merge despite the modified working tree, and then quilt push && quilt refresh -pab until I get back to the top of the stack, modulo slight fiddliness when a patch disappears entirely; thus effectively using quilt’s conflict resolution rather than bzr’s. I suppose this will serve as additional incentive to reduce my patch count. I know that people have been working on making this work nicely with topgit, although I’m certainly not going to put up with the rest of git due to that; I’m happy to wait for looms to become usable and integrated. :-)
It would be nice if there were some standard DEP-3 way to note that a patch has been accepted or rejected upstream, beyond just putting it in the description. In particular, it seems to me that listing patches accepted upstream could be used to speed up the process of merging new upstream releases.

On the whole I’m satisfied with this, and the benefits definitely outweigh the costs. Thanks to the dpkg team for all their work on this!

parted 2.2 transition

2010-03-22T01:12:48+00:00

I’ve started the transition of parted 2.2 to unstable. This is a major update needed for sensible support of newer hard disks with alignment requirements different from the archaic cylinder alignment tradition. I posted to debian-boot with a summary of the partman changes involved.

debhelper statistics

2010-03-03T00:22:19+00:00

I don’t know if anyone else has been tracking this recently, but a while back I got curious about the relative proportions of dh(1) and CDBS in the archive, and started running some daily analysis on the Lintian lab. Apologies for my poor graphing abilities, but the graph is here (occasionally updated):

Although dh is still a bit behind CDBS, the steady upward trend is quite striking - it looks set to break 20% soon, up from under 13% in September - compared with CDBS which has been sitting within half a percentage point of 25% the whole time.

Incidentally, was that an ftpmaster trying to sign his name in the graph over Christmas or something? :-)

Catching up

2010-02-21T20:04:55+00:00

I did a bit of catching up on my Debian backlog over the last week or so. Among the things I got round to:

I released man-db 2.5.7. This was mostly an “I’ve been meaning to do this for ages” kind of thing to reduce the bug list a bit, closing ten Debian bugs, but there were a few interesting things in there as well, such as always saving cat pages in UTF-8 and recoding to the user’s locale at display time (long overdue), adjusting the search order for localised manual pages by request of quite a few non-native English speakers to prefer a page in the right section over a page in the right language, and a cute gimmick to make things like man /usr/bin/time display the appropriate manual page rather than the text of the executable. See the NEWS file for more details.
binfmt-support now installs cleanly on non-Linux systems, even if it doesn’t do anything useful yet.
I fixed a couple of shell bugs in groff.
halibut now complies with the Debian Vim policy, even though I can’t say I entirely agree with it in this case.
I fixed a really odd build failure in troffcvt. Yay imake, or something.
All Debian patches to putty are now upstream, or will be once I upload a new snapshot. Thanks to Simon Tatham and Jacob Nevins.
I did a few bits and pieces of packaging cleanup with an eye on my DDPO list, and added some watch files where they were missing.
Responded to an offer to take over icoutils maintenance.

So nothing really earth-shaking, and as ever openssh could use some attention, but I feel a bit better about my backlog now. I do still have a critical bug in makepasswd to fix, and a sponsored upload of parrot; those are the next two things on my to-do list.

Tissue of lies

2009-11-13T17:37:36+00:00

In case it isn’t obvious, in “Ubuntu 9.10 SP1 coming in spring 2010”, “Ubuman” is blatantly lying in attributing a number of statements to me. None of the text there was written by me, and if you thought any of it was true then you should probably make sure your troll radar is working properly. Nice joke, but try harder next time - it doesn’t even look like my writing style.

(I wouldn’t normally bother to respond, since I’m probably just giving it more publicity, but apparently one or two people may already have been taken in by it. One person was sensible enough to write to me and check the facts.)

Keysigning bits

2009-07-31T11:31:44+00:00

If you’re generating one of these shiny new RSA keys, do please remember to generate an encryption subkey too if you expect people to sign it - at least your more obscure UIDs. I’m not going to mail unencrypted signatures around unless I have some out-of-band knowledge that the e-mail address actually belongs to the person I met.

I generated a new 4096-bit RSA key myself at DebConf (baa!), and have just published a key transition document. Please consider signing my new key if you signed my old one.

man-db: ‘man -K’

2009-07-14T15:36:45+00:00

I recently implemented man -K (full-text search over all manual pages) in man-db. This was inspired by a similar feature in Federico Lucifredi’s man package (formerly maintained by Andries Brouwer). I think I did a much better job of it, though. The man package just forks grep for every manual page; man-db takes advantage of the pipeline library I wrote for it a while back and does it entirely in-process (decompression requires a fork but no exec, while the man package has to exec gunzip as well).

The upshot is that, with a hot cache, man-db takes around 40 seconds to search all manual pages on my laptop; the man package (also with a hot cache) takes around five minutes, and interactive performance goes down the drain while it’s doing it since it’s spawning subprocesses like crazy. If I limit to a single section, the disparity is closer to 3x than 10x, but it’s still very noticeable. It’s interesting how much good libraries can do to help guide efficient approaches to problems.

Of course, a proper full-text search engine would be much better still, but that’s a project for some other time …

Python SIGPIPE handling

2009-07-02T08:14:26+00:00

Enrico writes about creating pipelines with Python’s subprocess module, and notes that you need to take care to close stdout in non-final subprocesses so that subprocesses get SIGPIPE correctly. This is correct as far as it goes (and true in any language, although there’s a Python bug report requesting that subprocess be able to do this itself, but there’s an additional gotcha with Python that you missed.

Python ignores SIGPIPE on startup, because it prefers to check every write and raise an IOError exception rather than taking the signal. This is all well and good for Python itself, but most Unix subprocesses don’t expect to work this way. Thus, when you are creating subprocesses from Python, it is very important to set SIGPIPE back to the default action. Before I realised this was necessary, I wrote code that caused serious data loss due to a child process carrying on out of control after its parent process died!

import signal
import subprocess

def subprocess_setup():
    # Python installs a SIGPIPE handler by default. This is usually not what
    # non-Python subprocesses expect.
    signal.signal(signal.SIGPIPE, signal.SIG_DFL)

subprocess.Popen(command, preexec_fn=subprocess_setup)

I filed a patch a while back to add a restore_sigpipe option to subprocess.Popen, which would take care of this. As I say in that bug report, in a future release I think this ought to be made the default, as it’s very easy to get things dangerously wrong right now.

code_swarm video of Ubuntu uploads

2009-05-28T20:29:55+00:00

Joey Hess posted a draft of a code_swarm video for d-i a couple of weeks ago, which reminded me that I’ve been meaning to do something similar for Ubuntu for a while now as it’s just about our archive’s fifth birthday. I have a more or less complete archive of all our -changes mailing lists locally (I think I’m missing some of the very early ones, before the end of July 2004; let me know if you were one of the very early Canonical employees and have a record of these), and with the aid of launchpadlib it’s fairly easy to map all the e-mail addresses into Launchpad user names, massage out some of the more obvious duplicates, and then treat the stream of uploads as if it were a stream of commits.

If you haven’t seen code_swarm before, each dot represents an upload, and the dots “swarm” around their corresponding committers’ names; more active committers have larger swarms of dots and brighter names. I assigned a colour to each of our archive components (uploads aren’t really at the C code vs. Python code vs. translations vs. whatever kind of granularity that you see in other code_swarm videos), which mostly means that people who predominantly upload to main are in roughly an Ubuntu tan colour, people who predominantly upload to universe are coloured bluish, and people with a good mixture tend to come out coloured green. If I get a bit more time I may try to figure out enough about video editing software to add some captions.

Here’s the video (194 MB).

Bug triage, redux

2009-03-05T11:04:02+00:00

I’ve been a bit surprised by the strong positive response to my previous post. People generally seemed to think it was quite non-ranty; maybe I should clean the rust off my flamethrower. :-) My hope was that I’d be able to persuade people to change some practices, so I guess that’s a good thing.

Of course, there are many very smart people doing bug triage very well, and I don’t want to impugn their fine work. Like its medical namesake, bug triage is a skilled discipline. While it’s often repetitive, and there are lots of people showing up with similar symptoms, a triage nurse can really make a difference by spotting urgent cases, cleaning up some of the initial blood, and referring the patient quickly to a doctor for attention. Or, if a pattern of cases suddenly appears, a triage nurse might be able to warn of an incipient epidemic. [Note: I have no medical experience, so please excuse me if I’m talking crap here. :-)] The bug triagers who do this well are an absolute godsend; especially when they respond to repetitive tasks with tremendously useful pieces of automation like bughelper. The cases I have trouble with are more like somebody showing up untrained, going through everyone in the waiting room, and telling each of them that they just need to go home, get some rest, and stop complaining so much. Sometimes of course they’ll be right, but without taking the time to understand the problem they’re probably going to do more harm than good.

Ian Jackson reminded me that it’s worth mentioning the purpose of bug reports on free software: namely, to improve the software. The GNU Project has some advice to maintainers on this. I think sometimes we stray into regarding bug reports more like support tickets. In that case it would be appropriate to focus on resolving each case as quickly as possible, if necessary by means of a workaround rather than by a software change, and only bother the developers when necessary. This is the wrong way to look at bug reports, though. The reason that we needed to set up a bug triage community in Ubuntu was that we had a relatively low developer-to-package ratio and a very high user-to-developer ratio, and we were getting a lot of bug reports that weren’t fleshed out enough for a developer to investigate them without spending a lot of time in back-and-forth with the reporter, so a number of people volunteered to take care of the initial back-and-forth so that good clear bug reports could be handed over to developers. This is all well and good, and indeed I encouraged it because I was personally finding myself unable to keep up with incoming bugs and actually fix anything at the same time. Somewhere along the way, though, some people got the impression that what we wanted was a first-line support firewall to try to defend developers from users, which of course naturally leads to ideas such as closing wishlist bugs containing ideas because obviously those important developers wouldn’t want to be bothered by them, and closing old bugs because clearly they must just be getting in developers’ way. Let me be clear about this now: I absolutely appreciate help getting bug reports into a state where I can deal with them efficiently, but I do not want to be defended from my users! I don’t have a basis from which to state that all developers feel the same way, but my guess is that most do.

Antti-Juhani Kaijanaho said he’d experienced most of these problems in Debian. I hadn’t actually intended my post to go to Planet Debian - I’d forgotten that the “ubuntu” category on my blog goes there too, which generally I see as a feature, but if I’d remembered that I would have been a little clearer that I was talking about Ubuntu bug triage. If I had been talking about Debian bug triage I’d probably have emphasised different things. Nevertheless, it’s interesting that at least one Debian (and non-Ubuntu) developer had experienced similar problems.

Justin Dugger mentions a practice of marking duplicate bugs invalid that he has problems with. I agree that this is suboptimal and try not to do it myself. That said, this is not something I object to to the same extent. Given that the purpose of bugs is to improve the software, the real goal is to be able to spend more time fixing bugs, not to get bugs into the ideal state when the underlying problem has already been solved. If it’s a choice between somebody having to spend time tracking down the exact duplicate bug number versus fixing another bug, I know which I’d take. Obviously, when doing this, it’s worth apologising that you weren’t able to find the original bug number, and explaining what the user can do if they believe that you’re mistaken (particularly if it’s a bug that’s believed to be fixed); the stock text people often use for this doesn’t seem informative enough to me.

Sebastien Bacher commented that preferred bug triage practices differ among teams: for instance, the Ubuntu desktop team deals with packages that are very much to the forefront of users’ attention and so get a lot of duplicate bugs. Indeed - and bug triagers who are working closely with the desktop team on this are almost certainly doing things the way the developers on the desktop team prefer, so I have no problem with that. The best advice I can give bug triagers is that their ultimate aim is to help developers, and so they should figure out which developers they need to work with and go and talk to them! That way, rather than duplicating work or being counterproductive, they can tailor their work to be most effective. Everybody wins.

Bug triage rants

2009-03-02T14:51:37+00:00

I hate to say this, but often when somebody does lots of bug triage on a package I work on, I find it to be a net loss for me. I end up having to go through all the things that were changed, correct a bunch of them, occasionally pacify angry bug submitters, and all the rest of it, and often the benefits are minimal at best.

I would very much like this not to be the case. Bug triage is supposed to help developers be more efficient, and I think most people who do bug triage are generally well-intentioned and eager to help. Accordingly, here is a series of mini-rants intended to have educational value.

Bugs are not like fruit.

Fruit goes bad if you leave it too long. By and large, bugs don’t, especially if they’re on software that doesn’t change very much. There is no reason why a bug filed against a package in Ubuntu 4.10 where the relevant code hasn’t changed much since shouldn’t still be perfectly valid. Even if it isn’t, it deserves proper consideration.

My biggest single annoyance with bug triage is people coming around and asking if bugs are still valid when they haven’t put any effort into reproducing them themselves. This annoys bug submitters too; every so often somebody replies and says “didn’t you even bother to check?”. This gives a very bad impression of us as a project - wouldn’t it be better if we looked as if we knew what we were talking about? There is a good reason to do this kind of check, of course: random undiagnosed crash reports and the like may well go away due to related changes, and it is occasionally worth checking. But if the bug is already well-understood and/or well-described, you should just go and check whether it’s still there rather than asking.

As I understand it, the intended workflow is that people file bugs, then if they aren’t clear enough bug triagers work with the submitter to gather information until they are, then they’re passed to developers for further work. We seem to have added an extra step wherein submitters must periodically give their bug a health-check, and if they don’t then it gets closed as being out of date. In a small minority of cases this is useful; in most cases, frankly, it makes us look a bit clueless. Can we please stop doing this? The more we waste people’s time doing this, the less likely it is that they’ll bother to respond to us, and this might help our statistics but doesn’t help the project as a whole.

I know that there’s a problem with bug count. I think every project of non-trivial size has that problem. But, honestly, the right answer is to fix more bugs - and, personally, I would be able to spend more time doing that if I weren’t often running around trying to make sure that bugs I care about aren’t getting overenthusiastically closed just because somebody thinks they’ve been lying around too long.

There is a good way to expire bugs like this, of course. It goes something like this: “I’ve read through your bug and tried to reproduce it with a current release, but I’m afraid I can’t do so. Are you still experiencing it? If not, then I think it might have been fixed by [this change I found in the package’s history that seems to be related].” You can’t do this en masse, but you’ll get a much better response from submitters, you’ll learn more doing it, and in the process of doing the necessary investigation of each bug you’ll find that there are many cases you don’t have to ask about at all.
Wishlist bugs are not intrinsically bad.

There are certainly cases where something is far too broad or vague for a bug report; but there are also plenty of cases, probably far more, where the wish in question is a relatively small change to the program, or doesn’t need any more sophisticated tracking, and a wishlist bug is just right. If you don’t know the program very well, it may be difficult to tell whether a wishlist bug is appropriate or not; in that case, just leave the bug alone.

Please, for the love of all that’s holy, don’t close wishlist bugs saying that people should use Brainstorm or write a specification instead! If you don’t want to see wishlist bugs in your statistics, just filter them out; it’s quite easy to do. Even worse, don’t tell people that something probably isn’t a good idea when you aren’t familiar with the software; people who have gone to the effort of writing up their idea for us deserve a response from somebody who knows the software well. I’ve encountered cases where friends of mine submitted a bug report (sometimes even at my request) and then a triager told them it was a bad idea and closed their bug. This sort of thing puts people off Ubuntu.

Specifications are software design documents. As such, they are best written by software designers. People who tell other people to go and write a specification may not realise that as a result of doing this for three years it’s now essentially impossible to find anything in the specification system! The intent was never that every user of Ubuntu would need to write a specification to get anything changed; specifications are used by developers to document the results of discussions and write up plans. They are not a straightforward alternative to wishlist bugs, nor do they turn out to work very well as what many formal processes call “requirements documents”; the process of refining the latter in the context of Ubuntu might involve wishlist bugs, mailing list threads, wiki pages, private discussions with developers, or things of that nature, and probably shouldn’t involve creating a specification until the requirements-gathering process is well underway.
Closing a bug is taking an item off somebody’s to-do list.

You wouldn’t go up to a colleague’s whiteboard and take an eraser to it unless you were sure that was OK, would you? Yet people seem to do that all the time with bugs. It’s OK when the bug is really just like a support request - “help, it crashed, what do I do?” - and either you’re pretty sure it’s user error or there’s just no way to get enough information to fix it. But once the initial triage process is done, now it’s on somebody’s to-do list.

This is closely related to …
If a developer has accepted it, leave it alone.

Every so often I find that there’s a bug that I have accepted by way of a bug comment or setting to Triaged or whatever, or even a bug that I filed on a package I work on as a reminder to myself, and somebody comes along and asks for more information or asks if we can still reproduce it or something. The hit rate on this kind of thing is extraordinarily low. There’s a good chance that the developer went and verified the bug against the code, and in that case it certainly doesn’t need more information (or they would have asked for it) and it probably isn’t going to go away without anyone noticing.

In most other free software projects, developers file bug reports themselves as a reminder about things that need to be done, and people leave them alone unless they’re intending to help with the fix. In Ubuntu, developers also have to spend time making sure that those to-do items don’t get expired. Nobody is helped by this.

launchpad-gm-scripts includes a Greasemonkey script called lp_karma_suffix, which can help you to identify developers without having to spend lots of time clicking around.
Check whether the package is being actively worked on.

Some packages are actively worked on in Ubuntu; some aren’t (e.g. we just sync packages from Debian, or they’re basically orphaned, or whatever). It’s worth checking which is which before doing any kind of extensive triage work. If it’s being actively worked on, why not go and talk to the developer(s) in question first? It’s only polite, and it will probably help you to do a better job.

Re: Perl is strange

2008-06-23T15:58:08+00:00

Christoph: That’s because =~ binds more tightly than +. This does what you meant:

$ perl -le 'print "yoo" if (1 + 1) =~ /3/'

perlop(1) has a useful table of precedence.

Don’t use sshkeygen.com to generate keys!

2008-06-23T09:31:57+00:00

To my horror, I recently saw this online SSH key generator.

I hope nobody reading this needs to be told why this is a bad idea. However, in case you do, here are a few reasons:

Every SSH implementation I know of - certainly all the major ones - that support public key authentication also provide a key generation utility. Even aside from all the good reasons not to, there is simply no reason why you should need to use a web-based tool in the first place.
How can you trust the person running this site? Without implying that I know he or she is untrustworthy (I don’t), and with the best will in the world, it’s a big Internet with a lot of nasty people on it. Do you really want somebody you don’t know in a position to keep a copy of all your private keys?
Even if the person is trustworthy, the server running sshkeygen.com is now a giant blinking target. If lots of people use it, there is every incentive in the world for the bad guys to try to take control of it so that they can keep a copy of all your private keys. (Or, as we know from recent bitter experience, they can just give out keys from a limited set and it will probably take a couple of years before anyone notices …)
The front page of sshkeygen.com says that the keys are escrowed. The plain English meaning of this would be that the operator of that site keeps a copy of the private key, to be held in trust in case (presumably) you lose it and need to retrieve it. Normally this sort of thing depends on a legal trust relationship, perhaps linked to a contract. What does it mean here? Is it just a buzzword? If it isn’t, then this just makes sshkeygen.com even more of a target.
sshkeygen.com delivers keys to you over unencrypted HTTP. Yes, this is on its to-do list. That isn’t really an excuse.
Even if keys were delivered over HTTPS, that still relies on people diligently checking the authenticity of the certificate. A self-signature (as suggested as an alternative in the to-do list) would be impossible to check with any reliability; and will people who have trouble with non-web-based key generation software really be able or inclined to confirm the signature chain? Browsers typically don’t enforce this very strictly, or if they do they provide fairly simple ways to bypass the enforcement, simply because so many sites have broken or poorly-signed SSL certificates, and keeping up with all the CAs is pretty hard work too.
Furthermore, delivering private keys over HTTPS makes that SSL certificate a single giant blinking target. Might it be compromised? How would you tell? What servers would need to be compromised in order to get a copy of the private SSL key?
Sure, Debian is in an awkward position here given the recent OpenSSL random number generation vulnerability. However, how do you know that sshkeygen.com is running on a system that doesn’t suffer from this? (As it happens, I have checked, and it doesn’t appear to suffer from this vulnerability - but most people won’t check and won’t know how to check.)

I think this is probably being done in innocent seriousness (although I kind of hope it’s a joke in poor taste), and have e-mailed the contact address offering to explain why it’s a bad idea.

Vim omni completion for Launchpad bugs

2008-01-31T11:17:58+00:00

I hacked together a little timesaver for developers this morning: omni completion for Launchpad bugs in Vim’s debchangelog mode. To use it, install vim 7.1-138+1ubuntu3 once it hits the mirrors, open up a debian/changelog file, type “LP: #”, and hit Ctrl-X Ctrl-O. It’ll think for a while and then give you a list of all the bugs open in Launchpad against the package in question, from which you can select to insert the bug number into your changelog.

Here’s a screenshot to make it clearer:

Thanks to Stefano Zacchiroli for doing the same for Debian bugs back in July.

UTF-8 manual pages

2008-01-29T01:57:51+00:00

See Encodings in man-db for context.

Yesterday, I uploaded man-db 2.5.1-1 to unstable. With this version, not only is it possible to install manual pages in UTF-8 (as with 2.5.0, although with fewer bugs), but it’s also possible to ask man to produce a version of an arbitrary page in the encoding of your choice, and have it guess the source encoding for you fairly reliably. This finally provides enough support to have debhelper automatically recode manual pages to UTF-8.

It’ll probably take a little while to shake out the corner-case bugs, but I’m generally pretty happy with this. Once the new man-db and debhelper land in testing, I’ll send a note to debian-devel-announce and push harder on my policy amendment.

Considering the historical state of man-db when it comes to localisation, and all of the dependencies and general yak-shaving that had to be tackled to get here, this represents the end of probably several hundred hours of work, so I’m pretty happy that this is out the door. The only remaining step is to add UTF-8 input support to groff, which fortunately Brian M. Carlson is working on. After that, we can reasonably claim to have dragged manual pages kicking and screaming into the 21st century.

aptitude safe-upgrade

2007-11-29T20:51:23+00:00

Erich: I do sometimes wonder why we don’t relax the definition of “safe” upgrades to include installing new packages but still not removing old ones. I know that many of my uses of dist-upgrade are just for when something grows a new dependency that I didn’t previously have installed.

(Of course this wouldn’t always help as it wouldn’t account for a new dependency that conflicted with an old dependency, but never mind. It would certainly do wonders for the metapackage case.)

Encodings in man-db

2007-09-17T07:28:20+00:00

I’ve spent some quality upstream time lately with man-db. Specifically, I’ve been upgrading its locale support. I recently published a pre-release, man-db 2.5.0-pre2 mainly for translators, but other people may be interested in having a look at it as well. I hope to release 2.5.0 quite soon so that all of this can land in Debian.

Firstly, man-db now supports creating and using databases for per-locale hierarchies of manual pages, not just English. This means that apropos and whatis can now display information about localised manual pages.

Secondly, I’ve been working on the transition to UTF-8 manual pages. Now, modulo some hacks, groff can’t yet deal with Unicode input; some possible input characters are reserved for its internal use which makes full 32-bit input difficult to do properly until that’s fixed. However, with a few exceptions, manual pages generally just need the subset of Unicode that corresponds to their language’s usual legacy character set, so for now it’s good enough to just recode on the fly from UTF-8 to some appropriate 8-bit character set and use groff’s support for that.

man-db has actually supported doing this kind of thing for a while, but it’s been difficult to use since it only applies to /usr/share/man/ll_CC.UTF-8/ directories, while manual pages usually aren’t country-specific. So, man-db 2.5.0 supports using /usr/share/man/ll.UTF-8/ instead, which is a bit more appropriate. Also, following a discussion with Adam Borowski, man-db can now try decoding manual pages as UTF-8 and fall back to 8-bit encodings even in directories without an explicit encoding tag; if this fails for some reason, you can put a '\" -*- coding: UTF-8 -*- line at the top of the page.

I’m still debating whether Debian policy should recommend installing UTF-8 manual pages in /usr/share/man/ll.UTF-8/ or just in /usr/share/man/ll/. Initially I was very strongly in favour of an encoding declaration, but now that man-db can do a pretty good job of guesswork I’m coming round to Adam Borowski’s position that people should be able to forget about character sets with UTF-8. Opinions here would be welcome. One thing I haven’t moved on is that any design that assumes that the encoding of manual pages on the filesystem has anything to do with the user’s locale is demonstrably incorrect and broken; I’m not going to use LC_CTYPE for anything except output. However, maybe “UTF-8 or the usual legacy encoding provided that the latter is not typically confused for the former” is a good enough specification, and that still has the desirable property of not requiring a flag day. I’ll try to come down from the fence before unleashing this code on the world.

Keysigning public service announcement

2007-07-04T17:45:39+00:00

If your key has so many UIDs and such a combinatorially exploded number of signatures on it that it takes gpg minutes just to start up in --edit-key mode, then I probably won’t bother signing it. HTH, HAND.

Moving conffiles between packages, redux

2006-12-23T23:37:08+00:00

I spent far too much of today cleaning up an upgrade bug to do with conffiles, which I suspect also affects other packages that have attempted to work around dpkg conffile prompts when moving conffiles between packages. If you maintain such a package, please review your code to make sure that it works properly when upgrading both with sarge’s dpkg and with etch’s dpkg. See my debian-devel post

for a full description.

Google Summer of Code project started (Debian)

2006-05-26T17:23:00+00:00

I’m mentoring Matheus Morais in the Google Summer of Code, porting d-i to the Hurd. We’ve exchanged a few mails and he has in hand all the preliminary (but not yet functional; wouldn’t want to make it too easy :-)) patches I’ve put together in the past. I think I should be reasonably well-placed to judge his progress.

Best of luck, Matheus!

Unix tools: sponge

2006-02-06T20:45:38+00:00

Joey writes about the lack of new tools that fit into the Unix philosophy. My favourite of such things I’ve written is sponge. It addresses the problem of editing files in-place with Unix tools, namely that if you just redirect output to the file you’re trying to edit then the redirection takes effect (clobbering the contents of the file) before the first command in the pipeline gets round to reading from the file. Switches like sed -i and perl -i work around this, but not every command you might want to use in a pipeline has such an option, and you can’t use that approach with multiple-command pipelines anyway.

I normally use sponge a bit like this:

sed '...' file | grep '...' | sponge file

Since it’s so trivial I imagine lots of other people have written something similar (another common name for it seems to be inplace; my name indicates soaking up all the input and then squeezing it all out again); but I do keep meaning to try to get a rewritten version into coreutils at some point.

debconf/cdebconf coinstallability

2006-01-27T02:55:06+00:00

Joey has been campaigning for a while to get everything in the archive changed to depend on debconf | debconf-2.0 or similar rather than just debconf, in order that we can start rolling out cdebconf as its replacement. Like most jobs that involve touching the bulk of the archive, this looks set to take quite a while, as the list of bugs should indicate.

Recently it occurred to me that we didn’t necessarily have to do it that way round. In a bout of late-night hacking while staying awake to look after a sick child (he seems mostly OK now, although the rushed trip to the hospital earlier was a bit on the nerve-wracking side), I’ve shuffled things around in the cdebconf package so that it no longer has any file conflicts with debconf or debconf-doc, and changed the debconf confmodule to fire up the cdebconf frontend rather than its own if the DEBCONF_USE_CDEBCONF environment variable is non-empty. (The details of this may change before it actually gets uploaded, as I’d like to get Joey to look it over and approve it first.) This allows you to install cdebconf, set that environment variable, and play around with cdebconf with relative ease; when we come to switch to cdebconf for real, instead of a huge conflicting mess that apt will probably have trouble resolving, it’ll just be a matter of changing a couple of lines in /usr/share/debconf/confmodule.

Of course, don’t expect cdebconf to be a complete working replacement for debconf just yet; if you try using it for a dist-upgrade run it’ll fall over. Due to its d-i heritage, it doesn’t yet load templates automatically; that has to be done by hand. Frontend names differ from debconf’s, which will need some migration code. At the moment it can only handle UTF-8 templates, which are mandated in the installer but only optional in the rest of the system. It doesn’t have all of debconf’s rich array of database modules. I haven’t adapted the Perl or Python confmodules yet. The list goes on. However, I think we at least stand a chance of getting a handle on the problem now.

(I’ll post this article to debian-devel once the changes have been reviewed and uploaded.)

Killer apps: bzr shelve

2006-01-09T16:47:43+00:00

Working on free software has made me fairly revision control system-agnostic; I can’t afford to get too wedded to any one system because as soon as I do somebody will invent something new and I’ll have to convert again, so I just work with whatever other people on the same project are using. Even CVS doesn’t make a lot of difference to the way I work as long as I’m working online and have cvsps handy. And of course I usually don’t bother with revision control if I’m just tweaking somebody else’s Debian source package a bit (in which case I just use debdiff for paranoia).

Using bzr at work, though, I think I just found my killer app in Michael Ellerman’s shelve plugin. My working style generally involves alternating between doing lots and lots of stuff in the one working copy and (after testing) going through and committing it in logical chunks. This is fine if everything’s in separate files (most revision control systems let you commit just some files), but if several of the chunks are in the one file then I’m reduced to saving diffs and manually editing out the bits I don’t want to commit yet, which is obviously pretty tedious and error-prone.

bzr shelve presents each diff hunk in your working copy to you in turn and asks you whether you want to keep it. If you say no, that hunk gets unapplied and goes into a “shelf”, where bzr unshelve can later reapply it. In the meantime commits act as though the shelved hunks didn’t exist. This doesn’t help if you want to defer only one of two immediately adjacent changes that end up in the same hunk, of course, but it vastly reduces the scale of the problem.

I suppose it would be easy enough to write a shelve-a-like for any other system; it’s just that I haven’t seen it for any other system yet. If working with systems that lack it really starts to annoy me, I may have to rip out the guts of shelve and figure out how to make it generic.

Single-stage installer

2006-01-03T15:32:27+00:00

Hot on the heels of Joey’s tale of getting rid of base-config (the second stage of the installer) in Debian, we’ve now pretty much got rid of it in Ubuntu Dapper too. The upshot of this is that rather than asking a bunch of questions, installing the base system, and rebooting to install everything else, we now just install everything in one go and reboot into a completed system.

This does mean that, if your system doesn’t boot, you don’t get to find out about it for a bit longer. However, it has lots of advantages in terms of speed (the much-maligned archive-copier mostly goes away), reducing code duplication (base-config had a bunch of infrastructure of its own which was done better in the core installer anyway), comprehensibility, and killing off some annoying bugs like #13561 (duplicate mirror questions in netboot installs), #15213 (second stage hangs if you skip archive-copier in the first stage), and #19571 (kernel messages scribble over base-config’s UI).

To go with Joey’s Debian timeline, the Ubuntu history looks a bit like this:

2004 (Jul): First base-config modifications for Ubuntu; we need to install the default desktop rather than dropping into tasksel.
2004 (Aug): Mark phones me up and asks if I can make the installer not need the CD in the second stage by copying all the packages across beforehand. Although it’s a bit awkward, I can see the UI advantages in that, so I write archive-copier at the Canonical conference in Oxford.
2004 (Sep): Mark asks me if we can ask the timezone, user/password, and apt configuration questions before the first reboot. With less than a month to go until our first release, I have a heart-attack at how much needs to be done, and it eventually gets deferred to Hoary.
2005 (Jan): Matt fixes up debconf’s passthrough frontend for use on the live CD, and we realise that this is an obvious way to run bits of base-config before the first reboot. It’s rather messy and takes until March or so before it really works right, but we get there in the end.
2005 (Apr): I get “put a progress bar in front of the dpkg output in the second stage” as a goal for Breezy. Naïvely, I think it’s a simple matter of programming, since I’d already done something similar for debootstrap and base-installer the previous year.
2005 (May): I hack progress bar support into debconf. Nothing actually uses it for anything yet, except as a convenient passthrough stub.
2005 (Jul/Aug): I actually try to implement the second-stage progress bar and realise that it’s about an order of magnitude harder than I thought, requiring a whole load of extra infrastructure in apt. Fortunately Michael Vogt saves the day here by writing lots of working code, and the progress bar works by early August.
2005 (Sep-Dec): Upstream d-i development ramps back up again, with tzsetup, clock-setup, apt-setup, and user-setup all being cranked out in short order and the corresponding pieces removed from base-config. I merge these as they mature, and manage to get agreement on including the Ubuntu debconf template changes in upstream apt-setup, which helps the diff size a lot.
2005 (Nov/Dec): Joey and I chat one evening about the Ubuntu second-stage progress bar work, and we end up designing and writing debconf-apt-progress based on its ideas, after which Joey knocks up pkgsel in no time flat.
2006 (Jan): The rest of the pieces land in Ubuntu, and we drop base-config out of the installer. To my surprise, nearly everything still just works.

Although it caused some friction, I’m glad that we did the first cuts of many of these things outside Debian and got to try things out before landing version-2-quality code in Debian. The end result is much nicer than the intermediate ones ever were.

Forwarding bugs to the IETF

2006-01-03T13:16:06+00:00

Sometimes following up on a bug takes you a lot further than you expected. Debian bug #337041 looked like it was going to be fairly straightforward once I upgraded coreutils to figure out what the new IUTF8 flag actually did, since the SSH2 protocol already supports transferring termios flags around.

Unfortunately, since IUTF8 is relatively new, it doesn’t have a number assigned in the draft connection protocol Moreover, that Internet-Draft is in the last stages before becoming an RFC and can’t be modified any more, and it doesn’t include any facility for private-use extensions. D’oh. To add further complication, since IUTF8 is Linux-specific, it’s not hard to imagine that some other OS might introduce something with the same name but subtly different semantics, and so the SSH protocols can’t just defer to POSIX for the definition but instead have to spell out exactly what they mean.

As a result of all of this, it looks like the best way to make progress might be for me to write an I-D myself that creates a channel extension to set or clear IUTF8, and attempt to enlist support from some upstream implementors. I didn’t expect bug triage to lead me into the Internet standardisation process quite so quickly!

Hello!

2006-01-03T13:11:31+00:00

New year, new blog. I’ve had a LiveJournal for a while, but don’t write very much in it, and many of its readers wouldn’t be interested in me talking about Debian and such anyway. I think the best solution is for me to keep technical posts here …