Sep 26 2017 A mysterious bug with Twisted plugins

I fixed a bug in Launchpad recently that led me deeper than I expected.

Launchpad uses Buildout as its build system for Python packages, and it’s served us well for many years. However, we’re using 1.7.1, which doesn’t support ensuring that packages required using setuptools’ setup_requires keyword only ever come from the local index URL when one is specified; that’s an essential constraint we need to be able to impose so that our build system isn’t immediately sensitive to downtime or changes in PyPI. There are various issues/PRs about this in Buildout (e.g. #238), but even if those are fixed it’ll almost certainly only be in Buildout v2, and upgrading to that is its own kettle of fish for other reasons. All this is a serious problem for us because newer versions of many of our vital dependencies (Twisted and testtools, to name but two) use setup_requires to pull in pbr, and so we’ve been stuck on old versions for some time; this is part of why Launchpad doesn’t yet support newer SSH key types, for instance. This situation obviously isn’t sustainable.

To deal with this, I’ve been working for some time on switching to virtualenv and pip. This is harder than you might think: Launchpad is a long-lived and complicated project, and it had quite a number of explicit and implicit dependencies on Buildout’s configuration and behaviour. Upgrading our infrastructure from Ubuntu 12.04 to 16.04 has helped a lot (12.04’s baseline virtualenv and pip have some deficiencies that would have required a more complicated bootstrapping procedure). I’ve dealt with most of these: for example, I had to reorganise a lot of our helper scripts (1, 2, 3), but there are still a few more things to go.

One remaining problem was that our Buildout configuration relied on building several different environments with different Python paths for various things. While this would technically be possible by way of building multiple virtualenvs, this would inflate our build time even further (we’re already going to have to cope with some slowdown as a result of using virtualenv, because the build system now has to do a lot more than constructing a glorified link farm to a bunch of cached eggs), and it seems like unnecessary complexity. The obvious thing to do seemed to be to collapse these into a single environment, since there was no obvious reason why it should actually matter if txpkgupload and txlongpoll were carefully kept off the path when running most of Launchpad: so I did that.

Then our build system got very sad.

Hmm, I thought. To keep our test times somewhat manageable, we run them in parallel across 20 containers, and we randomise the order in which they run to try to shake out test isolation bugs. It’s not completely unknown for there to be some oddities resulting from that. So I ran it again. Nope, but slightly differently sad this time. Furthermore, I couldn’t reproduce these failures locally no matter how hard I tried. Oh dear. This was obviously not going to be a good day.

In fact I spent a while on various different guesswork-based approaches. I found bug 571334 in Ampoule, an AMP-based process pool implementation that we use for some job runners, and proposed a fix for that, but cherry-picking that fix into Launchpad didn’t help matters. I tried backing out subsets of my changes and determined that if both txlongpoll and txpkgupload were absent from the Python module path in the context of the tests in question then everything was fine. I tried running strace locally and staring at the output for some time in the hope of enlightenment: that reminded me that the two packages in question install modules under twisted.plugins, which did at least establish a reason they might affect the environment that was more plausible than magic, but nothing much more specific than that.

On Friday I was fiddling about with this again and trying to insert some more debugging when I noticed some interesting behaviour around plugin caching. If I caused the txpkgupload plugin to raise an exception when loaded, the Twisted plugin system would remove its dropin.cache (because it was stale) and not create a new one (because there was now no content to put in it). After that, running the relevant tests would fail as I’d seen in our buildbot. Aha! This meant that I could also reproduce it by doing an even cleaner build than I’d previously tried to do, by removing the cached txpkgupload and txlongpoll eggs and allowing the build system to recreate them. When they were recreated, they didn’t contain dropin.cache, instead allowing that to be created on first use.

Based on this clue I was able to get to the answer relatively quickly. Ampoule has a specialised bootstrapping sequence for its worker processes that starts by doing this:

from twisted.application import reactors
reactors.installReactor(reactor)

Now, twisted.application.reactors.installReactor calls twisted.plugin.getPlugins, so the very start of this bootstrapping sequence is going to involve loading all plugins found on the module path (I assume it’s possible to write a plugin that adds an alternative reactor implementation). If dropin.cache is up to date, then it will just get the information it needs from that; but if it isn’t, it will go ahead and import the plugin. If the plugin happens (as Twisted code often does) to run from twisted.internet import reactor at some point while being imported, then that will install the platform’s default reactor, and then twisted.application.reactors.installReactor will raise ReactorAlreadyInstalledError. Since Ampoule turns this into an info-level log message for some reason, and the tests in question only passed through error-level messages or higher, this meant that all we could see was that a worker process had exited non-zero but not why.

The Twisted documentation recommends generating the plugin cache at build time for other reasons, but we weren’t doing that. Fixing that makes everything work again.

There are still a few more things needed to get us onto pip, but we’re now pretty close. After that we can finally start bringing our dependencies up to date.