I fixed a bug in Launchpad recently that led me deeper than I expected.
Launchpad uses Buildout as its build system for
Python packages, and it’s served us well for many years. However, we’re
using 1.7.1, which doesn’t support ensuring that packages required using
only ever come from the local index URL when one is specified; that’s an
essential constraint we need to be able to impose so that our build system
isn’t immediately sensitive to downtime or changes in PyPI. There are
various issues/PRs about this in Buildout (e.g.
#238), but even if those
are fixed it’ll almost certainly only be in Buildout v2, and upgrading to
that is its own kettle of fish for other reasons. All this is a serious
problem for us because newer versions of many of our vital dependencies
testtools, to name but two) use
setup_requires to pull in pbr, and so
we’ve been stuck on old versions for some time; this is part of why
Launchpad doesn’t yet support newer SSH key types, for instance. This
situation obviously isn’t sustainable.
To deal with this, I’ve been working for some time on switching to virtualenv and pip. This is harder than you might think: Launchpad is a long-lived and complicated project, and it had quite a number of explicit and implicit dependencies on Buildout’s configuration and behaviour. Upgrading our infrastructure from Ubuntu 12.04 to 16.04 has helped a lot (12.04’s baseline virtualenv and pip have some deficiencies that would have required a more complicated bootstrapping procedure). I’ve dealt with most of these: for example, I had to reorganise a lot of our helper scripts (1, 2, 3), but there are still a few more things to go.
One remaining problem was that our Buildout configuration relied on building several different environments with different Python paths for various things. While this would technically be possible by way of building multiple virtualenvs, this would inflate our build time even further (we’re already going to have to cope with some slowdown as a result of using virtualenv, because the build system now has to do a lot more than constructing a glorified link farm to a bunch of cached eggs), and it seems like unnecessary complexity. The obvious thing to do seemed to be to collapse these into a single environment, since there was no obvious reason why it should actually matter if txpkgupload and txlongpoll were carefully kept off the path when running most of Launchpad: so I did that.
Then our build system got very sad.
Hmm, I thought. To keep our test times somewhat manageable, we run them in parallel across 20 containers, and we randomise the order in which they run to try to shake out test isolation bugs. It’s not completely unknown for there to be some oddities resulting from that. So I ran it again. Nope, but slightly differently sad this time. Furthermore, I couldn’t reproduce these failures locally no matter how hard I tried. Oh dear. This was obviously not going to be a good day.
In fact I spent a while on various different guesswork-based approaches. I
found bug 571334 in
Ampoule, an AMP-based process pool implementation that we use for some job
runners, and proposed a
for that, but cherry-picking that fix into Launchpad didn’t help matters. I
tried backing out subsets of my changes and determined that if both
txpkgupload were absent from the Python module path in
the context of the tests in question then everything was fine. I tried
strace locally and staring at the output for some time in the hope
of enlightenment: that reminded me that the two packages in question install
twisted.plugins, which did at least establish a reason they
might affect the environment that was more plausible than magic, but nothing
much more specific than that.
On Friday I was fiddling about with this again and trying to insert some
more debugging when I noticed some interesting behaviour around plugin
If I caused the
txpkgupload plugin to raise an exception when loaded, the
Twisted plugin system would remove its
dropin.cache (because it was stale)
and not create a new one (because there was now no content to put in it).
After that, running the relevant tests would fail as I’d seen in our
buildbot. Aha! This meant that I could also reproduce it by doing an even
cleaner build than I’d previously tried to do, by removing the cached
txlongpoll eggs and allowing the build system to
recreate them. When they were recreated, they didn’t contain
dropin.cache, instead allowing that to be created on first use.
Based on this clue I was able to get to the answer relatively quickly. Ampoule has a specialised bootstrapping sequence for its worker processes that starts by doing this:
from twisted.application import reactors reactors.installReactor(reactor)
twisted.plugin.getPlugins, so the very start of this bootstrapping
sequence is going to involve loading all plugins found on the module path (I
assume it’s possible to write a plugin that adds an alternative reactor
dropin.cache is up to date, then it will just get the
information it needs from that; but if it isn’t, it will go ahead and import
the plugin. If the plugin happens (as Twisted code often does) to run
twisted.internet import reactor at some point while being imported, then
that will install the platform’s default reactor, and then
twisted.application.reactors.installReactor will raise
ReactorAlreadyInstalledError. Since Ampoule turns this into an info-level
log message for some reason, and the tests in question only passed through
error-level messages or higher, this meant that all we could see was that a
worker process had exited non-zero but not why.
There are still a few more things needed to get us onto pip, but we’re now pretty close. After that we can finally start bringing our dependencies up to date.