--- /dev/null
+Title: A mysterious bug with Twisted plugins
+Slug: mysterious-bug-with-twisted-plugins
+Date: 2017-09-26 11:20:14 -0400
+Category: launchpad
+Tags: launchpad, planet-debian, planet-ubuntu
+
+I fixed a bug in Launchpad recently that led me deeper than I expected.
+
+Launchpad uses [Buildout](http://www.buildout.org/) as its build system for
+Python packages, and it's served us well for many years. However, we're
+using 1.7.1, which doesn't support ensuring that packages required using
+setuptools' [setup_requires
+keyword](https://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords)
+only ever come from the local index URL when one is specified; that's an
+essential constraint we need to be able to impose so that our build system
+isn't immediately sensitive to downtime or changes in PyPI. There are
+various issues/PRs about this in Buildout (e.g.
+[#238](https://github.com/buildout/buildout/pull/238)), but even if those
+are fixed it'll almost certainly only be in Buildout v2, and upgrading to
+that is its own kettle of fish for other reasons. All this is a serious
+problem for us because newer versions of many of our vital dependencies
+([Twisted](http://twistedmatrix.com/) and
+[testtools](https://pypi.python.org/pypi/testtools), to name but two) use
+`setup_requires` to pull in [pbr](https://pypi.python.org/pypi/pbr), and so
+we've been stuck on old versions for some time; this is part of why
+Launchpad doesn't yet support newer SSH key types, for instance. This
+situation obviously isn't sustainable.
+
+To deal with this, I've been working for some time on switching to
+[virtualenv](https://virtualenv.pypa.io/en/stable/) and
+[pip](https://pip.pypa.io/en/stable/). This is harder than you might think:
+Launchpad is a long-lived and complicated project, and it had quite a number
+of explicit and implicit dependencies on Buildout's configuration and
+behaviour. Upgrading our infrastructure from Ubuntu 12.04 to 16.04 has
+helped a lot (12.04's baseline virtualenv and pip have some deficiencies
+that would have required a more complicated bootstrapping procedure). I've
+dealt with most of these: for example, I had to reorganise a lot of our
+helper scripts
+([1](https://code.launchpad.net/~cjwatson/launchpad/simplify-buildout-bin-python-easy/+merge/314976),
+[2](https://code.launchpad.net/~cjwatson/launchpad/simplify-buildout-bin-shell/+merge/314973),
+[3](https://code.launchpad.net/~cjwatson/launchpad/simplify-buildout-bin-test/+merge/323743)),
+but there are still a few more things to go.
+
+One remaining problem was that our Buildout configuration relied on building
+several different environments with different Python paths for various
+things. While this would technically be possible by way of building
+multiple virtualenvs, this would inflate our build time even further (we're
+already going to have to cope with some slowdown as a result of using
+virtualenv, because the build system now has to do a lot more than
+constructing a glorified link farm to a bunch of cached eggs), and it seems
+like unnecessary complexity. The obvious thing to do seemed to be to
+collapse these into a single environment, since there was no obvious reason
+why it should actually matter if
+[txpkgupload](https://pypi.python.org/pypi/txpkgupload) and
+[txlongpoll](https://pypi.python.org/pypi/txlongpoll) were carefully kept
+off the path when running most of Launchpad: so [I did
+that](https://code.launchpad.net/~cjwatson/launchpad/simplify-buildout-recipes/+merge/330159).
+
+Then our build system [got very
+sad](http://lpbuildbot.canonical.com/builders/lp-devel-precise/builds/1582/steps/shell_9/logs/summary).
+
+Hmm, I thought. To keep our test times somewhat manageable, we run them in
+parallel across 20 containers, and we randomise the order in which they run
+to try to shake out test isolation bugs. It's not completely unknown for
+there to be some oddities resulting from that. So I ran it again. [Nope,
+but slightly differently sad this
+time](http://lpbuildbot.canonical.com/builders/lp-devel-precise/builds/1583/steps/shell_9/logs/summary).
+Furthermore, I couldn't reproduce these failures locally no matter how hard
+I tried. Oh dear. This was obviously not going to be a good day.
+
+In fact I spent a while on various different guesswork-based approaches. I
+found [bug 571334](https://bugs.launchpad.net/ampoule/+bug/571334) in
+Ampoule, an AMP-based process pool implementation that we use for some job
+runners, and proposed a
+[fix](https://code.launchpad.net/~cjwatson/ampoule/process-error-not-ready/+merge/330848)
+for that, but cherry-picking that fix into Launchpad didn't help matters. I
+tried backing out subsets of my changes and determined that if both
+`txlongpoll` and `txpkgupload` were absent from the Python module path in
+the context of the tests in question then everything was fine. I tried
+running `strace` locally and staring at the output for some time in the hope
+of enlightenment: that reminded me that the two packages in question install
+modules under `twisted.plugins`, which did at least establish a reason they
+might affect the environment that was more plausible than magic, but nothing
+much more specific than that.
+
+On Friday I was fiddling about with this again and trying to insert some
+more debugging when I noticed some interesting behaviour around [plugin
+caching](https://twistedmatrix.com/documents/current/core/howto/plugin.html#plugin-caching).
+If I caused the `txpkgupload` plugin to raise an exception when loaded, the
+Twisted plugin system would remove its `dropin.cache` (because it was stale)
+and not create a new one (because there was now no content to put in it).
+After that, running the relevant tests would fail as I'd seen in our
+buildbot. Aha! This meant that I could also reproduce it by doing an even
+cleaner build than I'd previously tried to do, by removing the cached
+`txpkgupload` and `txlongpoll` eggs and allowing the build system to
+recreate them. When they were recreated, they didn't contain
+`dropin.cache`, instead allowing that to be created on first use.
+
+Based on this clue I was able to get to the answer relatively quickly.
+Ampoule has a specialised bootstrapping sequence for its worker processes
+that starts by doing this:
+
+ :::python
+ from twisted.application import reactors
+ reactors.installReactor(reactor)
+
+Now, `twisted.application.reactors.installReactor` calls
+`twisted.plugin.getPlugins`, so the very start of this bootstrapping
+sequence is going to involve loading all plugins found on the module path (I
+assume it's possible to write a plugin that adds an alternative reactor
+implementation). If `dropin.cache` is up to date, then it will just get the
+information it needs from that; but if it isn't, it will go ahead and import
+the plugin. If the plugin happens (as Twisted code often does) to run `from
+twisted.internet import reactor` at some point while being imported, then
+that will install the platform's default reactor, and *then*
+`twisted.application.reactors.installReactor` will raise
+`ReactorAlreadyInstalledError`. Since Ampoule turns this into an info-level
+log message for some reason, and the tests in question only passed through
+error-level messages or higher, this meant that all we could see was that a
+worker process had exited non-zero but not why.
+
+The Twisted documentation
+[recommends](https://twistedmatrix.com/documents/current/core/howto/plugin.html#plugin-caching)
+generating the plugin cache at build time for other reasons, but we weren't
+doing that. [Fixing
+that](https://code.launchpad.net/~cjwatson/launchpad/build-twisted-plugin-cache/+merge/331240)
+makes everything work again.
+
+There are still a few more things needed to get us onto pip, but we're now
+pretty close. After that we can finally start bringing our dependencies up
+to date.