From d5f3abfea84b7cd435d46e4916f7b3fb6c52b825 Mon Sep 17 00:00:00 2001 From: Colin Watson Date: Tue, 26 Sep 2017 11:27:04 -0400 Subject: [PATCH] A mysterious bug with Twisted plugins --- .../mysterious-bug-with-twisted-plugins.md | 131 ++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 content/mysterious-bug-with-twisted-plugins.md diff --git a/content/mysterious-bug-with-twisted-plugins.md b/content/mysterious-bug-with-twisted-plugins.md new file mode 100644 index 00000000..d24616fd --- /dev/null +++ b/content/mysterious-bug-with-twisted-plugins.md @@ -0,0 +1,131 @@ +Title: A mysterious bug with Twisted plugins +Slug: mysterious-bug-with-twisted-plugins +Date: 2017-09-26 11:20:14 -0400 +Category: launchpad +Tags: launchpad, planet-debian, planet-ubuntu + +I fixed a bug in Launchpad recently that led me deeper than I expected. + +Launchpad uses [Buildout](http://www.buildout.org/) as its build system for +Python packages, and it's served us well for many years. However, we're +using 1.7.1, which doesn't support ensuring that packages required using +setuptools' [setup_requires +keyword](https://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords) +only ever come from the local index URL when one is specified; that's an +essential constraint we need to be able to impose so that our build system +isn't immediately sensitive to downtime or changes in PyPI. There are +various issues/PRs about this in Buildout (e.g. +[#238](https://github.com/buildout/buildout/pull/238)), but even if those +are fixed it'll almost certainly only be in Buildout v2, and upgrading to +that is its own kettle of fish for other reasons. All this is a serious +problem for us because newer versions of many of our vital dependencies +([Twisted](http://twistedmatrix.com/) and +[testtools](https://pypi.python.org/pypi/testtools), to name but two) use +`setup_requires` to pull in [pbr](https://pypi.python.org/pypi/pbr), and so +we've been stuck on old versions for some time; this is part of why +Launchpad doesn't yet support newer SSH key types, for instance. This +situation obviously isn't sustainable. + +To deal with this, I've been working for some time on switching to +[virtualenv](https://virtualenv.pypa.io/en/stable/) and +[pip](https://pip.pypa.io/en/stable/). This is harder than you might think: +Launchpad is a long-lived and complicated project, and it had quite a number +of explicit and implicit dependencies on Buildout's configuration and +behaviour. Upgrading our infrastructure from Ubuntu 12.04 to 16.04 has +helped a lot (12.04's baseline virtualenv and pip have some deficiencies +that would have required a more complicated bootstrapping procedure). I've +dealt with most of these: for example, I had to reorganise a lot of our +helper scripts +([1](https://code.launchpad.net/~cjwatson/launchpad/simplify-buildout-bin-python-easy/+merge/314976), +[2](https://code.launchpad.net/~cjwatson/launchpad/simplify-buildout-bin-shell/+merge/314973), +[3](https://code.launchpad.net/~cjwatson/launchpad/simplify-buildout-bin-test/+merge/323743)), +but there are still a few more things to go. + +One remaining problem was that our Buildout configuration relied on building +several different environments with different Python paths for various +things. While this would technically be possible by way of building +multiple virtualenvs, this would inflate our build time even further (we're +already going to have to cope with some slowdown as a result of using +virtualenv, because the build system now has to do a lot more than +constructing a glorified link farm to a bunch of cached eggs), and it seems +like unnecessary complexity. The obvious thing to do seemed to be to +collapse these into a single environment, since there was no obvious reason +why it should actually matter if +[txpkgupload](https://pypi.python.org/pypi/txpkgupload) and +[txlongpoll](https://pypi.python.org/pypi/txlongpoll) were carefully kept +off the path when running most of Launchpad: so [I did +that](https://code.launchpad.net/~cjwatson/launchpad/simplify-buildout-recipes/+merge/330159). + +Then our build system [got very +sad](http://lpbuildbot.canonical.com/builders/lp-devel-precise/builds/1582/steps/shell_9/logs/summary). + +Hmm, I thought. To keep our test times somewhat manageable, we run them in +parallel across 20 containers, and we randomise the order in which they run +to try to shake out test isolation bugs. It's not completely unknown for +there to be some oddities resulting from that. So I ran it again. [Nope, +but slightly differently sad this +time](http://lpbuildbot.canonical.com/builders/lp-devel-precise/builds/1583/steps/shell_9/logs/summary). +Furthermore, I couldn't reproduce these failures locally no matter how hard +I tried. Oh dear. This was obviously not going to be a good day. + +In fact I spent a while on various different guesswork-based approaches. I +found [bug 571334](https://bugs.launchpad.net/ampoule/+bug/571334) in +Ampoule, an AMP-based process pool implementation that we use for some job +runners, and proposed a +[fix](https://code.launchpad.net/~cjwatson/ampoule/process-error-not-ready/+merge/330848) +for that, but cherry-picking that fix into Launchpad didn't help matters. I +tried backing out subsets of my changes and determined that if both +`txlongpoll` and `txpkgupload` were absent from the Python module path in +the context of the tests in question then everything was fine. I tried +running `strace` locally and staring at the output for some time in the hope +of enlightenment: that reminded me that the two packages in question install +modules under `twisted.plugins`, which did at least establish a reason they +might affect the environment that was more plausible than magic, but nothing +much more specific than that. + +On Friday I was fiddling about with this again and trying to insert some +more debugging when I noticed some interesting behaviour around [plugin +caching](https://twistedmatrix.com/documents/current/core/howto/plugin.html#plugin-caching). +If I caused the `txpkgupload` plugin to raise an exception when loaded, the +Twisted plugin system would remove its `dropin.cache` (because it was stale) +and not create a new one (because there was now no content to put in it). +After that, running the relevant tests would fail as I'd seen in our +buildbot. Aha! This meant that I could also reproduce it by doing an even +cleaner build than I'd previously tried to do, by removing the cached +`txpkgupload` and `txlongpoll` eggs and allowing the build system to +recreate them. When they were recreated, they didn't contain +`dropin.cache`, instead allowing that to be created on first use. + +Based on this clue I was able to get to the answer relatively quickly. +Ampoule has a specialised bootstrapping sequence for its worker processes +that starts by doing this: + + :::python + from twisted.application import reactors + reactors.installReactor(reactor) + +Now, `twisted.application.reactors.installReactor` calls +`twisted.plugin.getPlugins`, so the very start of this bootstrapping +sequence is going to involve loading all plugins found on the module path (I +assume it's possible to write a plugin that adds an alternative reactor +implementation). If `dropin.cache` is up to date, then it will just get the +information it needs from that; but if it isn't, it will go ahead and import +the plugin. If the plugin happens (as Twisted code often does) to run `from +twisted.internet import reactor` at some point while being imported, then +that will install the platform's default reactor, and *then* +`twisted.application.reactors.installReactor` will raise +`ReactorAlreadyInstalledError`. Since Ampoule turns this into an info-level +log message for some reason, and the tests in question only passed through +error-level messages or higher, this meant that all we could see was that a +worker process had exited non-zero but not why. + +The Twisted documentation +[recommends](https://twistedmatrix.com/documents/current/core/howto/plugin.html#plugin-caching) +generating the plugin cache at build time for other reasons, but we weren't +doing that. [Fixing +that](https://code.launchpad.net/~cjwatson/launchpad/build-twisted-plugin-cache/+merge/331240) +makes everything work again. + +There are still a few more things needed to get us onto pip, but we're now +pretty close. After that we can finally start bringing our dependencies up +to date. -- 2.30.2