Hacking on grub2

Various people observed in a long thread on debian-devel that the grub2 package was in a bit of a mess in terms of its release-critical bug count, and Jordi and Stefano both got in touch with me directly to gently point out that I probably ought to be doing something about it as one of the co-maintainers.

Actually, I don’t think grub2 was in quite as bad a state as its 18 RC bugs suggested. Of course every boot loader failure is critical to the person affected by it, not to mention that GRUB 2 offers more complex functionality than any other boot loader (e.g. LVM and RAID), and so it tends to accumulate RC bugs at rather a high rate. That said, we’d been neglecting its bug list for some time; Robert and Felix have both been taking some time off, Jordi mostly only cared about PowerPC and can’t do that any more due to hardware failure, and I hadn’t been able to pick up the slack.

Most of my projects at work for the next while involve GRUB in one way or another, so I decided it was a perfectly reasonable use of work time to do something about this; I was going to need fully up-to-date snapshots anyway, and practically all the Debian grub2 bugs affect Ubuntu too. Thus, with the exception of some other little things like releasing the first Maverick alpha, I’ve spent pretty much the last week and a half solidly trying to get the grub2 package back into shape, with four uploads so far.

The RC issues that remain are:

  • upgrade-from-grub-legacy problems (#547944, #550477):

    I think this has just been traditionally undertested. I’m setting up a KVM image now with GRUB Legacy which I can snapshot just before and after running upgrade-from-grub-legacy, and I should be able to unpick the bugs this way.

  • LVM snapshots break GRUB’s LVM module (#574863):

    Sean has been working on this and seems to be nearly there. Yay.

  • RAID metadata version 1.x not supported (#492897):

    This became rather more of an issue recently since mdadm switched its default from the old 0.90 format which GRUB understood. Felix put together a branch implementing the hard parts of this a while back, and I’ve been trying to finish it off. The hard bit is dealing with device naming, especially as the new-format and rather more useful names under /dev/md/ don’t show up during d-i after creating RAID volumes; I think this is because we always create them as /dev/md0 etc. It’s looking tractable, though.

  • Another odd problem probing RAID (#548648):

    Not sure about this one, and I’ll need to work with Josip on it as soon as I get a chance.

  • Stable device naming #554790) and consequential problems due to grub-install not being properly run (#557425 and many other sub-RC bugs):

    Ubuntu’s been carrying a patch to rearrange device presentation in the postinst, which Robert OKed in principle ages ago and so I’ve been intending to merge it for a while, but there are a few known problems with it that I need to fix first. One known unfixable problem is that it will have to ask some people which devices they want GRUB to be installed on, even if they’d answered that question before: this will be one-time, and it’s because it recorded the answer using unstable device names and so has in some sense forgotten. Simple cases (e.g. single-disk) can be handled without needing to ask again, though.

  • Alignment errors on SPARC (#560823):

    I have no idea what’s going on here, I’m afraid. I’ll try to trace it, but may have to downgrade it at some point since after all we don’t install GRUB by default on SPARC yet.

  • Fonts not shown in gfxmenu (#564844):

    Apparently fixed upstream, but I couldn’t find the responsible commit so I want to make sure I can get gfxmenu working before closing this.

  • Sensitivity to out-of-date device.map files (#575076 and other sub-RC bugs):

    We’re trying to get rid of device.map in general. It was fine in the 1990s but it’s hopeless now. Unfortunately there are still a small number of problems with running entirely without one, and one of my patches to help is controversial upstream, so we probably won’t get to that for squeeze. In the meantime we’ll probably just need some extra sanity-checking and robustness in the event that there’s an incorrect or out-of-date device.map lying around, which we may just be able to do in the maintainer scripts or something if necessary.

  • Seriously weird failures to load initramfs (#582342):

    If anyone can produce a reproduction recipe for this, that would really help me out. There are too many reports to discount as user error, but I haven’t seen this myself yet.

  • Build failure on sparc (unfiled):

    We’ve been discussing this upstream, but for the time being I’m just going to stop building grub-emu on sparc as a workaround.

If we can fix that lot, or even just the ones that are reasonably well-understood, I think we’ll be in reasonable shape. I’d also like to make grub-mkconfig a bit more robust in the event that the root filesystem isn’t one that GRUB understands (#561855, #562672), and I’d quite like to write some more documentation.

On the upside, progress has been good. We have multiple terminal support thanks to a new upstream snapshot (#506707), update-grub runs much faster (#508834, #574088), we have DM-RAID support with a following wind (#579919), the new scheme with symlinks under /dev/mapper/ works (#550704), we have basic support for btrfs / as long as you have something GRUB understands properly on /boot (#540786), we have full info documentation covering all the user-adjustable settings in /etc/default/grub, and a host of other smaller fixes. I’m hoping we can keep this up.

If you’d like to help, contact me, especially if there’s something particular that isn’t being handled that you think you could work on. GRUB 2 is actually quite a pleasant codebase to work on once you get used to its layout; it’s certainly much easier to fix bugs in than GRUB Legacy ever was, as far as I’m concerned. Thanks to tools like grub-probe and grub-fstest, it’s very often possible to fix problems without needing to reboot for anything other than a final sanity check (although KVM certainly helps), and you can often debug very substantial bits of the boot loader - the bits that actually go wrong - using standard tools such as strace and gdb. Upstream is helpful and I’ve been able to get many of the problems above fixed directly there. If you have a sound knowledge of C and a decent level of understanding of the environment a boot loader needs to operate in - or for that matter specialist knowledge of interesting device types - then you should be able to find something to do.