GRUB 2 boot problems

(This is partly a repost of material I’ve posted to bug reports and to debian-release, put together with some more detail for a wider audience.)

You could be forgiven for looking at the RC bug activity on grub2 over the last couple of days and thinking that it’s all gone to hell in a handbasket with recent uploads. In fact, aside from an interesting case which turned out to be due to botched handling of the GRUB Legacy to GRUB 2 chainloading setup (which prompted me to fix three other RC bugs along the way), all the recent problems people have been having have been duplicates of one of these bugs which have existed essentially forever:

When GRUB boots, its boot sector first loads its “core image”, which is usually embedded in the gap between the boot sector and the first partition on the same disk as the boot sector. This core image then figures out where to find /boot/grub, and loads grub.cfg from it as well as more GRUB modules.

The thing that tends to go wrong here is that the core image must be from the same version of GRUB as any modules it loads. /boot/grub/*.mod are updated only by grub-install, so this normally works OK. However, for various reasons (deliberate or accidental) some people install GRUB to multiple disks. In this case, grub-install might update /boot/grub/*.mod along with the core image on one disk, but your BIOS might actually be booting from a different disk. The effect of this will be that you’ll have an old core image and new modules, which will probably blow up in any number of possible ways. Quite often, this problem lies dormant for a while because GRUB happens not to change in a way that causes incompatibility between the core image and modules, but then we get massive spikes of bug reports any time the interface does change. Since these bugs sometimes bite people upgrading from testing to unstable, they get interpreted as regressions from the version in testing even though that isn’t strictly true (but it tends not to be very productive to argue this line; after all, people’s computers suddenly don’t boot!). Any problem that causes the core image to be installed to a disk other than the one actually being booted from, or not to be installed at all, will show up this way sooner or later.

On 2010-06-10, there was a substantial upstream change to the handling of list iterators (to reduce core image size and make code clearer and faster) which introduced an incompatibility between old core images and newer modules. This caused a bunch of dormant problems to flare up again, and so there was a flood of reports of booting problems with 1.98+20100614-1 and newer, often described as “the unaligned pointer bug” due to how it happened to manifest this time round. In previous cases, GRUB reported undefined symbols on boot, but it’s all essentially the same problem even though there are different symptoms.

The confusing bit when handling bug reports is that not only are there different symptoms with the same cause, but there are also multiple causes for the same symptom! This takes a certain amount of untangling, especially when lots of people have thought “ooh, that bug looks a bit like mine” and jumped in with their own comments. Working through this was a worthwhile exercise, as it came up with an entirely new cause for a problem I thought was fairly well-understood (thanks to debugging assistance from Sedat Dilek). If you had set up GRUB 2 to be automatically chainloaded from GRUB Legacy (which happens automatically on upgrade from the latter to the former), never got round to running upgrade-from-grub-legacy once you confirmed it worked, and then later ran grub-install by hand for one reason or another, then the core image you installed by hand would never be updated and would eventually fall over the next time the core/modules interface changed. Fixing future cases of this was easy enough, but fixing existing cases involved figuring out how to detect whether an installed GRUB boot sector came from GRUB Legacy or GRUB 2, which isn’t as easy as you might think. Fortunately, it turns out that there are a limited number of jump offsets that have ever been used in the second byte of the boot sector, and none of the GRUB 2 values clash with the only value ever used in GRUB Legacy; so, if you still have /boot/grub/stage2 et al on upgrade, we scan all disks for a GRUB 2 boot sector, and if we find one then we offer to complete the upgrade to GRUB 2.

Unless anything new shows up, that just leaves the problems that were already understood. Today, I posted a patch to generate stable device names in by default. If this is accepted, then we can do something or other to fix up on upgrade, switch over to /dev/disk/by-id names in grub-pc/install_devices at the same time, and that should take care of the vast majority of this kind of upgrade bug. I think at that point it should be feasible to get a new version into testing, and we should be down from 18 RC bugs towards the end of last month to around 6. We can then start attacking things like the lack of support for mdadm 1.x metadata.

Since my last blog entry on GRUB 2, improvements have included:

  • Substantial work on info grub, with, among other things, new sections on /etc/default/grub and on configuring authentication.
  • A workaround for GRUB’s inability to probe dm-crypt devices, thanks to Marc Haber.
  • Several build fixes for architectures I wasn’t testing, and a fix for broken nested partition handling on Debian GNU/kFreeBSD. I’m now testing GNU/kFreeBSD locally.
  • Rather less cruft in fs.lst, partmap.lst, and video.lst, which should speed up booting a bit by e.g. avoiding unnecessary filesystem probing.
  • upgrade-from-grub-legacy actually now installs GRUB 2 to the boot sector (!).
  • Ask for confirmation if grub-pc/install_devices is left empty.

The next upstream snapshot will bring several improvements to EFI video support, mainly thanks to Vladimir Serbinenko. I’ve been working on making grub-install actually work on UEFI systems as one of my goals for the next Ubuntu release, and I hope to get this landed in the not-too-distant future.