Mon, 21 Jun 2010
(This is partly a repost of material I've posted to bug reports and to debian-release, put together with some more detail for a wider audience.)
You could be forgiven for looking at the RC bug activity on grub2 over the last couple of days and thinking that it's all gone to hell in a handbasket with recent uploads. In fact, aside from an interesting case which turned out to be due to botched handling of the GRUB Legacy to GRUB 2 chainloading setup (which prompted me to fix three other RC bugs along the way), all the recent problems people have been having have been duplicates of one of these bugs which have existed essentially forever:
When GRUB boots, its boot sector first loads its "core image", which is usually embedded in the gap between the boot sector and the first partition on the same disk as the boot sector. This core image then figures out where to find /boot/grub, and loads grub.cfg from it as well as more GRUB modules.
The thing that tends to go wrong here is that the core image must be from the same version of GRUB as any modules it loads. /boot/grub/*.mod are updated only by grub-install, so this normally works OK. However, for various reasons (deliberate or accidental) some people install GRUB to multiple disks. In this case, grub-install might update /boot/grub/*.mod along with the core image on one disk, but your BIOS might actually be booting from a different disk. The effect of this will be that you'll have an old core image and new modules, which will probably blow up in any number of possible ways. Quite often, this problem lies dormant for a while because GRUB happens not to change in a way that causes incompatibility between the core image and modules, but then we get massive spikes of bug reports any time the interface does change. Since these bugs sometimes bite people upgrading from testing to unstable, they get interpreted as regressions from the version in testing even though that isn't strictly true (but it tends not to be very productive to argue this line; after all, people's computers suddenly don't boot!). Any problem that causes the core image to be installed to a disk other than the one actually being booted from, or not to be installed at all, will show up this way sooner or later.
On 2010-06-10, there was a substantial upstream change to the handling of list iterators (to reduce core image size and make code clearer and faster) which introduced an incompatibility between old core images and newer modules. This caused a bunch of dormant problems to flare up again, and so there was a flood of reports of booting problems with 1.98+20100614-1 and newer, often described as "the unaligned pointer bug" due to how it happened to manifest this time round. In previous cases, GRUB reported undefined symbols on boot, but it's all essentially the same problem even though there are different symptoms.
The confusing bit when handling bug reports is that not only are there different symptoms with the same cause, but there are also multiple causes for the same symptom! This takes a certain amount of untangling, especially when lots of people have thought "ooh, that bug looks a bit like mine" and jumped in with their own comments. Working through this was a worthwhile exercise, as it came up with an entirely new cause for a problem I thought was fairly well-understood (thanks to debugging assistance from Sedat Dilek). If you had set up GRUB 2 to be automatically chainloaded from GRUB Legacy (which happens automatically on upgrade from the latter to the former), never got round to running
Unless anything new shows up, that just leaves the problems that were already understood. Today, I posted a patch to generate stable device names in device.map by default. If this is accepted, then we can do something or other to fix up device.map on upgrade, switch over to
Since my last blog entry on GRUB 2, improvements have included:
The next upstream snapshot will bring several improvements to EFI video support, mainly thanks to Vladimir Serbinenko. I've been working on making