GRUB 2 boot problems
(This is partly a repost of material I’ve posted to bug reports and to debian-release, put together with some more detail for a wider audience.)
You could be forgiven for looking at the RC bug activity on grub2 over the last couple of days and thinking that it’s all gone to hell in a handbasket with recent uploads. In fact, aside from an interesting case which turned out to be due to botched handling of the GRUB Legacy to GRUB 2 chainloading setup (which prompted me to fix three other RC bugs along the way), all the recent problems people have been having have been duplicates of one of these bugs which have existed essentially forever:
- #554790 - grub-pc/install_devices uses unstable device names
- #583271 - device.map uses unstable device names
When GRUB boots, its boot sector first loads its “core image”, which is usually embedded in the gap between the boot sector and the first partition on the same disk as the boot sector. This core image then figures out where to find /boot/grub, and loads grub.cfg from it as well as more GRUB modules.
The thing that tends to go wrong here is that the core image must be from
the same version of GRUB as any modules it loads. /boot/grub/*.mod
are
updated only by grub-install, so this normally works OK. However, for
various reasons (deliberate or accidental) some people install GRUB to
multiple disks. In this case, grub-install might update /boot/grub/*.mod
along with the core image on one disk, but your BIOS might actually be
booting from a different disk. The effect of this will be that you’ll have
an old core image and new modules, which will probably blow up in any number
of possible ways. Quite often, this problem lies dormant for a while
because GRUB happens not to change in a way that causes incompatibility
between the core image and modules, but then we get massive spikes of bug
reports any time the interface does change. Since these bugs sometimes bite
people upgrading from testing to unstable, they get interpreted as
regressions from the version in testing even though that isn’t strictly true
(but it tends not to be very productive to argue this line; after all,
people’s computers suddenly don’t boot!). Any problem that causes the core
image to be installed to a disk other than the one actually being booted
from, or not to be installed at all, will show up this way sooner or later.
On 2010-06-10, there was a substantial upstream change to the handling of list iterators (to reduce core image size and make code clearer and faster) which introduced an incompatibility between old core images and newer modules. This caused a bunch of dormant problems to flare up again, and so there was a flood of reports of booting problems with 1.98+20100614-1 and newer, often described as “the unaligned pointer bug” due to how it happened to manifest this time round. In previous cases, GRUB reported undefined symbols on boot, but it’s all essentially the same problem even though there are different symptoms.
The confusing bit when handling bug reports is that not only are there
different symptoms with the same cause, but there are also multiple causes
for the same symptom! This takes a certain amount of untangling, especially
when lots of people have thought “ooh, that bug looks a bit like mine” and
jumped in with their own comments. Working through this was a worthwhile
exercise, as it came up with an entirely new cause for a problem I thought
was fairly well-understood (thanks to debugging assistance from Sedat
Dilek). If you had set up GRUB 2 to be automatically chainloaded from GRUB
Legacy (which happens automatically on upgrade from the latter to the
former), never got round to running upgrade-from-grub-legacy
once you
confirmed it worked, and then later ran grub-install
by hand for one
reason or another, then the core image you installed by hand would never be
updated and would eventually fall over the
next time the core/modules interface changed. Fixing future cases of this
was easy enough, but fixing existing cases involved figuring out how to
detect whether an installed GRUB boot sector came from GRUB Legacy or GRUB
2, which isn’t as easy as you might think. Fortunately, it turns out that
there are a limited number of jump offsets that have ever been used in the
second byte of the boot sector, and none of the GRUB 2 values clash with the
only value ever used in GRUB Legacy; so, if you still have
/boot/grub/stage2
et al on upgrade, we scan all disks for a GRUB 2 boot
sector, and if we find one then we offer to complete the upgrade to GRUB 2.
Unless anything new shows up, that just leaves the problems that were
already understood. Today, I posted a patch to generate stable device
names in device.map by
default.
If this is accepted, then we can do something or other to fix up device.map
on upgrade, switch over to /dev/disk/by-id
names in
grub-pc/install_devices
at the same time, and that should take care of the
vast majority of this kind of upgrade bug. I think at that point it should
be feasible to get a new version into testing, and we should be down from 18
RC bugs towards the end of last month to around 6. We can then start
attacking things like the lack of support for mdadm 1.x metadata.
Since my last blog entry on GRUB 2, improvements have included:
- Substantial work on
info grub
, with, among other things, new sections on/etc/default/grub
and on configuring authentication. - A workaround for GRUB’s inability to probe dm-crypt devices, thanks to Marc Haber.
- Several build fixes for architectures I wasn’t testing, and a fix for broken nested partition handling on Debian GNU/kFreeBSD. I’m now testing GNU/kFreeBSD locally.
- Rather less cruft in
fs.lst
,partmap.lst
, andvideo.lst
, which should speed up booting a bit by e.g. avoiding unnecessary filesystem probing. upgrade-from-grub-legacy
actually now installs GRUB 2 to the boot sector (!).- Ask for confirmation if
grub-pc/install_devices
is left empty.
The next upstream snapshot will bring several improvements to EFI video
support, mainly thanks to Vladimir Serbinenko. I’ve been working on making
grub-install
actually work on UEFI systems as one of my goals for the next
Ubuntu release, and I hope to get this landed in the not-too-distant future.