Colin Watson's blog
Subscribe to a syndicated feed of my blog.
Powered by Blosxom
Windows applications making GRUB 2 unbootable
If you find that running Windows makes a GRUB 2-based system unbootable (Debian bug, Ubuntu bug), then I'd like to hear from you. This is a bug in which some proprietary Windows-based software overwrites particular sectors in the gap between the master boot record and the first partition, sometimes called the "embedding area". GRUB Legacy and GRUB 2 both normally use this part of the disk to store one of their key components: GRUB Legacy calls this component Stage 1.5, while GRUB 2 calls it the core image (comparison). However, Stage 1.5 is less useful than the core image (for example, the latter provides a rescue shell which can be used to recover from some problems), and is therefore rather smaller: somewhere around 10KB vs. 24KB for the common case of ext on plain block devices. It seems that the Windows-based software writes to a sector which is after the end of Stage 1.5, but before the end of the core image. This is why the problem appears to be new with GRUB 2.
At least some occurrences of this are with software which writes a signature to the embedding area which hangs around even after uninstallation (even with one of those tools that tracks everything the installation process did and reverses it, I gather), so that you cannot uninstall and reinstall the application to defeat a trial period. This seems like a fine example of an antifeature, especially given its destructive consequences for free software, and is in general a poor piece of engineering; what happens if multiple such programs want to use the same sector, I wonder? They clearly aren't doing much checking that the sector is unused, not that that's really possible anyway. While I do not normally think that GRUB should go to any great lengths to accommodate proprietary software, this is a case where we need to defend ourselves against the predatory practices of some companies making us look bad: a relatively small number of people do enough detective work to realise that it's the fault of a particular Windows application, but many more simply blame our operating system because it won't start any more.
I believe that it may be possible to assemble a collection of signatures of such software, and arrange to avoid the disk sectors they have stolen. Indeed, I have a first draft of the necessary code. This is not a particularly pleasant solution, but it seems to be the most practical way around the problem; I'm hoping that several of the programs at fault are using common "licence manager" code or something like that, so that we can address most of the problems with a relatively small number of signatures. In order to do this, I need to hear from as many people as possible who are affected by this problem.
If you suffer from this problem, then please do the following:
- Save the output of
fdisk -lu to a file. In this output, take note of the start sector of the first partition (usually 63, but might also be 2048 on recent installations, or occasionally something else). If this is something other than 63, then replace 63 in the following items with your number.
- Save the contents of the embedding area to a file (replace
/dev/sda with your disk device if it's something else):
dd if=/dev/sda of=sda.1 count=63
- Do whatever you do to make GRUB unbootable (presumably starting Windows), then boot into a recovery environment. Before you reinstall GRUB, save the new contents of the embedding area to a different file:
dd if=/dev/sda of=sda.2 count=63
- Follow up to either the Debian or the Ubuntu bug with these three files (the output of
fdisk -lu, and the embedding area before and after making GRUB unbootable.
I hope that this will help me to assemble enough information to fix this bug at least for most people, and of course if you provide this information then I can make sure to fix your particular version of this problem. Thanks in advance!
debhelper statistics, redux
Apropos of my previous post, I see that dh has now overtaken CDBS as the most popular rules helper system of its kind in Debian unstable, and shows no particular sign of slowing its rate of uptake any time soon. The resolution of the graph is such that you can't see it yet, but dh drew dead level with CDBS on Thursday, and today 3836 packages are using dh as opposed to 3823 using CDBS.
GRUB 2: With luck ...
... this version, or something not too far away from it, might actually stand a chance of getting into testing.
I've just uploaded grub2 1.98+20100702-1. The most significant set of changes in this release is that it switches
/boot/grub/device.map and the
grub-pc/install_devices debconf question over to stable device names under
/dev/disk/by-id (on Linux kernels). The code implementing this is reasonably careful, and it should make it quite difficult for people to accidentally fail to upgrade their installed GRUB core image; I explained the problems that tends to cause in the previous post in this series. There will probably be a few small glitches I need to clear up, but I've given this much more extensive testing than usual so I hope I won't break too many people's computers (again).
I did this work first in Ubuntu as one of my major goals for 10.04 LTS, which exposed a few problems that I wanted to fix before inflicting it on Debian as well (fixes for those are now under testing for 10.04.1). Most significantly, I felt it was necessary to start offering partitions in the select list for
grub-pc/install_devices, but I went a bit overboard and offered all partitions in a giant list. This seemed like a good idea at the time, but it tended to confuse people into just selecting everything in the list, which in particular tended to make Windows unbootable! So I dialled that back a bit, and in the version I just merged it will only offer the partitions mounted on
/boot/grub (de-duplicating if necessary). This seems like a reasonable compromise between confusing people too much and forcing them to install only to MBRs.
My next priority will be making whatever fixes are necessary to get this version into testing, since the problems with
/dev/mapper symlinks in testing aren't getting any less urgent, and this is finally a version that shouldn't break for most people due to the kernel's switch to libata. I expect that I'll try to get mdadm 1.x metadata sorted out immediately after that.
Other improvements since my last entry have included:
- Further documentation work. Thanks to Vladimir Serbinenko (and to Jordan Uggla for hosting it temporarily), there's now an HTML version of the GRUB manual from trunk online, which includes new sections on embedded configuration files, the various GRUB image files,
device.map, and (shortly) a summary of changes from GRUB Legacy.
- Video improvements: among other things, UEFI systems whose firmware uses the Graphics Output Protocol should now work rather better, and GRUB now includes specific support for some cards often used with minimal firmware support under emulation.
- A fix to handle large memory maps exposed by some UEFI firmware.
- Automatic configuration support for Fedora 13. You may need os-prober 1.39 from unstable as well.
- Automatic configuration support for Linux on Xen.
- Skip LVM snapshots rather than failing when they're present.
GRUB 2 boot problems
(This is partly a repost of material I've posted to bug reports and to debian-release, put together with some more detail for a wider audience.)
You could be forgiven for looking at the RC bug activity on grub2 over the last couple of days and thinking that it's all gone to hell in a handbasket with recent uploads. In fact, aside from an interesting case which turned out to be due to botched handling of the GRUB Legacy to GRUB 2 chainloading setup (which prompted me to fix three other RC bugs along the way), all the recent problems people have been having have been duplicates of one of these bugs which have existed essentially forever:
When GRUB boots, its boot sector first loads its "core image", which is usually embedded in the gap between the boot sector and the first partition on the same disk as the boot sector. This core image then figures out where to find /boot/grub, and loads grub.cfg from it as well as more GRUB modules.
The thing that tends to go wrong here is that the core image must be from the same version of GRUB as any modules it loads. /boot/grub/*.mod are updated only by grub-install, so this normally works OK. However, for various reasons (deliberate or accidental) some people install GRUB to multiple disks. In this case, grub-install might update /boot/grub/*.mod along with the core image on one disk, but your BIOS might actually be booting from a different disk. The effect of this will be that you'll have an old core image and new modules, which will probably blow up in any number of possible ways. Quite often, this problem lies dormant for a while because GRUB happens not to change in a way that causes incompatibility between the core image and modules, but then we get massive spikes of bug reports any time the interface does change. Since these bugs sometimes bite people upgrading from testing to unstable, they get interpreted as regressions from the version in testing even though that isn't strictly true (but it tends not to be very productive to argue this line; after all, people's computers suddenly don't boot!). Any problem that causes the core image to be installed to a disk other than the one actually being booted from, or not to be installed at all, will show up this way sooner or later.
On 2010-06-10, there was a substantial upstream change to the handling of list iterators (to reduce core image size and make code clearer and faster) which introduced an incompatibility between old core images and newer modules. This caused a bunch of dormant problems to flare up again, and so there was a flood of reports of booting problems with 1.98+20100614-1 and newer, often described as "the unaligned pointer bug" due to how it happened to manifest this time round. In previous cases, GRUB reported undefined symbols on boot, but it's all essentially the same problem even though there are different symptoms.
The confusing bit when handling bug reports is that not only are there different symptoms with the same cause, but there are also multiple causes for the same symptom! This takes a certain amount of untangling, especially when lots of people have thought "ooh, that bug looks a bit like mine" and jumped in with their own comments. Working through this was a worthwhile exercise, as it came up with an entirely new cause for a problem I thought was fairly well-understood (thanks to debugging assistance from Sedat Dilek). If you had set up GRUB 2 to be automatically chainloaded from GRUB Legacy (which happens automatically on upgrade from the latter to the former), never got round to running
upgrade-from-grub-legacy once you confirmed it worked, and then later ran
grub-install by hand for one reason or another, then the core image you installed by hand would never be updated and would eventually fall over the next time the core/modules interface changed. Fixing future cases of this was easy enough, but fixing existing cases involved figuring out how to detect whether an installed GRUB boot sector came from GRUB Legacy or GRUB 2, which isn't as easy as you might think. Fortunately, it turns out that there are a limited number of jump offsets that have ever been used in the second byte of the boot sector, and none of the GRUB 2 values clash with the only value ever used in GRUB Legacy; so, if you still have
/boot/grub/stage2 et al on upgrade, we scan all disks for a GRUB 2 boot sector, and if we find one then we offer to complete the upgrade to GRUB 2.
Unless anything new shows up, that just leaves the problems that were already understood. Today, I posted a patch to generate stable device names in device.map by default. If this is accepted, then we can do something or other to fix up device.map on upgrade, switch over to
/dev/disk/by-id names in
grub-pc/install_devices at the same time, and that should take care of the vast majority of this kind of upgrade bug. I think at that point it should be feasible to get a new version into testing, and we should be down from 18 RC bugs towards the end of last month to around 6. We can then start attacking things like the lack of support for mdadm 1.x metadata.
Since my last blog entry on GRUB 2, improvements have included:
- Substantial work on
info grub, with, among other things, new sections on
/etc/default/grub and on configuring authentication.
- A workaround for GRUB's inability to probe dm-crypt devices, thanks to Marc Haber.
- Several build fixes for architectures I wasn't testing, and a fix for broken nested partition handling on Debian GNU/kFreeBSD. I'm now testing GNU/kFreeBSD locally.
- Rather less cruft in
video.lst, which should speed up booting a bit by e.g. avoiding unnecessary filesystem probing.
upgrade-from-grub-legacy actually now installs GRUB 2 to the boot sector (!).
- Ask for confirmation if
grub-pc/install_devices is left empty.
The next upstream snapshot will bring several improvements to EFI video support, mainly thanks to Vladimir Serbinenko. I've been working on making
grub-install actually work on UEFI systems as one of my goals for the next Ubuntu release, and I hope to get this landed in the not-too-distant future.
Hacking on grub2
Various people observed in a long thread on debian-devel that the grub2 package was in a bit of a mess in terms of its release-critical bug count, and Jordi and Stefano both got in touch with me directly to gently point out that I probably ought to be doing something about it as one of the co-maintainers.
Actually, I don't think grub2 was in quite as bad a state as its 18 RC bugs suggested. Of course every boot loader failure is critical to the person affected by it, not to mention that GRUB 2 offers more complex functionality than any other boot loader (e.g. LVM and RAID), and so it tends to accumulate RC bugs at rather a high rate. That said, we'd been neglecting its bug list for some time; Robert and Felix have both been taking some time off, Jordi mostly only cared about PowerPC and can't do that any more due to hardware failure, and I hadn't been able to pick up the slack.
Most of my projects at work for the next while involve GRUB in one way or another, so I decided it was a perfectly reasonable use of work time to do something about this; I was going to need fully up-to-date snapshots anyway, and practically all the Debian grub2 bugs affect Ubuntu too. Thus, with the exception of some other little things like releasing the first Maverick alpha, I've spent pretty much the last week and a half solidly trying to get the grub2 package back into shape, with four uploads so far.
The RC issues that remain are:
upgrade-from-grub-legacy problems (#547944, #550477):
I think this has just been traditionally undertested. I'm setting up a KVM image now with GRUB Legacy which I can snapshot just before and after running
upgrade-from-grub-legacy, and I should be able to unpick the bugs this way.
LVM snapshots break GRUB's LVM module (#574863):
Sean has been working on this and seems to be nearly there. Yay.
RAID metadata version 1.x not supported (#492897):
This became rather more of an issue recently since
mdadm switched its default from the old 0.90 format which GRUB understood. Felix put together a branch implementing the hard parts of this a while back, and I've been trying to finish it off. The hard bit is dealing with device naming, especially as the new-format and rather more useful names under
/dev/md/ don't show up during d-i after creating RAID volumes; I think this is because we always create them as
/dev/md0 etc. It's looking tractable, though.
Another odd problem probing RAID (#548648):
Not sure about this one, and I'll need to work with Josip on it as soon as I get a chance.
Stable device naming #554790) and consequential problems due to
grub-install not being properly run (#557425 and many other sub-RC bugs):
Ubuntu's been carrying a patch to rearrange device presentation in the postinst, which Robert OKed in principle ages ago and so I've been intending to merge it for a while, but there are a few known problems with it that I need to fix first. One known unfixable problem is that it will have to ask some people which devices they want GRUB to be installed on, even if they'd answered that question before: this will be one-time, and it's because it recorded the answer using unstable device names and so has in some sense forgotten. Simple cases (e.g. single-disk) can be handled without needing to ask again, though.
Alignment errors on SPARC (#560823):
I have no idea what's going on here, I'm afraid. I'll try to trace it, but may have to downgrade it at some point since after all we don't install GRUB by default on SPARC yet.
Fonts not shown in gfxmenu (#564844):
Apparently fixed upstream, but I couldn't find the responsible commit so I want to make sure I can get gfxmenu working before closing this.
Sensitivity to out-of-date
device.map files (#575076 and other sub-RC bugs):
We're trying to get rid of
device.map in general. It was fine in the 1990s but it's hopeless now. Unfortunately there are still a small number of problems with running entirely without one, and one of my patches to help is controversial upstream, so we probably won't get to that for squeeze. In the meantime we'll probably just need some extra sanity-checking and robustness in the event that there's an incorrect or out-of-date
device.map lying around, which we may just be able to do in the maintainer scripts or something if necessary.
Seriously weird failures to load initramfs (#582342):
If anyone can produce a reproduction recipe for this, that would really help me out. There are too many reports to discount as user error, but I haven't seen this myself yet.
Build failure on sparc (unfiled):
We've been discussing this upstream, but for the time being I'm just going to stop building
grub-emu on sparc as a workaround.
If we can fix that lot, or even just the ones that are reasonably well-understood, I think we'll be in reasonable shape. I'd also like to make
grub-mkconfig a bit more robust in the event that the root filesystem isn't one that GRUB understands (#561855, #562672), and I'd quite like to write some more documentation.
On the upside, progress has been good. We have multiple terminal support thanks to a new upstream snapshot (#506707),
update-grub runs much faster (#508834, #574088), we have DM-RAID support with a following wind (#579919), the new scheme with symlinks under
/dev/mapper/ works (#550704), we have basic support for btrfs
/ as long as you have something GRUB understands properly on
/boot (#540786), we have full info documentation covering all the user-adjustable settings in
/etc/default/grub, and a host of other smaller fixes. I'm hoping we can keep this up.
If you'd like to help, contact me, especially if there's something particular that isn't being handled that you think you could work on. GRUB 2 is actually quite a pleasant codebase to work on once you get used to its layout; it's certainly much easier to fix bugs in than GRUB Legacy ever was, as far as I'm concerned. Thanks to tools like
grub-fstest, it's very often possible to fix problems without needing to reboot for anything other than a final sanity check (although KVM certainly helps), and you can often debug very substantial bits of the boot loader - the bits that actually go wrong - using standard tools such as
gdb. Upstream is helpful and I've been able to get many of the problems above fixed directly there. If you have a sound knowledge of C and a decent level of understanding of the environment a boot loader needs to operate in - or for that matter specialist knowledge of interesting device types - then you should be able to find something to do.
Thoughts on 3.0 (quilt) format
Note: I wrote most of this before Neil Williams' recent comments on the 3.0 family of formats, so despite the timing this isn't really a reaction to that although I do have a couple of responses. On the whole I think I agree that the Lintian message is a bit heavy-handed and I'm not sure I'm thrilled about the idea of the default source format being changed (though I can see why the dpkg maintainers are interested in that). That said, as far as I personally am concerned, there is a vast cognitive benefit to me in having as much as possible be common to all my packages. Once I have more than a couple of packages that require patching and benefit from the
3.0 (quilt) format as a result, I find it in my interest to use it for all my non-native packages even if they're patchless right now, so that for instance if they need patches in the future I can handle them the same way. It's not unheard of for me to apply temporary patches even to packages I actively maintain upstream, so I don't discount those either. I haven't decided what to do with my native packages yet; unless they're big enough for bzip2 compression to be worthwhile, there doesn't seem to be much immediate advantage to
Anyway, on to the main body of this post:
I've been one of the holdouts resisting use of patch systems for a long time, on the basis that I felt strongly that
dpkg-source -x ought to give you the source that's actually built, rather than having to mess around with
debian/rules targets in order to see it. Now that the
3.0 (quilt) format is available to fix this bug, I felt that I ought to revisit my resistance and start trying to use it. Migrating to it from monolithic diffs is of course a bit more work than migrating to it from other patch systems, so it's taken me a little while to get round to it. I'd been thinking about holding off until there was better integration with revision control (e.g. bzr looms), as I feel that patch files really ought to be an export format, but I eventually decided that I shouldn't let the perfect be the enemy of the good. I have enough experience with co-maintaining packages that use build-time patch systems to be able to compare my reactions.
After experimenting with a couple of small packages, I moved over to the deep end and converted openssh a few weekends ago, since quite a few people have requested over the years that the Debian changes to openssh be easier to audit. This was a substantial job - over 6000 lines of upstream patches - but not actually as much work as I expected. I took a fairly simplistic approach: first, I unapplied all the upstream patches from my tree; then I ran
bzr di | interdiff -q /dev/stdin /dev/null >x, reduced it to a single logically-discrete patch, applied it to a new quilt patch using
quilt fold, and repeated until
x was empty. This was maybe an hour or two of work, and then I went through and tagged all the patches according to DEP-3, which took another few hours. After the first pass, I ended up with 38 patches and a much clearer idea of what has been forwarded upstream and what hasn't; I currently have 5 patches to forward or eliminate, down from 18.
- I don't lose any of my history. Since all the patches remain applied to the tree in revision control (this is what
dpkg-source -x gives you, so it's the natural representation in revision control too),
bzr blame works just as you'd expect and displays both upstream and Debian changes at once. I rely on tools like blame a lot, and I really hate the way build-time patch systems make it hard to use revision control when the tree is in a built state, so this was a hard requirement for me.
- I've used patch tagging before, so I was expecting some benefits, but viscerally I feel much more in control. It's so much less laborious now to see what I need to do by way of forwarding. I don't regret waiting for 3.0 (quilt) to become available, but I hadn't realised quite how much I was being held back beforehand.
- Adding new patches is pretty natural, much more so than with build-time patch systems. You can create and apply the patch, test-build, and commit when it works. I much prefer this over having to clean the tree before committing (or commit just part of the tree, which is error-prone). The more that committing to a Debian package feels like committing to an upstream project, the better.
- There's definitely something to be said for patch-tracker being more useful. It deals with DEP-3 to the extent of linkifying URLs, although it might be nice if patch descriptions were displayed on the overview page for each version.
- It's a bit awkward to set things up when checking out from revision control; I didn't really want to check in the
.pc directory, and the tree checks out in the patched state (as it should), so I needed some way for developers to get quilt working easily after a checkout. This is sort of the reverse of the previous problem, where users had to do something special after
dpkg-source -x, and I consider it less serious so I'm willing to put up with it. I ended up with a rune in debian/rules that ought to live somewhere more common.
- Everything ends up represented twice in revision control: the patch files, plus the changes to the patched files themselves. I'm OK with this although it is a little inelegant.
- Although I haven't had to do it yet, I expect that merging new upstream releases will be a bit harder. bzr will deal with resolving conflicts in the patched files themselves, and that's why I use a revision control system after all, but then I'll have to go and refresh all the patches and will probably end up doing some of the same conflict resolution a second time. I think the best answer right now is to
quilt pop -a, force a merge despite the modified working tree, and then
quilt push && quilt refresh -pab until I get back to the top of the stack, modulo slight fiddliness when a patch disappears entirely; thus effectively using quilt's conflict resolution rather than bzr's. I suppose this will serve as additional incentive to reduce my patch count. I know that people have been working on making this work nicely with topgit, although I'm certainly not going to put up with the rest of git due to that; I'm happy to wait for looms to become usable and integrated. :-)
- It would be nice if there were some standard DEP-3 way to note that a patch has been accepted or rejected upstream, beyond just putting it in the description. In particular, it seems to me that listing patches accepted upstream could be used to speed up the process of merging new upstream releases.
On the whole I'm satisfied with this, and the benefits definitely outweigh the costs. Thanks to the dpkg team for all their work on this!
parted 2.2 transition
I've started the transition of parted 2.2 to unstable. This is a major update needed for sensible support of newer hard disks with alignment requirements different from the archaic cylinder alignment tradition. I posted to debian-boot with a summary of the partman changes involved.
I don't know if anyone else has been tracking this recently, but a while back I got curious about the relative proportions of dh(1) and CDBS in the archive, and started running some daily analysis on the Lintian lab. Apologies for my poor graphing abilities, but the graph is here (occasionally updated):
Although dh is still a bit behind CDBS, the steady upward trend is quite striking - it looks set to break 20% soon, up from under 13% in September - compared with CDBS which has been sitting within half a percentage point of 25% the whole time.
Incidentally, was that an ftpmaster trying to sign his name in the graph over Christmas or something? :-)
I did a bit of catching up on my Debian backlog over the last week or so. Among the things I got round to:
- I released man-db 2.5.7. This was mostly an "I've been meaning to do this for ages" kind of thing to reduce the bug list a bit, closing ten Debian bugs, but there were a few interesting things in there as well, such as always saving cat pages in UTF-8 and recoding to the user's locale at display time (long overdue), adjusting the search order for localised manual pages by request of quite a few non-native English speakers to prefer a page in the right section over a page in the right language, and a cute gimmick to make things like
man /usr/bin/time display the appropriate manual page rather than the text of the executable. See the NEWS file for more details.
- binfmt-support now installs cleanly on non-Linux systems, even if it doesn't do anything useful yet.
- I fixed a couple of shell bugs in groff.
- halibut now complies with the Debian Vim policy, even though I can't say I entirely agree with it in this case.
- I fixed a really odd build failure in troffcvt. Yay imake, or something.
- All Debian patches to putty are now upstream, or will be once I upload a new snapshot. Thanks to Simon Tatham and Jacob Nevins.
- I did a few bits and pieces of packaging cleanup with an eye on my DDPO list, and added some watch files where they were missing.
- Responded to an offer to take over icoutils maintenance.
So nothing really earth-shaking, and as ever openssh could use some attention, but I feel a bit better about my backlog now. I do still have a critical bug in makepasswd to fix, and a sponsored upload of parrot; those are the next two things on my to-do list.