Wubi bug 693671
I spent most of last week working on Ubuntu bug 693671 (“wubi install will not boot - phase 2 stops with: Try (hd0,0): NTFS5”), which was quite a challenge to debug since it involved digging into parts of the Wubi boot process I’d never really touched before. Since I don’t think much of this is very well-documented, I’d like to spend a bit of time explaining what was involved, in the hope that it will help other developers in the future.
Wubi is a system for installing Ubuntu into a file in a Windows filesystem, so that it doesn’t require separate partitions and can be uninstalled like any other Windows application. The purpose of this is to make it easy for Windows users to try out Ubuntu without the need to worry about repartitioning, before they commit to a full installation. Wubi started out as an external project, and initially patched the installer on the fly to do all the rather unconventional things it needed to do; we integrated it into Ubuntu 8.04 LTS, which involved turning these patches into proper installer facilities that could be accessed using preseeding, so that Wubi only needs to handle the Windows user interface and other Windows-specific tasks.
Anyone familiar with a GNU/Linux system’s boot process will immediately see that this isn’t as simple as it sounds. Of course, ntfs-3g is a pretty solid piece of software so we can handle the Windows filesystem without too much trouble, and loopback mounts are well-understood so we can just have the initramfs loop-mount the root filesystem. Where are you going to get the kernel and initramfs from, though? Well, we used to copy them out to the NTFS filesystem so that GRUB could read them, but this was overly complicated and error-prone. When we switched to GRUB 2, we could instead use its built-in loopback facilities, and we were able to simplify this. So all was more or less well, except for the elephant in the room. How are you going to load GRUB?
In a Wubi installation, NTLDR (or BOOTMGR in Windows Vista and newer) still
owns the boot process. Ubuntu is added as a boot menu option using BCDEdit.
You might then think that you can just have the Windows boot loader
chain-load GRUB. Unfortunately, NTLDR only loads 16 sectors - 8192 bytes -
from disk. GRUB won’t fit in that: the smallest core.img you can generate
at the moment is over 18 kilobytes. Thus, you need something that is small
enough to be loaded by NTLDR, but that is intelligent enough to understand
NTFS to the point where it can find a particular file in the root directory
of a filesystem, load boot loader code from it, and jump to that. The
answer for this was GRUB4DOS. Most of
GRUB4DOS is based on GRUB Legacy, which is not of much interest to us any
more, but it includes an assembly-language program called GRLDR that
supports doing this very thing for FAT, NTFS, and ext2. In Wubi, we build
GRLDR as wubildr.mbr
, and build a specially-configured GRUB core image as
wubildr
.
Now, the messages shown in the bug report suggested a failure either within
GRLDR or very early in GRUB. The first thing I did was to remember that
GRLDR has been integrated into the grub-extras ntldr-img
module suitable
for use with GRUB 2, so I tried building wubildr.mbr
from that; no change,
but this gave me a modern baseline to work on. OK; now to try QEMU (you can
use tricks like qemu -hda /dev/sda
if you’re very careful not to do
anything that might involve writing to the host filesystem from within the
guest, such as recursively booting your host OS … [update: Tollef Fog
Heen and Zygmunt Krynicki both point out that you can use the -snapshot
option to make this safer]). No go; it hung somewhere in the middle of
NTLDR. Still, I could at least insert debug statements, copy the built
wubildr.mbr
over to my test machine, and reboot for each test, although it
would be slow and tedious. Couldn’t I?
Well, yes, I mostly could, but that 8192-byte limit came back to bite me, along with an internal 2048-byte limit that GRLDR allocates for its NTFS bootstrap code. There were only a few spare bytes. Something like this would more or less fit, to print a single mark character at various points so that I could see how far it was getting:
pushal
xorw %bx, %bx /* video page 0 */
movw $0x0e4d, %ax /* print 'M' */
int $0x10
popal
In a few places, if I removed some code I didn’t need on my test machine
(say, CHS compatibility), I could even fit in cheap and nasty code to print
a single register in hex (as long as you didn’t mind ‘A’ to ‘F’ actually
being ‘:’ to ‘?’ in ASCII; and note that this is real-mode code, so the loop
counter is %cx
not %ecx
):
/* print %edx in dumbed-down hex */
pushal
xorw %bx, %bx
movb $0xe, %ah
movw $8, %cx
1:
roll $4, %edx
movb %dl, %al
andb $0xf, %al
int $0x10
loop 1b
popal
After a considerable amount of work tracking down problems by bisection like
this, I also observed that GRLDR’s NTFS code bears quite a bit of
resemblance in its logical flow to GRUB 2’s NTFS module, and indeed the same
person wrote much of both. Since I knew that the latter worked, I could use
it to relieve my brain of trying to understand assembly code logic directly,
and could compare the two to look for discrepancies. I did find a few of
these, and corrected a simple one. Testing at this point suggested that the
boot process was getting as far as GRUB but still wasn’t printing anything.
I removed some Ubuntu patches which quieten down GRUB’s startup: still
nothing - so I switched my attentions to
grub-core/kern/i386/pc/startup.S,
which contains the first code executed from GRUB’s core image. Code before
the first call to real_to_prot
(which switches the processor into
protected mode) succeeded, while code after that point failed. Even more
mysteriously, code added to real_to_prot
before the actual switch to
protected mode failed too. Now I was clearly getting somewhere interesting,
but what was going on? What I really wanted was to be able to single-step,
or at least see what was at the memory location it was supposed to be
jumping to.
Around this point I was venting on IRC, and somebody asked if it was
reproducible in QEMU. Although I’d tried that already, I went back and
tried again. Ubuntu’s qemu
is actually built from qemu-kvm, and if I used
qemu -no-kvm
then it worked much better. Excellent! Now I could use GDB:
(gdb) target remote | qemu -gdb stdio -no-kvm -hda /dev/sda
This let me run until the point when NTLDR was about to hand over control,
then interrupt and set a breakpoint at 0x8200
(the entry point of
startup.S
). This revealed that the address that should have been
real_to_prot
was in fact garbage. I set a breakpoint at 0x7c00
(GRLDR’s
entry point) and stepped all the way through to ensure it was doing the
right thing. In the process it was helpful to know that GDB and QEMU don’t
handle real mode very well between
them. Useful tricks
here were:
- Use
set architecture i8086
before disassembling real-mode code (andset architecture i386
to switch back). - GDB prints addresses relative to the current segment base, but if you
want to enter an address then you need to calculate a linear address
yourself. For example, breakpoints must be set at
(CS << 4) + IP
, rather than just atIP
.
Single-stepping showed that GRLDR was loading the entirety of wubildr
correctly and jumping to it. The first instruction it jumped to wasn’t in
startup.S
, though, and then I remembered that we prefix the core image
with
grub-core/boot/i386/pc/lnxboot.S.
Stepping through this required a clear head since it copies itself around
and changes segment registers a few times. The interesting part was at
real_code_2
, where it copies a sector of the kernel to the target load
address, and then checks a known offset to find out whether the “kernel” is
in fact GRUB rather than a Linux kernel. I checked that offset by hand, and
there was the smoking gun. GRUB recently acquired Reed-Solomon error
correction on its core image, to allow it to recover from other software
writing over sectors in the boot track. This moved the magic number
lnxboot.S
was checking somewhat further into the core image, after the
first sector. lnxboot.S
couldn’t find it because it hadn’t copied it yet!
A bit of
adjustment
and all was well again.
The lesson for me from all of this has been to try hard to get an interactive debugger working. Really hard. It’s worth quite a bit of up-front effort if it saves you from killing neurons stepping through pages of code by hand. I think the real-mode debugging tricks I picked up should be useful for working on GRUB in the future.