Stephen C. Tweedie
2006-Nov-09 03:49 UTC
[Xen-devel] [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.
Hi, We have been debugging a nasty VMX issue recently, where certain kernel builds would fail to boot under VMX from syslinux, whereas others booted fine and the same kernel booted OK under grub. In practice, this means that we have only ever seen this on bootable ISO images. The underlying problem turned out to be vmxassist mishandling the setting up of segment registers on entry into VM86_PROTECTED mode. The bug occurs when the guest is entering protected mode and reaches the far jump to set up the 32-bit segment registers for the virtual VMENTER we''re about to perform. vmxassist, in protected_mode(), looks at the segment registers, which might be 16-bit segments or might be 32-bit segment selectors at this stage (depending on whether the OS has reloaded them since entering protected mode); and it tries to load the segments from the GDT. Unconditionally. Even if the segment register still actually contains a real mode 16-bit segment. Whoops. Now, enter the *second* bug, this time in the main Linux kernel itself. On x86_64, the kernel boot sequence sets up a bogus GDT: arch/x86_64/boot/setup.S has gdt_48: .word 0x8000 # gdt limit=2048, # 256 GDT entries .word 0, 0 # gdt base (filled in later) which shows that we think we have a 2048-byte GDT, with 256 entries (that all adds up fine); but 2048 is 0x800, not 0x8000. So when we do an lgdt to set this gdt up, we are making it 16 times too big. Unfortunately, when we enter this code from syslinux, SS has a 16-bit value still, 0x3000. That should be way off the end of the GDT and hence illegal as a descriptor even if we mistakenly tried to load it, but because the GDT has been loaded with the wrong size, vmxassist thinks that 0x3000 *IS* a valid segment descriptor, and loads it into the VMCS for the guest''s protected mode VMENTER. And so, if, by chance, the 8 bytes at (GDT+0x3000) in the kernel image pass the VMENTER instruction''s simple sanity tests, we ignore the problem and shortly afterwards the kernel will load up a valid SS; but if we fail those sanity tests then the VMENTER fails with "bad guest state". It''s just luck whether a given vmlinuz works or not. The reason that some kernels show this problem and others do not under Xen is because of the 0x8000 GDT-size kernel bug. But the blame lies squarely with Xen, because the kernel has never loaded any segments from the undefined GDT area above 0x800, and yet vmxassist tried to set up a VMCS segment from it anyway. So, while we would still like to fix the kernel GDT, the *real* problem here is in vmxassist''s mishandling of segments during the move to protected mode. Now, vmxassist already has code to detect 16-bit segments that survived unmodified from a transition into and out of protected mode, and to save and restore those appropriately. It does this using "saved_rm_regs", which get cleared on entry to protected mode, and then set to the old segment value if we fail to set a given 32-bit segment correctly. The fix is to save the 16-bit segments *always*, on entry to protected mode when %CR0(PE) is first set; and to clear the saved 16-bit segment and set the 32-bit variant in oldctx whenever a 32-bit segment descriptor is set during the transition to 32-bit CS. Then, when we finally do the VMENTER, we will set up the VMCS from only the 32-bit segments, clearing the VMCS entries for segments that have not been assigned valid 32-bit segments yet. Tested on various RHEL-5 boot.isos, including ones which worked before and ones which triggered the bug; all now boot correctly. Signed-off-by: Stephen Tweedie <sct@redhat.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2006-Nov-09 08:56 UTC
Re: [Xen-devel] [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.
On 9/11/06 3:49 am, "Stephen C. Tweedie" <sct@redhat.com> wrote:> The fix is to save the 16-bit segments *always*, on entry to protected > mode when %CR0(PE) is first set; and to clear the saved 16-bit segment > and set the 32-bit variant in oldctx whenever a 32-bit segment > descriptor is set during the transition to 32-bit CS. Then, when we > finally do the VMENTER, we will set up the VMCS from only the 32-bit > segments, clearing the VMCS entries for segments that have not been > assigned valid 32-bit segments yet.So, after setting CR0.PE but before doing a jump to complete the transition to protected mode, is the guest now running with zeroed data selectors but with the old ''shadow segment state''? So it can still use its data segments until the long jump? Not that I know whether any bootloader would actually *want* to access data in that critical region.... -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stephen C. Tweedie
2006-Nov-09 12:32 UTC
Re: [Xen-devel] [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.
Hi, On Thu, 2006-11-09 at 08:56 +0000, Keir Fraser wrote:> On 9/11/06 3:49 am, "Stephen C. Tweedie" <sct@redhat.com> wrote: > > > The fix is to save the 16-bit segments *always*, on entry to protected > > mode when %CR0(PE) is first set; and to clear the saved 16-bit segment > > and set the 32-bit variant in oldctx whenever a 32-bit segment > > descriptor is set during the transition to 32-bit CS. Then, when we > > finally do the VMENTER, we will set up the VMCS from only the 32-bit > > segments, clearing the VMCS entries for segments that have not been > > assigned valid 32-bit segments yet. > > So, after setting CR0.PE but before doing a jump to complete the transition > to protected mode, is the guest now running with zeroed data selectors but > with the old ''shadow segment state''?I tried to change the code as little as possible to minimise regressions, as I don''t have the full battery of guest OSes to test. But the intention at least was that the guest shouldn''t change much as a result of the patch. We''re still running entirely in vmxassist VM86_REAL_TO_PROTECTED mode during the change; the actual vmcs does not get changed until we go to VM86_PROTECTED mode (and not at all if we return to real mode before getting that far.) So the guest selectors as far as the CPU is concerned are still the selectors we set up to run vmxassist; and the selectors in effect as far as the emulated guest OS code is concerned remains the old 16-bit segments that vmxassist uses during that emulation. The numerical offset that the guest sees as a selector when reading es/ds etc. should not change, as the patch doesn''t touch regs at all. This actually highlights the fact that vmxassist''s emulation for real- to-protected mode is pretty minimal, and in fact it passes the buck to VMENTER earlier than it should. It does so as soon as CS goes 32-bit, but the other registers are still 16 bit and there''s no theoretical reason why a guest couldn''t try to access 16-bit data/stack segments from a 32-bit CS. I haven''t touched any of that --- the logic is still that when we drop to true 32-bit VMENTER, segments that were 16-bit are cleared. I just changed the logic to make sure that we are better at detecting which segments still contain 16-bit descriptors, and which ones have been reloaded from the GDT and hence should survive into 32- bit mode. The fact that each segment has multiple different interpretations here --- in the vmxassist context itself, in the guest VMCS, and in the guest visible register set, doesn''t help the discussion so I may have misunderstood your question or been unclear in my reply! Cheers, Stephen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Steven Rostedt
2006-Nov-09 12:56 UTC
Re: [Xen-devel] [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.
Keir Fraser wrote:> On 9/11/06 3:49 am, "Stephen C. Tweedie" <sct@redhat.com> wrote: > >> The fix is to save the 16-bit segments *always*, on entry to protected >> mode when %CR0(PE) is first set; and to clear the saved 16-bit segment >> and set the 32-bit variant in oldctx whenever a 32-bit segment >> descriptor is set during the transition to 32-bit CS. Then, when we >> finally do the VMENTER, we will set up the VMCS from only the 32-bit >> segments, clearing the VMCS entries for segments that have not been >> assigned valid 32-bit segments yet. > > So, after setting CR0.PE but before doing a jump to complete the transition > to protected mode, is the guest now running with zeroed data selectors but > with the old ''shadow segment state''? So it can still use its data segments > until the long jump? Not that I know whether any bootloader would actually > *want* to access data in that critical region.... > >Keir, oldctx.*_sel is basically the shadow selector. They are not used in emulating the code during the single step. So zeroing them out is fine. The regs structure is what is used in calculating getting data in the data segments. The main thing here that Stephen''s approach is different to the previous approach, is that if a segment descriptor *is* changed (ie. %es for Vista) then we have to update the oldctx explicitly for that selector. So far the only place we see this is with the "pop %es". And Stephen put in the proper code to update that descriptor. The old approach would just update all the selectors if they fit inside the GDT table, whether or not they were changed or loaded. So when Linux had a bad (extremely large) GDT limit, we loaded all the selectors into the shadow selector, even though they were bogus. -- Steve _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Steven Rostedt
2006-Nov-09 14:28 UTC
Re: [Xen-devel] [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.
Keir Fraser wrote:> On 9/11/06 12:56, "Steven Rostedt" <srostedt@redhat.com> wrote: > >> The old approach would just update all the selectors if they fit inside >> the GDT table, whether or not they were changed or loaded. So when Linux >> had a bad (extremely large) GDT limit, we loaded all the selectors into >> the shadow selector, even though they were bogus. > > A smaller patch might have been to implement those VMENTER sanity checks > inside load_seg(). I''m sure they''re quite simple. However, this new approach > is arguably saner than the old, and it''s worked on a few images I''ve tested > it with, so I checked it in and we''ll see how it holds up under wider > testing. >I believed Stephen looked at the sanity check approach, and determined that the code to simulate the sanity checks would be quite complex and error prone. But Stephen can correct me if I''m wrong. -- Steve _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stephen C. Tweedie
2006-Nov-09 17:31 UTC
Re: [Xen-devel] [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.
Hi, On Thu, 2006-11-09 at 14:08 +0000, Keir Fraser wrote:> A smaller patch might have been to implement those VMENTER sanity checks > inside load_seg(). I''m sure they''re quite simple.I checked, as that was going to be my first approach. Turns out that they are not actually all that simple, with all sorts of different rules for base, size, access rights, type, flags and so on, dependent on which segment it is, whether it''s 16 or 32-bit, etc. They aren''t _too_ tricky, but after looking at it some, there are enough rules there that I was actually more worried about introducing a regression going down that route than in the patch I ended up with. Cheers, Stephen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel