Hi, Jan There is a bug with booting 3 guest with no-EPT mode, when xen handle guest page fault will walk shadow guest table, fail to get l4e from the top level table, then it will trigger a Fatal Page Fault and panic Xen. Looked at this issue and found the bug is brought by this patch changeset 26523:fd997a96d448 x86: debugging code for testing 16Tb support on smaller memory systems I''m not much clear what''s the reason of the modification(#ifdef NDEBUG) in xen/arch/x86/domain_page.c of this patch? But removing the macro limiting will solve the fatal page fault bug for shadow page mode, can it be simply removed? or can you look at it because you are very familiar with it. --- a/xen/arch/x86/domain_page.c Fri Feb 08 11:06:04 2013 +0100 +++ b/xen/arch/x86/domain_page.c Fri Mar 29 14:21:02 2013 +0800 @@ -422,10 +422,8 @@ void *map_domain_page_global(unsigned lo ASSERT(!in_irq() && local_irq_is_enabled()); -#ifdef NDEBUG if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) return mfn_to_virt(mfn); -#endif spin_lock(&globalmap_lock); Best Regards, Xudong Hao
>>> On 29.03.13 at 07:39, "Hao, Xudong" <xudong.hao@intel.com> wrote: > There is a bug with booting 3 guest with no-EPT mode, when xen handle guest > page fault will walk shadow guest table, fail to get l4e from the top level > table, then it will trigger a Fatal Page Fault and panic Xen. > > Looked at this issue and found the bug is brought by this patch > changeset 26523:fd997a96d448 > x86: debugging code for testing 16Tb support on smaller memory systems > > I''m not much clear what''s the reason of the modification(#ifdef NDEBUG) in > xen/arch/x86/domain_page.c of this patch?The point is to make sure the domain page mapping code actually gets tested. The shortcut is a performance optimization.> But removing the macro limiting will solve the fatal page fault bug for > shadow page mode, can it be simply removed? or can you look at it because you > are very familiar with it.While it could be removed, this would be a fix for the problem you''re seeing - you''d only later see someone else run into it on a system with more than 5Tb. Hence we need you to provide technical details of the crash you observed (register and stack dump as well as call trace, plus any details on the specific guest you''re running that distinguishes it from other guests that don''t exhibit this problem). Jan
> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Tuesday, April 02, 2013 3:38 PM > To: Hao, Xudong > Cc: xen-devel (xen-devel@lists.xen.org) > Subject: Re: Bug on shadow page mode > > >>> On 29.03.13 at 07:39, "Hao, Xudong" <xudong.hao@intel.com> wrote: > > There is a bug with booting 3 guest with no-EPT mode, when xen handle > guest > > page fault will walk shadow guest table, fail to get l4e from the top level > > table, then it will trigger a Fatal Page Fault and panic Xen. > > > > Looked at this issue and found the bug is brought by this patch > > changeset 26523:fd997a96d448 > > x86: debugging code for testing 16Tb support on smaller memory > systems > > > > I''m not much clear what''s the reason of the modification(#ifdef NDEBUG) in > > xen/arch/x86/domain_page.c of this patch? > > The point is to make sure the domain page mapping code actually > gets tested. The shortcut is a performance optimization. >So in a smaller memory system, original code would run into the shortcut. Now if NDEBUG not be defined, it will skip the mfn_to_virt(mfn) returning and run code below it. Our case panic xen in this condition. We have 2G physical memory system, run 3 RHEL6u3 guests with shadow page mode, each guest allocate 400MB memory and set dom0 512M memory, when the 3nd guest booting, xen will panic on walking shadow page table.> > But removing the macro limiting will solve the fatal page fault bug for > > shadow page mode, can it be simply removed? or can you look at it because > you > > are very familiar with it. > > While it could be removed, this would be a fix for the problem you''re > seeing - you''d only later see someone else run into it on a system > with more than 5Tb. > > Hence we need you to provide technical details of the crash you > observed (register and stack dump as well as call trace, plus any > details on the specific guest you''re running that distinguishes it > from other guests that don''t exhibit this problem). >(XEN) ----[ Xen-4.3-unstable x86_64 debug=y Tainted: C ]---- (XEN) CPU: 4 (XEN) RIP: e008:[<ffff82c4c01e637f>] guest_walk_tables_4_levels+0x135/0x6a6 (XEN) RFLAGS: 0000000000010286 CONTEXT: hypervisor (XEN) rax: 0000000000000001 rbx: ffff83007f087c98 rcx: 0000000000000005 (XEN) rdx: 0000000000000002 rsi: 0000000000800000 rdi: ffff83007f087ce0 (XEN) rbp: ffff83007f087ab8 rsp: ffff83007f087a38 r8: 0000000000000004 (XEN) r9: 000000000002b650 r10: 0000000000000022 r11: 0000000000000206 (XEN) r12: ffff830047374000 r13: 0000003a0f388718 r14: ffff82c4c025f90c (XEN) r15: ffff82c406a00000 cr0: 000000008005003b cr4: 00000000000426f0 (XEN) cr3: 000000004767e000 cr2: ffff82c406a00000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff83007f087a38: (XEN) 000000007f087a88 ffff8300477df000 800000003e604025 000000007f087b10 (XEN) ffff83007f087ac8 ffff8300477c4820 0000000528dd40f8 0080000000000004 (XEN) ffff830028dd4ff8 ffff830000000000 ffff83007f087b48 ffff830047374000 (XEN) ffff8300477df000 ffff83007f087f18 0000000000000000 000000000000000e (XEN) ffff83007f087d18 ffff82c4c020d8cc ffff82c406a00000 ffff83007f087b18 (XEN) ffff83007f080000 ffff82c4c0311cc8 0000000000000c40 ffff83007f080000 (XEN) ffff82c4c0311cc8 00000000000003c8 0000000000000740 0000000000000000 (XEN) 0000000000000001 0000000000028dd4 ffff8300477dfb58 0000000000000108 (XEN) ffff8300477dfb58 ffff83007f080000 ffff82c4c0311cc8 ffff82c4c0311cd8 (XEN) 0000000003a0f388 ffff82c4c011759c ffff8300477dfae8 ffff830047374980 (XEN) ffff8300477df000 000000000000652f 0000003a0f388718 000000000002b650 (XEN) ffff83007f087c18 ffff82c4c01ef4d7 ffff83007f087be0 ffff83007f080000 (XEN) ffff83007f087c18 0000000100000100 ffff83007f087c18 ffff830047374000 (XEN) 0000000000006a00 000000370002b650 000000000002b650 ffff830047374000 (XEN) ffff830047374980 0000000000000000 ffff83007f087c18 ffff82c4c0125e3d (XEN) ffff83007f087c78 ffff82c4c021017d 0c00000000000000 ffff8300477df000 (XEN) ffff82e0008e8640 ffff830047374000 0000000013650000 0000000013650000 (XEN) 0000000000000000 0000000000000000 ffff83007f087c78 ffff82c4c01ae218 (XEN) ffff83007f087cc8 ffff82c4c01b4abc 0000000028dd4027 ffff82c4c0207f86 (XEN) 0000003a0f388718 0000000000000000 0000000000000000 0000000000000000 (XEN) Xen call trace: (XEN) [<ffff82c4c01e637f>] guest_walk_tables_4_levels+0x135/0x6a6 (XEN) [<ffff82c4c020d8cc>] sh_page_fault__guest_4+0x505/0x2015 (XEN) [<ffff82c4c01d2135>] vmx_vmexit_handler+0x86c/0x1748 (XEN) (XEN) Pagetable walk from ffff82c406a00000: (XEN) L4[0x105] = 000000007f26e063 ffffffffffffffff (XEN) L3[0x110] = 000000005ce30063 ffffffffffffffff (XEN) L2[0x035] = 0000000014aab063 ffffffffffffffff (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff (XEN) (XEN) **************************************** (XEN) Panic on CPU 4: (XEN) FATAL PAGE FAULT (XEN) [error_code=0000] (XEN) Faulting linear address: ffff82c406a00000 (XEN) **************************************** (XEN) (XEN) Reboot in five seconds... (XEN) Resetting with ACPI MEMORY or I/O RESET_REG> Jan
>>> On 02.04.13 at 10:40, "Hao, Xudong" <xudong.hao@intel.com> wrote: > (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 4 > (XEN) RIP: e008:[<ffff82c4c01e637f>] guest_walk_tables_4_levels+0x135/0x6a6 > (XEN) RFLAGS: 0000000000010286 CONTEXT: hypervisor > (XEN) rax: 0000000000000001 rbx: ffff83007f087c98 rcx: 0000000000000005 > (XEN) rdx: 0000000000000002 rsi: 0000000000800000 rdi: ffff83007f087ce0 > (XEN) rbp: ffff83007f087ab8 rsp: ffff83007f087a38 r8: 0000000000000004 > (XEN) r9: 000000000002b650 r10: 0000000000000022 r11: 0000000000000206 > (XEN) r12: ffff830047374000 r13: 0000003a0f388718 r14: ffff82c4c025f90c > (XEN) r15: ffff82c406a00000 cr0: 000000008005003b cr4: 00000000000426f0 > (XEN) cr3: 000000004767e000 cr2: ffff82c406a00000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff83007f087a38: > (XEN) 000000007f087a88 ffff8300477df000 800000003e604025 000000007f087b10 > (XEN) ffff83007f087ac8 ffff8300477c4820 0000000528dd40f8 0080000000000004 > (XEN) ffff830028dd4ff8 ffff830000000000 ffff83007f087b48 ffff830047374000 > (XEN) ffff8300477df000 ffff83007f087f18 0000000000000000 000000000000000e > (XEN) ffff83007f087d18 ffff82c4c020d8cc ffff82c406a00000 ffff83007f087b18 > (XEN) ffff83007f080000 ffff82c4c0311cc8 0000000000000c40 ffff83007f080000 > (XEN) ffff82c4c0311cc8 00000000000003c8 0000000000000740 0000000000000000 > (XEN) 0000000000000001 0000000000028dd4 ffff8300477dfb58 0000000000000108 > (XEN) ffff8300477dfb58 ffff83007f080000 ffff82c4c0311cc8 ffff82c4c0311cd8 > (XEN) 0000000003a0f388 ffff82c4c011759c ffff8300477dfae8 ffff830047374980 > (XEN) ffff8300477df000 000000000000652f 0000003a0f388718 000000000002b650 > (XEN) ffff83007f087c18 ffff82c4c01ef4d7 ffff83007f087be0 ffff83007f080000 > (XEN) ffff83007f087c18 0000000100000100 ffff83007f087c18 ffff830047374000 > (XEN) 0000000000006a00 000000370002b650 000000000002b650 ffff830047374000 > (XEN) ffff830047374980 0000000000000000 ffff83007f087c18 ffff82c4c0125e3d > (XEN) ffff83007f087c78 ffff82c4c021017d 0c00000000000000 ffff8300477df000 > (XEN) ffff82e0008e8640 ffff830047374000 0000000013650000 0000000013650000 > (XEN) 0000000000000000 0000000000000000 ffff83007f087c78 ffff82c4c01ae218 > (XEN) ffff83007f087cc8 ffff82c4c01b4abc 0000000028dd4027 ffff82c4c0207f86 > (XEN) 0000003a0f388718 0000000000000000 0000000000000000 0000000000000000 > (XEN) Xen call trace: > (XEN) [<ffff82c4c01e637f>] guest_walk_tables_4_levels+0x135/0x6a6 > (XEN) [<ffff82c4c020d8cc>] sh_page_fault__guest_4+0x505/0x2015 > (XEN) [<ffff82c4c01d2135>] vmx_vmexit_handler+0x86c/0x1748 > (XEN) > (XEN) Pagetable walk from ffff82c406a00000: > (XEN) L4[0x105] = 000000007f26e063 ffffffffffffffff > (XEN) L3[0x110] = 000000005ce30063 ffffffffffffffff > (XEN) L2[0x035] = 0000000014aab063 ffffffffffffffff > (XEN) L1[0x000] = 0000000000000000 ffffffffffffffffTim, I''m afraid this is something for you. From what I can tell, despite sh_walk_guest_tables() being called from sh_page_fault() without the paging lock held, there doesn''t appear to be a way for this to race sh_update_cr3(). And with the way the latter updates guest_vtable, the only way for a page fault to happen upon use of that cached mapping would be between the call to sh_unmap_domain_page_global() and the immediately following one to sh_map_domain_page_global() (i.e. while the pointer is stale). What I do note is /* PAGING_LEVELS==4 implies 64-bit, which means that * map_domain_page_global can''t fail */ BUG_ON(v->arch.paging.shadow.guest_vtable == NULL); which is no longer true. Sadly the 2-level paging case also doesn''t really handle the similar error there, so it''s not really clear to me how this would best be fixed. And that''s not the reason for the problem here anyway.> (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 4: > (XEN) FATAL PAGE FAULT > (XEN) [error_code=0000] > (XEN) Faulting linear address: ffff82c406a00000 > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... > (XEN) Resetting with ACPI MEMORY or I/O RESET_REG
At 12:50 +0100 on 02 Apr (1364907054), Jan Beulich wrote:> >>> On 02.04.13 at 10:40, "Hao, Xudong" <xudong.hao@intel.com> wrote: > > (XEN) [<ffff82c4c01e637f>] guest_walk_tables_4_levels+0x135/0x6a6 > > (XEN) [<ffff82c4c020d8cc>] sh_page_fault__guest_4+0x505/0x2015 > > (XEN) [<ffff82c4c01d2135>] vmx_vmexit_handler+0x86c/0x1748 > > (XEN) > > (XEN) Pagetable walk from ffff82c406a00000: > > (XEN) L4[0x105] = 000000007f26e063 ffffffffffffffff > > (XEN) L3[0x110] = 000000005ce30063 ffffffffffffffff > > (XEN) L2[0x035] = 0000000014aab063 ffffffffffffffff > > (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff > > Tim, > > I''m afraid this is something for you. From what I can tell, despite > sh_walk_guest_tables() being called from sh_page_fault() without > the paging lock held, there doesn''t appear to be a way for this to > race sh_update_cr3(). And with the way the latter updates > guest_vtable, the only way for a page fault to happen upon use > of that cached mapping would be between the call to > sh_unmap_domain_page_global() and the immediately following > one to sh_map_domain_page_global() (i.e. while the pointer is > stale).I''ll have a look at it on Thursday; swapping the map and the unmap should be trivial, anyway. Is this bug easily reproducable, or was it only hit once? I''d expect a race like this to be nigh impossible, especially considering that 32-bit Xen had the same code for years.> What I do note is > > /* PAGING_LEVELS==4 implies 64-bit, which means that > * map_domain_page_global can''t fail */ > BUG_ON(v->arch.paging.shadow.guest_vtable == NULL); > > which is no longer true. Sadly the 2-level paging case also > doesn''t really handle the similar error there, so it''s not really > clear to me how this would best be fixed. And that''s not the > reason for the problem here anyway.I''ll look at that too -- it may be that we can avoid the _global() map altogether. HAP seems to manage without it, but it has far fewer lookups. Maybe I could add a per-vcpu fixmap for it, which would cover most cases (i.e. local lookups). Cheers, Tim.> > (XEN) > > (XEN) **************************************** > > (XEN) Panic on CPU 4: > > (XEN) FATAL PAGE FAULT > > (XEN) [error_code=0000] > > (XEN) Faulting linear address: ffff82c406a00000 > > (XEN) **************************************** > > (XEN) > > (XEN) Reboot in five seconds... > > (XEN) Resetting with ACPI MEMORY or I/O RESET_REG > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
>>> On 02.04.13 at 18:45, Tim Deegan <tim@xen.org> wrote: > Is this bug easily reproducable, or was it only hit once? I''d expect a > race like this to be nigh impossible, especially considering that 32-bit > Xen had the same code for years.I understood that this is reproducible - Xudong? Jan
> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Wednesday, April 03, 2013 3:34 PM > To: Hao, Xudong; Tim Deegan > Cc: xen-devel (xen-devel@lists.xen.org) > Subject: Re: [Xen-devel] Bug on shadow page mode > > >>> On 02.04.13 at 18:45, Tim Deegan <tim@xen.org> wrote: > > Is this bug easily reproducable, or was it only hit once? I''d expect a > > race like this to be nigh impossible, especially considering that 32-bit > > Xen had the same code for years. > > I understood that this is reproducible - Xudong? >Yes, it''s easily to reproduce by xen debug.
Hi, At 12:50 +0100 on 02 Apr (1364907054), Jan Beulich wrote:> > (XEN) Xen call trace: > > (XEN) [<ffff82c4c01e637f>] guest_walk_tables_4_levels+0x135/0x6a6 > > (XEN) [<ffff82c4c020d8cc>] sh_page_fault__guest_4+0x505/0x2015 > > (XEN) [<ffff82c4c01d2135>] vmx_vmexit_handler+0x86c/0x1748 > > (XEN) > > (XEN) Pagetable walk from ffff82c406a00000: > > (XEN) L4[0x105] = 000000007f26e063 ffffffffffffffff > > (XEN) L3[0x110] = 000000005ce30063 ffffffffffffffff > > (XEN) L2[0x035] = 0000000014aab063 ffffffffffffffff > > (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff > > Tim, > > I''m afraid this is something for you. From what I can tell, despite > sh_walk_guest_tables() being called from sh_page_fault() without > the paging lock held, there doesn''t appear to be a way for this to > race sh_update_cr3(). And with the way the latter updates > guest_vtable, the only way for a page fault to happen upon use > of that cached mapping would be between the call to > sh_unmap_domain_page_global() and the immediately following > one to sh_map_domain_page_global() (i.e. while the pointer is > stale).Hmmm. So the only way I can see that happening is if some foreign agent resets the vcpu''s state while it''s actually running, which AFAICT shouldn''t happen. I looked at reversing the unmap and map, but realised that won''t solve the problem -- in such a race sh_walk_guest_tables() could cache the old value. For testing, here''s a patch that gets rid of guest_vtable altogether. I''m not entirely happy with it, partly because it will have a perf hit on large-RAM machines, and partly because I''m worried that whatever path caused this crash might cause other problems. But it would be good to confirm that it stops the crashes. Cheers, Tim. commit 5469971c30eca85f270cbd9bc46f63ec0fd543c0 Author: Tim Deegan <tim@xen.org> Date: Thu Apr 4 11:16:28 2013 +0100 x86/mm/shadow: Don''t keep a permanent mapping of the top-level guest table. This addresses two issues. First, a crash seen where sh_walk_guest_tables() found that guest_vtable was not a valid mapping. Second, the lack of any error handling if map_domain_page_global() were to fail. Signed-off-by: Tim Deegan <tim@xen.org> diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c index a593f76..8ca4b9b 100644 --- a/xen/arch/x86/mm/shadow/multi.c +++ b/xen/arch/x86/mm/shadow/multi.c @@ -174,15 +174,26 @@ static inline uint32_t sh_walk_guest_tables(struct vcpu *v, unsigned long va, walk_t *gw, uint32_t pfec) { - return guest_walk_tables(v, p2m_get_hostp2m(v->domain), va, gw, pfec, + uint32_t rc; + mfn_t top_mfn; + void *top_map; + #if GUEST_PAGING_LEVELS == 3 /* PAE */ - _mfn(INVALID_MFN), - v->arch.paging.shadow.gl3e + top_mfn = _mfn(INVALID_MFN); + top_map = v->arch.paging.shadow.gl3e; #else /* 32 or 64 */ - pagetable_get_mfn(v->arch.guest_table), - v->arch.paging.shadow.guest_vtable + top_mfn = pagetable_get_mfn(v->arch.guest_table); + top_map = sh_map_domain_page(top_mfn); +#endif + + rc = guest_walk_tables(v, p2m_get_hostp2m(v->domain), va, gw, pfec, + top_mfn, top_map); + +#if GUEST_PAGING_LEVELS != 3 + sh_unmap_domain_page(top_map); #endif - ); + + return rc; } /* This validation is called with lock held, and after write permission @@ -219,24 +230,20 @@ shadow_check_gwalk(struct vcpu *v, unsigned long va, walk_t *gw, int version) * logic simple. */ perfc_incr(shadow_check_gwalk); -#if GUEST_PAGING_LEVELS >= 3 /* PAE or 64... */ -#if GUEST_PAGING_LEVELS >= 4 /* 64-bit only... */ - l4p = (guest_l4e_t *)v->arch.paging.shadow.guest_vtable; +#if GUEST_PAGING_LEVELS >= 4 /* 64-bit... */ + l4p = sh_map_domain_page(gw->l4mfn); mismatch |= (gw->l4e.l4 != l4p[guest_l4_table_offset(va)].l4); + sh_unmap_domain_page(l4p); l3p = sh_map_domain_page(gw->l3mfn); mismatch |= (gw->l3e.l3 != l3p[guest_l3_table_offset(va)].l3); sh_unmap_domain_page(l3p); -#else +#elif GUEST_PAGING_LEVELS >= 3 /* PAE... */ mismatch |= (gw->l3e.l3 ! v->arch.paging.shadow.gl3e[guest_l3_table_offset(va)].l3); #endif l2p = sh_map_domain_page(gw->l2mfn); mismatch |= (gw->l2e.l2 != l2p[guest_l2_table_offset(va)].l2); sh_unmap_domain_page(l2p); -#else - l2p = (guest_l2e_t *)v->arch.paging.shadow.guest_vtable; - mismatch |= (gw->l2e.l2 != l2p[guest_l2_table_offset(va)].l2); -#endif if ( !(guest_supports_superpages(v) && (guest_l2e_get_flags(gw->l2e) & _PAGE_PSE)) ) { @@ -3784,7 +3791,7 @@ sh_update_linear_entries(struct vcpu *v) } -/* Removes vcpu->arch.paging.shadow.guest_vtable and vcpu->arch.shadow_table[]. +/* Removes vcpu->arch.shadow_table[]. * Does all appropriate management/bookkeeping/refcounting/etc... */ static void @@ -3794,24 +3801,6 @@ sh_detach_old_tables(struct vcpu *v) int i = 0; //// - //// vcpu->arch.paging.shadow.guest_vtable - //// - -#if GUEST_PAGING_LEVELS == 3 - /* PAE guests don''t have a mapping of the guest top-level table */ - ASSERT(v->arch.paging.shadow.guest_vtable == NULL); -#else - if ( v->arch.paging.shadow.guest_vtable ) - { - struct domain *d = v->domain; - if ( shadow_mode_external(d) || shadow_mode_translate(d) ) - sh_unmap_domain_page_global(v->arch.paging.shadow.guest_vtable); - v->arch.paging.shadow.guest_vtable = NULL; - } -#endif // !NDEBUG - - - //// //// vcpu->arch.shadow_table[] //// @@ -3970,26 +3959,12 @@ sh_update_cr3(struct vcpu *v, int do_locking) //// - //// vcpu->arch.paging.shadow.guest_vtable + //// vcpu->arch.paging.shadow.gl3e[] //// -#if GUEST_PAGING_LEVELS == 4 - if ( shadow_mode_external(d) || shadow_mode_translate(d) ) - { - if ( v->arch.paging.shadow.guest_vtable ) - sh_unmap_domain_page_global(v->arch.paging.shadow.guest_vtable); - v->arch.paging.shadow.guest_vtable = sh_map_domain_page_global(gmfn); - /* PAGING_LEVELS==4 implies 64-bit, which means that - * map_domain_page_global can''t fail */ - BUG_ON(v->arch.paging.shadow.guest_vtable == NULL); - } - else - v->arch.paging.shadow.guest_vtable = __linear_l4_table; -#elif GUEST_PAGING_LEVELS == 3 +#if GUEST_PAGING_LEVELS == 3 /* On PAE guests we don''t use a mapping of the guest''s own top-level * table. We cache the current state of that table and shadow that, * until the next CR3 write makes us refresh our cache. */ - ASSERT(v->arch.paging.shadow.guest_vtable == NULL); - if ( shadow_mode_external(d) ) /* Find where in the page the l3 table is */ guest_idx = guest_index((void *)v->arch.hvm_vcpu.guest_cr[3]); @@ -4005,20 +3980,6 @@ sh_update_cr3(struct vcpu *v, int do_locking) for ( i = 0; i < 4 ; i++ ) v->arch.paging.shadow.gl3e[i] = gl3e[i]; sh_unmap_domain_page(gl3e); -#elif GUEST_PAGING_LEVELS == 2 - if ( shadow_mode_external(d) || shadow_mode_translate(d) ) - { - if ( v->arch.paging.shadow.guest_vtable ) - sh_unmap_domain_page_global(v->arch.paging.shadow.guest_vtable); - v->arch.paging.shadow.guest_vtable = sh_map_domain_page_global(gmfn); - /* Does this really need map_domain_page_global? Handle the - * error properly if so. */ - BUG_ON(v->arch.paging.shadow.guest_vtable == NULL); /* XXX */ - } - else - v->arch.paging.shadow.guest_vtable = __linear_l2_table; -#else -#error this should never happen #endif diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h index 6f9744a..4921c99 100644 --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -122,8 +122,6 @@ struct shadow_vcpu { l3_pgentry_t l3table[4] __attribute__((__aligned__(32))); /* PAE guests: per-vcpu cache of the top-level *guest* entries */ l3_pgentry_t gl3e[4] __attribute__((__aligned__(32))); - /* Non-PAE guests: pointer to guest top-level pagetable */ - void *guest_vtable; /* Last MFN that we emulated a write to as unshadow heuristics. */ unsigned long last_emulated_mfn_for_unshadow; /* MFN of the last shadow that we shot a writeable mapping in */
At 11:18 +0100 on 04 Apr (1365074288), Tim Deegan wrote:> Hi, > > At 12:50 +0100 on 02 Apr (1364907054), Jan Beulich wrote: > > > (XEN) Xen call trace: > > > (XEN) [<ffff82c4c01e637f>] guest_walk_tables_4_levels+0x135/0x6a6 > > > (XEN) [<ffff82c4c020d8cc>] sh_page_fault__guest_4+0x505/0x2015 > > > (XEN) [<ffff82c4c01d2135>] vmx_vmexit_handler+0x86c/0x1748 > > > (XEN) > > > (XEN) Pagetable walk from ffff82c406a00000: > > > (XEN) L4[0x105] = 000000007f26e063 ffffffffffffffff > > > (XEN) L3[0x110] = 000000005ce30063 ffffffffffffffff > > > (XEN) L2[0x035] = 0000000014aab063 ffffffffffffffff > > > (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff > > > > Tim, > > > > I''m afraid this is something for you. From what I can tell, despite > > sh_walk_guest_tables() being called from sh_page_fault() without > > the paging lock held, there doesn''t appear to be a way for this to > > race sh_update_cr3(). And with the way the latter updates > > guest_vtable, the only way for a page fault to happen upon use > > of that cached mapping would be between the call to > > sh_unmap_domain_page_global() and the immediately following > > one to sh_map_domain_page_global() (i.e. while the pointer is > > stale). > > Hmmm. So the only way I can see that happening is if some foreign agent > resets the vcpu''s state while it''s actually running, which AFAICT > shouldn''t happen.OTOH, looking at map_domain_page_global, there doesn''t seem to be any locking preventing two CPUs from populating a page of global-map l1es at the same time. So, here''s a different patch to test -- it would be good to know if this patch by itself fixes the crash. Tim. diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c index 7421e03..efda6af 100644 --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -354,9 +354,10 @@ void *map_domain_page_global(unsigned long mfn) set_bit(idx, inuse); inuse_cursor = idx + 1; + pl1e = virt_to_xen_l1e(va); + spin_unlock(&globalmap_lock); - pl1e = virt_to_xen_l1e(va); if ( !pl1e ) return NULL; l1e_write(pl1e, l1e_from_pfn(mfn, __PAGE_HYPERVISOR));
>>> On 04.04.13 at 12:34, Tim Deegan <tim@xen.org> wrote: > At 11:18 +0100 on 04 Apr (1365074288), Tim Deegan wrote: >> Hi, >> >> At 12:50 +0100 on 02 Apr (1364907054), Jan Beulich wrote: >> > > (XEN) Xen call trace: >> > > (XEN) [<ffff82c4c01e637f>] guest_walk_tables_4_levels+0x135/0x6a6 >> > > (XEN) [<ffff82c4c020d8cc>] sh_page_fault__guest_4+0x505/0x2015 >> > > (XEN) [<ffff82c4c01d2135>] vmx_vmexit_handler+0x86c/0x1748 >> > > (XEN) >> > > (XEN) Pagetable walk from ffff82c406a00000: >> > > (XEN) L4[0x105] = 000000007f26e063 ffffffffffffffff >> > > (XEN) L3[0x110] = 000000005ce30063 ffffffffffffffff >> > > (XEN) L2[0x035] = 0000000014aab063 ffffffffffffffff >> > > (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff >> > >> > Tim, >> > >> > I''m afraid this is something for you. From what I can tell, despite >> > sh_walk_guest_tables() being called from sh_page_fault() without >> > the paging lock held, there doesn''t appear to be a way for this to >> > race sh_update_cr3(). And with the way the latter updates >> > guest_vtable, the only way for a page fault to happen upon use >> > of that cached mapping would be between the call to >> > sh_unmap_domain_page_global() and the immediately following >> > one to sh_map_domain_page_global() (i.e. while the pointer is >> > stale). >> >> Hmmm. So the only way I can see that happening is if some foreign agent >> resets the vcpu''s state while it''s actually running, which AFAICT >> shouldn''t happen. > > OTOH, looking at map_domain_page_global, there doesn''t seem to be any > locking preventing two CPUs from populating a page of global-map l1es at > the same time. So, here''s a different patch to test -- it would be good > to know if this patch by itself fixes the crash.That''s a very good point. The code flow was taken from the old 32-bit implementation, and I didn''t pay attention to this aspect (which is a non-issue for the 32-bit code) when porting it over. So feel free to put my ack on this patch once you put it into proper shape. Thanks and sorry for bothering you with a bug I introduced, Jan> --- a/xen/arch/x86/domain_page.c > +++ b/xen/arch/x86/domain_page.c > @@ -354,9 +354,10 @@ void *map_domain_page_global(unsigned long mfn) > set_bit(idx, inuse); > inuse_cursor = idx + 1; > > + pl1e = virt_to_xen_l1e(va); > + > spin_unlock(&globalmap_lock); > > - pl1e = virt_to_xen_l1e(va); > if ( !pl1e ) > return NULL; > l1e_write(pl1e, l1e_from_pfn(mfn, __PAGE_HYPERVISOR));
> -----Original Message----- > From: Tim Deegan [mailto:tim@xen.org] > Sent: Thursday, April 04, 2013 6:35 PM > To: Jan Beulich > Cc: Hao, Xudong; xen-devel (xen-devel@lists.xen.org) > Subject: Re: [Xen-devel] Bug on shadow page mode > > At 11:18 +0100 on 04 Apr (1365074288), Tim Deegan wrote: > > Hi, > > > > At 12:50 +0100 on 02 Apr (1364907054), Jan Beulich wrote: > > > > (XEN) Xen call trace: > > > > (XEN) [<ffff82c4c01e637f>] guest_walk_tables_4_levels+0x135/0x6a6 > > > > (XEN) [<ffff82c4c020d8cc>] sh_page_fault__guest_4+0x505/0x2015 > > > > (XEN) [<ffff82c4c01d2135>] vmx_vmexit_handler+0x86c/0x1748 > > > > (XEN) > > > > (XEN) Pagetable walk from ffff82c406a00000: > > > > (XEN) L4[0x105] = 000000007f26e063 ffffffffffffffff > > > > (XEN) L3[0x110] = 000000005ce30063 ffffffffffffffff > > > > (XEN) L2[0x035] = 0000000014aab063 ffffffffffffffff > > > > (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff > > > > > > Tim, > > > > > > I''m afraid this is something for you. From what I can tell, despite > > > sh_walk_guest_tables() being called from sh_page_fault() without > > > the paging lock held, there doesn''t appear to be a way for this to > > > race sh_update_cr3(). And with the way the latter updates > > > guest_vtable, the only way for a page fault to happen upon use > > > of that cached mapping would be between the call to > > > sh_unmap_domain_page_global() and the immediately following > > > one to sh_map_domain_page_global() (i.e. while the pointer is > > > stale). > > > > Hmmm. So the only way I can see that happening is if some foreign agent > > resets the vcpu''s state while it''s actually running, which AFAICT > > shouldn''t happen. > > OTOH, looking at map_domain_page_global, there doesn''t seem to be any > locking preventing two CPUs from populating a page of global-map l1es at > the same time. So, here''s a different patch to test -- it would be good > to know if this patch by itself fixes the crash. >Holding lock during l1e populating fixes the crash on my side. Thanks -Xudong> Tim. > > diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c > index 7421e03..efda6af 100644 > --- a/xen/arch/x86/domain_page.c > +++ b/xen/arch/x86/domain_page.c > @@ -354,9 +354,10 @@ void *map_domain_page_global(unsigned long mfn) > set_bit(idx, inuse); > inuse_cursor = idx + 1; > > + pl1e = virt_to_xen_l1e(va); > + > spin_unlock(&globalmap_lock); > > - pl1e = virt_to_xen_l1e(va); > if ( !pl1e ) > return NULL; > l1e_write(pl1e, l1e_from_pfn(mfn, __PAGE_HYPERVISOR));
> -----Original Message----- > From: Tim Deegan [mailto:tim@xen.org] > Sent: Thursday, April 04, 2013 6:18 PM > To: Jan Beulich > Cc: Hao, Xudong; xen-devel (xen-devel@lists.xen.org) > Subject: Re: [Xen-devel] Bug on shadow page mode > > Hi, > > At 12:50 +0100 on 02 Apr (1364907054), Jan Beulich wrote: > > > (XEN) Xen call trace: > > > (XEN) [<ffff82c4c01e637f>] guest_walk_tables_4_levels+0x135/0x6a6 > > > (XEN) [<ffff82c4c020d8cc>] sh_page_fault__guest_4+0x505/0x2015 > > > (XEN) [<ffff82c4c01d2135>] vmx_vmexit_handler+0x86c/0x1748 > > > (XEN) > > > (XEN) Pagetable walk from ffff82c406a00000: > > > (XEN) L4[0x105] = 000000007f26e063 ffffffffffffffff > > > (XEN) L3[0x110] = 000000005ce30063 ffffffffffffffff > > > (XEN) L2[0x035] = 0000000014aab063 ffffffffffffffff > > > (XEN) L1[0x000] = 0000000000000000 ffffffffffffffff > > > > Tim, > > > > I''m afraid this is something for you. From what I can tell, despite > > sh_walk_guest_tables() being called from sh_page_fault() without > > the paging lock held, there doesn''t appear to be a way for this to > > race sh_update_cr3(). And with the way the latter updates > > guest_vtable, the only way for a page fault to happen upon use > > of that cached mapping would be between the call to > > sh_unmap_domain_page_global() and the immediately following > > one to sh_map_domain_page_global() (i.e. while the pointer is > > stale). > > Hmmm. So the only way I can see that happening is if some foreign agent > resets the vcpu''s state while it''s actually running, which AFAICT > shouldn''t happen. I looked at reversing the unmap and map, but realised > that won''t solve the problem -- in such a race sh_walk_guest_tables() > could cache the old value. > > For testing, here''s a patch that gets rid of guest_vtable altogether. > I''m not entirely happy with it, partly because it will have a perf hit > on large-RAM machines, and partly because I''m worried that whatever path > caused this crash might cause other problems. But it would be good to > confirm that it stops the crashes. >This fixed the crash too, but I think you will use the other fixing. :)> Cheers, > > Tim. > > commit 5469971c30eca85f270cbd9bc46f63ec0fd543c0 > Author: Tim Deegan <tim@xen.org> > Date: Thu Apr 4 11:16:28 2013 +0100 > > x86/mm/shadow: Don''t keep a permanent mapping of the top-level guest > table. > > This addresses two issues. First, a crash seen where > sh_walk_guest_tables() found that guest_vtable was not a valid > mapping. Second, the lack of any error handling if > map_domain_page_global() were to fail. > > Signed-off-by: Tim Deegan <tim@xen.org> > > diff --git a/xen/arch/x86/mm/shadow/multi.c > b/xen/arch/x86/mm/shadow/multi.c > index a593f76..8ca4b9b 100644 > --- a/xen/arch/x86/mm/shadow/multi.c > +++ b/xen/arch/x86/mm/shadow/multi.c > @@ -174,15 +174,26 @@ static inline uint32_t > sh_walk_guest_tables(struct vcpu *v, unsigned long va, walk_t *gw, > uint32_t pfec) > { > - return guest_walk_tables(v, p2m_get_hostp2m(v->domain), va, gw, pfec, > + uint32_t rc; > + mfn_t top_mfn; > + void *top_map; > + > #if GUEST_PAGING_LEVELS == 3 /* PAE */ > - _mfn(INVALID_MFN), > - v->arch.paging.shadow.gl3e > + top_mfn = _mfn(INVALID_MFN); > + top_map = v->arch.paging.shadow.gl3e; > #else /* 32 or 64 */ > - pagetable_get_mfn(v->arch.guest_table), > - v->arch.paging.shadow.guest_vtable > + top_mfn = pagetable_get_mfn(v->arch.guest_table); > + top_map = sh_map_domain_page(top_mfn); > +#endif > + > + rc = guest_walk_tables(v, p2m_get_hostp2m(v->domain), va, gw, pfec, > + top_mfn, top_map); > + > +#if GUEST_PAGING_LEVELS != 3 > + sh_unmap_domain_page(top_map); > #endif > - ); > + > + return rc; > } > > /* This validation is called with lock held, and after write permission > @@ -219,24 +230,20 @@ shadow_check_gwalk(struct vcpu *v, unsigned long > va, walk_t *gw, int version) > * logic simple. > */ > perfc_incr(shadow_check_gwalk); > -#if GUEST_PAGING_LEVELS >= 3 /* PAE or 64... */ > -#if GUEST_PAGING_LEVELS >= 4 /* 64-bit only... */ > - l4p = (guest_l4e_t *)v->arch.paging.shadow.guest_vtable; > +#if GUEST_PAGING_LEVELS >= 4 /* 64-bit... */ > + l4p = sh_map_domain_page(gw->l4mfn); > mismatch |= (gw->l4e.l4 != l4p[guest_l4_table_offset(va)].l4); > + sh_unmap_domain_page(l4p); > l3p = sh_map_domain_page(gw->l3mfn); > mismatch |= (gw->l3e.l3 != l3p[guest_l3_table_offset(va)].l3); > sh_unmap_domain_page(l3p); > -#else > +#elif GUEST_PAGING_LEVELS >= 3 /* PAE... */ > mismatch |= (gw->l3e.l3 !> > v->arch.paging.shadow.gl3e[guest_l3_table_offset(va)].l3); > #endif > l2p = sh_map_domain_page(gw->l2mfn); > mismatch |= (gw->l2e.l2 != l2p[guest_l2_table_offset(va)].l2); > sh_unmap_domain_page(l2p); > -#else > - l2p = (guest_l2e_t *)v->arch.paging.shadow.guest_vtable; > - mismatch |= (gw->l2e.l2 != l2p[guest_l2_table_offset(va)].l2); > -#endif > if ( !(guest_supports_superpages(v) && > (guest_l2e_get_flags(gw->l2e) & _PAGE_PSE)) ) > { > @@ -3784,7 +3791,7 @@ sh_update_linear_entries(struct vcpu *v) > } > > > -/* Removes vcpu->arch.paging.shadow.guest_vtable and > vcpu->arch.shadow_table[]. > +/* Removes vcpu->arch.shadow_table[]. > * Does all appropriate management/bookkeeping/refcounting/etc... > */ > static void > @@ -3794,24 +3801,6 @@ sh_detach_old_tables(struct vcpu *v) > int i = 0; > > //// > - //// vcpu->arch.paging.shadow.guest_vtable > - //// > - > -#if GUEST_PAGING_LEVELS == 3 > - /* PAE guests don''t have a mapping of the guest top-level table */ > - ASSERT(v->arch.paging.shadow.guest_vtable == NULL); > -#else > - if ( v->arch.paging.shadow.guest_vtable ) > - { > - struct domain *d = v->domain; > - if ( shadow_mode_external(d) || shadow_mode_translate(d) ) > - > sh_unmap_domain_page_global(v->arch.paging.shadow.guest_vtable); > - v->arch.paging.shadow.guest_vtable = NULL; > - } > -#endif // !NDEBUG > - > - > - //// > //// vcpu->arch.shadow_table[] > //// > > @@ -3970,26 +3959,12 @@ sh_update_cr3(struct vcpu *v, int do_locking) > > > //// > - //// vcpu->arch.paging.shadow.guest_vtable > + //// vcpu->arch.paging.shadow.gl3e[] > //// > -#if GUEST_PAGING_LEVELS == 4 > - if ( shadow_mode_external(d) || shadow_mode_translate(d) ) > - { > - if ( v->arch.paging.shadow.guest_vtable ) > - > sh_unmap_domain_page_global(v->arch.paging.shadow.guest_vtable); > - v->arch.paging.shadow.guest_vtable > sh_map_domain_page_global(gmfn); > - /* PAGING_LEVELS==4 implies 64-bit, which means that > - * map_domain_page_global can''t fail */ > - BUG_ON(v->arch.paging.shadow.guest_vtable == NULL); > - } > - else > - v->arch.paging.shadow.guest_vtable = __linear_l4_table; > -#elif GUEST_PAGING_LEVELS == 3 > +#if GUEST_PAGING_LEVELS == 3 > /* On PAE guests we don''t use a mapping of the guest''s own top-level > * table. We cache the current state of that table and shadow that, > * until the next CR3 write makes us refresh our cache. */ > - ASSERT(v->arch.paging.shadow.guest_vtable == NULL); > - > if ( shadow_mode_external(d) ) > /* Find where in the page the l3 table is */ > guest_idx = guest_index((void *)v->arch.hvm_vcpu.guest_cr[3]); > @@ -4005,20 +3980,6 @@ sh_update_cr3(struct vcpu *v, int do_locking) > for ( i = 0; i < 4 ; i++ ) > v->arch.paging.shadow.gl3e[i] = gl3e[i]; > sh_unmap_domain_page(gl3e); > -#elif GUEST_PAGING_LEVELS == 2 > - if ( shadow_mode_external(d) || shadow_mode_translate(d) ) > - { > - if ( v->arch.paging.shadow.guest_vtable ) > - > sh_unmap_domain_page_global(v->arch.paging.shadow.guest_vtable); > - v->arch.paging.shadow.guest_vtable > sh_map_domain_page_global(gmfn); > - /* Does this really need map_domain_page_global? Handle the > - * error properly if so. */ > - BUG_ON(v->arch.paging.shadow.guest_vtable == NULL); /* XXX */ > - } > - else > - v->arch.paging.shadow.guest_vtable = __linear_l2_table; > -#else > -#error this should never happen > #endif > > > diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h > index 6f9744a..4921c99 100644 > --- a/xen/include/asm-x86/domain.h > +++ b/xen/include/asm-x86/domain.h > @@ -122,8 +122,6 @@ struct shadow_vcpu { > l3_pgentry_t l3table[4] __attribute__((__aligned__(32))); > /* PAE guests: per-vcpu cache of the top-level *guest* entries */ > l3_pgentry_t gl3e[4] __attribute__((__aligned__(32))); > - /* Non-PAE guests: pointer to guest top-level pagetable */ > - void *guest_vtable; > /* Last MFN that we emulated a write to as unshadow heuristics. */ > unsigned long last_emulated_mfn_for_unshadow; > /* MFN of the last shadow that we shot a writeable mapping in */
>>> On 04.04.13 at 12:34, Tim Deegan <tim@xen.org> wrote: > OTOH, looking at map_domain_page_global, there doesn''t seem to be any > locking preventing two CPUs from populating a page of global-map l1es at > the same time. So, here''s a different patch to test -- it would be good > to know if this patch by itself fixes the crash.With Xudong having verified the fix, I would convert it into full patch form and commit it, but I''d need your confirmation that I can put your S-o-b on it. Jan> --- a/xen/arch/x86/domain_page.c > +++ b/xen/arch/x86/domain_page.c > @@ -354,9 +354,10 @@ void *map_domain_page_global(unsigned long mfn) > set_bit(idx, inuse); > inuse_cursor = idx + 1; > > + pl1e = virt_to_xen_l1e(va); > + > spin_unlock(&globalmap_lock); > > - pl1e = virt_to_xen_l1e(va); > if ( !pl1e ) > return NULL; > l1e_write(pl1e, l1e_from_pfn(mfn, __PAGE_HYPERVISOR));
At 09:12 +0100 on 08 Apr (1365412350), Jan Beulich wrote:> >>> On 04.04.13 at 12:34, Tim Deegan <tim@xen.org> wrote: > > OTOH, looking at map_domain_page_global, there doesn''t seem to be any > > locking preventing two CPUs from populating a page of global-map l1es at > > the same time. So, here''s a different patch to test -- it would be good > > to know if this patch by itself fixes the crash. > > With Xudong having verified the fix, I would convert it into full > patch form and commit it, but I''d need your confirmation that I > can put your S-o-b on it.Sure - thanks! Signed-off-by: Tim Deegan <tim@xen.org> I might commit the other patch as well, just to turn a possible BUG() into a perf regression. Or maybe I''ll see about some middle ground. Tim.