thr3ads.net - Xen devel - [Xen-devel] Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes" [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Simon Horman

2009-Dec-06 23:53 UTC

[Xen-devel] Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

changeset:   20236:d6f4089f0f8c
	user:        Keir Fraser <keir.fraser@citrix.com>
	date:        Tue Sep 22 08:16:49 2009 +0100
	description:
	x86-64: reduce range spanned by 1:1 mapping and frame table indexes

	Introduces a virtual space conserving transformation on the MFN thus
	far used to index 1:1 mapping and frame table, removing the largest
	range of contiguous bits (below the most significant one) which are
	zero for all valid MFNs from the MFN representation, to be used to
	index into those arrays, thereby cutting the virtual range these
	tables must cover approximately by half with each bit removed.

	Since this should account for hotpluggable memory (in order to not
	requiring a re-write when that gets supported), the determination of
	which bits are candidates for removal must not be based on the E820
	information, but instead has to use the SRAT. That in turn requires a
	change to the ordering of steps done during early boot.

	Signed-off-by: Jan Beulich <jbeulich@novell.com>

The change above seems to cause the panic below when used
in conjunction with qemu-xen

	commit f09a5ba89434bb3f28172640354258d1d6cd8579
	Author: Ian Jackson <ian.jackson@eu.citrix.com>
	Date:   Fri Sep 18 16:41:42 2009 +0100

	passthrough: basic graphics passthrough support

	basic gfx passthrough support:
	  - add a vga type for gfx passthrough
	  - retrieve VGA bios from host 0xC0000, then load it to guest 0xC0000
	  - register/unregister legacy VGA I/O ports and MMIOs for
            passthroughed gfx

	Signed-off-by: Ben Lin <ben.y.lin@intel.com>
	Signed-off-by: Weidong Han <weidong.han@intel.com>
	Acked-by: Jean Guyader <jean.guyader@critix.com>
	Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

Which is the qemu-xen version listed in Config.mk.
And the following version of etherboot which includes
support for the Intel 82576 (igb), added by me.

	commit d06ae11df9a2ce94518c4665c9315d3514908dd8
	Author: Thomas Miletich <thomas.miletich@gmail.com>
	Date:   Wed Nov 25 18:02:59 2009 +0100

	[e1000] Enable interrupts in a more UNDI compatible way

	Signed-off-by: Marty Connor <mdc@etherboot.org>

The panic occurs when the 82576 is passed through at boot time
and used for PXE boot. It does not occur when a 82572 (a very
different card, but it uses the same gpxe driver) is used.
Nor does it occur when xen-unstable is rolled back to the previous
revision. The problem still occurs with the latest xen-unstable.


(XEN) PCI add device 01:00.0
pciback 0000:01:00.0: seizing device
ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI interrupt for device 0000:01:00.0 disabled
(XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 1:0.0
(XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 1:0.0
(XEN) [VT-D]io.c:280:d0 VT-d irq bind: m_irq = 37 device = 4 intx = 0
(XEN) HVM1: HVM Loader
(XEN) HVM1: Detected Xen v3.5-unstable
(XEN) HVM1: CPU speed is 2827 MHz
(XEN) irq.c:243: Dom1 PCI link 0 changed 0 -> 5
(XEN) HVM1: PCI-ISA link 0 routed to IRQ5
(XEN) irq.c:243: Dom1 PCI link 1 changed 0 -> 10
(XEN) HVM1: PCI-ISA link 1 routed to IRQ10
(XEN) irq.c:243: Dom1 PCI link 2 changed 0 -> 11
(XEN) HVM1: PCI-ISA link 2 routed to IRQ11
(XEN) irq.c:243: Dom1 PCI link 3 changed 0 -> 5
(XEN) HVM1: PCI-ISA link 3 routed to IRQ5
(XEN) HVM1: pci dev 01:3 INTA->IRQ10
(XEN) HVM1: pci dev 03:0 INTA->IRQ5
(XEN) HVM1: pci dev 04:0 INTA->IRQ5
(XEN) HVM1: pci dev 02:0 bar 10 size 02000000: f0000008
(XEN) HVM1: pci dev 03:0 bar 14 size 01000000: f2000008
(XEN) HVM1: pci dev 04:0 bar 14 size 00400000: f3000000
(XEN) HVM1: pci dev 04:0 bar 30 size 00400000: f3400000
(XEN) domctl.c:836:d0 memory_map:add: gfn=f3800 mfn=f0800 nr_mfns=20
(XEN) HVM1: pci dev 04:0 bar 10 size 00020000: f3800000
(XEN) domctl.c:836:d0 memory_map:add: gfn=f3820 mfn=f0840 nr_mfns=4
(XEN) domctl.c:846:d0 memory_map:remove: gfn=f3820 mfn=f0840 nr_mfns=1
(XEN) HVM1: pci dev 04:0 bar 1c size 00004000: f3820000
(XEN) HVM1: pci dev 02:0 bar 14 size 00001000: f3824000
(XEN) HVM1: pci dev 03:0 bar 10 size 00000100: 0000c001
(XEN) HVM1: pci dev 04:0 bar 18 size 00000020: 0000c101
(XEN) domctl.c:887:d0 ioport_map:add f_gport=c100 f_mport=3100 np=20
(XEN) HVM1: pci dev 01:1 bar 20 size 00000010: 0000c121
(XEN) HVM1: Multiprocessor initialisation:
(XEN) HVM1:  - CPU0 ... 36-bit phys ... fixed MTRRs ... var MTRRs [2/8] ...
done.
(XEN) HVM1: Testing HVM environment:
(XEN) HVM1:  - REP INSB across page boundaries ... passed
(XEN) HVM1:  - GS base MSRs and SWAPGS ... passed
(XEN) HVM1: Passed 2 of 2 tests
(XEN) HVM1: Writing SMBIOS tables ...
(XEN) HVM1: Loading ROMBIOS ...
(XEN) HVM1: 9788 bytes of ROMBIOS high-memory extensions:
(XEN) HVM1:   Relocating to 0xfc000000-0xfc00263c ... done
(XEN) HVM1: Creating MP tables ...
(XEN) HVM1: Loading Cirrus VGABIOS ...
(XEN) HVM1: Loading PCI Option ROM ...
(XEN) HVM1:  - Manufacturer: http://etherboot.org
(XEN) HVM1:  - Product name: gPXE
(XEN) HVM1: Loading ACPI ...
(XEN) HVM1:  - Lo data: 000ea020-000ea04f
(XEN) HVM1:  - Hi data: fc002800-fc011dcf
(XEN) HVM1: vm86 TSS at fc012000
(XEN) HVM1: BIOS map:
(XEN) HVM1:  c0000-c8fff: VGA BIOS
(XEN) HVM1:  c9000-da7ff: gPXE ROM
(XEN) HVM1:  eb000-eb14e: SMBIOS tables
(XEN) HVM1:  f0000-fffff: Main BIOS
(XEN) HVM1: Invoking ROMBIOS ...
(XEN) HVM1: $Revision: 1.221 $ $Date: 2008/12/07 17:32:29 $
(XEN) stdvga.c:147:d1 entering stdvga and caching modes
(XEN) HVM1: VGABios $Id: vgabios.c,v 1.67 2008/01/27 09:44:12 vruppert Exp $
(XEN) HVM1: Bochs BIOS - build: 06/23/99
(XEN) HVM1: $Revision: 1.221 $ $Date: 2008/12/07 17:32:29 $
(XEN) HVM1: Options: apmbios pcibios eltorito PMM 
(XEN) HVM1: 
(XEN) HVM1: ata0-0: PCHS=4161/16/63 translation=large LCHS=520/128/63
(XEN) HVM1: ata0 master: QEMU HARDDISK ATA-7 Hard-Disk (2048 MBytes)
(XEN) HVM1: IDE time out
(XEN) HVM1: 
(XEN) HVM1: 
(XEN) HVM1: 
(XEN) HVM1: Press F12 for boot menu.
(XEN) HVM1: 
(XEN) HVM1: Booting from Network...
(XEN) HVM1: Booting from c900:02ed
(XEN) ----[ Xen-3.5-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c4801c0a72>] sh_remove_shadows+0x149/0x86a
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
(XEN) rax: ffff8301140fa000   rbx: ffff82f601e10000   rcx: 0000000000000000
(XEN) rdx: ffff82c4801fb9e0   rsi: 00000000000f0800   rdi: ffff8301140fae1c
(XEN) rbp: ffff82c4802dfb68   rsp: ffff82c4802dfb38   r8:  00000000000f0800
(XEN) r9:  ffff8301186a8000   r10: ffff8300ddc08000   r11: 0000000000000000
(XEN) r12: ffff8300ddc08000   r13: 00000000000f0800   r14: ffff82c4802dff28
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000000026f0
(XEN) cr3: 00000001186a3000   cr2: ffff82f601e1000f
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff82c4802dfb38:
(XEN)    0000000100000001 ffff82c4802dfce8 ffff8301140fa000 ffff8300ddc08000
(XEN)    ffff82c4802dff28 0000000000000000 ffff82c4802dfe18 ffff82c4801cbfff
(XEN)    ffff82c4802dff28 ffff82c4802dff28 ffff82c480257ba0 ffff82c4802dff28
(XEN)    ffff82c4802dff28 ffff82c4802dff28 ffff82c4802dff28 ffff82c4802dfdc8
(XEN)    ffff82c4802dfcc8 ffff82c4802dff28 ffff82c4802dfce0 ffff82c4802dff28
(XEN)    00000000f38000d8 ffff8301140faeb0 ffff8301140fae18 ffff8301140fa210
(XEN)    ffff81800079c000 0000004400000000 0000000000000000 0000000000000f38
(XEN)    00000000000f3800 00000000000f0800 0000000000000048 ffff8300ddc09a38
(XEN)    ffff8140c0003ce0 0000000000000030 ffffffffffffffff 00000020f81dc348
(XEN)    ffff81800079c0f8 000000000000001f 000000000000ad8e ffffffffffffffff
(XEN)    00000000000f3800 000000000011602c ffff81800079c100 00000000000f3800
(XEN)    00000000000f0800 00000000000f0800 0000000000000000 00000020802dfde8
(XEN)    0000000000000000 00000000000f3800 00000000f081f037 ffff8300ddc08000
(XEN)    000000000011869f 0000000500000005 ffff82c4802dff28 0000002000000020
(XEN)    ffff82c4802d0000 ffff000000d880c7 1cbb8308408bffff 7d83387512000001
(XEN)    1f0014ba32740044 0000000000000527 0c9f000800000007 07ec3880ffffffff
(XEN)    0c93001000000000 07ec3880ffffffff 0c93001000000000 07ec3880ffffffff
(XEN)    0000c09300000000 0000000000000005 ffff82c4802dfde8 ffff82c4802dff28
(XEN)    000000000000000a ffff82c480230ee0 ffff82c4802dfdc8 fffffffffffffffe
(XEN)    ffff8300ddc08000 ffff82c4802dfde8 00000000f0800037 ffff82c4802dff28
(XEN) Xen call trace:
(XEN)    [<ffff82c4801c0a72>] sh_remove_shadows+0x149/0x86a
(XEN)    [<ffff82c4801cbfff>] sh_page_fault__guest_2+0x184d/0x1bf3
(XEN)    [<ffff82c4801b2c1d>] vmx_vmexit_handler+0x717/0x1a68
(XEN)    
(XEN) Pagetable walk from ffff82f601e1000f:
(XEN)  L4[0x105] = 00000000decea027 5555555555555555
(XEN)  L3[0x1d8] = 000000011bffb063 5555555555555555
(XEN)  L2[0x00f] = 0000000000000000 ffffffffffffffff 
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff82f601e1000f
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2009-Dec-07 10:37 UTC

head link

[Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

>(XEN) ----[ Xen-3.5-unstable  x86_64  debug=y  Not tainted ]----
>(XEN) CPU:    0
>(XEN) RIP:    e008:[<ffff82c4801c0a72>] sh_remove_shadows+0x149/0x86a
>(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
>(XEN) rax: ffff8301140fa000   rbx: ffff82f601e10000   rcx: 0000000000000000
>(XEN) rdx: ffff82c4801fb9e0   rsi: 00000000000f0800   rdi: ffff8301140fae1c
>(XEN) rbp: ffff82c4802dfb68   rsp: ffff82c4802dfb38   r8:  00000000000f0800
>(XEN) r9:  ffff8301186a8000   r10: ffff8300ddc08000   r11: 0000000000000000
>(XEN) r12: ffff8300ddc08000   r13: 00000000000f0800   r14: ffff82c4802dff28
>(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000000026f0
>...
>(XEN) Xen call trace:
>(XEN)    [<ffff82c4801c0a72>] sh_remove_shadows+0x149/0x86a
>(XEN)    [<ffff82c4801cbfff>] sh_page_fault__guest_2+0x184d/0x1bf3
>(XEN)    [<ffff82c4801b2c1d>] vmx_vmexit_handler+0x717/0x1a68
>(XEN)    
>(XEN) Pagetable walk from ffff82f601e1000f:
>(XEN)  L4[0x105] = 00000000decea027 5555555555555555
>(XEN)  L3[0x1d8] = 000000011bffb063 5555555555555555
>(XEN)  L2[0x00f] = 0000000000000000 ffffffffffffffff 
>(XEN) 
>(XEN) ****************************************
>(XEN) Panic on CPU 0:
>(XEN) FATAL PAGE FAULT
>(XEN) [error_code=0000]
>(XEN) Faulting linear address: ffff82f601e1000f
>(XEN) ****************************************
While I can''t determine the exact source location corresponding to the
crash (without the disassembly of the function), the page table walk
suggests this is a read from to the M2P table, which imposes a couple
of questions: How can this be a non-quad-word aligned access? Is the
access, if it makes sense, guarded by an mfn_valid() check? Is the
memory address corresponding to the M2P slot (mfn 0x3c20001) in a
physical memory hole?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Simon Horman

2009-Dec-07 10:48 UTC

head link

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

On Mon, Dec 07, 2009 at 10:37:43AM +0000, Jan Beulich
wrote:> >(XEN) ----[ Xen-3.5-unstable  x86_64  debug=y  Not tainted ]----
> >(XEN) CPU:    0
> >(XEN) RIP:    e008:[<ffff82c4801c0a72>]
sh_remove_shadows+0x149/0x86a
> >(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
> >(XEN) rax: ffff8301140fa000   rbx: ffff82f601e10000   rcx:
0000000000000000
> >(XEN) rdx: ffff82c4801fb9e0   rsi: 00000000000f0800   rdi:
ffff8301140fae1c
> >(XEN) rbp: ffff82c4802dfb68   rsp: ffff82c4802dfb38   r8: 
00000000000f0800
> >(XEN) r9:  ffff8301186a8000   r10: ffff8300ddc08000   r11:
0000000000000000
> >(XEN) r12: ffff8300ddc08000   r13: 00000000000f0800   r14:
ffff82c4802dff28
> >(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4:
00000000000026f0
> >...
> >(XEN) Xen call trace:
> >(XEN)    [<ffff82c4801c0a72>] sh_remove_shadows+0x149/0x86a
> >(XEN)    [<ffff82c4801cbfff>]
sh_page_fault__guest_2+0x184d/0x1bf3
> >(XEN)    [<ffff82c4801b2c1d>] vmx_vmexit_handler+0x717/0x1a68
> >(XEN)    
> >(XEN) Pagetable walk from ffff82f601e1000f:
> >(XEN)  L4[0x105] = 00000000decea027 5555555555555555
> >(XEN)  L3[0x1d8] = 000000011bffb063 5555555555555555
> >(XEN)  L2[0x00f] = 0000000000000000 ffffffffffffffff 
> >(XEN) 
> >(XEN) ****************************************
> >(XEN) Panic on CPU 0:
> >(XEN) FATAL PAGE FAULT
> >(XEN) [error_code=0000]
> >(XEN) Faulting linear address: ffff82f601e1000f
> >(XEN) ****************************************
> 
> While I can''t determine the exact source location corresponding to
the
> crash (without the disassembly of the function), the page table walk
> suggests this is a read from to the M2P table, which imposes a couple
> of questions: How can this be a non-quad-word aligned access? Is the
> access, if it makes sense, guarded by an mfn_valid() check? Is the
> memory address corresponding to the M2P slot (mfn 0x3c20001) in a
> physical memory hole?
Any tips on how I could investigate those questions?


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2009-Dec-07 11:06 UTC

head link

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

>>> Simon Horman <horms@verge.net.au> 07.12.09 11:48 >>>
>On Mon, Dec 07, 2009 at 10:37:43AM +0000, Jan Beulich wrote:
>> While I can''t determine the exact source location
corresponding to the
>> crash (without the disassembly of the function), the page table walk
>> suggests this is a read from to the M2P table, which imposes a couple
>> of questions: How can this be a non-quad-word aligned access? Is the
>> access, if it makes sense, guarded by an mfn_valid() check? Is the
>> memory address corresponding to the M2P slot (mfn 0x3c20001) in a
>> physical memory hole?
>
>Any tips on how I could investigate those questions?
For the first two, you''d have to connect register/stack values to
source variables (by analyzing the disassembly) to understand what
access it is that causes the issue, and where the values come from.
Or alternatively just add debugging printk()-s to the function in
question (but that could be a lot of output depending on how long the
guest survives). Or whatever else debugging technique you like...

For the third, all it takes is looking up the memory map in the hypervisor
(boot) log.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Simon Horman

2009-Dec-07 12:13 UTC

head link

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

On Mon, Dec 07, 2009 at 11:06:44AM +0000, Jan Beulich
wrote:> >>> Simon Horman <horms@verge.net.au> 07.12.09 11:48
>>>
> >On Mon, Dec 07, 2009 at 10:37:43AM +0000, Jan Beulich wrote:
> >> While I can''t determine the exact source location
corresponding to the
> >> crash (without the disassembly of the function), the page table
walk
> >> suggests this is a read from to the M2P table, which imposes a
couple
> >> of questions: How can this be a non-quad-word aligned access? Is
the
> >> access, if it makes sense, guarded by an mfn_valid() check? Is the
> >> memory address corresponding to the M2P slot (mfn 0x3c20001) in a
> >> physical memory hole?
> >
> >Any tips on how I could investigate those questions?
> 
> For the first two, you''d have to connect register/stack values to
> source variables (by analyzing the disassembly) to understand what
> access it is that causes the issue, and where the values come from.
> Or alternatively just add debugging printk()-s to the function in
> question (but that could be a lot of output depending on how long the
> guest survives). Or whatever else debugging technique you like...
Thanks, I was fearing something along those lines.
> For the third, all it takes is looking up the memory map in the hypervisor
> (boot) log.
Ok, thats an easy one :-)

I''ll poke some more tomorrow.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Simon Horman

2009-Dec-11 01:39 UTC

head link

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

On Mon, Dec 07, 2009 at 11:13:34PM +1100, Simon Horman
wrote:> On Mon, Dec 07, 2009 at 11:06:44AM +0000, Jan Beulich wrote:
> > >>> Simon Horman <horms@verge.net.au> 07.12.09 11:48
>>>
> > >On Mon, Dec 07, 2009 at 10:37:43AM +0000, Jan Beulich wrote:
> > >> While I can''t determine the exact source location
corresponding to the
> > >> crash (without the disassembly of the function), the page
table walk
> > >> suggests this is a read from to the M2P table, which imposes
a couple
> > >> of questions: How can this be a non-quad-word aligned access?
Is the
> > >> access, if it makes sense, guarded by an mfn_valid() check?
Is the
> > >> memory address corresponding to the M2P slot (mfn 0x3c20001)
in a
> > >> physical memory hole?
> > >
> > >Any tips on how I could investigate those questions?
> > 
> > For the first two, you''d have to connect register/stack
values to
> > source variables (by analyzing the disassembly) to understand what
> > access it is that causes the issue, and where the values come from.
> > Or alternatively just add debugging printk()-s to the function in
> > question (but that could be a lot of output depending on how long the
> > guest survives). Or whatever else debugging technique you like...
> 
> Thanks, I was fearing something along those lines.
> 
> > For the third, all it takes is looking up the memory map in the
hypervisor
> > (boot) log.
> 
> Ok, thats an easy one :-)
> 
> I''ll poke some more tomorrow.
Hi Jan,

I haven''t exactly followed your advice above, but I think I have
made a little bit of progress.

* The memory access is to a MMIO region - a control register for the NIC.
  The access is made by the e1000 gPXE driver.

* Interestingly, if there is a write to the page (or perhaps even anywhere
  in the entire 20-page MMIO region?) before a read, the problem
doesn''t
  occur. So a simple work-around in the gPXE driver is to just write to a
  register before reading any.

* In sh_page_fault() the call to gfn_to_mfn_guest_dbg(d, gfn, &p2mt) sets
  p2mt to p2m_mmio_direct, which seems to be correct. I''m a bit
  stuck in working out what goes wrong from there.

For reference, a panic with some more debug info is below.

106697 sh: sh_page_fault__guest_2(): fast path mmio 0x000000000b8f9a
106698 sh: sh_page_fault__guest_2(): d:v=1:0 va=0xb8f9c err=15, rip=7a39
106699 sh: sh_page_fault__guest_2(): 2954: A
106700 sh: sh_page_fault__guest_2(): 2998: B
106701 sh: sh_page_fault__guest_2(): fast path mmio 0x000000000b8f9c
106702 sh: sh_page_fault__guest_2(): d:v=1:0 va=0xb8f9e err=15, rip=7a39
106703 sh: sh_page_fault__guest_2(): 2954: A
106704 sh: sh_page_fault__guest_2(): 2998: B
106705 sh: sh_page_fault__guest_2(): fast path mmio 0x000000000b8f9e
106706 sh: sh_page_fault__guest_2(): d:v=1:0 va=0xf30000d8 err=2, rip=6f6
106707 sh: sh_page_fault__guest_2(): 2954: A
106708 sh: sh_page_fault__guest_2(): 2998: B
106709 sh: sh_page_fault__guest_2(): 3077: page_fault_slow_path:
106710 sh: sh_page_fault__guest_2(): 3081: C
106711 sh: sh_page_fault__guest_2(): 3095: rewalk:
106712 sh: sh_page_fault__guest_2(): 3097: D
106713 sh: sh_page_fault__guest_2(): 3106: E
106714 sh: sh_page_fault__guest_2(): 3122: F
(XEN) _gfn_to_mfn_type_dbg: current
106715 sh: sh_page_fault__guest_2(): 3141: gfn_to_mfn_guest_dbg: p2mt=5
106716 sh: sh_page_fault__guest_2(): 3143: G
106717 sh: sh_page_fault__guest_2(): 3181: H
106718 sh: sh_page_fault__guest_2(): 3193: I
106719 sh: sh_page_fault__guest_2(): 3204: J
106720 sh: sh_page_fault__guest_2(): 3216: K
106721 shdebug: make_fl1_shadow(): (f3000)=>118644
106722 sh: set_fl1_shadow_status(): gfn=f3000, type=00000002, smfn=118644
106723 shdebug: _sh_propagate(): demand write level 2 guest f30000e7 shadow
0000000118644067
106724 sh: sh_page_fault__guest_2(): 3241: L
106725 shdebug: _sh_propagate(): demand write level 1 guest f3000067 shadow
00000000f0500037
106726 sh: sh_page_fault__guest_2(): 3263: M
106727 shdebug: _sh_propagate(): prefetch level 1 guest f3001067 shadow
00000000f0501037
106728 shdebug: _sh_propagate(): prefetch level 1 guest f3002067 shadow
00000000f0502037
106729 shdebug: _sh_propagate(): prefetch level 1 guest f3003067 shadow
00000000f0503037
106730 shdebug: _sh_propagate(): prefetch level 1 guest f3004067 shadow
00000000f0504037
106731 shdebug: _sh_propagate(): prefetch level 1 guest f3005067 shadow
00000000f0505037
106732 shdebug: _sh_propagate(): prefetch level 1 guest f3006067 shadow
00000000f0506037
106733 shdebug: _sh_propagate(): prefetch level 1 guest f3007067 shadow
00000000f0507037
106734 shdebug: _sh_propagate(): prefetch level 1 guest f3008067 shadow
00000000f0508037
106735 shdebug: _sh_propagate(): prefetch level 1 guest f3009067 shadow
00000000f0509037
106736 shdebug: _sh_propagate(): prefetch level 1 guest f300a067 shadow
00000000f050a037
106737 shdebug: _sh_propagate(): prefetch level 1 guest f300b067 shadow
00000000f050b037
106738 shdebug: _sh_propagate(): prefetch level 1 guest f300c067 shadow
00000000f050c037
106739 shdebug: _sh_propagate(): prefetch level 1 guest f300d067 shadow
00000000f050d037
106740 shdebug: _sh_propagate(): prefetch level 1 guest f300e067 shadow
00000000f050e037
106741 shdebug: _sh_propagate(): prefetch level 1 guest f300f067 shadow
00000000f050f037
106742 shdebug: _sh_propagate(): prefetch level 1 guest f3010067 shadow
00000000f0510037
106743 shdebug: _sh_propagate(): prefetch level 1 guest f3011067 shadow
00000000f0511037
106744 shdebug: _sh_propagate(): prefetch level 1 guest f3012067 shadow
00000000f0512037
106745 shdebug: _sh_propagate(): prefetch level 1 guest f3013067 shadow
00000000f0513037
106746 shdebug: _sh_propagate(): prefetch level 1 guest f3014067 shadow
00000000f0514037
106747 shdebug: _sh_propagate(): prefetch level 1 guest f3015067 shadow
00000000f0515037
106748 shdebug: _sh_propagate(): prefetch level 1 guest f3016067 shadow
00000000f0516037
106749 shdebug: _sh_propagate(): prefetch level 1 guest f3017067 shadow
00000000f0517037
106750 shdebug: _sh_propagate(): prefetch level 1 guest f3018067 shadow
00000000f0518037
106751 shdebug: _sh_propagate(): prefetch level 1 guest f3019067 shadow
00000000f0519037
106752 shdebug: _sh_propagate(): prefetch level 1 guest f301a067 shadow
00000000f051a037
106753 shdebug: _sh_propagate(): prefetch level 1 guest f301b067 shadow
00000000f051b037
106754 shdebug: _sh_propagate(): prefetch level 1 guest f301c067 shadow
00000000f051c037
106755 shdebug: _sh_propagate(): prefetch level 1 guest f301d067 shadow
00000000f051d037
106756 shdebug: _sh_propagate(): prefetch level 1 guest f301e067 shadow
00000000f051e037
106757 shdebug: _sh_propagate(): prefetch level 1 guest f301f067 shadow
00000000f051f037
106758 sh: sh_page_fault__guest_2(): 3285: N
106759 sh: sh_page_fault__guest_2(): 3310: O
106760 sh: sh_page_fault__guest_2(): 3319: P
106761 sh: sh_page_fault__guest_2(): 3332: Q
106762 sh: sh_page_fault__guest_2(): 3343: goto emulate;
106763 sh: sh_page_fault__guest_2(): 3361: emulate:
106764 sh: sh_page_fault__guest_2(): 3367: R
106765 sh: sh_page_fault__guest_2(): 3390: emulate_readonly:
106766 sh: sh_page_fault__guest_2(): 3403: early_emulation:
106767 sh: sh_page_fault__guest_2(): 3405: S
106768 sh: sh_page_fault__guest_2(): emulate: eip=0x6f6 esp=0x3d264
106769 sh: sh_page_fault__guest_2(): 3446: T
106770 sh: sh_page_fault__guest_2(): emulator failure, unshadowing mfn 0xf0500
106771 sh: sh_remove_shadows(): d=1, v=0, gmfn=f0500
(XEN) ----[ Xen-3.5-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    1
(XEN) RIP:    e008:[<ffff82c4801c6cfa>] sh_remove_shadows+0x169/0x922
(XEN) RFLAGS: 0000000000010282   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff8300ded42000   rcx: 000000000000000a
(XEN) rdx: 00000000000003f8   rsi: 0000000000000282   rdi: ffff82c480235c64
(XEN) rbp: ffff830118fe7b58   rsp: ffff830118fe7b28   r8:  0000000000000000
(XEN) r9:  ffff82c480201820   r10: 00000000ffffffff   r11: 0000000000000005
(XEN) r12: ffff82f601e0a000   r13: 00000000000f0500   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000000026f0
(XEN) cr3: 00000001164d3000   cr2: ffff82f601e0a00f
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff830118fe7b28:
(XEN)    0000000100000001 0000000000000001 ffff8301164e0000 ffff8300ded42000
(XEN)    0000000000000000 ffff830118fe7f28 ffff830118fe7e18 ffff82c4801d3497
(XEN)    00000000000006f6 0000000000002c98 ffff830118fe7f28 ffff830118fe7f28
(XEN)    ffff82c480265b80 ffff830118fe7f28 ffff830118fe7f28 ffff830118fe7f28
(XEN)    ffff830118fe7f28 ffff830118fe7dc8 ffff830118fe7cc8 ffff830118fe7f28
(XEN)    ffff830118fe7ce0 ffff830118fe7f28 00000000f30000d8 ffff8301164e0eb0
(XEN)    ffff8301164e0e18 ffff8301164e0210 ffff818000798000 0000004300000000
(XEN)    0000000000000000 0000000000000f30 00000000000f3000 00000000000f0500
(XEN)    00000000000000d8 ffff8300ded43a38 ffff8140c0003cc0 0000000000000030
(XEN)    ffffffffffffffff 00000020f81dcee8 ffff8180007980f8 000000000000001f
(XEN)    000000000000c39e ffffffffffffffff 00000000000f3000 0000000000118644
(XEN)    ffff818000798100 00000000000f3000 00000000000f3000 00000000000f3000
(XEN)    0000000000000000 0000002018fe7de8 00000000000f0500 0000000000000000
(XEN)    00000000f051f037 ffff8300ded42000 00000000001164cf 0000000500000005
(XEN)    ffff830118fe7f28 0000002000000020 ffff830118fe0000 ffff000000d880c7
(XEN)    e405f608408bffff b3ff1275010003c2 1f9335680000011c 00000000000006f6
(XEN)    0c9f000800000007 07ec2ce0ffffffff 0c93001000000000 07ec2ce0ffffffff
(XEN)    0c93001000000000 07ec2ce0ffffffff 0000c09300000000 0000000000000005
(XEN)    ffff830118fe7de8 ffff830118fe7f28 000000000000000a ffff82c48023d5c0
(XEN)    ffff830118fe7dc8 fffffffffffffffe ffff8300ded42000 ffff830118fe7de8
(XEN) Xen call trace:
(XEN)    [<ffff82c4801c6cfa>] sh_remove_shadows+0x169/0x922
(XEN)    [<ffff82c4801d3497>] sh_page_fault__guest_2+0x1f2d/0x23fb
(XEN)    [<ffff82c4801b83cd>] vmx_vmexit_handler+0x716/0x19b4
(XEN)    
(XEN) Pagetable walk from ffff82f601e0a00f:
(XEN)  L4[0x105] = 00000000decfa027 5555555555555555
(XEN)  L3[0x1d8] = 000000011bffb063 5555555555555555
(XEN)  L2[0x00f] = 0000000000000000 ffffffffffffffff 
(XEN) debugtrace_dump() starting
(XEN) debugtrace_dump() finished
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff82f601e0a00f
(XEN) ****************************************
(XEN) 
(XEN) Manual reset required (''noreboot'' specified)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2009-Dec-11 10:15 UTC

head link

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Hi,

I seem to have missed this thread earlier; sorry about that. 

At 01:39 +0000 on 11 Dec (1260495591), Simon Horman
wrote:> * The memory access is to a MMIO region - a control register for the NIC.
>   The access is made by the e1000 gPXE driver.
> 
> * Interestingly, if there is a write to the page (or perhaps even anywhere
>   in the entire 20-page MMIO region?) before a read, the problem
doesn''t
>   occur. So a simple work-around in the gPXE driver is to just write to a
>   register before reading any.
> 
> * In sh_page_fault() the call to gfn_to_mfn_guest_dbg(d, gfn, &p2mt)
sets
>   p2mt to p2m_mmio_direct, which seems to be correct. I''m a bit
>   stuck in working out what goes wrong from there.
So from the trace below it looks like the shadow fault handler is trying
to emulate an access to a passthrough MMIO address.  That''s surprising.
And if it only happens when the first access is a read, that''s a bit 
suspect too. 
> 106706 sh: sh_page_fault__guest_2(): d:v=1:0 va=0xf30000d8 err=2, rip=6f6
Oh, but this is a write, according to the error code. 
> 106707 sh: sh_page_fault__guest_2(): 2954: A
> 106708 sh: sh_page_fault__guest_2(): 2998: B
> 106709 sh: sh_page_fault__guest_2(): 3077: page_fault_slow_path:
> 106710 sh: sh_page_fault__guest_2(): 3081: C
> 106711 sh: sh_page_fault__guest_2(): 3095: rewalk:
> 106712 sh: sh_page_fault__guest_2(): 3097: D
> 106713 sh: sh_page_fault__guest_2(): 3106: E
> 106714 sh: sh_page_fault__guest_2(): 3122: F
> (XEN) _gfn_to_mfn_type_dbg: current
> 106715 sh: sh_page_fault__guest_2(): 3141: gfn_to_mfn_guest_dbg: p2mt=5
> 106716 sh: sh_page_fault__guest_2(): 3143: G
> 106717 sh: sh_page_fault__guest_2(): 3181: H
> 106718 sh: sh_page_fault__guest_2(): 3193: I
> 106719 sh: sh_page_fault__guest_2(): 3204: J
> 106720 sh: sh_page_fault__guest_2(): 3216: K
> 106721 shdebug: make_fl1_shadow(): (f3000)=>118644
FL1 shadow means we''re building a shadow L1 that''s equivalent
to a
single PSE guest entry, pointing a guest frames f3000 to f31ff.
> 106722 sh: set_fl1_shadow_status(): gfn=f3000, type=00000002, smfn=118644
> 106723 shdebug: _sh_propagate(): demand write level 2 guest f30000e7 shadow
0000000118644067
> 106724 sh: sh_page_fault__guest_2(): 3241: L
> 106725 shdebug: _sh_propagate(): demand write level 1 guest f3000067 shadow
00000000f0500037
> 106726 sh: sh_page_fault__guest_2(): 3263: M
> 106727 shdebug: _sh_propagate(): prefetch level 1 guest f3001067 shadow
00000000f0501037
> 106728 shdebug: _sh_propagate(): prefetch level 1 guest f3002067 shadow
00000000f0502037
[...] and the shadow PTE flags look OK: A, PCD, U/S, R/W, P.
> 106758 sh: sh_page_fault__guest_2(): 3285: N
> 106759 sh: sh_page_fault__guest_2(): 3310: O
> 106760 sh: sh_page_fault__guest_2(): 3319: P
> 106761 sh: sh_page_fault__guest_2(): 3332: Q
> 106762 sh: sh_page_fault__guest_2(): 3343: goto emulate;
That''s the interesting one.  I guess your line numbers are different
because of the extra printout but it must be this:

    if ( is_hvm_domain(d) 
         && unlikely(!hvm_wp_enabled(v)) 
         && regs->error_code == (PFEC_write_access|PFEC_page_present)
)
    {
        perfc_incr(shadow_fault_emulate_wp);
        goto emulate;
    }

Hah.  That explains why we''re emulating.  The guest has CR0.WP clear,
and this was a write fault that wouldn''t have happened on real
hardware,
so we need to emulate it because we can''t actually disable CR0.WP or
the
shadow pagetables stop working altogether.

In fact in this case we don''t need to emulate it because retrying would
be good enough, but in the general case we might.
> 106763 sh: sh_page_fault__guest_2(): 3361: emulate:
> 106764 sh: sh_page_fault__guest_2(): 3367: R
> 106765 sh: sh_page_fault__guest_2(): 3390: emulate_readonly:
> 106766 sh: sh_page_fault__guest_2(): 3403: early_emulation:
> 106767 sh: sh_page_fault__guest_2(): 3405: S
> 106768 sh: sh_page_fault__guest_2(): emulate: eip=0x6f6 esp=0x3d264
> 106769 sh: sh_page_fault__guest_2(): 3446: T
> 106770 sh: sh_page_fault__guest_2(): emulator failure, unshadowing mfn
0xf0500
The emulation fails (because the emulator can''t/won''t map MMIO
space)
and we''re about to bail out, try unshadowing everything and retry (at
which point everything will just work).  Except:
> 106771 sh: sh_remove_shadows(): d=1, v=0, gmfn=f0500
sh_remove_shadows isn''t ready to be called with an MMIO MFN.  We used
to
get away with this before the m2p was made sparse, but now MMIO holes
mean m2p holes. 

Two fixes suggest themselves.  The first is to gate the CR0.WP emulation
path on mfn_valid().  The emulator won''t handle it so it only leads to
confusion and delay if we try:

--- a/xen/arch/x86/mm/shadow/multi.c	Fri Dec 11 09:17:09 2009 +0000
+++ b/xen/arch/x86/mm/shadow/multi.c	Fri Dec 11 10:06:39 2009 +0000
@@ -3305,7 +3305,8 @@
      * fault was a non-user write to a present page.  */
     if ( is_hvm_domain(d) 
          && unlikely(!hvm_wp_enabled(v)) 
-         && regs->error_code ==
(PFEC_write_access|PFEC_page_present) )
+         && regs->error_code ==
(PFEC_write_access|PFEC_page_present)
+         && mfn_valid(gmfn) )
     {
         perfc_incr(shadow_fault_emulate_wp);
         goto emulate;


The second is to make sh_remove_shadows() safe against being called with
a bogus MFN.  I''m not quite decided whether the right answer is to put 
"if (!mfn_valid(gmfn)) return;" at the top of it (which would stop a
certain class of crashes) or just "ASSERT(gmfn_valid(gmfn));" to make
it
clearer what''s gone wrong.  

If neither of those works we may have to make the emulator handle actual
MMIO but that scares me.

I''m away for the next week, but if this is still broken after that
I''ll
have another go at it.

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2009-Dec-11 10:16 UTC

head link

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

>>> Simon Horman <horms@verge.net.au> 11.12.09 02:39 >>>
>106705 sh: sh_page_fault__guest_2(): fast path mmio 0x000000000b8f9e
>106706 sh: sh_page_fault__guest_2(): d:v=1:0 va=0xf30000d8 err=2, rip=6f6
>...
>106721 shdebug: make_fl1_shadow(): (f3000)=>118644
>106722 sh: set_fl1_shadow_status(): gfn=f3000, type=00000002, smfn=118644
>106723 shdebug: _sh_propagate(): demand write level 2 guest f30000e7 shadow
0000000118644067
>106724 sh: sh_page_fault__guest_2(): 3241: L
>106725 shdebug: _sh_propagate(): demand write level 1 guest f3000067 shadow
00000000f0500037
>106726 sh: sh_page_fault__guest_2(): 3263: M
>106727 shdebug: _sh_propagate(): prefetch level 1 guest f3001067 shadow
00000000f0501037
>...
>106770 sh: sh_page_fault__guest_2(): emulator failure, unshadowing mfn
0xf0500
>106771 sh: sh_remove_shadows(): d=1, v=0, gmfn=f0500
The thing is that calling sh_remove_shadows(), as it is implemented, is
invalid for mmio pages, as there''s no guarantee that those would have
a struct page_info associated (and there never was such a guarantee,
with the patches referenced the likelihood just increased that this would
happen).

A possible fix would seem to be to simply do nothing in
sh_remove_shadows() if !mfn_valid(gmfn), but me not really being
knowledgeable in shadow code means that this may be the entirely
wrong thing to do. Tim?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Simon Horman

2009-Dec-11 11:56 UTC

head link

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

On Fri, Dec 11, 2009 at 10:15:32AM +0000, Tim Deegan
wrote:> Hi,
> 
> I seem to have missed this thread earlier; sorry about that. 
> 
> At 01:39 +0000 on 11 Dec (1260495591), Simon Horman wrote:
> > * The memory access is to a MMIO region - a control register for the
NIC.
> >   The access is made by the e1000 gPXE driver.
> > 
> > * Interestingly, if there is a write to the page (or perhaps even
anywhere
> >   in the entire 20-page MMIO region?) before a read, the problem
doesn''t
> >   occur. So a simple work-around in the gPXE driver is to just write
to a
> >   register before reading any.
> > 
> > * In sh_page_fault() the call to gfn_to_mfn_guest_dbg(d, gfn,
&p2mt) sets
> >   p2mt to p2m_mmio_direct, which seems to be correct. I''m a
bit
> >   stuck in working out what goes wrong from there.
> 
> So from the trace below it looks like the shadow fault handler is trying
> to emulate an access to a passthrough MMIO address.  That''s
surprising.
> And if it only happens when the first access is a read, that''s a
bit
> suspect too. 
Sorry, I wasn''t as clear as I might have been. The problem occurs
if the first access is a write. The work-around is to make
the first access a read.
> > 106706 sh: sh_page_fault__guest_2(): d:v=1:0 va=0xf30000d8 err=2,
rip=6f6
> 
> Oh, but this is a write, according to the error code. 
> 
> > 106707 sh: sh_page_fault__guest_2(): 2954: A
> > 106708 sh: sh_page_fault__guest_2(): 2998: B
> > 106709 sh: sh_page_fault__guest_2(): 3077: page_fault_slow_path:
> > 106710 sh: sh_page_fault__guest_2(): 3081: C
> > 106711 sh: sh_page_fault__guest_2(): 3095: rewalk:
> > 106712 sh: sh_page_fault__guest_2(): 3097: D
> > 106713 sh: sh_page_fault__guest_2(): 3106: E
> > 106714 sh: sh_page_fault__guest_2(): 3122: F
> > (XEN) _gfn_to_mfn_type_dbg: current
> > 106715 sh: sh_page_fault__guest_2(): 3141: gfn_to_mfn_guest_dbg:
p2mt=5
> > 106716 sh: sh_page_fault__guest_2(): 3143: G
> > 106717 sh: sh_page_fault__guest_2(): 3181: H
> > 106718 sh: sh_page_fault__guest_2(): 3193: I
> > 106719 sh: sh_page_fault__guest_2(): 3204: J
> > 106720 sh: sh_page_fault__guest_2(): 3216: K
> > 106721 shdebug: make_fl1_shadow(): (f3000)=>118644
> 
> FL1 shadow means we''re building a shadow L1 that''s
equivalent to a
> single PSE guest entry, pointing a guest frames f3000 to f31ff.
> 
> > 106722 sh: set_fl1_shadow_status(): gfn=f3000, type=00000002,
smfn=118644
> > 106723 shdebug: _sh_propagate(): demand write level 2 guest f30000e7
shadow 0000000118644067
> > 106724 sh: sh_page_fault__guest_2(): 3241: L
> > 106725 shdebug: _sh_propagate(): demand write level 1 guest f3000067
shadow 00000000f0500037
> > 106726 sh: sh_page_fault__guest_2(): 3263: M
> > 106727 shdebug: _sh_propagate(): prefetch level 1 guest f3001067
shadow 00000000f0501037
> > 106728 shdebug: _sh_propagate(): prefetch level 1 guest f3002067
shadow 00000000f0502037
> 
> [...] and the shadow PTE flags look OK: A, PCD, U/S, R/W, P.
> 
> > 106758 sh: sh_page_fault__guest_2(): 3285: N
> > 106759 sh: sh_page_fault__guest_2(): 3310: O
> > 106760 sh: sh_page_fault__guest_2(): 3319: P
> > 106761 sh: sh_page_fault__guest_2(): 3332: Q
> > 106762 sh: sh_page_fault__guest_2(): 3343: goto emulate;
> 
> That''s the interesting one.  I guess your line numbers are
different
> because of the extra printout but it must be this:
> 
>     if ( is_hvm_domain(d) 
>          && unlikely(!hvm_wp_enabled(v)) 
>          && regs->error_code ==
(PFEC_write_access|PFEC_page_present) )
>     {
>         perfc_incr(shadow_fault_emulate_wp);
>         goto emulate;
>     }
Yes, it is that.
> Hah.  That explains why we''re emulating.  The guest has CR0.WP
clear,
> and this was a write fault that wouldn''t have happened on real
hardware,
> so we need to emulate it because we can''t actually disable CR0.WP
or the
> shadow pagetables stop working altogether.
> 
> In fact in this case we don''t need to emulate it because retrying
would
> be good enough, but in the general case we might.
I''m not sure that I understand why that is true.
> > 106763 sh: sh_page_fault__guest_2(): 3361: emulate:
> > 106764 sh: sh_page_fault__guest_2(): 3367: R
> > 106765 sh: sh_page_fault__guest_2(): 3390: emulate_readonly:
> > 106766 sh: sh_page_fault__guest_2(): 3403: early_emulation:
> > 106767 sh: sh_page_fault__guest_2(): 3405: S
> > 106768 sh: sh_page_fault__guest_2(): emulate: eip=0x6f6 esp=0x3d264
> > 106769 sh: sh_page_fault__guest_2(): 3446: T
> > 106770 sh: sh_page_fault__guest_2(): emulator failure, unshadowing mfn
0xf0500
> 
> The emulation fails (because the emulator can''t/won''t map
MMIO space)
> and we''re about to bail out, try unshadowing everything and retry
(at
> which point everything will just work).  Except:
> 
> > 106771 sh: sh_remove_shadows(): d=1, v=0, gmfn=f0500
> 
> sh_remove_shadows isn''t ready to be called with an MMIO MFN.  We
used to
> get away with this before the m2p was made sparse, but now MMIO holes
> mean m2p holes. 
> 
> Two fixes suggest themselves.  The first is to gate the CR0.WP emulation
> path on mfn_valid().  The emulator won''t handle it so it only
leads to
> confusion and delay if we try:
> 
> --- a/xen/arch/x86/mm/shadow/multi.c	Fri Dec 11 09:17:09 2009 +0000
a> +++ b/xen/arch/x86/mm/shadow/multi.c	Fri Dec 11 10:06:39 2009 +0000
> @@ -3305,7 +3305,8 @@
>       * fault was a non-user write to a present page.  */
>      if ( is_hvm_domain(d) 
>           && unlikely(!hvm_wp_enabled(v)) 
> -         && regs->error_code ==
(PFEC_write_access|PFEC_page_present) )
> +         && regs->error_code ==
(PFEC_write_access|PFEC_page_present)
> +         && mfn_valid(gmfn) )
>      {
>          perfc_incr(shadow_fault_emulate_wp);
>          goto emulate;
That resolves the problem that I observed.> 
> 
> The second is to make sh_remove_shadows() safe against being called with
> a bogus MFN.  I''m not quite decided whether the right answer is to
put
> "if (!mfn_valid(gmfn)) return;" at the top of it (which would
stop a
> certain class of crashes) or just "ASSERT(gmfn_valid(gmfn));" to
make it
> clearer what''s gone wrong.  
Adding "if (!mfn_valid(gmfn)) return;" near the top of
sh_remove_shadows()
also resolves the problem that I was seeing.

Thanks (x2) !

Although I''m not really that familiar with the code in question,
I think that I have a preference for both guarding the goto emulate
and adding an ASSERT to sh_remove_shadows(). That is:

Index: xen-unstable.hg/xen/arch/x86/mm/shadow/common.c
==================================================================---
xen-unstable.hg.orig/xen/arch/x86/mm/shadow/common.c	2009-12-11
20:31:48.000000000 +0900
+++ xen-unstable.hg/xen/arch/x86/mm/shadow/common.c	2009-12-11
20:48:48.000000000 +0900
@@ -2752,6 +2752,7 @@ void sh_remove_shadows(struct vcpu *v, m
     };
 
     ASSERT(!(all && fast));
+    ASSERT(mfn_valid(gmfn));
 
     /* Although this is an externally visible function, we do not know
      * whether the shadow lock will be held when it is called (since it
Index: xen-unstable.hg/xen/arch/x86/mm/shadow/multi.c
==================================================================---
xen-unstable.hg.orig/xen/arch/x86/mm/shadow/multi.c	2009-12-11
20:47:17.000000000 +0900
+++ xen-unstable.hg/xen/arch/x86/mm/shadow/multi.c	2009-12-11 20:48:17.000000000
+0900
@@ -3305,7 +3305,8 @@ static int sh_page_fault(struct vcpu *v,
      * fault was a non-user write to a present page.  */
     if ( is_hvm_domain(d) 
          && unlikely(!hvm_wp_enabled(v)) 
-         && regs->error_code ==
(PFEC_write_access|PFEC_page_present) )
+         && regs->error_code ==
(PFEC_write_access|PFEC_page_present)
+         && mfn_valid(gmfn) )
     {
         perfc_incr(shadow_fault_emulate_wp);
         goto emulate;


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Simon Horman

2009-Dec-11 11:57 UTC

head link

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

On Fri, Dec 11, 2009 at 10:16:36AM +0000, Jan Beulich
wrote:> >>> Simon Horman <horms@verge.net.au> 11.12.09 02:39
>>>
> >106705 sh: sh_page_fault__guest_2(): fast path mmio 0x000000000b8f9e
> >106706 sh: sh_page_fault__guest_2(): d:v=1:0 va=0xf30000d8 err=2,
rip=6f6
> >...
> >106721 shdebug: make_fl1_shadow(): (f3000)=>118644
> >106722 sh: set_fl1_shadow_status(): gfn=f3000, type=00000002,
smfn=118644
> >106723 shdebug: _sh_propagate(): demand write level 2 guest f30000e7
shadow 0000000118644067
> >106724 sh: sh_page_fault__guest_2(): 3241: L
> >106725 shdebug: _sh_propagate(): demand write level 1 guest f3000067
shadow 00000000f0500037
> >106726 sh: sh_page_fault__guest_2(): 3263: M
> >106727 shdebug: _sh_propagate(): prefetch level 1 guest f3001067 shadow
00000000f0501037
> >...
> >106770 sh: sh_page_fault__guest_2(): emulator failure, unshadowing mfn
0xf0500
> >106771 sh: sh_remove_shadows(): d=1, v=0, gmfn=f0500
> 
> The thing is that calling sh_remove_shadows(), as it is implemented, is
> invalid for mmio pages, as there''s no guarantee that those would
have
> a struct page_info associated (and there never was such a guarantee,
> with the patches referenced the likelihood just increased that this would
> happen).
> 
> A possible fix would seem to be to simply do nothing in
> sh_remove_shadows() if !mfn_valid(gmfn), but me not really being
> knowledgeable in shadow code means that this may be the entirely
> wrong thing to do. Tim?
That does seem to work :-)

Tim seems to have a few other ideas too.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2009-Dec-11 12:15 UTC

head link

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

At 11:56 +0000 on 11 Dec (1260532578), Simon Horman
wrote:> > Hah.  That explains why we''re emulating.  The guest has
CR0.WP clear,
> > and this was a write fault that wouldn''t have happened on
real hardware,
> > so we need to emulate it because we can''t actually disable
CR0.WP or the
> > shadow pagetables stop working altogether.
> > 
> > In fact in this case we don''t need to emulate it because
retrying would
> > be good enough, but in the general case we might.
> 
> I''m not sure that I understand why that is true.
Which part?  We need to emulate writes when the guest''s CR0.WP is 0
because
- we can''t just set CR0.WP=0 or our write-protection of shadowed
  pagetables goes away. 
- we can''t just use writeable shadow PTEs for read-only guest PTEs 
  because (a) other vcpus mihgt have CR0.WP==1 (this does happen 
  with virus scanner software on Windows) and (b) you can''t 
  properly express "read-only in userspace, read-write in kernel".

All we need to do is retry because in this particular case although 
WP=0 the PTE is in fact read/write; it was just missing from the
shadows.
> Although I''m not really that familiar with the code in question,
> I think that I have a preference for both guarding the goto emulate
> and adding an ASSERT to sh_remove_shadows(). That is:
> 
> Index: xen-unstable.hg/xen/arch/x86/mm/shadow/common.c
> ==================================================================> ---
xen-unstable.hg.orig/xen/arch/x86/mm/shadow/common.c	2009-12-11
20:31:48.000000000 +0900
> +++ xen-unstable.hg/xen/arch/x86/mm/shadow/common.c	2009-12-11
20:48:48.000000000 +0900
> @@ -2752,6 +2752,7 @@ void sh_remove_shadows(struct vcpu *v, m
>      };
>  
>      ASSERT(!(all && fast));
> +    ASSERT(mfn_valid(gmfn));
>  
>      /* Although this is an externally visible function, we do not know
>       * whether the shadow lock will be held when it is called (since it
> Index: xen-unstable.hg/xen/arch/x86/mm/shadow/multi.c
> ==================================================================> ---
xen-unstable.hg.orig/xen/arch/x86/mm/shadow/multi.c	2009-12-11
20:47:17.000000000 +0900
> +++ xen-unstable.hg/xen/arch/x86/mm/shadow/multi.c	2009-12-11
20:48:17.000000000 +0900
> @@ -3305,7 +3305,8 @@ static int sh_page_fault(struct vcpu *v,
>       * fault was a non-user write to a present page.  */
>      if ( is_hvm_domain(d) 
>           && unlikely(!hvm_wp_enabled(v)) 
> -         && regs->error_code ==
(PFEC_write_access|PFEC_page_present) )
> +         && regs->error_code ==
(PFEC_write_access|PFEC_page_present)
> +         && mfn_valid(gmfn) )
>      {
>          perfc_incr(shadow_fault_emulate_wp);
>          goto emulate;
> 
Yes, that seems good to me.

Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>

Cheers,

Tim

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Simon Horman

2009-Dec-13 23:19 UTC

head link

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

On Fri, Dec 11, 2009 at 12:15:06PM +0000, Tim Deegan
wrote:> At 11:56 +0000 on 11 Dec (1260532578), Simon Horman wrote:
> > > Hah.  That explains why we''re emulating.  The guest has
CR0.WP clear,
> > > and this was a write fault that wouldn''t have happened
on real hardware,
> > > so we need to emulate it because we can''t actually
disable CR0.WP or the
> > > shadow pagetables stop working altogether.
> > > 
> > > In fact in this case we don''t need to emulate it because
retrying would
> > > be good enough, but in the general case we might.
> > 
> > I''m not sure that I understand why that is true.
> 
> Which part?  We need to emulate writes when the guest''s CR0.WP is
0 because
> - we can''t just set CR0.WP=0 or our write-protection of shadowed
>   pagetables goes away. 
> - we can''t just use writeable shadow PTEs for read-only guest PTEs
>   because (a) other vcpus mihgt have CR0.WP==1 (this does happen 
>   with virus scanner software on Windows) and (b) you can''t 
>   properly express "read-only in userspace, read-write in
kernel".
> 
> All we need to do is retry because in this particular case although 
> WP=0 the PTE is in fact read/write; it was just missing from the
> shadows.
> 
> > Although I''m not really that familiar with the code in
question,
> > I think that I have a preference for both guarding the goto emulate
> > and adding an ASSERT to sh_remove_shadows(). That is:
> > 
> > Index: xen-unstable.hg/xen/arch/x86/mm/shadow/common.c
> > ==================================================================>
> --- xen-unstable.hg.orig/xen/arch/x86/mm/shadow/common.c	2009-12-11
20:31:48.000000000 +0900
> > +++ xen-unstable.hg/xen/arch/x86/mm/shadow/common.c	2009-12-11
20:48:48.000000000 +0900
> > @@ -2752,6 +2752,7 @@ void sh_remove_shadows(struct vcpu *v, m
> >      };
> >  
> >      ASSERT(!(all && fast));
> > +    ASSERT(mfn_valid(gmfn));
> >  
> >      /* Although this is an externally visible function, we do not
know
> >       * whether the shadow lock will be held when it is called (since
it
> > Index: xen-unstable.hg/xen/arch/x86/mm/shadow/multi.c
> > ==================================================================>
> --- xen-unstable.hg.orig/xen/arch/x86/mm/shadow/multi.c	2009-12-11
20:47:17.000000000 +0900
> > +++ xen-unstable.hg/xen/arch/x86/mm/shadow/multi.c	2009-12-11
20:48:17.000000000 +0900
> > @@ -3305,7 +3305,8 @@ static int sh_page_fault(struct vcpu *v,
> >       * fault was a non-user write to a present page.  */
> >      if ( is_hvm_domain(d) 
> >           && unlikely(!hvm_wp_enabled(v)) 
> > -         && regs->error_code ==
(PFEC_write_access|PFEC_page_present) )
> > +         && regs->error_code ==
(PFEC_write_access|PFEC_page_present)
> > +         && mfn_valid(gmfn) )
> >      {
> >          perfc_incr(shadow_fault_emulate_wp);
> >          goto emulate;
> > 
Oops, forgot

Signed-off-by: Simon Horman <horms@verge.net.au>
> Yes, that seems good to me.
> 
> Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Dec 2009 - Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

[Xen-devel] Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

[Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"

Re: [Xen-devel] Re: Possible regression in "x86-64: reduce range spanned by 1:1 mapping and frame table indexes"