Lars Kurth
2013-Nov-04 19:54 UTC
Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
See http://xenproject.org/help/questions-and-answers/hypervisor-fatal-page-fault-xen-4-3-1.html --- I have a 32 core system running XEN 4.3.1 with 30 Windows XP VM''s. DOM0 is Centos 6.3 based with linux kernel 3.10.16. In my configuration all of the windows HVMs are running having been restored from xl save. VM''s are destroyed or restored in an on-demand fashion. After some time XEN will experience a fatal page fault while restoring one of the windows HVM subjects. This does not happen very often, perhaps once in a 16 to 48 hour period. The stack trace from xen follows. Thanks in advance for any help. (XEN) ----[ Xen-4.3.1 x86_64 debug=n Tainted: C ]---- (XEN) CPU: 52 (XEN) RIP: e008:[] domain_page_map_to_mfn+0x86/0xc0 (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor (XEN) rax: 000ffffffffff000 rbx: ffff8300bb163760 rcx: 0000000000000000 (XEN) rdx: ffff810000000000 rsi: 0000000000000000 rdi: 0000000000000000 (XEN) rbp: ffff8300bb163000 rsp: ffff8310333e7cd8 r8: 0000000000000000 (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 (XEN) r12: ffff8310333e7f18 r13: 0000000000000000 r14: 0000000000000000 (XEN) r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000000426f0 (XEN) cr3: 000000211bee5000 cr2: ffff810000000000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff8310333e7cd8: (XEN) 0000000000000001 ffff82c4c01de869 ffff82c4c0182c70 ffff8300bb163000 (XEN) 0000000000000014 ffff8310333e7f18 0000000000000000 ffff82c4c01d7548 (XEN) ffff8300bb163490 ffff8300bb163000 ffff82c4c01c65b8 ffff8310333e7e60 (XEN) ffff82c4c01badef ffff8300bb163000 0000000000000003 ffff833144d8e000 (XEN) ffff82c4c01b4885 ffff8300bb163000 ffff8300bb163000 ffff8300bdff1000 (XEN) 0000000000000001 ffff82c4c02f2880 ffff82c4c02f2880 ffff82c4c0308440 (XEN) ffff82c4c01d0ea8 ffff8300bb163000 ffff82c4c015ad6c ffff82c4c02f2880 (XEN) ffff82c4c02cf800 00000000ffffffff ffff8310333f5060 ffff82c4c02f2880 (XEN) 0000000000000282 0010000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 ffff82c4c02f2880 ffff8300bdff1000 ffff8300bb163000 (XEN) 000031a10f2b16ca 0000000000000001 ffff82c4c02f2880 ffff82c4c0308440 (XEN) ffff82c4c0124444 0000000000000034 ffff8310333f5060 0000000001c9c380 (XEN) 00000000c0155965 ffff82c4c01c6146 0000000001c9c380 ffffffffffffff00 (XEN) ffff82c4c0128fa8 ffff8300bb163000 ffff8327d50e9000 ffff82c4c01bc490 (XEN) 0000000000000000 ffff82c4c01dd254 0000000080549ae0 ffff82c4c01cfc3c (XEN) ffff8300bb163000 ffff82c4c01d6128 ffff82c4c0125db9 ffff82c4c0125db9 (XEN) ffff8310333e0000 ffff8300bb163000 000000000012ffc0 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 ffff82c4c01deaa3 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) 000000000012ffc0 000000007ffdf000 0000000000000000 0000000000000000 (XEN) Xen call trace: (XEN) [] domain_page_map_to_mfn+0x86/0xc0 (XEN) [] nvmx_handle_vmlaunch+0x49/0x160 (XEN) [] __update_vcpu_system_time+0x240/0x310 (XEN) [] vmx_vmexit_handler+0xb58/0x18c0 (XEN) [] pt_restore_timer+0xa8/0xc0 (XEN) [] hvm_io_assist+0xef/0x120 (XEN) [] hvm_do_resume+0x195/0x1c0 (XEN) [] vmx_do_resume+0x148/0x210 (XEN) [] context_switch+0x1bc/0xfc0 (XEN) [] schedule+0x254/0x5f0 (XEN) [] pt_update_irq+0x256/0x2b0 (XEN) [] timer_softirq_action+0x168/0x210 (XEN) [] hvm_vcpu_has_pending_irq+0x50/0xb0 (XEN) [] nvmx_switch_guest+0x54/0x1560 (XEN) [] vmx_intr_assist+0x6c/0x490 (XEN) [] vmx_vmenter_helper+0x88/0x160 (XEN) [] __do_softirq+0x69/0xa0 (XEN) [] __do_softirq+0x69/0xa0 (XEN) [] vmx_asm_do_vmentry+0/0xed (XEN) (XEN) Pagetable walk from ffff810000000000: (XEN) L4[0x102] = 000000211bee5063 ffffffffffffffff (XEN) L3[0x000] = 0000000000000000 ffffffffffffffff (XEN) (XEN) **************************************** (XEN) Panic on CPU 52: (XEN) FATAL PAGE FAULT (XEN) [error_code=0000] (XEN) Faulting linear address: ffff810000000000 (XEN) **************************************** (XEN) (XEN) Reboot in five seconds... (XEN) Resetting with ACPI MEMORY or I/O RESET_REG. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Nov-04 20:00 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On 04/11/13 19:54, Lars Kurth wrote:> See http://xenproject.org/help/questions-and-answers/hypervisor-fatal-page-fault-xen-4-3-1.html > --- > I have a 32 core system running XEN 4.3.1 with 30 Windows XP VM''s. > DOM0 is Centos 6.3 based with linux kernel 3.10.16. > In my configuration all of the windows HVMs are running having been > restored from xl save. > VM''s are destroyed or restored in an on-demand fashion. After some > time XEN will experience a fatal page fault while restoring one of the > windows HVM subjects. This does not happen very often, perhaps once in > a 16 to 48 hour period. > The stack trace from xen follows. Thanks in advance for any help.Which version of Xen were these images saved on? Are you expecting to be using nested-virt? (It is still very definitely experimental) ~Andrew> > (XEN) ----[ Xen-4.3.1 x86_64 debug=n Tainted: C ]---- > (XEN) CPU: 52 > (XEN) RIP: e008:[] domain_page_map_to_mfn+0x86/0xc0 > (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor > (XEN) rax: 000ffffffffff000 rbx: ffff8300bb163760 rcx: 0000000000000000 > (XEN) rdx: ffff810000000000 rsi: 0000000000000000 rdi: 0000000000000000 > (XEN) rbp: ffff8300bb163000 rsp: ffff8310333e7cd8 r8: 0000000000000000 > (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 > (XEN) r12: ffff8310333e7f18 r13: 0000000000000000 r14: 0000000000000000 > (XEN) r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000000426f0 > (XEN) cr3: 000000211bee5000 cr2: ffff810000000000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff8310333e7cd8: > (XEN) 0000000000000001 ffff82c4c01de869 ffff82c4c0182c70 ffff8300bb163000 > (XEN) 0000000000000014 ffff8310333e7f18 0000000000000000 ffff82c4c01d7548 > (XEN) ffff8300bb163490 ffff8300bb163000 ffff82c4c01c65b8 ffff8310333e7e60 > (XEN) ffff82c4c01badef ffff8300bb163000 0000000000000003 ffff833144d8e000 > (XEN) ffff82c4c01b4885 ffff8300bb163000 ffff8300bb163000 ffff8300bdff1000 > (XEN) 0000000000000001 ffff82c4c02f2880 ffff82c4c02f2880 ffff82c4c0308440 > (XEN) ffff82c4c01d0ea8 ffff8300bb163000 ffff82c4c015ad6c ffff82c4c02f2880 > (XEN) ffff82c4c02cf800 00000000ffffffff ffff8310333f5060 ffff82c4c02f2880 > (XEN) 0000000000000282 0010000000000000 0000000000000000 0000000000000000 > (XEN) 0000000000000000 ffff82c4c02f2880 ffff8300bdff1000 ffff8300bb163000 > (XEN) 000031a10f2b16ca 0000000000000001 ffff82c4c02f2880 ffff82c4c0308440 > (XEN) ffff82c4c0124444 0000000000000034 ffff8310333f5060 0000000001c9c380 > (XEN) 00000000c0155965 ffff82c4c01c6146 0000000001c9c380 ffffffffffffff00 > (XEN) ffff82c4c0128fa8 ffff8300bb163000 ffff8327d50e9000 ffff82c4c01bc490 > (XEN) 0000000000000000 ffff82c4c01dd254 0000000080549ae0 ffff82c4c01cfc3c > (XEN) ffff8300bb163000 ffff82c4c01d6128 ffff82c4c0125db9 ffff82c4c0125db9 > (XEN) ffff8310333e0000 ffff8300bb163000 000000000012ffc0 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000000 ffff82c4c01deaa3 > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > (XEN) 000000000012ffc0 000000007ffdf000 0000000000000000 0000000000000000 > (XEN) Xen call trace: > (XEN) [] domain_page_map_to_mfn+0x86/0xc0 > (XEN) [] nvmx_handle_vmlaunch+0x49/0x160 > (XEN) [] __update_vcpu_system_time+0x240/0x310 > (XEN) [] vmx_vmexit_handler+0xb58/0x18c0 > (XEN) [] pt_restore_timer+0xa8/0xc0 > (XEN) [] hvm_io_assist+0xef/0x120 > (XEN) [] hvm_do_resume+0x195/0x1c0 > (XEN) [] vmx_do_resume+0x148/0x210 > (XEN) [] context_switch+0x1bc/0xfc0 > (XEN) [] schedule+0x254/0x5f0 > (XEN) [] pt_update_irq+0x256/0x2b0 > (XEN) [] timer_softirq_action+0x168/0x210 > (XEN) [] hvm_vcpu_has_pending_irq+0x50/0xb0 > (XEN) [] nvmx_switch_guest+0x54/0x1560 > (XEN) [] vmx_intr_assist+0x6c/0x490 > (XEN) [] vmx_vmenter_helper+0x88/0x160 > (XEN) [] __do_softirq+0x69/0xa0 > (XEN) [] __do_softirq+0x69/0xa0 > (XEN) [] vmx_asm_do_vmentry+0/0xed > (XEN) > (XEN) Pagetable walk from ffff810000000000: > (XEN) L4[0x102] = 000000211bee5063 ffffffffffffffff > (XEN) L3[0x000] = 0000000000000000 ffffffffffffffff > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 52: > (XEN) FATAL PAGE FAULT > (XEN) [error_code=0000] > (XEN) Faulting linear address: ffff810000000000 > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... > (XEN) Resetting with ACPI MEMORY or I/O RESET_REG. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2013-Nov-05 09:53 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On Mon, 2013-11-04 at 19:54 +0000, Lars Kurth wrote:> See > http://xenproject.org/help/questions-and-answers/hypervisor-fatal-page-fault-xen-4-3-1.htmlTBH I think for this kind of thing (i.e. a bug not a user question) the most appropriate thing to do would be to redirect them to xen-devel themselves (with a reminder that they do not need to subscribe to post). This is going to take some back and forth to get to the bottom of and having you sit in the middle is just silly. Ian.
Jan Beulich
2013-Nov-05 10:04 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
>>> On 04.11.13 at 20:54, Lars Kurth <lars.kurth.xen@gmail.com> wrote: > See > http://xenproject.org/help/questions-and-answers/hypervisor-fatal-page-fault-xen-4-3- > 1.html > --- > I have a 32 core system running XEN 4.3.1 with 30 Windows XP VM''s. > DOM0 is Centos 6.3 based with linux kernel 3.10.16. > In my configuration all of the windows HVMs are running having been > restored from xl save. > VM''s are destroyed or restored in an on-demand fashion. After some time XEN > will experience a fatal page fault while restoring one of the windows HVM > subjects. This does not happen very often, perhaps once in a 16 to 48 hour > period. > The stack trace from xen follows. Thanks in advance for any help. > > (XEN) ----[ Xen-4.3.1 x86_64 debug=n Tainted: C ]---- > (XEN) CPU: 52 > (XEN) RIP: e008:[] domain_page_map_to_mfn+0x86/0xc0Zapping addresses (here and below in the stack trace) is never helpful when someone asks for help with a crash. Also, in order to not just guess, the matching xen-syms or xen.efi should be made available or pointed to.> (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor > (XEN) rax: 000ffffffffff000 rbx: ffff8300bb163760 rcx: 0000000000000000 > (XEN) rdx: ffff810000000000 rsi: 0000000000000000 rdi: 0000000000000000 > (XEN) rbp: ffff8300bb163000 rsp: ffff8310333e7cd8 r8: 0000000000000000 > (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 > (XEN) r12: ffff8310333e7f18 r13: 0000000000000000 r14: 0000000000000000 > (XEN) r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000000426f0 > (XEN) cr3: 000000211bee5000 cr2: ffff810000000000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff8310333e7cd8: > (XEN) 0000000000000001 ffff82c4c01de869 ffff82c4c0182c70 ffff8300bb163000 > (XEN) 0000000000000014 ffff8310333e7f18 0000000000000000 ffff82c4c01d7548 > (XEN) ffff8300bb163490 ffff8300bb163000 ffff82c4c01c65b8 ffff8310333e7e60 > (XEN) ffff82c4c01badef ffff8300bb163000 0000000000000003 ffff833144d8e000 > (XEN) ffff82c4c01b4885 ffff8300bb163000 ffff8300bb163000 ffff8300bdff1000 > (XEN) 0000000000000001 ffff82c4c02f2880 ffff82c4c02f2880 ffff82c4c0308440 > (XEN) ffff82c4c01d0ea8 ffff8300bb163000 ffff82c4c015ad6c ffff82c4c02f2880 > (XEN) ffff82c4c02cf800 00000000ffffffff ffff8310333f5060 ffff82c4c02f2880 > (XEN) 0000000000000282 0010000000000000 0000000000000000 0000000000000000 > (XEN) 0000000000000000 ffff82c4c02f2880 ffff8300bdff1000 ffff8300bb163000 > (XEN) 000031a10f2b16ca 0000000000000001 ffff82c4c02f2880 ffff82c4c0308440 > (XEN) ffff82c4c0124444 0000000000000034 ffff8310333f5060 0000000001c9c380 > (XEN) 00000000c0155965 ffff82c4c01c6146 0000000001c9c380 ffffffffffffff00 > (XEN) ffff82c4c0128fa8 ffff8300bb163000 ffff8327d50e9000 ffff82c4c01bc490 > (XEN) 0000000000000000 ffff82c4c01dd254 0000000080549ae0 ffff82c4c01cfc3c > (XEN) ffff8300bb163000 ffff82c4c01d6128 ffff82c4c0125db9 ffff82c4c0125db9 > (XEN) ffff8310333e0000 ffff8300bb163000 000000000012ffc0 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000000 ffff82c4c01deaa3 > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > (XEN) 000000000012ffc0 000000007ffdf000 0000000000000000 0000000000000000 > (XEN) Xen call trace: > (XEN) [] domain_page_map_to_mfn+0x86/0xc0 > (XEN) [] nvmx_handle_vmlaunch+0x49/0x160 > (XEN) [] __update_vcpu_system_time+0x240/0x310 > (XEN) [] vmx_vmexit_handler+0xb58/0x18c0 > (XEN) [] pt_restore_timer+0xa8/0xc0 > (XEN) [] hvm_io_assist+0xef/0x120 > (XEN) [] hvm_do_resume+0x195/0x1c0 > (XEN) [] vmx_do_resume+0x148/0x210 > (XEN) [] context_switch+0x1bc/0xfc0 > (XEN) [] schedule+0x254/0x5f0 > (XEN) [] pt_update_irq+0x256/0x2b0 > (XEN) [] timer_softirq_action+0x168/0x210 > (XEN) [] hvm_vcpu_has_pending_irq+0x50/0xb0 > (XEN) [] nvmx_switch_guest+0x54/0x1560 > (XEN) [] vmx_intr_assist+0x6c/0x490 > (XEN) [] vmx_vmenter_helper+0x88/0x160 > (XEN) [] __do_softirq+0x69/0xa0 > (XEN) [] __do_softirq+0x69/0xa0 > (XEN) [] vmx_asm_do_vmentry+0/0xed > (XEN) > (XEN) Pagetable walk from ffff810000000000: > (XEN) L4[0x102] = 000000211bee5063 ffffffffffffffff > (XEN) L3[0x000] = 0000000000000000 ffffffffffffffffThis makes me suspect that domain_page_map_to_mfn() gets a NULL pointer passed here. As said above, this is only guesswork at this point, and as Ian already pointed out, directing the reporter to xen-devel would seem to be the right thing to do here anyway. Jan> (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 52: > (XEN) FATAL PAGE FAULT > (XEN) [error_code=0000] > (XEN) Faulting linear address: ffff810000000000 > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... > (XEN) Resetting with ACPI MEMORY or I/O RESET_REG.
Lars Kurth
2013-Nov-05 15:46 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
Jan, Andrew, Ian, pulling in Jeff who raised the question. Snippets from misc replies attached. Jeff, please look through these (in particular Jan''s answer) and answer any further questions on this thread. On 05/11/2013 09:53, Ian Campbell wrote: > TBH I think for this kind of thing (i.e. a bug not a user question) the most appropriate thing to > do would be to redirect them to xen-devel themselves (with a reminder that they do not need > to subscribe to post). Agreed. Another option is for me to start the thread and pull in the raiser of the thread into it, if it is a bug. Was not sure this was a real bug at first, but it seems it is. On 04/11/2013 20:00, Andrew Cooper wrote: > Which version of Xen were these images saved on? [Jeff] We were careful to regenerate all the images after upgrading the 4.3.1. Also saw the same problem on 4.3.0. > Are you expecting to be using nested-virt? (It is still very definitely experimental) [Jeff] Not using nested-virt. On 05/11/2013 10:04, Jan Beulich wrote:>>>> On 04.11.13 at 20:54, Lars Kurth <lars.kurth.xen@gmail.com> wrote: >> See >> http://xenproject.org/help/questions-and-answers/hypervisor-fatal-page-fault-xen-4-3- >> 1.html >> --- >> I have a 32 core system running XEN 4.3.1 with 30 Windows XP VM''s. >> DOM0 is Centos 6.3 based with linux kernel 3.10.16. >> In my configuration all of the windows HVMs are running having been >> restored from xl save. >> VM''s are destroyed or restored in an on-demand fashion. After some time XEN >> will experience a fatal page fault while restoring one of the windows HVM >> subjects. This does not happen very often, perhaps once in a 16 to 48 hour >> period. >> The stack trace from xen follows. Thanks in advance for any help. >> >> (XEN) ----[ Xen-4.3.1 x86_64 debug=n Tainted: C ]---- >> (XEN) CPU: 52 >> (XEN) RIP: e008:[] domain_page_map_to_mfn+0x86/0xc0 > Zapping addresses (here and below in the stack trace) is never > helpful when someone asks for help with a crash. Also, in order > to not just guess, the matching xen-syms or xen.efi should be > made available or pointed to. > >> (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor >> (XEN) rax: 000ffffffffff000 rbx: ffff8300bb163760 rcx: 0000000000000000 >> (XEN) rdx: ffff810000000000 rsi: 0000000000000000 rdi: 0000000000000000 >> (XEN) rbp: ffff8300bb163000 rsp: ffff8310333e7cd8 r8: 0000000000000000 >> (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 >> (XEN) r12: ffff8310333e7f18 r13: 0000000000000000 r14: 0000000000000000 >> (XEN) r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000000426f0 >> (XEN) cr3: 000000211bee5000 cr2: ffff810000000000 >> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >> (XEN) Xen stack trace from rsp=ffff8310333e7cd8: >> (XEN) 0000000000000001 ffff82c4c01de869 ffff82c4c0182c70 ffff8300bb163000 >> (XEN) 0000000000000014 ffff8310333e7f18 0000000000000000 ffff82c4c01d7548 >> (XEN) ffff8300bb163490 ffff8300bb163000 ffff82c4c01c65b8 ffff8310333e7e60 >> (XEN) ffff82c4c01badef ffff8300bb163000 0000000000000003 ffff833144d8e000 >> (XEN) ffff82c4c01b4885 ffff8300bb163000 ffff8300bb163000 ffff8300bdff1000 >> (XEN) 0000000000000001 ffff82c4c02f2880 ffff82c4c02f2880 ffff82c4c0308440 >> (XEN) ffff82c4c01d0ea8 ffff8300bb163000 ffff82c4c015ad6c ffff82c4c02f2880 >> (XEN) ffff82c4c02cf800 00000000ffffffff ffff8310333f5060 ffff82c4c02f2880 >> (XEN) 0000000000000282 0010000000000000 0000000000000000 0000000000000000 >> (XEN) 0000000000000000 ffff82c4c02f2880 ffff8300bdff1000 ffff8300bb163000 >> (XEN) 000031a10f2b16ca 0000000000000001 ffff82c4c02f2880 ffff82c4c0308440 >> (XEN) ffff82c4c0124444 0000000000000034 ffff8310333f5060 0000000001c9c380 >> (XEN) 00000000c0155965 ffff82c4c01c6146 0000000001c9c380 ffffffffffffff00 >> (XEN) ffff82c4c0128fa8 ffff8300bb163000 ffff8327d50e9000 ffff82c4c01bc490 >> (XEN) 0000000000000000 ffff82c4c01dd254 0000000080549ae0 ffff82c4c01cfc3c >> (XEN) ffff8300bb163000 ffff82c4c01d6128 ffff82c4c0125db9 ffff82c4c0125db9 >> (XEN) ffff8310333e0000 ffff8300bb163000 000000000012ffc0 0000000000000000 >> (XEN) 0000000000000000 0000000000000000 0000000000000000 ffff82c4c01deaa3 >> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >> (XEN) 000000000012ffc0 000000007ffdf000 0000000000000000 0000000000000000 >> (XEN) Xen call trace: >> (XEN) [] domain_page_map_to_mfn+0x86/0xc0 >> (XEN) [] nvmx_handle_vmlaunch+0x49/0x160 >> (XEN) [] __update_vcpu_system_time+0x240/0x310 >> (XEN) [] vmx_vmexit_handler+0xb58/0x18c0 >> (XEN) [] pt_restore_timer+0xa8/0xc0 >> (XEN) [] hvm_io_assist+0xef/0x120 >> (XEN) [] hvm_do_resume+0x195/0x1c0 >> (XEN) [] vmx_do_resume+0x148/0x210 >> (XEN) [] context_switch+0x1bc/0xfc0 >> (XEN) [] schedule+0x254/0x5f0 >> (XEN) [] pt_update_irq+0x256/0x2b0 >> (XEN) [] timer_softirq_action+0x168/0x210 >> (XEN) [] hvm_vcpu_has_pending_irq+0x50/0xb0 >> (XEN) [] nvmx_switch_guest+0x54/0x1560 >> (XEN) [] vmx_intr_assist+0x6c/0x490 >> (XEN) [] vmx_vmenter_helper+0x88/0x160 >> (XEN) [] __do_softirq+0x69/0xa0 >> (XEN) [] __do_softirq+0x69/0xa0 >> (XEN) [] vmx_asm_do_vmentry+0/0xed >> (XEN) >> (XEN) Pagetable walk from ffff810000000000: >> (XEN) L4[0x102] = 000000211bee5063 ffffffffffffffff >> (XEN) L3[0x000] = 0000000000000000 ffffffffffffffff > This makes me suspect that domain_page_map_to_mfn() gets a > NULL pointer passed here. As said above, this is only guesswork > at this point, and as Ian already pointed out, directing the > reporter to xen-devel would seem to be the right thing to do > here anyway. > > Jan_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
<Jeff_Zimmerman@McAfee.com>
2013-Nov-05 21:55 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
Lars, I understand the mailing list limits attachment size to 512K. Where can I post the xen binary an symbols file? Jeff On Nov 5, 2013, at 7:46 AM, Lars Kurth <lars.kurth@xen.org<mailto:lars.kurth@xen.org>> wrote: Jan, Andrew, Ian, pulling in Jeff who raised the question. Snippets from misc replies attached. Jeff, please look through these (in particular Jan''s answer) and answer any further questions on this thread. On 05/11/2013 09:53, Ian Campbell wrote:> TBH I think for this kind of thing (i.e. a bug not a user question) the most appropriate thing to > do would be to redirect them to xen-devel themselves (with a reminder that they do not need > to subscribe to post).Agreed. Another option is for me to start the thread and pull in the raiser of the thread into it, if it is a bug. Was not sure this was a real bug at first, but it seems it is. On 04/11/2013 20:00, Andrew Cooper wrote:> Which version of Xen were these images saved on?[Jeff] We were careful to regenerate all the images after upgrading the 4.3.1. Also saw the same problem on 4.3.0.> Are you expecting to be using nested-virt? (It is still very definitely experimental)[Jeff] Not using nested-virt. On 05/11/2013 10:04, Jan Beulich wrote: On 04.11.13 at 20:54, Lars Kurth <lars.kurth.xen@gmail.com><mailto:lars.kurth.xen@gmail.com> wrote: See http://xenproject.org/help/questions-and-answers/hypervisor-fatal-page-fault-xen-4-3- 1.html --- I have a 32 core system running XEN 4.3.1 with 30 Windows XP VM''s. DOM0 is Centos 6.3 based with linux kernel 3.10.16. In my configuration all of the windows HVMs are running having been restored from xl save. VM''s are destroyed or restored in an on-demand fashion. After some time XEN will experience a fatal page fault while restoring one of the windows HVM subjects. This does not happen very often, perhaps once in a 16 to 48 hour period. The stack trace from xen follows. Thanks in advance for any help. (XEN) ----[ Xen-4.3.1 x86_64 debug=n Tainted: C ]---- (XEN) CPU: 52 (XEN) RIP: e008:[] domain_page_map_to_mfn+0x86/0xc0 Zapping addresses (here and below in the stack trace) is never helpful when someone asks for help with a crash. Also, in order to not just guess, the matching xen-syms or xen.efi should be made available or pointed to. (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor (XEN) rax: 000ffffffffff000 rbx: ffff8300bb163760 rcx: 0000000000000000 (XEN) rdx: ffff810000000000 rsi: 0000000000000000 rdi: 0000000000000000 (XEN) rbp: ffff8300bb163000 rsp: ffff8310333e7cd8 r8: 0000000000000000 (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 (XEN) r12: ffff8310333e7f18 r13: 0000000000000000 r14: 0000000000000000 (XEN) r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000000426f0 (XEN) cr3: 000000211bee5000 cr2: ffff810000000000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff8310333e7cd8: (XEN) 0000000000000001 ffff82c4c01de869 ffff82c4c0182c70 ffff8300bb163000 (XEN) 0000000000000014 ffff8310333e7f18 0000000000000000 ffff82c4c01d7548 (XEN) ffff8300bb163490 ffff8300bb163000 ffff82c4c01c65b8 ffff8310333e7e60 (XEN) ffff82c4c01badef ffff8300bb163000 0000000000000003 ffff833144d8e000 (XEN) ffff82c4c01b4885 ffff8300bb163000 ffff8300bb163000 ffff8300bdff1000 (XEN) 0000000000000001 ffff82c4c02f2880 ffff82c4c02f2880 ffff82c4c0308440 (XEN) ffff82c4c01d0ea8 ffff8300bb163000 ffff82c4c015ad6c ffff82c4c02f2880 (XEN) ffff82c4c02cf800 00000000ffffffff ffff8310333f5060 ffff82c4c02f2880 (XEN) 0000000000000282 0010000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 ffff82c4c02f2880 ffff8300bdff1000 ffff8300bb163000 (XEN) 000031a10f2b16ca 0000000000000001 ffff82c4c02f2880 ffff82c4c0308440 (XEN) ffff82c4c0124444 0000000000000034 ffff8310333f5060 0000000001c9c380 (XEN) 00000000c0155965 ffff82c4c01c6146 0000000001c9c380 ffffffffffffff00 (XEN) ffff82c4c0128fa8 ffff8300bb163000 ffff8327d50e9000 ffff82c4c01bc490 (XEN) 0000000000000000 ffff82c4c01dd254 0000000080549ae0 ffff82c4c01cfc3c (XEN) ffff8300bb163000 ffff82c4c01d6128 ffff82c4c0125db9 ffff82c4c0125db9 (XEN) ffff8310333e0000 ffff8300bb163000 000000000012ffc0 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 ffff82c4c01deaa3 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) 000000000012ffc0 000000007ffdf000 0000000000000000 0000000000000000 (XEN) Xen call trace: (XEN) [] domain_page_map_to_mfn+0x86/0xc0 (XEN) [] nvmx_handle_vmlaunch+0x49/0x160 (XEN) [] __update_vcpu_system_time+0x240/0x310 (XEN) [] vmx_vmexit_handler+0xb58/0x18c0 (XEN) [] pt_restore_timer+0xa8/0xc0 (XEN) [] hvm_io_assist+0xef/0x120 (XEN) [] hvm_do_resume+0x195/0x1c0 (XEN) [] vmx_do_resume+0x148/0x210 (XEN) [] context_switch+0x1bc/0xfc0 (XEN) [] schedule+0x254/0x5f0 (XEN) [] pt_update_irq+0x256/0x2b0 (XEN) [] timer_softirq_action+0x168/0x210 (XEN) [] hvm_vcpu_has_pending_irq+0x50/0xb0 (XEN) [] nvmx_switch_guest+0x54/0x1560 (XEN) [] vmx_intr_assist+0x6c/0x490 (XEN) [] vmx_vmenter_helper+0x88/0x160 (XEN) [] __do_softirq+0x69/0xa0 (XEN) [] __do_softirq+0x69/0xa0 (XEN) [] vmx_asm_do_vmentry+0/0xed (XEN) (XEN) Pagetable walk from ffff810000000000: (XEN) L4[0x102] = 000000211bee5063 ffffffffffffffff (XEN) L3[0x000] = 0000000000000000 ffffffffffffffff This makes me suspect that domain_page_map_to_mfn() gets a NULL pointer passed here. As said above, this is only guesswork at this point, and as Ian already pointed out, directing the reporter to xen-devel would seem to be the right thing to do here anyway. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
<Jeff_Zimmerman@McAfee.com>
2013-Nov-05 22:46 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
Asit, I''ve attached two files, one is from dmesg | grep microcode, second is first process from /proc/cpuinfo Jeff On Nov 5, 2013, at 2:29 PM, "Mallick, Asit K" <asit.k.mallick@intel.com> wrote:> Jeff, > Could you check if you you have latest microcode updates installed on this system? Or, could you send me the microcode rev and I can check. > > Thanks, > Asit > > > From: "Jeff_Zimmerman@McAfee.com<mailto:Jeff_Zimmerman@McAfee.com>" <Jeff_Zimmerman@McAfee.com<mailto:Jeff_Zimmerman@McAfee.com>> > Date: Tuesday, November 5, 2013 2:55 PM > To: "lars.kurth@xen.org<mailto:lars.kurth@xen.org>" <lars.kurth@xen.org<mailto:lars.kurth@xen.org>> > Cc: "lars.kurth.xen@gmail.com<mailto:lars.kurth.xen@gmail.com>" <lars.kurth.xen@gmail.com<mailto:lars.kurth.xen@gmail.com>>, "xen-devel@lists.xenproject.org<mailto:xen-devel@lists.xenproject.org>" <xen-devel@lists.xenproject.org<mailto:xen-devel@lists.xenproject.org>>, "JBeulich@suse.com<mailto:JBeulich@suse.com>" <JBeulich@suse.com<mailto:JBeulich@suse.com>> > Subject: Re: [Xen-devel] Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.) > > Lars, > I understand the mailing list limits attachment size to 512K. Where can I post the xen binary an symbols file? > Jeff > > On Nov 5, 2013, at 7:46 AM, Lars Kurth <lars.kurth@xen.org<mailto:lars.kurth@xen.org>> wrote: > > Jan, Andrew, Ian, > > pulling in Jeff who raised the question. Snippets from misc replies attached. Jeff, please look through these (in particular Jan''s answer) and answer any further questions on this thread. > > On 05/11/2013 09:53, Ian Campbell wrote: >> TBH I think for this kind of thing (i.e. a bug not a user question) the most appropriate thing to >> do would be to redirect them to xen-devel themselves (with a reminder that they do not need >> to subscribe to post). > Agreed. Another option is for me to start the thread and pull in the raiser of the thread into it, if it is a bug. Was not sure this was a real bug at first, but it seems it is. > > On 04/11/2013 20:00, Andrew Cooper wrote: >> Which version of Xen were these images saved on? > [Jeff] We were careful to regenerate all the images after upgrading the 4.3.1. Also saw the same problem on 4.3.0. > >> Are you expecting to be using nested-virt? (It is still very definitely experimental) > [Jeff] Not using nested-virt. > > On 05/11/2013 10:04, Jan Beulich wrote: > > On 04.11.13 at 20:54, Lars Kurth <lars.kurth.xen@gmail.com><mailto:lars.kurth.xen@gmail.com> wrote: > > > See > http://xenproject.org/help/questions-and-answers/hypervisor-fatal-page-fault-xen-4-3- > 1.html > --- > I have a 32 core system running XEN 4.3.1 with 30 Windows XP VM''s. > DOM0 is Centos 6.3 based with linux kernel 3.10.16. > In my configuration all of the windows HVMs are running having been > restored from xl save. > VM''s are destroyed or restored in an on-demand fashion. After some time XEN > will experience a fatal page fault while restoring one of the windows HVM > subjects. This does not happen very often, perhaps once in a 16 to 48 hour > period. > The stack trace from xen follows. Thanks in advance for any help. > > (XEN) ----[ Xen-4.3.1 x86_64 debug=n Tainted: C ]---- > (XEN) CPU: 52 > (XEN) RIP: e008:[] domain_page_map_to_mfn+0x86/0xc0 > > > Zapping addresses (here and below in the stack trace) is never > helpful when someone asks for help with a crash. Also, in order > to not just guess, the matching xen-syms or xen.efi should be > made available or pointed to. > > > > (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor > (XEN) rax: 000ffffffffff000 rbx: ffff8300bb163760 rcx: 0000000000000000 > (XEN) rdx: ffff810000000000 rsi: 0000000000000000 rdi: 0000000000000000 > (XEN) rbp: ffff8300bb163000 rsp: ffff8310333e7cd8 r8: 0000000000000000 > (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 > (XEN) r12: ffff8310333e7f18 r13: 0000000000000000 r14: 0000000000000000 > (XEN) r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000000426f0 > (XEN) cr3: 000000211bee5000 cr2: ffff810000000000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff8310333e7cd8: > (XEN) 0000000000000001 ffff82c4c01de869 ffff82c4c0182c70 ffff8300bb163000 > (XEN) 0000000000000014 ffff8310333e7f18 0000000000000000 ffff82c4c01d7548 > (XEN) ffff8300bb163490 ffff8300bb163000 ffff82c4c01c65b8 ffff8310333e7e60 > (XEN) ffff82c4c01badef ffff8300bb163000 0000000000000003 ffff833144d8e000 > (XEN) ffff82c4c01b4885 ffff8300bb163000 ffff8300bb163000 ffff8300bdff1000 > (XEN) 0000000000000001 ffff82c4c02f2880 ffff82c4c02f2880 ffff82c4c0308440 > (XEN) ffff82c4c01d0ea8 ffff8300bb163000 ffff82c4c015ad6c ffff82c4c02f2880 > (XEN) ffff82c4c02cf800 00000000ffffffff ffff8310333f5060 ffff82c4c02f2880 > (XEN) 0000000000000282 0010000000000000 0000000000000000 0000000000000000 > (XEN) 0000000000000000 ffff82c4c02f2880 ffff8300bdff1000 ffff8300bb163000 > (XEN) 000031a10f2b16ca 0000000000000001 ffff82c4c02f2880 ffff82c4c0308440 > (XEN) ffff82c4c0124444 0000000000000034 ffff8310333f5060 0000000001c9c380 > (XEN) 00000000c0155965 ffff82c4c01c6146 0000000001c9c380 ffffffffffffff00 > (XEN) ffff82c4c0128fa8 ffff8300bb163000 ffff8327d50e9000 ffff82c4c01bc490 > (XEN) 0000000000000000 ffff82c4c01dd254 0000000080549ae0 ffff82c4c01cfc3c > (XEN) ffff8300bb163000 ffff82c4c01d6128 ffff82c4c0125db9 ffff82c4c0125db9 > (XEN) ffff8310333e0000 ffff8300bb163000 000000000012ffc0 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000000 ffff82c4c01deaa3 > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > (XEN) 000000000012ffc0 000000007ffdf000 0000000000000000 0000000000000000 > (XEN) Xen call trace: > (XEN) [] domain_page_map_to_mfn+0x86/0xc0 > (XEN) [] nvmx_handle_vmlaunch+0x49/0x160 > (XEN) [] __update_vcpu_system_time+0x240/0x310 > (XEN) [] vmx_vmexit_handler+0xb58/0x18c0 > (XEN) [] pt_restore_timer+0xa8/0xc0 > (XEN) [] hvm_io_assist+0xef/0x120 > (XEN) [] hvm_do_resume+0x195/0x1c0 > (XEN) [] vmx_do_resume+0x148/0x210 > (XEN) [] context_switch+0x1bc/0xfc0 > (XEN) [] schedule+0x254/0x5f0 > (XEN) [] pt_update_irq+0x256/0x2b0 > (XEN) [] timer_softirq_action+0x168/0x210 > (XEN) [] hvm_vcpu_has_pending_irq+0x50/0xb0 > (XEN) [] nvmx_switch_guest+0x54/0x1560 > (XEN) [] vmx_intr_assist+0x6c/0x490 > (XEN) [] vmx_vmenter_helper+0x88/0x160 > (XEN) [] __do_softirq+0x69/0xa0 > (XEN) [] __do_softirq+0x69/0xa0 > (XEN) [] vmx_asm_do_vmentry+0/0xed > (XEN) > (XEN) Pagetable walk from ffff810000000000: > (XEN) L4[0x102] = 000000211bee5063 ffffffffffffffff > (XEN) L3[0x000] = 0000000000000000 ffffffffffffffff > > > This makes me suspect that domain_page_map_to_mfn() gets a > NULL pointer passed here. As said above, this is only guesswork > at this point, and as Ian already pointed out, directing the > reporter to xen-devel would seem to be the right thing to do > here anyway. > > Jan > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Mallick, Asit K
2013-Nov-05 23:17 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
It is running with the latest microcode revision 0x710. Thanks, Asit From: "Jeff_Zimmerman@McAfee.com<mailto:Jeff_Zimmerman@McAfee.com>" <Jeff_Zimmerman@McAfee.com<mailto:Jeff_Zimmerman@McAfee.com>> Date: Tuesday, November 5, 2013 3:46 PM To: "Mallick, Asit K" <asit.k.mallick@intel.com<mailto:asit.k.mallick@intel.com>> Cc: "xen-devel@lists.xenproject.org<mailto:xen-devel@lists.xenproject.org>" <xen-devel@lists.xenproject.org<mailto:xen-devel@lists.xenproject.org>> Subject: Re: [Xen-devel] Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.) Asit, I''ve attached two files, one is from dmesg | grep microcode, second is first process from /proc/cpuinfo Jeff On Nov 5, 2013, at 2:29 PM, "Mallick, Asit K" <asit.k.mallick@intel.com<mailto:asit.k.mallick@intel.com>> wrote:> Jeff, > Could you check if you you have latest microcode updates installed on this system? Or, could you send me the microcode rev and I can check. > > Thanks, > Asit > > > From: "Jeff_Zimmerman@McAfee.com<mailto:Jeff_Zimmerman@McAfee.com><mailto:Jeff_Zimmerman@McAfee.com>" <Jeff_Zimmerman@McAfee.com<mailto:Jeff_Zimmerman@McAfee.com><mailto:Jeff_Zimmerman@McAfee.com>> > Date: Tuesday, November 5, 2013 2:55 PM > To: "lars.kurth@xen.org<mailto:lars.kurth@xen.org><mailto:lars.kurth@xen.org>" <lars.kurth@xen.org<mailto:lars.kurth@xen.org><mailto:lars.kurth@xen.org>> > Cc: "lars.kurth.xen@gmail.com<mailto:lars.kurth.xen@gmail.com><mailto:lars.kurth.xen@gmail.com>" <lars.kurth.xen@gmail.com<mailto:lars.kurth.xen@gmail.com><mailto:lars.kurth.xen@gmail.com>>, "xen-devel@lists.xenproject.org<mailto:xen-devel@lists.xenproject.org><mailto:xen-devel@lists.xenproject.org>" <xen-devel@lists.xenproject.org<mailto:xen-devel@lists.xenproject.org><mailto:xen-devel@lists.xenproject.org>>, "JBeulich@suse.com<mailto:JBeulich@suse.com><mailto:JBeulich@suse.com>" <JBeulich@suse.com<mailto:JBeulich@suse.com><mailto:JBeulich@suse.com>> > Subject: Re: [Xen-devel] Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.) > > Lars, > I understand the mailing list limits attachment size to 512K. Where can I post the xen binary an symbols file? > Jeff > > On Nov 5, 2013, at 7:46 AM, Lars Kurth <lars.kurth@xen.org<mailto:lars.kurth@xen.org><mailto:lars.kurth@xen.org>> wrote: > > Jan, Andrew, Ian, > > pulling in Jeff who raised the question. Snippets from misc replies attached. Jeff, please look through these (in particular Jan''s answer) and answer any further questions on this thread. > > On 05/11/2013 09:53, Ian Campbell wrote: >> TBH I think for this kind of thing (i.e. a bug not a user question) the most appropriate thing to >> do would be to redirect them to xen-devel themselves (with a reminder that they do not need >> to subscribe to post). > Agreed. Another option is for me to start the thread and pull in the raiser of the thread into it, if it is a bug. Was not sure this was a real bug at first, but it seems it is. > > On 04/11/2013 20:00, Andrew Cooper wrote: >> Which version of Xen were these images saved on? > [Jeff] We were careful to regenerate all the images after upgrading the 4.3.1. Also saw the same problem on 4.3.0. > >> Are you expecting to be using nested-virt? (It is still very definitely experimental) > [Jeff] Not using nested-virt. > > On 05/11/2013 10:04, Jan Beulich wrote: > > On 04.11.13 at 20:54, Lars Kurth <lars.kurth.xen@gmail.com<mailto:lars.kurth.xen@gmail.com>><mailto:lars.kurth.xen@gmail.com> wrote: > > > See > http://xenproject.org/help/questions-and-answers/hypervisor-fatal-page-fault-xen-4-3- > 1.html > --- > I have a 32 core system running XEN 4.3.1 with 30 Windows XP VM''s. > DOM0 is Centos 6.3 based with linux kernel 3.10.16. > In my configuration all of the windows HVMs are running having been > restored from xl save. > VM''s are destroyed or restored in an on-demand fashion. After some time XEN > will experience a fatal page fault while restoring one of the windows HVM > subjects. This does not happen very often, perhaps once in a 16 to 48 hour > period. > The stack trace from xen follows. Thanks in advance for any help. > > (XEN) ----[ Xen-4.3.1 x86_64 debug=n Tainted: C ]---- > (XEN) CPU: 52 > (XEN) RIP: e008:[] domain_page_map_to_mfn+0x86/0xc0 > > > Zapping addresses (here and below in the stack trace) is never > helpful when someone asks for help with a crash. Also, in order > to not just guess, the matching xen-syms or xen.efi should be > made available or pointed to. > > > > (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor > (XEN) rax: 000ffffffffff000 rbx: ffff8300bb163760 rcx: 0000000000000000 > (XEN) rdx: ffff810000000000 rsi: 0000000000000000 rdi: 0000000000000000 > (XEN) rbp: ffff8300bb163000 rsp: ffff8310333e7cd8 r8: 0000000000000000 > (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 > (XEN) r12: ffff8310333e7f18 r13: 0000000000000000 r14: 0000000000000000 > (XEN) r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000000426f0 > (XEN) cr3: 000000211bee5000 cr2: ffff810000000000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff8310333e7cd8: > (XEN) 0000000000000001 ffff82c4c01de869 ffff82c4c0182c70 ffff8300bb163000 > (XEN) 0000000000000014 ffff8310333e7f18 0000000000000000 ffff82c4c01d7548 > (XEN) ffff8300bb163490 ffff8300bb163000 ffff82c4c01c65b8 ffff8310333e7e60 > (XEN) ffff82c4c01badef ffff8300bb163000 0000000000000003 ffff833144d8e000 > (XEN) ffff82c4c01b4885 ffff8300bb163000 ffff8300bb163000 ffff8300bdff1000 > (XEN) 0000000000000001 ffff82c4c02f2880 ffff82c4c02f2880 ffff82c4c0308440 > (XEN) ffff82c4c01d0ea8 ffff8300bb163000 ffff82c4c015ad6c ffff82c4c02f2880 > (XEN) ffff82c4c02cf800 00000000ffffffff ffff8310333f5060 ffff82c4c02f2880 > (XEN) 0000000000000282 0010000000000000 0000000000000000 0000000000000000 > (XEN) 0000000000000000 ffff82c4c02f2880 ffff8300bdff1000 ffff8300bb163000 > (XEN) 000031a10f2b16ca 0000000000000001 ffff82c4c02f2880 ffff82c4c0308440 > (XEN) ffff82c4c0124444 0000000000000034 ffff8310333f5060 0000000001c9c380 > (XEN) 00000000c0155965 ffff82c4c01c6146 0000000001c9c380 ffffffffffffff00 > (XEN) ffff82c4c0128fa8 ffff8300bb163000 ffff8327d50e9000 ffff82c4c01bc490 > (XEN) 0000000000000000 ffff82c4c01dd254 0000000080549ae0 ffff82c4c01cfc3c > (XEN) ffff8300bb163000 ffff82c4c01d6128 ffff82c4c0125db9 ffff82c4c0125db9 > (XEN) ffff8310333e0000 ffff8300bb163000 000000000012ffc0 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000000 ffff82c4c01deaa3 > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > (XEN) 000000000012ffc0 000000007ffdf000 0000000000000000 0000000000000000 > (XEN) Xen call trace: > (XEN) [] domain_page_map_to_mfn+0x86/0xc0 > (XEN) [] nvmx_handle_vmlaunch+0x49/0x160 > (XEN) [] __update_vcpu_system_time+0x240/0x310 > (XEN) [] vmx_vmexit_handler+0xb58/0x18c0 > (XEN) [] pt_restore_timer+0xa8/0xc0 > (XEN) [] hvm_io_assist+0xef/0x120 > (XEN) [] hvm_do_resume+0x195/0x1c0 > (XEN) [] vmx_do_resume+0x148/0x210 > (XEN) [] context_switch+0x1bc/0xfc0 > (XEN) [] schedule+0x254/0x5f0 > (XEN) [] pt_update_irq+0x256/0x2b0 > (XEN) [] timer_softirq_action+0x168/0x210 > (XEN) [] hvm_vcpu_has_pending_irq+0x50/0xb0 > (XEN) [] nvmx_switch_guest+0x54/0x1560 > (XEN) [] vmx_intr_assist+0x6c/0x490 > (XEN) [] vmx_vmenter_helper+0x88/0x160 > (XEN) [] __do_softirq+0x69/0xa0 > (XEN) [] __do_softirq+0x69/0xa0 > (XEN) [] vmx_asm_do_vmentry+0/0xed > (XEN) > (XEN) Pagetable walk from ffff810000000000: > (XEN) L4[0x102] = 000000211bee5063 ffffffffffffffff > (XEN) L3[0x000] = 0000000000000000 ffffffffffffffff > > > This makes me suspect that domain_page_map_to_mfn() gets a > NULL pointer passed here. As said above, this is only guesswork > at this point, and as Ian already pointed out, directing the > reporter to xen-devel would seem to be the right thing to do > here anyway. > > Jan > > >
Andrew Cooper
2013-Nov-06 00:23 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On 05/11/2013 22:46, Jeff_Zimmerman@McAfee.com wrote:> Asit, > I''ve attached two files, one is from dmesg | grep microcode, second is > first process from /proc/cpuinfo > Jeff > > On Nov 5, 2013, at 2:29 PM, "Mallick, Asit K" <asit.k.mallick@intel.com> > wrote: > > > Jeff, > > Could you check if you you have latest microcode updates installed > on this system? Or, could you send me the microcode rev and I can check. > > > > Thanks, > > Asit > > > > > > From: "Jeff_Zimmerman@McAfee.com<mailto:Jeff_Zimmerman@McAfee.com>" > <Jeff_Zimmerman@McAfee.com<mailto:Jeff_Zimmerman@McAfee.com>> > > Date: Tuesday, November 5, 2013 2:55 PM > > To: "lars.kurth@xen.org<mailto:lars.kurth@xen.org>" > <lars.kurth@xen.org<mailto:lars.kurth@xen.org>> > > Cc: "lars.kurth.xen@gmail.com<mailto:lars.kurth.xen@gmail.com>" > <lars.kurth.xen@gmail.com<mailto:lars.kurth.xen@gmail.com>>, > "xen-devel@lists.xenproject.org<mailto:xen-devel@lists.xenproject.org>" <xen-devel@lists.xenproject.org<mailto:xen-devel@lists.xenproject.org>>, > "JBeulich@suse.com<mailto:JBeulich@suse.com>" > <JBeulich@suse.com<mailto:JBeulich@suse.com>> > > Subject: Re: [Xen-devel] Intermittent fatal page fault with XEN > 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.) > > > > Lars, > > I understand the mailing list limits attachment size to 512K. Where > can I post the xen binary an symbols file? > > Jeff > > > > On Nov 5, 2013, at 7:46 AM, Lars Kurth > <lars.kurth@xen.org<mailto:lars.kurth@xen.org>> wrote: > > > > Jan, Andrew, Ian, > > > > pulling in Jeff who raised the question. Snippets from misc replies > attached. Jeff, please look through these (in particular Jan''s answer) > and answer any further questions on this thread. > > > > On 05/11/2013 09:53, Ian Campbell wrote: > >> TBH I think for this kind of thing (i.e. a bug not a user question) > the most appropriate thing to > >> do would be to redirect them to xen-devel themselves (with a > reminder that they do not need > >> to subscribe to post). > > Agreed. Another option is for me to start the thread and pull in the > raiser of the thread into it, if it is a bug. Was not sure this was a > real bug at first, but it seems it is. > > > > On 04/11/2013 20:00, Andrew Cooper wrote: > >> Which version of Xen were these images saved on? > > [Jeff] We were careful to regenerate all the images after upgrading > the 4.3.1. Also saw the same problem on 4.3.0. > > > >> Are you expecting to be using nested-virt? (It is still very > definitely experimental) > > [Jeff] Not using nested-virt. > > > > On 05/11/2013 10:04, Jan Beulich wrote: > > > > On 04.11.13 at 20:54, Lars Kurth > <lars.kurth.xen@gmail.com><mailto:lars.kurth.xen@gmail.com> wrote: > > > > > > See > > > http://xenproject.org/help/questions-and-answers/hypervisor-fatal-page-fault-xen-4-3- > > 1.html > > --- > > I have a 32 core system running XEN 4.3.1 with 30 Windows XP VM''s. > > DOM0 is Centos 6.3 based with linux kernel 3.10.16. > > In my configuration all of the windows HVMs are running having been > > restored from xl save. > > VM''s are destroyed or restored in an on-demand fashion. After some > time XEN > > will experience a fatal page fault while restoring one of the > windows HVM > > subjects. This does not happen very often, perhaps once in a 16 to > 48 hour > > period. > > The stack trace from xen follows. Thanks in advance for any help. > > > > (XEN) ----[ Xen-4.3.1 x86_64 debug=n Tainted: C ]---- > > (XEN) CPU: 52 > > (XEN) RIP: e008:[] domain_page_map_to_mfn+0x86/0xc0 > > > > > > Zapping addresses (here and below in the stack trace) is never > > helpful when someone asks for help with a crash. Also, in order > > to not just guess, the matching xen-syms or xen.efi should be > > made available or pointed to. > > > > > > > > (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor > > (XEN) rax: 000ffffffffff000 rbx: ffff8300bb163760 rcx: 0000000000000000 > > (XEN) rdx: ffff810000000000 rsi: 0000000000000000 rdi: 0000000000000000 > > (XEN) rbp: ffff8300bb163000 rsp: ffff8310333e7cd8 r8: 0000000000000000 > > (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 > > (XEN) r12: ffff8310333e7f18 r13: 0000000000000000 r14: 0000000000000000 > > (XEN) r15: 0000000000000000 cr0: 0000000080050033 cr4: 00000000000426f0 > > (XEN) cr3: 000000211bee5000 cr2: ffff810000000000 > > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > > (XEN) Xen stack trace from rsp=ffff8310333e7cd8: > > (XEN) 0000000000000001 ffff82c4c01de869 ffff82c4c0182c70 > ffff8300bb163000 > > (XEN) 0000000000000014 ffff8310333e7f18 0000000000000000 > ffff82c4c01d7548 > > (XEN) ffff8300bb163490 ffff8300bb163000 ffff82c4c01c65b8 > ffff8310333e7e60 > > (XEN) ffff82c4c01badef ffff8300bb163000 0000000000000003 > ffff833144d8e000 > > (XEN) ffff82c4c01b4885 ffff8300bb163000 ffff8300bb163000 > ffff8300bdff1000 > > (XEN) 0000000000000001 ffff82c4c02f2880 ffff82c4c02f2880 > ffff82c4c0308440 > > (XEN) ffff82c4c01d0ea8 ffff8300bb163000 ffff82c4c015ad6c > ffff82c4c02f2880 > > (XEN) ffff82c4c02cf800 00000000ffffffff ffff8310333f5060 > ffff82c4c02f2880 > > (XEN) 0000000000000282 0010000000000000 0000000000000000 > 0000000000000000 > > (XEN) 0000000000000000 ffff82c4c02f2880 ffff8300bdff1000 > ffff8300bb163000 > > (XEN) 000031a10f2b16ca 0000000000000001 ffff82c4c02f2880 > ffff82c4c0308440 > > (XEN) ffff82c4c0124444 0000000000000034 ffff8310333f5060 > 0000000001c9c380 > > (XEN) 00000000c0155965 ffff82c4c01c6146 0000000001c9c380 > ffffffffffffff00 > > (XEN) ffff82c4c0128fa8 ffff8300bb163000 ffff8327d50e9000 > ffff82c4c01bc490 > > (XEN) 0000000000000000 ffff82c4c01dd254 0000000080549ae0 > ffff82c4c01cfc3c > > (XEN) ffff8300bb163000 ffff82c4c01d6128 ffff82c4c0125db9 > ffff82c4c0125db9 > > (XEN) ffff8310333e0000 ffff8300bb163000 000000000012ffc0 > 0000000000000000 > > (XEN) 0000000000000000 0000000000000000 0000000000000000 > ffff82c4c01deaa3 > > (XEN) 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000 > > (XEN) 000000000012ffc0 000000007ffdf000 0000000000000000 > 0000000000000000 > > (XEN) Xen call trace: > > (XEN) [] domain_page_map_to_mfn+0x86/0xc0 > > (XEN) [] nvmx_handle_vmlaunch+0x49/0x160 > > (XEN) [] __update_vcpu_system_time+0x240/0x310 > > (XEN) [] vmx_vmexit_handler+0xb58/0x18c0 > > (XEN) [] pt_restore_timer+0xa8/0xc0 > > (XEN) [] hvm_io_assist+0xef/0x120 > > (XEN) [] hvm_do_resume+0x195/0x1c0 > > (XEN) [] vmx_do_resume+0x148/0x210 > > (XEN) [] context_switch+0x1bc/0xfc0 > > (XEN) [] schedule+0x254/0x5f0 > > (XEN) [] pt_update_irq+0x256/0x2b0 > > (XEN) [] timer_softirq_action+0x168/0x210 > > (XEN) [] hvm_vcpu_has_pending_irq+0x50/0xb0 > > (XEN) [] nvmx_switch_guest+0x54/0x1560 > > (XEN) [] vmx_intr_assist+0x6c/0x490 > > (XEN) [] vmx_vmenter_helper+0x88/0x160 > > (XEN) [] __do_softirq+0x69/0xa0 > > (XEN) [] __do_softirq+0x69/0xa0 > > (XEN) [] vmx_asm_do_vmentry+0/0xed > > (XEN) > > (XEN) Pagetable walk from ffff810000000000: > > (XEN) L4[0x102] = 000000211bee5063 ffffffffffffffff > > (XEN) L3[0x000] = 0000000000000000 ffffffffffffffff > > > > > > This makes me suspect that domain_page_map_to_mfn() gets a > > NULL pointer passed here. As said above, this is only guesswork > > at this point, and as Ian already pointed out, directing the > > reporter to xen-devel would seem to be the right thing to do > > here anyway. > > > > Jan > > > > > > >As Jan said, the above censoring is almost completely defeating the purpose of trying to help you. However, while you are not expecting to be using nested-virt, you clearly appear to be from the stack trace, so something is clearly up. Which toolstack are you using for VMs ? What is the configuration for the affected VM? ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2013-Nov-06 10:05 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On Wed, 2013-11-06 at 00:23 +0000, Andrew Cooper wrote:> Which toolstack are you using for VMs ? What is the configuration for > the affected VM?And what exact Windows OS? It''s not entirely out the question that a modern one might try and use VMX for various things if it saw it. And doesn''t mcafee have a Windows product which does things along those lines ? :-) Ian.
Jan Beulich
2013-Nov-06 14:09 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
>>> On 05.11.13 at 22:36, <Jeff_Zimmerman@McAfee.com> wrote: > Attaching the xen binary and symbols file. > Hopefully they will come through.Please give the attached patch a try - afaict it should eliminate the host crash, but I''m pretty certain you''ll then see the guest misbehave. Depending on what other load you place on the system as a whole, you''re either overloading it (i.e. we''re running out of mapping space in the hypervisor) or there''s a mapping leak that - so far at least - I can''t spot. In any event I''d suggest you try running a debug build of the hypervisor, so that eventual problems can be spotted earlier. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
<Jeff_Zimmerman@McAfee.com>
2013-Nov-06 16:05 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
Jan, I will give your patch a try. I have to recant my previous statement regarding not using nested-virt. It seems some of the code that is being executed on the vm contains vmx instructions. Since by virtue of running this code in an hvm subject make it nested-virt. This raises a question, if this functionality is undesired can we just disable nested virt by adding nestedhvm=false to the configuration file? Should the cpuid and cupid_check settings be changed as well? Thanks, Jeff On Nov 6, 2013, at 6:09 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 05.11.13 at 22:36, <Jeff_Zimmerman@McAfee.com> wrote: >> Attaching the xen binary and symbols file. >> Hopefully they will come through. > > Please give the attached patch a try - afaict it should eliminate > the host crash, but I''m pretty certain you''ll then see the guest > misbehave. Depending on what other load you place on the > system as a whole, you''re either overloading it (i.e. we''re > running out of mapping space in the hypervisor) or there''s a > mapping leak that - so far at least - I can''t spot. > > In any event I''d suggest you try running a debug build of the > hypervisor, so that eventual problems can be spotted earlier. > > Jan > > <nVMX-map-errors.patch>
Jan Beulich
2013-Nov-06 16:16 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
>>> On 06.11.13 at 17:05, <Jeff_Zimmerman@McAfee.com> wrote: > This raises a question, if this functionality is undesired can we just > disable nested virt by adding > nestedhvm=false to the configuration file?Sure. And as that''s supposedly the default, just deleting the line should be fine too.> Should the cpuid and cupid_check settings be changed as well?I don''t think so, unless you manually override it to look like VMX was available. That said - it would still be nice if you could help us figure out the bug''s origin (and I assume you realize that it would be even more helpful for us if you did all this on 4.4-unstable). Jan
Ian Campbell
2013-Nov-06 16:18 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On Wed, 2013-11-06 at 16:05 +0000, Jeff_Zimmerman@McAfee.com wrote:> Jan, > > I will give your patch a try. > I have to recant my previous statement regarding not using nested-virt. > It seems some of the code that is being executed on the vm contains vmx instructions. > Since by virtue of running this code in an hvm subject make it nested-virt. > > This raises a question, if this functionality is undesired can we just disable nested virt by adding > nestedhvm=false to the configuration file? Should the cpuid and > cupid_check settings be changed as well?I''m reasonably certain that nestedhvm=false will clear the relevant flags in the guest visible cpuid. I''d say it was a bug if this doesn''t happen. nestedhvm should be disabled by default, did you explicitly enable it? Removing the line altogether ought to disable it too. Please let us know if not. Ian.
<Jeff_Zimmerman@McAfee.com>
2013-Nov-06 16:48 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On Nov 6, 2013, at 8:18 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:> On Wed, 2013-11-06 at 16:05 +0000, Jeff_Zimmerman@McAfee.com wrote: >> Jan, >> >> I will give your patch a try. >> I have to recant my previous statement regarding not using nested-virt. >> It seems some of the code that is being executed on the vm contains vmx instructions. >> Since by virtue of running this code in an hvm subject make it nested-virt. >> >> This raises a question, if this functionality is undesired can we just disable nested virt by adding >> nestedhvm=false to the configuration file? Should the cpuid and >> cupid_check settings be changed as well? > > I''m reasonably certain that nestedhvm=false will clear the relevant > flags in the guest visible cpuid. I''d say it was a bug if this doesn''t > happen. > > nestedhvm should be disabled by default, did you explicitly enable it? > Removing the line altogether ought to disable it too. Please let us know > if not. > > Ian. >I did not enable nestedvm and when I run xl list -l the output shows nestedhvm=<default> I was not sure what the default was supposed to be. I will try setting it and re-run our test. Jeff
Andrew Cooper
2013-Nov-06 16:54 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On 06/11/13 16:48, Jeff_Zimmerman@McAfee.com wrote:> On Nov 6, 2013, at 8:18 AM, Ian Campbell <Ian.Campbell@citrix.com> > wrote: > >> On Wed, 2013-11-06 at 16:05 +0000, Jeff_Zimmerman@McAfee.com wrote: >>> Jan, >>> >>> I will give your patch a try. >>> I have to recant my previous statement regarding not using nested-virt. >>> It seems some of the code that is being executed on the vm contains vmx instructions. >>> Since by virtue of running this code in an hvm subject make it nested-virt. >>> >>> This raises a question, if this functionality is undesired can we just disable nested virt by adding >>> nestedhvm=false to the configuration file? Should the cpuid and >>> cupid_check settings be changed as well? >> I''m reasonably certain that nestedhvm=false will clear the relevant >> flags in the guest visible cpuid. I''d say it was a bug if this doesn''t >> happen. >> >> nestedhvm should be disabled by default, did you explicitly enable it? >> Removing the line altogether ought to disable it too. Please let us know >> if not. >> >> Ian. >> > I did not enable nestedvm and when I run xl list -l the output shows nestedhvm=<default> > I was not sure what the default was supposed to be. I will try setting it and re-run our test. > Jeffnested-virt is strictly experimental, and still has known bugs (and clearly some unknown ones). I looked over the xl code and thought that nestedhvm should default to false, but I would prefer someone more familar with libxl and the idl to confirm what the default should be. ~Andrew> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Ian Campbell
2013-Nov-06 17:06 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On Wed, 2013-11-06 at 16:54 +0000, Andrew Cooper wrote:> I looked over the xl code and thought that nestedhvm should default to > false, but I would prefer someone more familar with libxl and the idl to > confirm what the default should be.libxl thinks the default is false and will set HVM_PARAM_NESTEDHVM to 0 in that case. Is there some way to query the hypervisor for what it thinks the setting is? Ian.
Andrew Cooper
2013-Nov-06 17:07 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On 06/11/13 17:06, Ian Campbell wrote:> On Wed, 2013-11-06 at 16:54 +0000, Andrew Cooper wrote: >> I looked over the xl code and thought that nestedhvm should default to >> false, but I would prefer someone more familar with libxl and the idl to >> confirm what the default should be. > libxl thinks the default is false and will set HVM_PARAM_NESTEDHVM to 0 > in that case. Is there some way to query the hypervisor for what it > thinks the setting is? > > Ian. > >A get hvmparam hypercall will retrieve the value, but it is initialised to 0 and only ever set by a set hvmparam hypercall. ~Andrew
Jan Beulich
2013-Nov-07 09:10 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
>>> On 06.11.13 at 18:07, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 06/11/13 17:06, Ian Campbell wrote: >> On Wed, 2013-11-06 at 16:54 +0000, Andrew Cooper wrote: >>> I looked over the xl code and thought that nestedhvm should default to >>> false, but I would prefer someone more familar with libxl and the idl to >>> confirm what the default should be. >> libxl thinks the default is false and will set HVM_PARAM_NESTEDHVM to 0 >> in that case. Is there some way to query the hypervisor for what it >> thinks the setting is? > > A get hvmparam hypercall will retrieve the value, but it is initialised > to 0 and only ever set by a set hvmparam hypercall.Which makes me start suspecting that the guest might be deriving its information on VMX being available from something other than CPUID. Of course we ought to confirm that we don''t unintentionally return the VMX flag set (and that the config file doesn''t override it in this way - I think we shouldn''t be suppressing user overrides here, but I didn''t go check whether we do). Jan
Ian Campbell
2013-Nov-07 09:30 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On Thu, 2013-11-07 at 09:10 +0000, Jan Beulich wrote:> >>> On 06.11.13 at 18:07, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 06/11/13 17:06, Ian Campbell wrote: > >> On Wed, 2013-11-06 at 16:54 +0000, Andrew Cooper wrote: > >>> I looked over the xl code and thought that nestedhvm should default to > >>> false, but I would prefer someone more familar with libxl and the idl to > >>> confirm what the default should be. > >> libxl thinks the default is false and will set HVM_PARAM_NESTEDHVM to 0 > >> in that case. Is there some way to query the hypervisor for what it > >> thinks the setting is? > > > > A get hvmparam hypercall will retrieve the value, but it is initialised > > to 0 and only ever set by a set hvmparam hypercall. > > Which makes me start suspecting that the guest might be deriving > its information on VMX being available from something other than > CPUID. Of course we ought to confirm that we don''t unintentionally > return the VMX flag set (and that the config file doesn''t override it > in this way - I think we shouldn''t be suppressing user overrides > here, but I didn''t go check whether we do).I was also wondering about the behaviour of using vmx instructions in a guest despite vmx not being visible in cpuid... Ian.
<Jeff_Zimmerman@McAfee.com>
2013-Nov-07 15:41 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On Nov 7, 2013, at 1:30 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:> On Thu, 2013-11-07 at 09:10 +0000, Jan Beulich wrote: >>>>> On 06.11.13 at 18:07, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> On 06/11/13 17:06, Ian Campbell wrote: >>>> On Wed, 2013-11-06 at 16:54 +0000, Andrew Cooper wrote: >>>>> I looked over the xl code and thought that nestedhvm should default to >>>>> false, but I would prefer someone more familar with libxl and the idl to >>>>> confirm what the default should be. >>>> libxl thinks the default is false and will set HVM_PARAM_NESTEDHVM to 0 >>>> in that case. Is there some way to query the hypervisor for what it >>>> thinks the setting is? >>> >>> A get hvmparam hypercall will retrieve the value, but it is initialised >>> to 0 and only ever set by a set hvmparam hypercall. >> >> Which makes me start suspecting that the guest might be deriving >> its information on VMX being available from something other than >> CPUID. Of course we ought to confirm that we don''t unintentionally >> return the VMX flag set (and that the config file doesn''t override it >> in this way - I think we shouldn''t be suppressing user overrides >> here, but I didn''t go check whether we do). > > I was also wondering about the behaviour of using vmx instructions in a > guest despite vmx not being visible in cpuid... > > Ian. > >We have found in our situation this is exactly the case. To verify we wrote some test code that makes vmx calls without checking cupid. On bare hardware the program executes as expected. In a VM on Xen it causes the hypervisor to panic. From a security standpoint this is very very bad. It might be a good idea to provide either a run-time or build-time option to disable nestedhvm. Just turning off the vmx bit is not enough as malicious or badly written code can cause a system crash. For us it looks like we can disable these instructions and avoid the crash. Jeff.
Andrew Cooper
2013-Nov-07 15:54 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On 07/11/13 15:41, Jeff_Zimmerman@McAfee.com wrote:> On Nov 7, 2013, at 1:30 AM, Ian Campbell <Ian.Campbell@citrix.com> > wrote: > >> On Thu, 2013-11-07 at 09:10 +0000, Jan Beulich wrote: >>>>>> On 06.11.13 at 18:07, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>>> On 06/11/13 17:06, Ian Campbell wrote: >>>>> On Wed, 2013-11-06 at 16:54 +0000, Andrew Cooper wrote: >>>>>> I looked over the xl code and thought that nestedhvm should default to >>>>>> false, but I would prefer someone more familar with libxl and the idl to >>>>>> confirm what the default should be. >>>>> libxl thinks the default is false and will set HVM_PARAM_NESTEDHVM to 0 >>>>> in that case. Is there some way to query the hypervisor for what it >>>>> thinks the setting is? >>>> A get hvmparam hypercall will retrieve the value, but it is initialised >>>> to 0 and only ever set by a set hvmparam hypercall. >>> Which makes me start suspecting that the guest might be deriving >>> its information on VMX being available from something other than >>> CPUID. Of course we ought to confirm that we don''t unintentionally >>> return the VMX flag set (and that the config file doesn''t override it >>> in this way - I think we shouldn''t be suppressing user overrides >>> here, but I didn''t go check whether we do). >> I was also wondering about the behaviour of using vmx instructions in a >> guest despite vmx not being visible in cpuid... >> >> Ian. >> >> > We have found in our situation this is exactly the case. To verify we wrote some > test code that makes vmx calls without checking cupid. On bare hardware the program > executes as expected. In a VM on Xen it causes the hypervisor to panic. > > From a security standpoint this is very very bad. It might be a good idea to provide either > a run-time or build-time option to disable nestedhvm. Just turning off the vmx bit is not enough > as malicious or badly written code can cause a system crash. > > For us it looks like we can disable these instructions and avoid the crash. > > Jeff.Hmm - that is very concerning that. And there does look to be a bug. Can you try the following patch and see whether it helps? diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h index c9afb56..7b1a349 100644 --- a/xen/include/asm-x86/hvm/hvm.h +++ b/xen/include/asm-x86/hvm/hvm.h @@ -359,7 +359,7 @@ static inline int hvm_event_pending(struct vcpu *v) /* These bits in CR4 cannot be set by the guest. */ #define HVM_CR4_GUEST_RESERVED_BITS(_v) \ (~((unsigned long) \ - (X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | \ + (X86_CR4_PVI | X86_CR4_TSD | \ X86_CR4_DE | X86_CR4_PSE | X86_CR4_PAE | \ X86_CR4_MCE | X86_CR4_PGE | X86_CR4_PCE | \ X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT | \ ~Andrew
Jan Beulich
2013-Nov-07 15:57 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
>>> On 07.11.13 at 16:41, <Jeff_Zimmerman@McAfee.com> wrote: > On Nov 7, 2013, at 1:30 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: >> I was also wondering about the behaviour of using vmx instructions in a >> guest despite vmx not being visible in cpuid... >> > We have found in our situation this is exactly the case. To verify we wrote > some > test code that makes vmx calls without checking cupid. On bare hardware the > program > executes as expected. In a VM on Xen it causes the hypervisor to panic.You trying it doesn''t yet imply that Windows also does so. Also, you say "program" - are you using these from user mode code?> From a security standpoint this is very very bad. It might be a good idea to > provide either > a run-time or build-time option to disable nestedhvm. Just turning off the vmx > bit is not enough > as malicious or badly written code can cause a system crash.Yes, we will absolutely need to do that. Jan
Jan Beulich
2013-Nov-07 16:00 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
>>> On 07.11.13 at 16:54, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > Can you try the following patch and see whether it helps? > > diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h > index c9afb56..7b1a349 100644 > --- a/xen/include/asm-x86/hvm/hvm.h > +++ b/xen/include/asm-x86/hvm/hvm.h > @@ -359,7 +359,7 @@ static inline int hvm_event_pending(struct vcpu *v) > /* These bits in CR4 cannot be set by the guest. */ > #define HVM_CR4_GUEST_RESERVED_BITS(_v) \ > (~((unsigned long) \ > - (X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | \ > + (X86_CR4_PVI | X86_CR4_TSD | \Are you mixing up VME and VMXE perhaps? Jan
<Jeff_Zimmerman@McAfee.com>
2013-Nov-07 16:02 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On Nov 7, 2013, at 7:57 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 07.11.13 at 16:41, <Jeff_Zimmerman@McAfee.com> wrote: >> On Nov 7, 2013, at 1:30 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: >>> I was also wondering about the behaviour of using vmx instructions in a >>> guest despite vmx not being visible in cpuid... >>> >> We have found in our situation this is exactly the case. To verify we wrote >> some >> test code that makes vmx calls without checking cupid. On bare hardware the >> program >> executes as expected. In a VM on Xen it causes the hypervisor to panic. > > You trying it doesn''t yet imply that Windows also does so. > > Also, you say "program" - are you using these from user mode code?Yes, from windows run as a privileged user. Windows XP sp3 can cause the crash. It seems windows 7 has better security, we cannot crash the system from a win7 guest.> >> From a security standpoint this is very very bad. It might be a good idea to >> provide either >> a run-time or build-time option to disable nestedhvm. Just turning off the vmx >> bit is not enough >> as malicious or badly written code can cause a system crash. > > Yes, we will absolutely need to do that. > > Jan >
Andrew Cooper
2013-Nov-07 16:06 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On 07/11/13 16:00, Jan Beulich wrote:>>>> On 07.11.13 at 16:54, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> Can you try the following patch and see whether it helps? >> >> diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h >> index c9afb56..7b1a349 100644 >> --- a/xen/include/asm-x86/hvm/hvm.h >> +++ b/xen/include/asm-x86/hvm/hvm.h >> @@ -359,7 +359,7 @@ static inline int hvm_event_pending(struct vcpu *v) >> /* These bits in CR4 cannot be set by the guest. */ >> #define HVM_CR4_GUEST_RESERVED_BITS(_v) \ >> (~((unsigned long) \ >> - (X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | \ >> + (X86_CR4_PVI | X86_CR4_TSD | \ > Are you mixing up VME and VMXE perhaps? > > Jan >I am indeed. Apologies for the noise, but I am still quite concerned I shall attempt to repro this on a XenRT machine Jeff: What system is this on (so I can pick a similar server to try with)? ~Andrew
<Jeff_Zimmerman@McAfee.com>
2013-Nov-07 16:12 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On Nov 7, 2013, at 8:06 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:> On 07/11/13 16:00, Jan Beulich wrote: >>>>> On 07.11.13 at 16:54, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >>> Can you try the following patch and see whether it helps? >>> >>> diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h >>> index c9afb56..7b1a349 100644 >>> --- a/xen/include/asm-x86/hvm/hvm.h >>> +++ b/xen/include/asm-x86/hvm/hvm.h >>> @@ -359,7 +359,7 @@ static inline int hvm_event_pending(struct vcpu *v) >>> /* These bits in CR4 cannot be set by the guest. */ >>> #define HVM_CR4_GUEST_RESERVED_BITS(_v) \ >>> (~((unsigned long) \ >>> - (X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | \ >>> + (X86_CR4_PVI | X86_CR4_TSD | \ >> Are you mixing up VME and VMXE perhaps? >> >> Jan >> > > I am indeed. Apologies for the noise, but I am still quite concerned > > I shall attempt to repro this on a XenRT machine > > Jeff: What system is this on (so I can pick a similar server to try with)?It is an intel S4600LH board.> > ~Andrew
Jan Beulich
2013-Nov-07 16:53 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
>>> On 07.11.13 at 17:02, <Jeff_Zimmerman@McAfee.com> wrote:> On Nov 7, 2013, at 7:57 AM, Jan Beulich <JBeulich@suse.com> > wrote: > >>>>> On 07.11.13 at 16:41, <Jeff_Zimmerman@McAfee.com> wrote: >>> On Nov 7, 2013, at 1:30 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: >>>> I was also wondering about the behaviour of using vmx instructions in a >>>> guest despite vmx not being visible in cpuid... >>>> >>> We have found in our situation this is exactly the case. To verify we wrote >>> some >>> test code that makes vmx calls without checking cupid. On bare hardware the >>> program >>> executes as expected. In a VM on Xen it causes the hypervisor to panic. >> >> You trying it doesn''t yet imply that Windows also does so. >> >> Also, you say "program" - are you using these from user mode code? > > Yes, from windows run as a privileged user. Windows XP sp3 can cause the > crash. > It seems windows 7 has better security, we cannot crash the system from a > win7 guest.Which is sort of odd. Anyway - care to try the attached patch? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Nov-07 17:02 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On 07/11/13 16:53, Jan Beulich wrote:>>>> On 07.11.13 at 17:02, <Jeff_Zimmerman@McAfee.com> wrote: >> On Nov 7, 2013, at 7:57 AM, Jan Beulich <JBeulich@suse.com> >> wrote: >> >>>>>> On 07.11.13 at 16:41, <Jeff_Zimmerman@McAfee.com> wrote: >>>> On Nov 7, 2013, at 1:30 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: >>>>> I was also wondering about the behaviour of using vmx instructions in a >>>>> guest despite vmx not being visible in cpuid... >>>>> >>>> We have found in our situation this is exactly the case. To verify we wrote >>>> some >>>> test code that makes vmx calls without checking cupid. On bare hardware the >>>> program >>>> executes as expected. In a VM on Xen it causes the hypervisor to panic. >>> You trying it doesn''t yet imply that Windows also does so. >>> >>> Also, you say "program" - are you using these from user mode code? >> Yes, from windows run as a privileged user. Windows XP sp3 can cause the >> crash. >> It seems windows 7 has better security, we cannot crash the system from a >> win7 guest. > Which is sort of odd. Anyway - care to try the attached patch? > > Jan >While the patch does look plausible, there is still clearly an issue that an HVM guest with nested_virt disabled can even use the VMX instructions, rather than getting flat out #UD exceptions. ~Andrew
Andrew Cooper
2013-Nov-07 18:13 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On 07/11/13 16:53, Jan Beulich wrote:>>>> On 07.11.13 at 17:02, <Jeff_Zimmerman@McAfee.com> wrote: >> On Nov 7, 2013, at 7:57 AM, Jan Beulich <JBeulich@suse.com> >> wrote: >> >>>>>> On 07.11.13 at 16:41, <Jeff_Zimmerman@McAfee.com> wrote: >>>> On Nov 7, 2013, at 1:30 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: >>>>> I was also wondering about the behaviour of using vmx instructions in a >>>>> guest despite vmx not being visible in cpuid... >>>>> >>>> We have found in our situation this is exactly the case. To verify we wrote >>>> some >>>> test code that makes vmx calls without checking cupid. On bare hardware the >>>> program >>>> executes as expected. In a VM on Xen it causes the hypervisor to panic. >>> You trying it doesn''t yet imply that Windows also does so. >>> >>> Also, you say "program" - are you using these from user mode code? >> Yes, from windows run as a privileged user. Windows XP sp3 can cause the >> crash. >> It seems windows 7 has better security, we cannot crash the system from a >> win7 guest. > Which is sort of odd. Anyway - care to try the attached patch? > > Jan >I have managed to reproduce the issue, and the patch appears to fix things. I have to admit to being very surprised that the VMX hardware doesn''t check CR4.VMXE before causing a vmexit. Reviewed-and-tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
<Jeff_Zimmerman@McAfee.com>
2013-Nov-07 18:33 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
On Nov 7, 2013, at 8:53 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 07.11.13 at 17:02, <Jeff_Zimmerman@McAfee.com> wrote: > >> On Nov 7, 2013, at 7:57 AM, Jan Beulich <JBeulich@suse.com> >> wrote: >> >>>>>> On 07.11.13 at 16:41, <Jeff_Zimmerman@McAfee.com> wrote: >>>> On Nov 7, 2013, at 1:30 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: >>>>> I was also wondering about the behaviour of using vmx instructions in a >>>>> guest despite vmx not being visible in cpuid... >>>>> >>>> We have found in our situation this is exactly the case. To verify we wrote >>>> some >>>> test code that makes vmx calls without checking cupid. On bare hardware the >>>> program >>>> executes as expected. In a VM on Xen it causes the hypervisor to panic. >>> >>> You trying it doesn''t yet imply that Windows also does so. >>> >>> Also, you say "program" - are you using these from user mode code? >> >> Yes, from windows run as a privileged user. Windows XP sp3 can cause the >> crash. >> It seems windows 7 has better security, we cannot crash the system from a >> win7 guest. > > Which is sort of odd. Anyway - care to try the attached patch? > > Jan > > <xsa75.patch>Just tried your patch. It seems to mitigate the problem. Thanks! -jeff
Jan Beulich
2013-Nov-08 07:50 UTC
Re: Intermittent fatal page fault with XEN 4.3.1 (Centos 6.3 DOM0 with linux kernel 3.10.16.)
>>> On 07.11.13 at 18:02, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > While the patch does look plausible, there is still clearly an issue > that an HVM guest with nested_virt disabled can even use the VMX > instructions, rather than getting flat out #UD exceptions.The real CR4.VMXE is (of course) set, and basing a decision on the read shadow would clearly be wrong from an architectural pov (as then this would no longer be just a read shadow). And this isn''t the problem here anyway - one problems is that the privilege level check is done _after_ the VMX non-root mode one. I guess they do it that way in order to allow the VMM maximum flexibility. Jan