Hi, I''m trying to get to the bottom of a SLES11 guest crash (stack trace below). This is an early page fault on a 64-bit SLES11 SP0 guest, and only appears in debug builds of Xen. An automated git bisect run in xen.git pointed at the following commit. I realise that this is a debug commit, and is therefore likely not the culprit, but it probably exposes the real bug, which was hiding silently before. commit f1bde87fc08ce8c818a1640a8fe4765d48923091 Author: Jan Beulich <jbeulich@suse.com> Date: Fri Feb 8 11:06:04 2013 +0100 x86: debugging code for testing 16Tb support on smaller memory systems Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> I wondered if anyone had advice on the likely cause, or how to go about debugging from here? In particular, are there any bits of code that are likely culprits, or commits that might be candidates? One option is to rerun the bisect, and apply this debug patch on top of each point... I''ll investigate that in the absence of alternatives. Andrew. (XEN) Pagetable walk from ffff8800027da000: (XEN) L4[0x110] = 000000022ca5a067 00000000000015d0 (XEN) L3[0x000] = 000000022ca59067 00000000000015d1 (XEN) L2[0x013] = 000000022ca45067 00000000000015e5 (XEN) L1[0x1da] = 801000022b850065 00000000000027da (XEN) domain_crash_sync called from entry.S (XEN) Domain 8 (vcpu#0) crashed on cpu#7: (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Not tainted ]---- (XEN) CPU: 7 (XEN) RIP: e033:[<ffffffff80217b91>] (XEN) RFLAGS: 0000000000000246 EM: 1 CONTEXT: pv guest (XEN) rax: ffff880000000000 rbx: ffff8800027da000 rcx: 0000000000000023 (XEN) rdx: 00003ffffffff000 rsi: 00000000027da000 rdi: 00000000deadbeef (XEN) rbp: ffff8800027d9fd8 rsp: ffffffff80709e60 r8: 0000000000000000 (XEN) r9: 00000000027da000 r10: ffff880001871680 r11: 0000000000000100 (XEN) r12: ffff880000000000 r13: 00003ffffffff000 r14: ffffffffff600000 (XEN) r15: 0000000229523065 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 000000022d329000 cr2: ffff8800027da000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 (XEN) Guest stack trace from rsp=ffffffff80709e60: (XEN) 0000000000000023 0000000000000100 0000000000000003 ffffffff80217b91 (XEN) 000000010000e030 0000000000010046 ffffffff80709ea8 000000000000e02b (XEN) ffffffff80217988 0000000000b90000 0000000229523065 ffffffffff600000 (XEN) 000000000082e000 0000000000b90000 ffff880000000000 0000000000000000 (XEN) ffffffff8021c6bf 0000000000040000 0000000040000000 00000000013be000 (XEN) ffffffff80714233 0000000000000000 0000000000000000 0000000000000000 (XEN) ffffffff80463720 ffffffff00000008 ffffffff80709f88 ffffffff80709f48 (XEN) ffffffff8071d80f 0000000000000002 0000000000000000 0000000000000000 (XEN) ffffffffffffffff ffffffff80733eb0 0000000000000000 0000000000000000 (XEN) ffffffff807106ff ffffffff7fffffff ffffffff807100d2 ffffffff807362e0 (XEN) ffffffff815be000 0000000000000000 0000000000000000 0000000000000000 (XEN) ffffffff807101b0 ffff800000000000 ffff804000000000 00000007ffffffff (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) ffffffff807102d2 ffffffff807106b4 ffffffff807106bc ffffffff807106c4 (XEN) ffffffff8020906a ffffffff8020911a ffffffff80209e34 ffffffff80209ed9 (XEN) ffffffff80209fba ffffffff80209fcd ffffffff80209fd4 ffffffff80209ff1 (XEN) ffffffff8020a0c9 ffffffff8045854b ffffffff8020ab35 ffffffff8020acb6 (XEN) ffffffff8020ae8d ffffffff80466182 ffffffff804662a5 ffffffff804663e9 (XEN) ffffffff80466409 ffffffff8020d9e1 ffffffff8020da1e ffffffff8020da73 (XEN) ffffffff8020dcbf ffffffff8020e670 ffffffff80713fb0 ffffffff80713fb8
Not by any chance running a 3.12 merge window kernel ? Friday, September 13, 2013, 12:49:42 PM, you wrote:> Hi,> I''m trying to get to the bottom of a SLES11 guest crash (stack trace > below). This is an early page fault on a 64-bit SLES11 SP0 guest, and > only appears in debug builds of Xen.> An automated git bisect run in xen.git pointed at the following commit. > I realise that this is a debug commit, and is therefore likely not the > culprit, but it probably exposes the real bug, which was hiding silently > before.> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 > Author: Jan Beulich <jbeulich@suse.com> > Date: Fri Feb 8 11:06:04 2013 +0100> x86: debugging code for testing 16Tb support on smaller memory systems> Signed-off-by: Jan Beulich <jbeulich@suse.com> > Acked-by: Keir Fraser <keir@xen.org>> I wondered if anyone had advice on the likely cause, or how to go about > debugging from here? In particular, are there any bits of code that are > likely culprits, or commits that might be candidates? One option is to > rerun the bisect, and apply this debug patch on top of each point... > I''ll investigate that in the absence of alternatives.> Andrew.> (XEN) Pagetable walk from ffff8800027da000: > (XEN) L4[0x110] = 000000022ca5a067 00000000000015d0 > (XEN) L3[0x000] = 000000022ca59067 00000000000015d1 > (XEN) L2[0x013] = 000000022ca45067 00000000000015e5 > (XEN) L1[0x1da] = 801000022b850065 00000000000027da > (XEN) domain_crash_sync called from entry.S > (XEN) Domain 8 (vcpu#0) crashed on cpu#7: > (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Not tainted ]---- > (XEN) CPU: 7 > (XEN) RIP: e033:[<ffffffff80217b91>] > (XEN) RFLAGS: 0000000000000246 EM: 1 CONTEXT: pv guest > (XEN) rax: ffff880000000000 rbx: ffff8800027da000 rcx: 0000000000000023 > (XEN) rdx: 00003ffffffff000 rsi: 00000000027da000 rdi: 00000000deadbeef > (XEN) rbp: ffff8800027d9fd8 rsp: ffffffff80709e60 r8: 0000000000000000 > (XEN) r9: 00000000027da000 r10: ffff880001871680 r11: 0000000000000100 > (XEN) r12: ffff880000000000 r13: 00003ffffffff000 r14: ffffffffff600000 > (XEN) r15: 0000000229523065 cr0: 000000008005003b cr4: 00000000000026f0 > (XEN) cr3: 000000022d329000 cr2: ffff8800027da000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 > (XEN) Guest stack trace from rsp=ffffffff80709e60: > (XEN) 0000000000000023 0000000000000100 0000000000000003 ffffffff80217b91 > (XEN) 000000010000e030 0000000000010046 ffffffff80709ea8 000000000000e02b > (XEN) ffffffff80217988 0000000000b90000 0000000229523065 ffffffffff600000 > (XEN) 000000000082e000 0000000000b90000 ffff880000000000 0000000000000000 > (XEN) ffffffff8021c6bf 0000000000040000 0000000040000000 00000000013be000 > (XEN) ffffffff80714233 0000000000000000 0000000000000000 0000000000000000 > (XEN) ffffffff80463720 ffffffff00000008 ffffffff80709f88 ffffffff80709f48 > (XEN) ffffffff8071d80f 0000000000000002 0000000000000000 0000000000000000 > (XEN) ffffffffffffffff ffffffff80733eb0 0000000000000000 0000000000000000 > (XEN) ffffffff807106ff ffffffff7fffffff ffffffff807100d2 ffffffff807362e0 > (XEN) ffffffff815be000 0000000000000000 0000000000000000 0000000000000000 > (XEN) ffffffff807101b0 ffff800000000000 ffff804000000000 00000007ffffffff > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > (XEN) ffffffff807102d2 ffffffff807106b4 ffffffff807106bc ffffffff807106c4 > (XEN) ffffffff8020906a ffffffff8020911a ffffffff80209e34 ffffffff80209ed9 > (XEN) ffffffff80209fba ffffffff80209fcd ffffffff80209fd4 ffffffff80209ff1 > (XEN) ffffffff8020a0c9 ffffffff8045854b ffffffff8020ab35 ffffffff8020acb6 > (XEN) ffffffff8020ae8d ffffffff80466182 ffffffff804662a5 ffffffff804663e9 > (XEN) ffffffff80466409 ffffffff8020d9e1 ffffffff8020da1e ffffffff8020da73 > (XEN) ffffffff8020dcbf ffffffff8020e670 ffffffff80713fb0 ffffffff80713fb8
On 13/09/13 12:14, Sander Eikelenboom wrote:> Not by any chance running a 3.12 merge window kernel ?Nope, the regular SLES11 SP0 2.6.27 kernel.> > Friday, September 13, 2013, 12:49:42 PM, you wrote: > >> Hi, > >> I''m trying to get to the bottom of a SLES11 guest crash (stack trace >> below). This is an early page fault on a 64-bit SLES11 SP0 guest, and >> only appears in debug builds of Xen. > >> An automated git bisect run in xen.git pointed at the following commit. >> I realise that this is a debug commit, and is therefore likely not the >> culprit, but it probably exposes the real bug, which was hiding silently >> before. > >> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 >> Author: Jan Beulich <jbeulich@suse.com> >> Date: Fri Feb 8 11:06:04 2013 +0100 > >> x86: debugging code for testing 16Tb support on smaller memory systems > >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> Acked-by: Keir Fraser <keir@xen.org> > > >> I wondered if anyone had advice on the likely cause, or how to go about >> debugging from here? In particular, are there any bits of code that are >> likely culprits, or commits that might be candidates? One option is to >> rerun the bisect, and apply this debug patch on top of each point... >> I''ll investigate that in the absence of alternatives. > >> Andrew. > >> (XEN) Pagetable walk from ffff8800027da000: >> (XEN) L4[0x110] = 000000022ca5a067 00000000000015d0 >> (XEN) L3[0x000] = 000000022ca59067 00000000000015d1 >> (XEN) L2[0x013] = 000000022ca45067 00000000000015e5 >> (XEN) L1[0x1da] = 801000022b850065 00000000000027da >> (XEN) domain_crash_sync called from entry.S >> (XEN) Domain 8 (vcpu#0) crashed on cpu#7: >> (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Not tainted ]---- >> (XEN) CPU: 7 >> (XEN) RIP: e033:[<ffffffff80217b91>] >> (XEN) RFLAGS: 0000000000000246 EM: 1 CONTEXT: pv guest >> (XEN) rax: ffff880000000000 rbx: ffff8800027da000 rcx: 0000000000000023 >> (XEN) rdx: 00003ffffffff000 rsi: 00000000027da000 rdi: 00000000deadbeef >> (XEN) rbp: ffff8800027d9fd8 rsp: ffffffff80709e60 r8: 0000000000000000 >> (XEN) r9: 00000000027da000 r10: ffff880001871680 r11: 0000000000000100 >> (XEN) r12: ffff880000000000 r13: 00003ffffffff000 r14: ffffffffff600000 >> (XEN) r15: 0000000229523065 cr0: 000000008005003b cr4: 00000000000026f0 >> (XEN) cr3: 000000022d329000 cr2: ffff8800027da000 >> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 >> (XEN) Guest stack trace from rsp=ffffffff80709e60: >> (XEN) 0000000000000023 0000000000000100 0000000000000003 ffffffff80217b91 >> (XEN) 000000010000e030 0000000000010046 ffffffff80709ea8 000000000000e02b >> (XEN) ffffffff80217988 0000000000b90000 0000000229523065 ffffffffff600000 >> (XEN) 000000000082e000 0000000000b90000 ffff880000000000 0000000000000000 >> (XEN) ffffffff8021c6bf 0000000000040000 0000000040000000 00000000013be000 >> (XEN) ffffffff80714233 0000000000000000 0000000000000000 0000000000000000 >> (XEN) ffffffff80463720 ffffffff00000008 ffffffff80709f88 ffffffff80709f48 >> (XEN) ffffffff8071d80f 0000000000000002 0000000000000000 0000000000000000 >> (XEN) ffffffffffffffff ffffffff80733eb0 0000000000000000 0000000000000000 >> (XEN) ffffffff807106ff ffffffff7fffffff ffffffff807100d2 ffffffff807362e0 >> (XEN) ffffffff815be000 0000000000000000 0000000000000000 0000000000000000 >> (XEN) ffffffff807101b0 ffff800000000000 ffff804000000000 00000007ffffffff >> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 >> (XEN) ffffffff807102d2 ffffffff807106b4 ffffffff807106bc ffffffff807106c4 >> (XEN) ffffffff8020906a ffffffff8020911a ffffffff80209e34 ffffffff80209ed9 >> (XEN) ffffffff80209fba ffffffff80209fcd ffffffff80209fd4 ffffffff80209ff1 >> (XEN) ffffffff8020a0c9 ffffffff8045854b ffffffff8020ab35 ffffffff8020acb6 >> (XEN) ffffffff8020ae8d ffffffff80466182 ffffffff804662a5 ffffffff804663e9 >> (XEN) ffffffff80466409 ffffffff8020d9e1 ffffffff8020da1e ffffffff8020da73 >> (XEN) ffffffff8020dcbf ffffffff8020e670 ffffffff80713fb0 ffffffff80713fb8 > > >
On 13/09/2013 12:34, Andrew Bennieston wrote:> On 13/09/13 12:14, Sander Eikelenboom wrote: >> Not by any chance running a 3.12 merge window kernel ? > Nope, the regular SLES11 SP0 2.6.27 kernel.Some thinks should be pointed out. This crash is because the guest takes a pagefault before setting up a pagefault handler. 9 times in 10, the guest boots perfectly well. This crash is *only* visible with a debug build of Xen, and is a regression from 4.1 The point of this query is to find out whether there is a bug in the identified Xen changeset, or whether it simply exposed some other bug. As things currently stand, there is some regression causing crashes for older kernels, and we (XenServer) would like to understand what the regression is and whether it affects more guests than just SLES11 SP0 (irrespective of the fact that people using SLES11 SP0 should really upgrade) ~Andrew
>>> On 13.09.13 at 12:49, Andrew Bennieston <andrew.bennieston@citrix.com> wrote: > I''m trying to get to the bottom of a SLES11 guest crash (stack trace > below). This is an early page fault on a 64-bit SLES11 SP0 guest, and > only appears in debug builds of Xen. > > An automated git bisect run in xen.git pointed at the following commit. > I realise that this is a debug commit, and is therefore likely not the > culprit, but it probably exposes the real bug, which was hiding silently > before. > > commit f1bde87fc08ce8c818a1640a8fe4765d48923091 > Author: Jan Beulich <jbeulich@suse.com> > Date: Fri Feb 8 11:06:04 2013 +0100 > > x86: debugging code for testing 16Tb support on smaller memory systems > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > Acked-by: Keir Fraser <keir@xen.org>This by itself indeed is rather unlikely to have introduced the bug. However, I''d also have thought that it''s unlikely for it to introduce a guest side crash.> I wondered if anyone had advice on the likely cause, or how to go about > debugging from here? In particular, are there any bits of code that are > likely culprits, or commits that might be candidates? One option is to > rerun the bisect, and apply this debug patch on top of each point... > I''ll investigate that in the absence of alternatives.There isn''t that big of a window, since the patch here matters only after the non-trivial map_domain_page() et al got re-introduced.> (XEN) Pagetable walk from ffff8800027da000: > (XEN) L4[0x110] = 000000022ca5a067 00000000000015d0 > (XEN) L3[0x000] = 000000022ca59067 00000000000015d1 > (XEN) L2[0x013] = 000000022ca45067 00000000000015e5 > (XEN) L1[0x1da] = 801000022b850065 00000000000027da > (XEN) domain_crash_sync called from entry.S > (XEN) Domain 8 (vcpu#0) crashed on cpu#7: > (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Not tainted ]---- > (XEN) CPU: 7 > (XEN) RIP: e033:[<ffffffff80217b91>] > (XEN) RFLAGS: 0000000000000246 EM: 1 CONTEXT: pv guest > (XEN) rax: ffff880000000000 rbx: ffff8800027da000 rcx: 0000000000000023 > (XEN) rdx: 00003ffffffff000 rsi: 00000000027da000 rdi: 00000000deadbeef > (XEN) rbp: ffff8800027d9fd8 rsp: ffffffff80709e60 r8: 0000000000000000 > (XEN) r9: 00000000027da000 r10: ffff880001871680 r11: 0000000000000100 > (XEN) r12: ffff880000000000 r13: 00003ffffffff000 r14: ffffffffff600000 > (XEN) r15: 0000000229523065 cr0: 000000008005003b cr4: 00000000000026f0 > (XEN) cr3: 000000022d329000 cr2: ffff8800027da000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 > (XEN) Guest stack trace from rsp=ffffffff80709e60: > (XEN) 0000000000000023 0000000000000100 0000000000000003 ffffffff80217b91 > (XEN) 000000010000e030 0000000000010046 ffffffff80709ea8 000000000000e02b > (XEN) ffffffff80217988 0000000000b90000 0000000229523065 ffffffffff600000 > (XEN) 000000000082e000 0000000000b90000 ffff880000000000 0000000000000000 > (XEN) ffffffff8021c6bf 0000000000040000 0000000040000000 00000000013be000 > (XEN) ffffffff80714233 0000000000000000 0000000000000000 0000000000000000 > (XEN) ffffffff80463720 ffffffff00000008 ffffffff80709f88 ffffffff80709f48 > (XEN) ffffffff8071d80f 0000000000000002 0000000000000000 0000000000000000 > (XEN) ffffffffffffffff ffffffff80733eb0 0000000000000000 0000000000000000 > (XEN) ffffffff807106ff ffffffff7fffffff ffffffff807100d2 ffffffff807362e0 > (XEN) ffffffff815be000 0000000000000000 0000000000000000 0000000000000000 > (XEN) ffffffff807101b0 ffff800000000000 ffff804000000000 00000007ffffffff > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > (XEN) ffffffff807102d2 ffffffff807106b4 ffffffff807106bc ffffffff807106c4 > (XEN) ffffffff8020906a ffffffff8020911a ffffffff80209e34 ffffffff80209ed9 > (XEN) ffffffff80209fba ffffffff80209fcd ffffffff80209fd4 ffffffff80209ff1 > (XEN) ffffffff8020a0c9 ffffffff8045854b ffffffff8020ab35 ffffffff8020acb6 > (XEN) ffffffff8020ae8d ffffffff80466182 ffffffff804662a5 ffffffff804663e9 > (XEN) ffffffff80466409 ffffffff8020d9e1 ffffffff8020da1e ffffffff8020da73 > (XEN) ffffffff8020dcbf ffffffff8020e670 ffffffff80713fb0 ffffffff80713fb8Did you try taking this apart, to understand what it is in the guest that crashes? Jan
>>> On 13.09.13 at 13:50, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 13/09/2013 12:34, Andrew Bennieston wrote: >> On 13/09/13 12:14, Sander Eikelenboom wrote: >>> Not by any chance running a 3.12 merge window kernel ? >> Nope, the regular SLES11 SP0 2.6.27 kernel. > > Some thinks should be pointed out. > > This crash is because the guest takes a pagefault before setting up a > pagefault handler. > > 9 times in 10, the guest boots perfectly well.Which suggests e.g. an issue with preemption handling. Is, if it crashes, the crash signature always the same?> This crash is *only* visible with a debug build of Xen, and is a > regression from 4.1 > > The point of this query is to find out whether there is a bug in the > identified Xen changeset, or whether it simply exposed some other bug.Very likely the latter. Jan
On Fri, Sep 13, 2013 at 12:50 PM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:> On 13/09/2013 12:34, Andrew Bennieston wrote: >> On 13/09/13 12:14, Sander Eikelenboom wrote: >>> Not by any chance running a 3.12 merge window kernel ? >> Nope, the regular SLES11 SP0 2.6.27 kernel. > > Some thinks should be pointed out. > > This crash is because the guest takes a pagefault before setting up a > pagefault handler. > > 9 times in 10, the guest boots perfectly well. > > This crash is *only* visible with a debug build of Xen, and is a > regression from 4.1 > > The point of this query is to find out whether there is a bug in the > identified Xen changeset, or whether it simply exposed some other bug.AIUI, the purpose of this changeset was to expose, on <5TiB machines, bugs that would normally only happen in 5<x<16TiB machines. So it is likely to be the latter. -George