Konrad Rzeszutek Wilk
2011-Mar-16 22:19 UTC
[Xen-devel] L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?
I am troubleshooting an issue where the Linux kernel tries to dereference a not present entry. I have a fix for this in for-2.6.32/bug-fixes .. but please read on. Specifically it tries to derefence the fixmapped value of APIC_BASE. The fixmapped value of APIC_BASE is actually not set due to git commit a1d8e2fa8325064338b2da1bcf0d7a0473883c284 which adds this in arch/x86/kernel/acpi/boot.c: static void __init acpi_register_lapic_address(unsigned long address) { /* Xen dom0 doesn''t have usable lapics */ if (xen_initial_domain()) return; mp_lapic_addr = address; set_fixmap_nocache(FIX_APIC_BASE, address); Later on we use ''native_apic_read'' which tries to use the APIC_BASE as address (it is present to be @ slot FIX_APIC_BASE of the fixmap API) and it fails (on some machines). Since we don''t call ''set_fixmap_nocache(FIX_APIC_BASE)'' and if one were to go through the pagetable this is what we get: [ 0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs [ 0.000000] mapped APIC to ffffffffff5fb000 (00000000) (XEN) d0:v0: unhandled page fault (ec=0000) (XEN) Pagetable walk from ffffffffff5fb020: (XEN) L4[0x1ff] = 0000000221003067 0000000000001003 (XEN) L3[0x1ff] = 0000000221004067 0000000000001004 (XEN) L2[0x1fa] = 0000000221771067 0000000000001771 (XEN) L1[0x1fb] = 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S (XEN) Domain 0 (vcpu#0) crashed on cpu#0: (XEN) ----[ Xen-4.1-110309 x86_64 debug=y Tainted: C ]---- (XEN) CPU: 0 (XEN) RIP: e033:[<ffffffff8102b5d1>] (XEN) RFLAGS: 0000000000000292 EM: 1 CONTEXT: pv guest (XEN) rax: ffffffff8164cf50 rbx: 000000026ec00000 rcx: 00000000ffffdd85 (XEN) rdx: 00000000ffffffff rsi: 0000000000000000 rdi: 0000000000000020 (XEN) rbp: ffffffff81643ea8 rsp: ffffffff81643e50 r8: 0000000000000002 (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 (XEN) r12: ffff880013671800 r13: 00000000bff66000 r14: ffffffffffffffff (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000000006f0 (XEN) cr3: 0000000221001000 cr2: ffffffffff5fb020 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 (XEN) Guest stack trace from rsp=ffffffff81643e50: Which is to say that the L1 has this: 0000000115771fa0: 00000000 00000000 00000000 00000000 0000000115771fb0: 00000000 00000000 00000000 00000000 0000000115771fc0: 00000000 00000000 15770067 80100001 0000000115771fd0: 15770067 80100001 00000000 00000000 0000000115771fe0: 00000000 00000000 00000000 00000000 0000000115771ff0: 00000000 00000000 00000000 00000000 L1[0x1fb] is machine address 115771fd8, which has nothing in it. OK, so I''ve come up a fix that is a back-port of how 2.6.38 does it which is that it removes the check I mentioned above and in xen_set_fixmap we set the FIX_APIC_BASE to actually point to a dummy ioapic_mapping. It is 7cb068cf1ba90425e12f3a7b3caed9d018fa9b8c in for-2.6.32/bug-fixes Gianni, you might want to check this out in case it fixes the problem you are experiencing. But one thing I can''t understand is why on one machine (IBM x3850) I get this crash, while another one with the same pagetable contents (L1 has nothing for 0x1fb) it works just fine? I added a panic and used the Xen hypervisor kdb to manually inspect the pagetable, and it has the same contents as the IBM x3850 -but it boots fine with this invalid value. Any ideas? FYI, seems another user (Sven Sübert) IBM x3650 hits the same bug. And with this fix he is able to boot. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2011-Mar-16 22:32 UTC
[Xen-devel] Re: L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?
On 16/03/2011 22:19, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:> OK, so I''ve come up a fix that is a back-port of how 2.6.38 does it > which is that it removes the check I mentioned above and in xen_set_fixmap > we set the FIX_APIC_BASE to actually point to a dummy ioapic_mapping. > It is 7cb068cf1ba90425e12f3a7b3caed9d018fa9b8c in for-2.6.32/bug-fixes > > Gianni, you might want to check this out in case it fixes the problem you > are experiencing. > > But one thing I can''t understand is why on one machine (IBM x3850) > I get this crash, while another one with the same pagetable contents > (L1 has nothing for 0x1fb) it works just fine? I added a panic and used > the Xen hypervisor kdb to manually inspect the pagetable, and it has > the same contents as the IBM x3850 -but it boots fine with this invalid value. > Any ideas?Could the native_apic_read() come from ACPI DSDT of that particular machine type (x3850)? K. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Mar-17 10:25 UTC
Re: [Xen-devel] L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?
>>> On 16.03.11 at 23:19, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > But one thing I can''t understand is why on one machine (IBM x3850) > I get this crash, while another one with the same pagetable contents > (L1 has nothing for 0x1fb) it works just fine? I added a panic and used > the Xen hypervisor kdb to manually inspect the pagetable, and it has > the same contents as the IBM x3850 -but it boots fine with this invalid > value. > Any ideas?Without seeing the full stack trace it''s hard to tell. To me, it looks like a mistake for native_apic_read() to be called at all under Xen, and perhaps there''s one lurking somewhere that gets hit only on those IBM (Summit?) machines. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Mar-17 15:52 UTC
Re: [Xen-devel] L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?
On Thu, Mar 17, 2011 at 10:25:11AM +0000, Jan Beulich wrote:> >>> On 16.03.11 at 23:19, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > But one thing I can''t understand is why on one machine (IBM x3850) > > I get this crash, while another one with the same pagetable contents > > (L1 has nothing for 0x1fb) it works just fine? I added a panic and used > > the Xen hypervisor kdb to manually inspect the pagetable, and it has > > the same contents as the IBM x3850 -but it boots fine with this invalid > > value. > > Any ideas? > > Without seeing the full stack trace it''s hard to tell. To me, it looks > like a mistake for native_apic_read() to be called at all under Xen, > and perhaps there''s one lurking somewhere that gets hit only on > those IBM (Summit?) machines.That was it. When we bootup we call ''set_xen_basic_apic_ops'' which sets apic->read to xen_apic_read. The default ''apic'' is set to apic_flat, so in essence we change apic_flat->read from native_apic_read to xen_apic_read. During bootup, the default_acpi_madt_oem_check is run which runs through all of the apic_probe[] array, on which the last one is is apic_physflat. And apic_physflat->probe() returns true on this IBM Summit box (and ES7000 boxs, and whatever has FADT set to ACPI_FADT_APIC_PHYSICAL) so we set apic now to apic_physflat and the apic->read ends up being native_apic_read. 2.6.38 fixes this by allowing in acpi_register_lapic_address, the the set_fixmap_nocache(FIX_APIC_BASE, address) to be called and we can provide it with a dummy page and native_apic_read can happily read from that fake page. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Mar-17 16:12 UTC
Re: [Xen-devel] L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?
On Thu, Mar 17, 2011 at 11:52:12AM -0400, Konrad Rzeszutek Wilk wrote:> On Thu, Mar 17, 2011 at 10:25:11AM +0000, Jan Beulich wrote: > > >>> On 16.03.11 at 23:19, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > > But one thing I can''t understand is why on one machine (IBM x3850) > > > I get this crash, while another one with the same pagetable contents > > > (L1 has nothing for 0x1fb) it works just fine? I added a panic and used > > > the Xen hypervisor kdb to manually inspect the pagetable, and it has > > > the same contents as the IBM x3850 -but it boots fine with this invalid > > > value. > > > Any ideas? > > > > Without seeing the full stack trace it''s hard to tell. To me, it looks > > like a mistake for native_apic_read() to be called at all under Xen, > > and perhaps there''s one lurking somewhere that gets hit only on > > those IBM (Summit?) machines. > > That was it. When we bootup we call ''set_xen_basic_apic_ops'' whichForgot to mention it but thank you for steering me in the right direction! The patches are in git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git for-2.6.32/bug-fixes _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Mar-17 16:12 UTC
Re: [Xen-devel] L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?
>>> On 17.03.11 at 16:52, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > 2.6.38 fixes this by allowing in acpi_register_lapic_address, the > the set_fixmap_nocache(FIX_APIC_BASE, address) to be called and we > can provide it with a dummy page and native_apic_read can happily > read from that fake page.I wonder whether that''s going to be appropriate in cases... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Mar-17 16:41 UTC
Re: [Xen-devel] L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?
On Thu, Mar 17, 2011 at 04:12:48PM +0000, Jan Beulich wrote:> >>> On 17.03.11 at 16:52, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > 2.6.38 fixes this by allowing in acpi_register_lapic_address, the > > the set_fixmap_nocache(FIX_APIC_BASE, address) to be called and we > > can provide it with a dummy page and native_apic_read can happily > > read from that fake page. > > I wonder whether that''s going to be appropriate in cases...If you boot the 2.6.38 it works, but it does provide these ugly and untrue values: 0.000000] ACPI: IOAPIC (id[0x0f] address[0xfec00000] gsi_base[0]) [ 0.000000] IOAPIC[0]: apic_id 15, version 255, address 0xfec00000, GSI 0-255 [ 0.000000] ACPI: IOAPIC (id[0x0e] address[0xfec01000] gsi_base[36]) [ 0.000000] IOAPIC[1]: apic_id 14, version 255, address 0xfec01000, GSI 36-291 [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) [ 0.000000] Int: type 0, pol 0, trig 0, bus 00, IRQ 00, APIC ID f, APIC INT 02 [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 8 global_irq 8 low edge) [ 0.000000] Int: type 0, pol 3, trig 1, bus 00, IRQ 08, APIC ID f, APIC INT 08 [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 low edge) [ 0.000000] Int: type 0, pol 3, trig 1, bus 00, IRQ 0e, APIC ID f, APIC INT 0e [ 0.000000] Int: type 0, pol 3, trig 3, bus 00, IRQ 09, APIC ID f, APIC INT 09 [ 0.000000] ACPI: IRQ0 used by override. I don''t remember if it was suggested to hpa/ingo/tglx whether we could provide another ''struct apic'' that would be Xen specific and the apic->probe() would either provide a struct mostly filled with dummy functions that return nothing, or the Xen apic->probe() function would over-write the current ''apic->read,write, etc'' with the xen dummy functions. However we seem to achieve this already by providing a dummy page that is read/writen to by the native_apic_[read|write]. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Mar-17 17:21 UTC
Re: [Xen-devel] L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?
On 03/17/2011 09:41 AM, Konrad Rzeszutek Wilk wrote:> I don''t remember if it was suggested to hpa/ingo/tglx whether we could > provide another ''struct apic'' that would be Xen specific and the apic->probe() > would either provide a struct mostly filled with dummy functions that return > nothing, or the Xen apic->probe() function would over-write the current > ''apic->read,write, etc'' with the xen dummy functions.I still maintain the "proper fix" is to just turn off the APIC CPU capability. There is no local apic, and trying to pretend otherwise just leads to a mass of hacks. But of course, that''s not particularly easy in practice... J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Mar-17 19:56 UTC
[Xen-devel] [PATCH] xen/apic: Provide an ''apic_xen'' to set the override the apic->[read|write] for all cases.
On Thu, Mar 17, 2011 at 12:41:43PM -0400, Konrad Rzeszutek Wilk wrote:> On Thu, Mar 17, 2011 at 04:12:48PM +0000, Jan Beulich wrote: > > >>> On 17.03.11 at 16:52, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > > 2.6.38 fixes this by allowing in acpi_register_lapic_address, the > > > the set_fixmap_nocache(FIX_APIC_BASE, address) to be called and we > > > can provide it with a dummy page and native_apic_read can happily > > > read from that fake page. > > > > I wonder whether that''s going to be appropriate in cases... > > If you boot the 2.6.38 it works, but it does provide these ugly and untrue values: > > 0.000000] ACPI: IOAPIC (id[0x0f] address[0xfec00000] gsi_base[0]) > [ 0.000000] IOAPIC[0]: apic_id 15, version 255, address 0xfec00000, GSI 0-255 > [ 0.000000] ACPI: IOAPIC (id[0x0e] address[0xfec01000] gsi_base[36]) > [ 0.000000] IOAPIC[1]: apic_id 14, version 255, address 0xfec01000, GSI 36-291 > [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) > [ 0.000000] Int: type 0, pol 0, trig 0, bus 00, IRQ 00, APIC ID f, APIC INT 02 > [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 8 global_irq 8 low edge) > [ 0.000000] Int: type 0, pol 3, trig 1, bus 00, IRQ 08, APIC ID f, APIC INT 08 > [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 low edge) > [ 0.000000] Int: type 0, pol 3, trig 1, bus 00, IRQ 0e, APIC ID f, APIC INT 0e > [ 0.000000] Int: type 0, pol 3, trig 3, bus 00, IRQ 09, APIC ID f, APIC INT 09 > [ 0.000000] ACPI: IRQ0 used by override. > > I don''t remember if it was suggested to hpa/ingo/tglx whether we could > provide another ''struct apic'' that would be Xen specific and the apic->probe() > would either provide a struct mostly filled with dummy functions that return > nothing, or the Xen apic->probe() function would over-write the current > ''apic->read,write, etc'' with the xen dummy functions. > > However we seem to achieve this already by providing a dummy page that > is read/writen to by the native_apic_[read|write].Except that mechanism seems to require some other back-ports from 2.6.38 that I am not so sure about. The patch worked great on the IBM box but broke all other ones. Stefano had sent me a couple of fixes where we remove some other "if (xen_initial_domain)" and move the "memset(ioapic_dummy_.." to another location but it did not work completly right. Instead of chasing the right combination, I went ahead with what I suggested about introducing another ''struct apic''. Here is the patch and if I revert the fix that I posted and apply this one (already on for-2.6.32/bug-fixes) I get all my machines to boot. This is for 2.6.32 - don''t know if we need to provide it for 2.6.38.>From a92e580fbb1ddae8aafed6360a105f274348d776 Mon Sep 17 00:00:00 2001From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Date: Thu, 17 Mar 2011 14:17:52 -0400 Subject: [PATCH] xen/apic: Provide an ''apic_xen'' to set the override the apic->[read|write] for all cases. When we bootup we call ''set_xen_basic_apic_ops'' which sets apic->read to xen_apic_read. The default ''apic'' is set to apic_flat, so in essence we change apic_flat->read from native_apic_read to xen_apic_read. During bootup, the default_acpi_madt_oem_check is run which runs through all of the apic_probe[] array, on which the last one is is apic_physflat. And apic_physflat->probe() returns true on this IBM Summit box (and ES7000 boxs, and whatever has FADT set to ACPI_FADT_APIC_PHYSICAL) so we set apic now to apic_physflat and the apic->read ends up being native_apic_read. 2.6.38 fixes this by allowing in acpi_register_lapic_address, the the set_fixmap_nocache(FIX_APIC_BASE, address) to be called and we can provide it with a dummy page and native_apic_read can happily read from that fake page. However, the 2.6.38 is not that applicable here as it crashes the case for non-IBM machines. The patch: "xen/ioapic: Allow set_fixmap to set FIX_APIC_BASE to dummy mapping." (7cb068cf1ba90425e12f3a7b3caed9d018fa9b8c) tried this and while it worked for IBM Summit machines it broke all other. Moving the memset to other areas of the code did not help either. The author thinks that there must be some extra back-ports involved to use that mechanism. This fix adds a ''struct apic'' that is Xen specific. This ''apic_xen'' is the first item on the apic_probe[i] for both 32 and 64-bit systems. As the the first on the list, if it detects that it is running under Xen it will short-circuit the iteration through the apic_probe[] hence not allowing us to set it to apic_flat (or bigsmp on 32). We populate the ''apic_xen'' with the default values from the ''apic'' and set the members with the Xen specific functions. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/kernel/apic/probe_32.c | 4 ++++ arch/x86/kernel/apic/probe_64.c | 4 ++++ arch/x86/xen/enlighten.c | 26 ++++++++++++++++++++++++++ 3 files changed, 34 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/apic/probe_32.c b/arch/x86/kernel/apic/probe_32.c index 88b9d22..798904d 100644 --- a/arch/x86/kernel/apic/probe_32.c +++ b/arch/x86/kernel/apic/probe_32.c @@ -174,11 +174,15 @@ extern struct apic apic_summit; extern struct apic apic_bigsmp; extern struct apic apic_es7000; extern struct apic apic_es7000_cluster; +extern struct apic apic_xen; struct apic *apic = &apic_default; EXPORT_SYMBOL_GPL(apic); static struct apic *apic_probe[] __initdata = { +#ifdef CONFIG_XEN + &apic_xen, +#endif #ifdef CONFIG_X86_NUMAQ &apic_numaq, #endif diff --git a/arch/x86/kernel/apic/probe_64.c b/arch/x86/kernel/apic/probe_64.c index 4c56f54..5ab12a4 100644 --- a/arch/x86/kernel/apic/probe_64.c +++ b/arch/x86/kernel/apic/probe_64.c @@ -28,11 +28,15 @@ extern struct apic apic_physflat; extern struct apic apic_x2xpic_uv_x; extern struct apic apic_x2apic_phys; extern struct apic apic_x2apic_cluster; +extern struct apic apic_xen; struct apic __read_mostly *apic = &apic_flat; EXPORT_SYMBOL_GPL(apic); static struct apic *apic_probe[] __initdata = { +#ifdef CONFIG_XEN + &apic_xen, +#endif #ifdef CONFIG_X86_UV &apic_x2apic_uv_x, #endif diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 070f138..c809938 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -750,6 +750,27 @@ static u32 xen_safe_apic_wait_icr_idle(void) return 0; } +static __init int xen_safe_flat_acpi_madt_oem_check(char *oem_id, + char *oem_table_id) +{ + if (!xen_initial_domain()) + return 0; + + return 1; +} + +static __init int xen_safe_probe(void) { + + if (!xen_initial_domain()) + return 0; + + return 1; +} + +struct apic apic_xen = { + .name = "xen", +}; + static __init void set_xen_basic_apic_ops(void) { apic->read = xen_apic_read; @@ -758,6 +779,11 @@ static __init void set_xen_basic_apic_ops(void) apic->icr_write = xen_apic_icr_write; apic->wait_icr_idle = xen_apic_wait_icr_idle; apic->safe_wait_icr_idle = xen_safe_apic_wait_icr_idle; + apic->probe = xen_safe_probe; + apic->acpi_madt_oem_check = xen_safe_flat_acpi_madt_oem_check; + /* Copy over the full contents of the newly modified apic into + * our apic_xen, which is to be called first by apic_probe[]. */ + memcpy(&apic_xen, apic, sizeof(struct apic)); } #endif -- 1.7.1> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gianni Tedesco
2011-Mar-22 13:10 UTC
[Xen-devel] Re: L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?
On Wed, 2011-03-16 at 22:19 +0000, Konrad Rzeszutek Wilk wrote:> I am troubleshooting an issue where the Linux kernel tries > to dereference a not present entry. I have a fix for this > in for-2.6.32/bug-fixes .. but please read on.I''ll give it a shot, I''ll try anything at this point ;P> Specifically it tries to derefence the fixmapped value of > APIC_BASE. The fixmapped value of APIC_BASE is actually not set > due to git commit a1d8e2fa8325064338b2da1bcf0d7a0473883c284 > which adds this in arch/x86/kernel/acpi/boot.c: > > static void __init acpi_register_lapic_address(unsigned long address) > { > /* Xen dom0 doesn''t have usable lapics */ > if (xen_initial_domain()) > return; > > mp_lapic_addr = address; > > set_fixmap_nocache(FIX_APIC_BASE, address); > > Later on we use ''native_apic_read'' which tries to use the APIC_BASE as > address (it is present to be @ slot FIX_APIC_BASE of the fixmap > API) and it fails (on some machines). > > Since we don''t call ''set_fixmap_nocache(FIX_APIC_BASE)'' and > if one were to go through the pagetable this is what we get: > > > [ 0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs > [ 0.000000] mapped APIC to ffffffffff5fb000 (00000000) > (XEN) d0:v0: unhandled page fault (ec=0000) > (XEN) Pagetable walk from ffffffffff5fb020: > (XEN) L4[0x1ff] = 0000000221003067 0000000000001003 > (XEN) L3[0x1ff] = 0000000221004067 0000000000001004 > (XEN) L2[0x1fa] = 0000000221771067 0000000000001771 > (XEN) L1[0x1fb] = 0000000000000000 ffffffffffffffff > (XEN) domain_crash_sync called from entry.S > (XEN) Domain 0 (vcpu#0) crashed on cpu#0: > (XEN) ----[ Xen-4.1-110309 x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 0 > (XEN) RIP: e033:[<ffffffff8102b5d1>] > (XEN) RFLAGS: 0000000000000292 EM: 1 CONTEXT: pv guest > (XEN) rax: ffffffff8164cf50 rbx: 000000026ec00000 rcx: 00000000ffffdd85 > (XEN) rdx: 00000000ffffffff rsi: 0000000000000000 rdi: 0000000000000020 > (XEN) rbp: ffffffff81643ea8 rsp: ffffffff81643e50 r8: 0000000000000002 > (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 > (XEN) r12: ffff880013671800 r13: 00000000bff66000 r14: ffffffffffffffff > (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000000006f0 > (XEN) cr3: 0000000221001000 cr2: ffffffffff5fb020 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 > (XEN) Guest stack trace from rsp=ffffffff81643e50: > > Which is to say that the L1 has this: > 0000000115771fa0: 00000000 00000000 00000000 00000000 > 0000000115771fb0: 00000000 00000000 00000000 00000000 > 0000000115771fc0: 00000000 00000000 15770067 80100001 > 0000000115771fd0: 15770067 80100001 00000000 00000000 > 0000000115771fe0: 00000000 00000000 00000000 00000000 > 0000000115771ff0: 00000000 00000000 00000000 00000000 > > L1[0x1fb] is machine address 115771fd8, which has nothing in it. > > OK, so I''ve come up a fix that is a back-port of how 2.6.38 does it > which is that it removes the check I mentioned above and in xen_set_fixmap > we set the FIX_APIC_BASE to actually point to a dummy ioapic_mapping. > It is 7cb068cf1ba90425e12f3a7b3caed9d018fa9b8c in for-2.6.32/bug-fixes > > Gianni, you might want to check this out in case it fixes the problem you > are experiencing.Not sure, mine happens a lot earlier, sort of just after the very early memory initialisation. Also we''re nowhere near trying to use APIC anything as an address afaict - just trying to reach the xen info page. The last thing I see is: [ 0.000000] kernel direct mapping tables up to 2f000000 @ 100000-27a000 [ 0.000000] init_memory_mapping: 0000000100000000-00000002a7000000> But one thing I can''t understand is why on one machine (IBM x3850) > I get this crash, while another one with the same pagetable contents > (L1 has nothing for 0x1fb) it works just fine? I added a panic and used > the Xen hypervisor kdb to manually inspect the pagetable, and it has > the same contents as the IBM x3850 -but it boots fine with this invalid value. > Any ideas?A missing TLB flush? heh> > FYI, seems another user (Sven Sübert) IBM x3650 hits the same bug. And with > this fix he is able to boot.Very odd, if this isn''t the bug I''m seeing it might be tangentially related. I''ll let you know Gianni _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel