Recently our automated testing system has caught a curious assertion while testing Xen 4.1.5 on a HaswellDT system. (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at irq.c:1030 (XEN) ----[ Xen-4.1.5 x86_64 debug=n Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e008:[<ffff82c48016b2b4>] do_IRQ+0x514/0x750 (XEN) RFLAGS: 0000000000010093 CONTEXT: hypervisor (XEN) rax: 000000000000002f rbx: ffff830249841e80 rcx: ffff82c4803127c0 (XEN) rdx: 0000000000000004 rsi: 0000000000000027 rdi: 0000000000000001 (XEN) rbp: 0000000000001e00 rsp: ffff82c4802bfd48 r8: ffff82c480312abc (XEN) r9: ffff8302498a5948 r10: 0000000000000009 r11: ffff8302498c6c80 (XEN) r12: ffff830243b07f50 r13: ffff8300a24f8000 r14: 00000af8373788e3 (XEN) r15: ffff830249841e80 cr0: 000000008005003b cr4: 00000000001026f0 (XEN) cr3: 00000002479e6000 cr2: 00000000e6d3c090 (XEN) ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff82c4802bfd48: (XEN) ffff830249841eb4 ffff82c480312ec0 000000000000001e 0000001e00000000 (XEN) 0000000000000000 00000000498a5670 ffff830249841d80 ffff830249840080 (XEN) ffff830249841db4 0000000000000000 ffff8302498a55e0 ffff8302498a5670 (XEN) ffff8300a24f8000 00000af8373788e3 00000af83736b8ed ffff82c480162ca0 (XEN) 00000af83736b8ed 00000af8373788e3 ffff8300a24f8000 ffff8302498a5670 (XEN) ffff8302498a55e0 0000000000000000 ffff8302498c6c80 0000000000000009 (XEN) ffff8302498a5948 ffff82c480313000 0000000000007f40 0000000000000001 (XEN) 0000000000000000 0000000000000000 00000af80db652fd 0000002700000000 (XEN) ffff82c4801a50a0 000000000000e008 0000000000000246 ffff82c4802bfe78 (XEN) 0000000000000000 ffff8302498a5670 ffff82c4801a6a56 ffffffffffffffff (XEN) ffff830249818000 0000000000000000 ffff8300a24f8000 ffff82c480122c11 (XEN) 00000af839021119 0000000000000000 0000000000000000 00000000802bff18 (XEN) 0000025c0000013b ffff82c4802e7580 ffff82c4802bff18 ffff8300a2838000 (XEN) ffff82c4802f61a0 ffff8300a24f8000 0000000000000002 00000af837304b45 (XEN) ffff82c48015b67a 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 00000000ee8a3f8c 0000000000000001 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 00000000ee8a3f74 0000000000000af8 (XEN) 0000000000000001 0000010000000000 00000000c01013a7 0000000000000061 (XEN) 0000000000000246 00000000ee8a3f70 0000000000000069 0000000000000000 (XEN) Xen call trace: (XEN) [<ffff82c48016b2b4>] do_IRQ+0x514/0x750 (XEN) 15[<ffff82c480162ca0>] common_interrupt+0x20/0x30 (XEN) 32[<ffff82c4801a50a0>] lapic_timer_nop+0x0/0x10 (XEN) 38[<ffff82c4801a6a56>] acpi_processor_idle+0x376/0x740 (XEN) 43[<ffff82c480122c11>] do_block+0x71/0xd0 (XEN) 56[<ffff82c48015b67a>] idle_loop+0x1a/0x50 (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 0: (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at irq.c:1030 (XEN) **************************************** And the disassembly before the assertion: ffff82c48016b29f: 48 8d 14 85 00 00 00 lea 0x0(,%rax,4),%rdx ffff82c48016b2a6: 00 ffff82c48016b2a7: 0f b6 44 11 ff movzbl -0x1(%rcx,%rdx,1),%eax ffff82c48016b2ac: 39 c6 cmp %eax,%esi ffff82c48016b2ae: 0f 8f 5c ff ff ff jg ffff82c48016b210 <do_IRQ+0x470> ffff82c48016b2b4: 0f 0b ud2 Xen has been woken up by an interrupt of vector 0x27, but has a vector 0x2f on the top of the pending EOI stack for the local APIC. I have put in more debugging to dump the LAPIC state of the two interesting vectors and the IOAPIC state, but I have no idea if/when the problem might reoccur. My understanding of LAPIC priority leads me to think that Xen really shouldn''t be woken up by a lower priority vector if a higher priority one is still un-eoi''d. There is not yet sufficient information to tell whether this is truely the case, or that Xen has simply gotten confused about which vectors it eoi''d. Having said that, we do keep line level interrupts un-eoi''d for extended periods while guests service the interrupt. Given that vectors are chosen at random, we could get into a situation where a line interrupt has a vector 0xdf and stays pending for 150ms (which I measured as a not-overly-uncommon mean-time-till-eoi for line level interrupt). This would starve any other guest interrupts for an extended period. Given directed-eoi support in the past few generations of processor, the requirement for the pending EOI stack has disappeared as far as I am aware. Would it be sensible idea in general to make use of the pending eoi stack conditional on not having/using directed EOI support? ~Andrew
>>> On 31.05.13 at 22:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > Xen has been woken up by an interrupt of vector 0x27, but has a vector > 0x2f on the top of the pending EOI stack for the local APIC. > > I have put in more debugging to dump the LAPIC state of the two > interesting vectors and the IOAPIC state, but I have no idea if/when the > problem might reoccur. > > My understanding of LAPIC priority leads me to think that Xen really > shouldn''t be woken up by a lower priority vector if a higher priority > one is still un-eoi''d. There is not yet sufficient information to tell > whether this is truely the case, or that Xen has simply gotten confused > about which vectors it eoi''d.Considering that this was on a Haswell, and got so far not reported by anyone else, I wonder whether that''s related to some effect of (or flaw in) APIC virtualization. But of course without knowing the state of the LAPIC, that''s hard to tell for sure. The more that a stray ack_APIC_irq() could lead to the same effect, and that EDX (holding "sp") has a value of 4 - quite a few lower priority vectors awaiting an EOI considering that vector group 2x is the lowest possible one (i.e. the other entries on the stack ought to have even larger vector numbers).> Having said that, we do keep line level interrupts un-eoi''d for extended > periods while guests service the interrupt. Given that vectors are > chosen at random, we could get into a situation where a line interrupt > has a vector 0xdf and stays pending for 150ms (which I measured as a > not-overly-uncommon mean-time-till-eoi for line level interrupt). This > would starve any other guest interrupts for an extended period. > > Given directed-eoi support in the past few generations of processor, the > requirement for the pending EOI stack has disappeared as far as I am > aware. Would it be sensible idea in general to make use of the pending > eoi stack conditional on not having/using directed EOI support?We don''t use ACKTYPE_EOI in that case: setup_IO_APIC() only sets ioapic_level_type.ack to irq_complete_move (consumed by pirq_acktype()) when ioapic_ack_new, and directed EOI implies !ioapic_ack_new (see verify_local_APIC()). The only other case of using ACKTYPE_EOI is for non-maskable MSIs. Jan
Hello all, I have also a Haswell system. I am running XenServer 6.2 (with Xen 4.1.5) on it and I am experiencing the same issue. Do you already have a solution for this problem ? Best regards Thimo (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at irq.c:1027^M (XEN) ----[ Xen-4.1.5.debug x86_64 debug=y Not tainted ]----^M (XEN) CPU: 1^M (XEN) RIP: e008:[<ffff82c480169662>] do_IRQ+0x3ba/0x6d9^M (XEN) RFLAGS: 0000000000010002 CONTEXT: hypervisor^M (XEN) rax: 0000000000000001 rbx: ffff83081f080f00 rcx: ffff83081f05b340^M (XEN) rdx: 0000000000000001 rsi: 000000000000002b rdi: 0000000000000001^M (XEN) rbp: ffff83081f057d88 rsp: ffff83081f057d18 r8: ffff83081f05b63c^M (XEN) r9: 000070044fb97100 r10: ffff8300b858c060 r11: 000020f3f5a4dea5^M (XEN) r12: 000000000000002b r13: ffff83081f004e80 r14: 000000000000001d^M (XEN) r15: 0000000000000002 cr0: 000000008005003b cr4: 00000000001026f0^M (XEN) cr3: 000000045915f000 cr2: 0000000000150008^M (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008^M (XEN) Xen stack trace from rsp=ffff83081f057d18:^M (XEN) 000000000000001d 000000000000001d ffff83081f080f00 0000000000000000^M (XEN) 00000000ffffffea ffff83081f080f00 0000000000000000 0000000000000000^M (XEN) ffffffffffffffff ffff83081f057f18 ffff83081f06bb00 ffff83081f06bb90^M (XEN) ffff8300b858c000 0000000000000002 00007cf7e0fa8247 ffff82c480161a66^M (XEN) 0000000000000002 ffff8300b858c000 ffff83081f06bb90 ffff83081f06bb00^M (XEN) ffff83081f057ef0 ffff83081f057f18 000020f3f5a4dea5 ffff8300b858c060^M (XEN) 000070044fb97100 ffff83081f05bb80 0000000000007f40 0000000000000001^M (XEN) 0000000000000000 000020f3c755a972 ffff83081f06bb90 0000002b00000000^M (XEN) ffff82c4801a21f0 000000000000e008 0000000000000246 ffff83081f057e48^M (XEN) 000000000000e010 ffff83081f057ef0 ffff82c4801a3dc4 000020f3f595c09c^M (XEN) 000020f3f596987e ffff8306383e3010 ffff83081f05b100 ffffffffffffffff^M (XEN) 0000000000000001 0000000000000001 ffffffffffffffff ffff83081f057f18^M (XEN) 00000000802d4680 0000000000000000 0000000000000000 ffff82c4802d4680^M (XEN) 000002a80000024b ffff8300b8586000 ffff83081f057f18 ffff8300b8586000^M (XEN) ffff8300b858c000 ffff8300b858c000 0000000000000002 ffff83081f057f10^M (XEN) ffff82c48015a261 ffff82c480126ccd 0000000000000001 ffff83081f057d18^M (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000^M (XEN) 0000000000000000 0000000000000000 0000000000000246 ffff88001a8093a0^M (XEN) 0000000100885e0f 000000000000000f 0000000000000000 ffffffff802063aa^M (XEN) 0000000000000001 00000000deadbeef 00000000deadbeef 0000010000000000^M (XEN) Xen call trace:^M (XEN) [<ffff82c480169662>] do_IRQ+0x3ba/0x6d9^M (XEN) [<ffff82c480161a66>] common_interrupt+0x26/0x30^M (XEN) [<ffff82c4801a21f0>] lapic_timer_nop+0x0/0x6^M (XEN) [<ffff82c48015a261>] idle_loop+0x48/0x59^M (XEN) ^M (XEN) ^M (XEN) ****************************************^M (XEN) Panic on CPU 1:^M (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at irq.c:1027^M (XEN) ****************************************^M (XEN) ^M (XEN) Reboot in five seconds...^M Am 31.05.2013 22:32, schrieb Andrew Cooper:> Recently our automated testing system has caught a curious assertion > while testing Xen 4.1.5 on a HaswellDT system. > > (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at irq.c:1030 > (XEN) ----[ Xen-4.1.5 x86_64 debug=n Not tainted ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff82c48016b2b4>] do_IRQ+0x514/0x750 > (XEN) RFLAGS: 0000000000010093 CONTEXT: hypervisor > (XEN) rax: 000000000000002f rbx: ffff830249841e80 rcx: ffff82c4803127c0 > (XEN) rdx: 0000000000000004 rsi: 0000000000000027 rdi: 0000000000000001 > (XEN) rbp: 0000000000001e00 rsp: ffff82c4802bfd48 r8: ffff82c480312abc > (XEN) r9: ffff8302498a5948 r10: 0000000000000009 r11: ffff8302498c6c80 > (XEN) r12: ffff830243b07f50 r13: ffff8300a24f8000 r14: 00000af8373788e3 > (XEN) r15: ffff830249841e80 cr0: 000000008005003b cr4: 00000000001026f0 > (XEN) cr3: 00000002479e6000 cr2: 00000000e6d3c090 > (XEN) ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff82c4802bfd48: > (XEN) ffff830249841eb4 ffff82c480312ec0 000000000000001e 0000001e00000000 > (XEN) 0000000000000000 00000000498a5670 ffff830249841d80 ffff830249840080 > (XEN) ffff830249841db4 0000000000000000 ffff8302498a55e0 ffff8302498a5670 > (XEN) ffff8300a24f8000 00000af8373788e3 00000af83736b8ed ffff82c480162ca0 > (XEN) 00000af83736b8ed 00000af8373788e3 ffff8300a24f8000 ffff8302498a5670 > (XEN) ffff8302498a55e0 0000000000000000 ffff8302498c6c80 0000000000000009 > (XEN) ffff8302498a5948 ffff82c480313000 0000000000007f40 0000000000000001 > (XEN) 0000000000000000 0000000000000000 00000af80db652fd 0000002700000000 > (XEN) ffff82c4801a50a0 000000000000e008 0000000000000246 ffff82c4802bfe78 > (XEN) 0000000000000000 ffff8302498a5670 ffff82c4801a6a56 ffffffffffffffff > (XEN) ffff830249818000 0000000000000000 ffff8300a24f8000 ffff82c480122c11 > (XEN) 00000af839021119 0000000000000000 0000000000000000 00000000802bff18 > (XEN) 0000025c0000013b ffff82c4802e7580 ffff82c4802bff18 ffff8300a2838000 > (XEN) ffff82c4802f61a0 ffff8300a24f8000 0000000000000002 00000af837304b45 > (XEN) ffff82c48015b67a 0000000000000000 0000000000000000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 00000000ee8a3f8c 0000000000000001 > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 00000000ee8a3f74 0000000000000af8 > (XEN) 0000000000000001 0000010000000000 00000000c01013a7 0000000000000061 > (XEN) 0000000000000246 00000000ee8a3f70 0000000000000069 0000000000000000 > (XEN) Xen call trace: > (XEN) [<ffff82c48016b2b4>] do_IRQ+0x514/0x750 > (XEN) 15[<ffff82c480162ca0>] common_interrupt+0x20/0x30 > (XEN) 32[<ffff82c4801a50a0>] lapic_timer_nop+0x0/0x10 > (XEN) 38[<ffff82c4801a6a56>] acpi_processor_idle+0x376/0x740 > (XEN) 43[<ffff82c480122c11>] do_block+0x71/0xd0 > (XEN) 56[<ffff82c48015b67a>] idle_loop+0x1a/0x50 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at irq.c:1030 > (XEN) **************************************** > > And the disassembly before the assertion: > > ffff82c48016b29f: 48 8d 14 85 00 00 00 lea 0x0(,%rax,4),%rdx > ffff82c48016b2a6: 00 > ffff82c48016b2a7: 0f b6 44 11 ff movzbl -0x1(%rcx,%rdx,1),%eax > ffff82c48016b2ac: 39 c6 cmp %eax,%esi > ffff82c48016b2ae: 0f 8f 5c ff ff ff jg ffff82c48016b210 <do_IRQ+0x470> > ffff82c48016b2b4: 0f 0b ud2 > > > Xen has been woken up by an interrupt of vector 0x27, but has a vector > 0x2f on the top of the pending EOI stack for the local APIC. > > I have put in more debugging to dump the LAPIC state of the two > interesting vectors and the IOAPIC state, but I have no idea if/when the > problem might reoccur. > > My understanding of LAPIC priority leads me to think that Xen really > shouldn''t be woken up by a lower priority vector if a higher priority > one is still un-eoi''d. There is not yet sufficient information to tell > whether this is truely the case, or that Xen has simply gotten confused > about which vectors it eoi''d. > > Having said that, we do keep line level interrupts un-eoi''d for extended > periods while guests service the interrupt. Given that vectors are > chosen at random, we could get into a situation where a line interrupt > has a vector 0xdf and stays pending for 150ms (which I measured as a > not-overly-uncommon mean-time-till-eoi for line level interrupt). This > would starve any other guest interrupts for an extended period. > > Given directed-eoi support in the past few generations of processor, the > requirement for the pending EOI stack has disappeared as far as I am > aware. Would it be sensible idea in general to make use of the pending > eoi stack conditional on not having/using directed EOI support? > > ~Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 31/07/13 09:30, Thimo E. wrote:> Hello all, > > I have also a Haswell system. I am running XenServer 6.2 (with Xen > 4.1.5) on it and I am experiencing the same issue. Do you already have > a solution for this problem ? > > Best regards > ThimoHi, We are still none the wiser on this issue. I have a debugging patch to get more information, but the problem hasn''t reoccurred since. This is now 2 crashes on Xen 4.1 and a single crash on Xen 4.2 that I have seen. For the benefit of anyone else who runs over this issue in the meantime, the patch (against Xen-4.3) is attached. Thimo: I shall put a new version of the XenServer 6.2 Xen with the debugging patch on the forum thread. ~Andrew> > (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at > irq.c:1027^M > (XEN) ----[ Xen-4.1.5.debug x86_64 debug=y Not tainted ]----^M > (XEN) CPU: 1^M > (XEN) RIP: e008:[<ffff82c480169662>] do_IRQ+0x3ba/0x6d9^M > (XEN) RFLAGS: 0000000000010002 CONTEXT: hypervisor^M > (XEN) rax: 0000000000000001 rbx: ffff83081f080f00 rcx: > ffff83081f05b340^M > (XEN) rdx: 0000000000000001 rsi: 000000000000002b rdi: > 0000000000000001^M > (XEN) rbp: ffff83081f057d88 rsp: ffff83081f057d18 r8: > ffff83081f05b63c^M > (XEN) r9: 000070044fb97100 r10: ffff8300b858c060 r11: > 000020f3f5a4dea5^M > (XEN) r12: 000000000000002b r13: ffff83081f004e80 r14: > 000000000000001d^M > (XEN) r15: 0000000000000002 cr0: 000000008005003b cr4: > 00000000001026f0^M > (XEN) cr3: 000000045915f000 cr2: 0000000000150008^M > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008^M > (XEN) Xen stack trace from rsp=ffff83081f057d18:^M > (XEN) 000000000000001d 000000000000001d ffff83081f080f00 > 0000000000000000^M > (XEN) 00000000ffffffea ffff83081f080f00 0000000000000000 > 0000000000000000^M > (XEN) ffffffffffffffff ffff83081f057f18 ffff83081f06bb00 > ffff83081f06bb90^M > (XEN) ffff8300b858c000 0000000000000002 00007cf7e0fa8247 > ffff82c480161a66^M > (XEN) 0000000000000002 ffff8300b858c000 ffff83081f06bb90 > ffff83081f06bb00^M > (XEN) ffff83081f057ef0 ffff83081f057f18 000020f3f5a4dea5 > ffff8300b858c060^M > (XEN) 000070044fb97100 ffff83081f05bb80 0000000000007f40 > 0000000000000001^M > (XEN) 0000000000000000 000020f3c755a972 ffff83081f06bb90 > 0000002b00000000^M > (XEN) ffff82c4801a21f0 000000000000e008 0000000000000246 > ffff83081f057e48^M > (XEN) 000000000000e010 ffff83081f057ef0 ffff82c4801a3dc4 > 000020f3f595c09c^M > (XEN) 000020f3f596987e ffff8306383e3010 ffff83081f05b100 > ffffffffffffffff^M > (XEN) 0000000000000001 0000000000000001 ffffffffffffffff > ffff83081f057f18^M > (XEN) 00000000802d4680 0000000000000000 0000000000000000 > ffff82c4802d4680^M > (XEN) 000002a80000024b ffff8300b8586000 ffff83081f057f18 > ffff8300b8586000^M > (XEN) ffff8300b858c000 ffff8300b858c000 0000000000000002 > ffff83081f057f10^M > (XEN) ffff82c48015a261 ffff82c480126ccd 0000000000000001 > ffff83081f057d18^M > (XEN) 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000^M > (XEN) 0000000000000000 0000000000000000 0000000000000246 > ffff88001a8093a0^M > (XEN) 0000000100885e0f 000000000000000f 0000000000000000 > ffffffff802063aa^M > (XEN) 0000000000000001 00000000deadbeef 00000000deadbeef > 0000010000000000^M > (XEN) Xen call trace:^M > (XEN) [<ffff82c480169662>] do_IRQ+0x3ba/0x6d9^M > (XEN) [<ffff82c480161a66>] common_interrupt+0x26/0x30^M > (XEN) [<ffff82c4801a21f0>] lapic_timer_nop+0x0/0x6^M > (XEN) [<ffff82c48015a261>] idle_loop+0x48/0x59^M > (XEN) ^M > (XEN) ^M > (XEN) ****************************************^M > (XEN) Panic on CPU 1:^M > (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at > irq.c:1027^M > (XEN) ****************************************^M > (XEN) ^M > (XEN) Reboot in five seconds...^M > > Am 31.05.2013 22:32, schrieb Andrew Cooper: >> Recently our automated testing system has caught a curious assertion >> while testing Xen 4.1.5 on a HaswellDT system. >> >> (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at >> irq.c:1030 >> (XEN) ----[ Xen-4.1.5 x86_64 debug=n Not tainted ]---- >> (XEN) CPU: 0 >> (XEN) RIP: e008:[<ffff82c48016b2b4>] do_IRQ+0x514/0x750 >> (XEN) RFLAGS: 0000000000010093 CONTEXT: hypervisor >> (XEN) rax: 000000000000002f rbx: ffff830249841e80 rcx: >> ffff82c4803127c0 >> (XEN) rdx: 0000000000000004 rsi: 0000000000000027 rdi: >> 0000000000000001 >> (XEN) rbp: 0000000000001e00 rsp: ffff82c4802bfd48 r8: >> ffff82c480312abc >> (XEN) r9: ffff8302498a5948 r10: 0000000000000009 r11: >> ffff8302498c6c80 >> (XEN) r12: ffff830243b07f50 r13: ffff8300a24f8000 r14: >> 00000af8373788e3 >> (XEN) r15: ffff830249841e80 cr0: 000000008005003b cr4: >> 00000000001026f0 >> (XEN) cr3: 00000002479e6000 cr2: 00000000e6d3c090 >> (XEN) ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0000 cs: e008 >> (XEN) Xen stack trace from rsp=ffff82c4802bfd48: >> (XEN) ffff830249841eb4 ffff82c480312ec0 000000000000001e >> 0000001e00000000 >> (XEN) 0000000000000000 00000000498a5670 ffff830249841d80 >> ffff830249840080 >> (XEN) ffff830249841db4 0000000000000000 ffff8302498a55e0 >> ffff8302498a5670 >> (XEN) ffff8300a24f8000 00000af8373788e3 00000af83736b8ed >> ffff82c480162ca0 >> (XEN) 00000af83736b8ed 00000af8373788e3 ffff8300a24f8000 >> ffff8302498a5670 >> (XEN) ffff8302498a55e0 0000000000000000 ffff8302498c6c80 >> 0000000000000009 >> (XEN) ffff8302498a5948 ffff82c480313000 0000000000007f40 >> 0000000000000001 >> (XEN) 0000000000000000 0000000000000000 00000af80db652fd >> 0000002700000000 >> (XEN) ffff82c4801a50a0 000000000000e008 0000000000000246 >> ffff82c4802bfe78 >> (XEN) 0000000000000000 ffff8302498a5670 ffff82c4801a6a56 >> ffffffffffffffff >> (XEN) ffff830249818000 0000000000000000 ffff8300a24f8000 >> ffff82c480122c11 >> (XEN) 00000af839021119 0000000000000000 0000000000000000 >> 00000000802bff18 >> (XEN) 0000025c0000013b ffff82c4802e7580 ffff82c4802bff18 >> ffff8300a2838000 >> (XEN) ffff82c4802f61a0 ffff8300a24f8000 0000000000000002 >> 00000af837304b45 >> (XEN) ffff82c48015b67a 0000000000000000 0000000000000000 >> 0000000000000000 >> (XEN) 0000000000000000 0000000000000000 00000000ee8a3f8c >> 0000000000000001 >> (XEN) 0000000000000000 0000000000000000 0000000000000000 >> 0000000000000000 >> (XEN) 0000000000000000 0000000000000000 00000000ee8a3f74 >> 0000000000000af8 >> (XEN) 0000000000000001 0000010000000000 00000000c01013a7 >> 0000000000000061 >> (XEN) 0000000000000246 00000000ee8a3f70 0000000000000069 >> 0000000000000000 >> (XEN) Xen call trace: >> (XEN) [<ffff82c48016b2b4>] do_IRQ+0x514/0x750 >> (XEN) 15[<ffff82c480162ca0>] common_interrupt+0x20/0x30 >> (XEN) 32[<ffff82c4801a50a0>] lapic_timer_nop+0x0/0x10 >> (XEN) 38[<ffff82c4801a6a56>] acpi_processor_idle+0x376/0x740 >> (XEN) 43[<ffff82c480122c11>] do_block+0x71/0xd0 >> (XEN) 56[<ffff82c48015b67a>] idle_loop+0x1a/0x50 >> (XEN) >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 0: >> (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at >> irq.c:1030 >> (XEN) **************************************** >> >> And the disassembly before the assertion: >> >> ffff82c48016b29f: 48 8d 14 85 00 00 00 lea 0x0(,%rax,4),%rdx >> ffff82c48016b2a6: 00 >> ffff82c48016b2a7: 0f b6 44 11 ff movzbl >> -0x1(%rcx,%rdx,1),%eax >> ffff82c48016b2ac: 39 c6 cmp %eax,%esi >> ffff82c48016b2ae: 0f 8f 5c ff ff ff jg >> ffff82c48016b210 <do_IRQ+0x470> >> ffff82c48016b2b4: 0f 0b ud2 >> >> >> Xen has been woken up by an interrupt of vector 0x27, but has a vector >> 0x2f on the top of the pending EOI stack for the local APIC. >> >> I have put in more debugging to dump the LAPIC state of the two >> interesting vectors and the IOAPIC state, but I have no idea if/when the >> problem might reoccur. >> >> My understanding of LAPIC priority leads me to think that Xen really >> shouldn''t be woken up by a lower priority vector if a higher priority >> one is still un-eoi''d. There is not yet sufficient information to tell >> whether this is truely the case, or that Xen has simply gotten confused >> about which vectors it eoi''d. >> >> Having said that, we do keep line level interrupts un-eoi''d for extended >> periods while guests service the interrupt. Given that vectors are >> chosen at random, we could get into a situation where a line interrupt >> has a vector 0xdf and stays pending for 150ms (which I measured as a >> not-overly-uncommon mean-time-till-eoi for line level interrupt). This >> would starve any other guest interrupts for an extended period. >> >> Given directed-eoi support in the past few generations of processor, the >> requirement for the pending EOI stack has disappeared as far as I am >> aware. Would it be sensible idea in general to make use of the pending >> eoi stack conditional on not having/using directed EOI support? >> >> ~Andrew >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Hi, I''ve postet it already in the forum thread, but to keep all of you up to date for this issue I am copying the logfile into this thread, too: XenServer crash again, attached you''ll find the output with the verbose messages Andrew inserted into the code. Best regards Thimo Am 31.07.2013 11:47, schrieb Andrew Cooper:> On 31/07/13 09:30, Thimo E. wrote: >> Hello all, >> >> I have also a Haswell system. I am running XenServer 6.2 (with Xen >> 4.1.5) on it and I am experiencing the same issue. Do you already have >> a solution for this problem ? >> >> Best regards >> Thimo > Hi, > > We are still none the wiser on this issue. I have a debugging patch to > get more information, but the problem hasn''t reoccurred since. This is > now 2 crashes on Xen 4.1 and a single crash on Xen 4.2 that I have seen. > > For the benefit of anyone else who runs over this issue in the meantime, > the patch (against Xen-4.3) is attached. > > Thimo: I shall put a new version of the XenServer 6.2 Xen with the > debugging patch on the forum thread. > > ~Andrew > >> (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at >> irq.c:1027^M >> (XEN) ----[ Xen-4.1.5.debug x86_64 debug=y Not tainted ]----^M >> (XEN) CPU: 1^M >> (XEN) RIP: e008:[<ffff82c480169662>] do_IRQ+0x3ba/0x6d9^M >> (XEN) RFLAGS: 0000000000010002 CONTEXT: hypervisor^M >> (XEN) rax: 0000000000000001 rbx: ffff83081f080f00 rcx: >> ffff83081f05b340^M >> (XEN) rdx: 0000000000000001 rsi: 000000000000002b rdi: >> 0000000000000001^M >> (XEN) rbp: ffff83081f057d88 rsp: ffff83081f057d18 r8: >> ffff83081f05b63c^M >> (XEN) r9: 000070044fb97100 r10: ffff8300b858c060 r11: >> 000020f3f5a4dea5^M >> (XEN) r12: 000000000000002b r13: ffff83081f004e80 r14: >> 000000000000001d^M >> (XEN) r15: 0000000000000002 cr0: 000000008005003b cr4: >> 00000000001026f0^M >> (XEN) cr3: 000000045915f000 cr2: 0000000000150008^M >> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008^M >> (XEN) Xen stack trace from rsp=ffff83081f057d18:^M >> (XEN) 000000000000001d 000000000000001d ffff83081f080f00 >> 0000000000000000^M >> (XEN) 00000000ffffffea ffff83081f080f00 0000000000000000 >> 0000000000000000^M >> (XEN) ffffffffffffffff ffff83081f057f18 ffff83081f06bb00 >> ffff83081f06bb90^M >> (XEN) ffff8300b858c000 0000000000000002 00007cf7e0fa8247 >> ffff82c480161a66^M >> (XEN) 0000000000000002 ffff8300b858c000 ffff83081f06bb90 >> ffff83081f06bb00^M >> (XEN) ffff83081f057ef0 ffff83081f057f18 000020f3f5a4dea5 >> ffff8300b858c060^M >> (XEN) 000070044fb97100 ffff83081f05bb80 0000000000007f40 >> 0000000000000001^M >> (XEN) 0000000000000000 000020f3c755a972 ffff83081f06bb90 >> 0000002b00000000^M >> (XEN) ffff82c4801a21f0 000000000000e008 0000000000000246 >> ffff83081f057e48^M >> (XEN) 000000000000e010 ffff83081f057ef0 ffff82c4801a3dc4 >> 000020f3f595c09c^M >> (XEN) 000020f3f596987e ffff8306383e3010 ffff83081f05b100 >> ffffffffffffffff^M >> (XEN) 0000000000000001 0000000000000001 ffffffffffffffff >> ffff83081f057f18^M >> (XEN) 00000000802d4680 0000000000000000 0000000000000000 >> ffff82c4802d4680^M >> (XEN) 000002a80000024b ffff8300b8586000 ffff83081f057f18 >> ffff8300b8586000^M >> (XEN) ffff8300b858c000 ffff8300b858c000 0000000000000002 >> ffff83081f057f10^M >> (XEN) ffff82c48015a261 ffff82c480126ccd 0000000000000001 >> ffff83081f057d18^M >> (XEN) 0000000000000000 0000000000000000 0000000000000000 >> 0000000000000000^M >> (XEN) 0000000000000000 0000000000000000 0000000000000246 >> ffff88001a8093a0^M >> (XEN) 0000000100885e0f 000000000000000f 0000000000000000 >> ffffffff802063aa^M >> (XEN) 0000000000000001 00000000deadbeef 00000000deadbeef >> 0000010000000000^M >> (XEN) Xen call trace:^M >> (XEN) [<ffff82c480169662>] do_IRQ+0x3ba/0x6d9^M >> (XEN) [<ffff82c480161a66>] common_interrupt+0x26/0x30^M >> (XEN) [<ffff82c4801a21f0>] lapic_timer_nop+0x0/0x6^M >> (XEN) [<ffff82c48015a261>] idle_loop+0x48/0x59^M >> (XEN) ^M >> (XEN) ^M >> (XEN) ****************************************^M >> (XEN) Panic on CPU 1:^M >> (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at >> irq.c:1027^M >> (XEN) ****************************************^M >> (XEN) ^M >> (XEN) Reboot in five seconds...^M >> >> Am 31.05.2013 22:32, schrieb Andrew Cooper: >>> Recently our automated testing system has caught a curious assertion >>> while testing Xen 4.1.5 on a HaswellDT system. >>> >>> (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at >>> irq.c:1030 >>> (XEN) ----[ Xen-4.1.5 x86_64 debug=n Not tainted ]---- >>> (XEN) CPU: 0 >>> (XEN) RIP: e008:[<ffff82c48016b2b4>] do_IRQ+0x514/0x750 >>> (XEN) RFLAGS: 0000000000010093 CONTEXT: hypervisor >>> (XEN) rax: 000000000000002f rbx: ffff830249841e80 rcx: >>> ffff82c4803127c0 >>> (XEN) rdx: 0000000000000004 rsi: 0000000000000027 rdi: >>> 0000000000000001 >>> (XEN) rbp: 0000000000001e00 rsp: ffff82c4802bfd48 r8: >>> ffff82c480312abc >>> (XEN) r9: ffff8302498a5948 r10: 0000000000000009 r11: >>> ffff8302498c6c80 >>> (XEN) r12: ffff830243b07f50 r13: ffff8300a24f8000 r14: >>> 00000af8373788e3 >>> (XEN) r15: ffff830249841e80 cr0: 000000008005003b cr4: >>> 00000000001026f0 >>> (XEN) cr3: 00000002479e6000 cr2: 00000000e6d3c090 >>> (XEN) ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0000 cs: e008 >>> (XEN) Xen stack trace from rsp=ffff82c4802bfd48: >>> (XEN) ffff830249841eb4 ffff82c480312ec0 000000000000001e >>> 0000001e00000000 >>> (XEN) 0000000000000000 00000000498a5670 ffff830249841d80 >>> ffff830249840080 >>> (XEN) ffff830249841db4 0000000000000000 ffff8302498a55e0 >>> ffff8302498a5670 >>> (XEN) ffff8300a24f8000 00000af8373788e3 00000af83736b8ed >>> ffff82c480162ca0 >>> (XEN) 00000af83736b8ed 00000af8373788e3 ffff8300a24f8000 >>> ffff8302498a5670 >>> (XEN) ffff8302498a55e0 0000000000000000 ffff8302498c6c80 >>> 0000000000000009 >>> (XEN) ffff8302498a5948 ffff82c480313000 0000000000007f40 >>> 0000000000000001 >>> (XEN) 0000000000000000 0000000000000000 00000af80db652fd >>> 0000002700000000 >>> (XEN) ffff82c4801a50a0 000000000000e008 0000000000000246 >>> ffff82c4802bfe78 >>> (XEN) 0000000000000000 ffff8302498a5670 ffff82c4801a6a56 >>> ffffffffffffffff >>> (XEN) ffff830249818000 0000000000000000 ffff8300a24f8000 >>> ffff82c480122c11 >>> (XEN) 00000af839021119 0000000000000000 0000000000000000 >>> 00000000802bff18 >>> (XEN) 0000025c0000013b ffff82c4802e7580 ffff82c4802bff18 >>> ffff8300a2838000 >>> (XEN) ffff82c4802f61a0 ffff8300a24f8000 0000000000000002 >>> 00000af837304b45 >>> (XEN) ffff82c48015b67a 0000000000000000 0000000000000000 >>> 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 00000000ee8a3f8c >>> 0000000000000001 >>> (XEN) 0000000000000000 0000000000000000 0000000000000000 >>> 0000000000000000 >>> (XEN) 0000000000000000 0000000000000000 00000000ee8a3f74 >>> 0000000000000af8 >>> (XEN) 0000000000000001 0000010000000000 00000000c01013a7 >>> 0000000000000061 >>> (XEN) 0000000000000246 00000000ee8a3f70 0000000000000069 >>> 0000000000000000 >>> (XEN) Xen call trace: >>> (XEN) [<ffff82c48016b2b4>] do_IRQ+0x514/0x750 >>> (XEN) 15[<ffff82c480162ca0>] common_interrupt+0x20/0x30 >>> (XEN) 32[<ffff82c4801a50a0>] lapic_timer_nop+0x0/0x10 >>> (XEN) 38[<ffff82c4801a6a56>] acpi_processor_idle+0x376/0x740 >>> (XEN) 43[<ffff82c480122c11>] do_block+0x71/0xd0 >>> (XEN) 56[<ffff82c48015b67a>] idle_loop+0x1a/0x50 >>> (XEN) >>> (XEN) >>> (XEN) **************************************** >>> (XEN) Panic on CPU 0: >>> (XEN) Assertion ''(sp == 0) || (peoi[sp-1].vector < vector)'' failed at >>> irq.c:1030 >>> (XEN) **************************************** >>> >>> And the disassembly before the assertion: >>> >>> ffff82c48016b29f: 48 8d 14 85 00 00 00 lea 0x0(,%rax,4),%rdx >>> ffff82c48016b2a6: 00 >>> ffff82c48016b2a7: 0f b6 44 11 ff movzbl >>> -0x1(%rcx,%rdx,1),%eax >>> ffff82c48016b2ac: 39 c6 cmp %eax,%esi >>> ffff82c48016b2ae: 0f 8f 5c ff ff ff jg >>> ffff82c48016b210 <do_IRQ+0x470> >>> ffff82c48016b2b4: 0f 0b ud2 >>> >>> >>> Xen has been woken up by an interrupt of vector 0x27, but has a vector >>> 0x2f on the top of the pending EOI stack for the local APIC. >>> >>> I have put in more debugging to dump the LAPIC state of the two >>> interesting vectors and the IOAPIC state, but I have no idea if/when the >>> problem might reoccur. >>> >>> My understanding of LAPIC priority leads me to think that Xen really >>> shouldn''t be woken up by a lower priority vector if a higher priority >>> one is still un-eoi''d. There is not yet sufficient information to tell >>> whether this is truely the case, or that Xen has simply gotten confused >>> about which vectors it eoi''d. >>> >>> Having said that, we do keep line level interrupts un-eoi''d for extended >>> periods while guests service the interrupt. Given that vectors are >>> chosen at random, we could get into a situation where a line interrupt >>> has a vector 0xdf and stays pending for 150ms (which I measured as a >>> not-overly-uncommon mean-time-till-eoi for line level interrupt). This >>> would starve any other guest interrupts for an extended period. >>> >>> Given directed-eoi support in the past few generations of processor, the >>> requirement for the pending EOI stack has disappeared as far as I am >>> aware. Would it be sensible idea in general to make use of the pending >>> eoi stack conditional on not having/using directed EOI support? >>> >>> ~Andrew >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xen.org >>> http://lists.xen.org/xen-devel > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 02/08/2013 23:50, Thimo E. wrote:> Hi, > > I''ve postet it already in the forum thread, but to keep all of you up > to date for this issue I am copying the logfile into this thread, too: > > XenServer crash again, attached you''ll find the output with the > verbose messages Andrew inserted into the code. > > Best regards > ThimoSo I can see that I did screw up the debugging patch a tad, but the information is still salvageable. Adjusted from my "interesting" idea of printk formatting, (XEN) **Pending EOI error (XEN) irq 29, vector 0x2e (XEN) s[0] irq 29, vec 0x2e, ready 0, ISR 1, TMR 0, IRR 0 (XEN) All LAPIC state: (XEN) [vector] ISR TMR IRR (XEN) [1f:01] 00000000 00000000 00000000 (XEN) [3f:20] 00016384 4095716568 00000000 (XEN) [5f:40] 00000000 4041382474 00000000 (XEN) [7f:60] 00000000 3967325758 00000000 (XEN) [9f:80] 00000000 2123395250 00000000 (XEN) [bf:a0] 00000000 1502837374 00000000 (XEN) [df:c0] 00000000 4270415335 00000000 (XEN) [ff:e0] 00000000 00000000 00000000 So Xen has been interrupted by an interrupt which it believes it has already seen, and is outstanding on the PendingEOI stack, waiting for Dom0 to actually deal with. The In Service Register indicates (given the hex/dec snafu) that only vector 0x2e is in service. I will update my debugging patch with some extra information tomorrow. ~Andrew --------------070008010106030901080604 Content-Type: text/html; charset="windows-1252" Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> <div class="moz-cite-prefix"><tt>On 02/08/2013 23:50, Thimo E. wrote:</tt><tt><br> </tt></div> <blockquote cite="mid:51FC37A9.9090809@digithi.de" type="cite"> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <div class="moz-cite-prefix"><tt>Hi,</tt><tt><br> </tt> <tt><br> </tt><tt> I''ve postet it already in the forum thread, but to keep all of you up to date for this issue I am copying the logfile into this thread, too:</tt><tt><br> </tt> <tt><br> </tt><tt> XenServer crash again, attached you''ll find the output with the verbose messages Andrew inserted into the code.</tt><tt><br> </tt> <tt><br> </tt><tt> Best regards</tt><tt><br> </tt><tt> Thimo</tt><tt><br> </tt></div> </blockquote> <tt><br> </tt><tt>So I can see that I did screw up the debugging patch a tad</tt><tt>, but the information is still salvageable.</tt><tt><br> </tt><tt><br> </tt><tt>Adjusted from my "interesting" idea of printk formatting,</tt><tt><br> </tt><tt><br> </tt><tt>(XEN) **Pending EOI error</tt><tt><br> </tt><tt>(XEN) irq 29, vector 0x2e</tt><tt><br> </tt><tt>(XEN) s[0] irq 29, vec 0x2e, ready 0, ISR 1, TMR 0, IRR 0</tt><tt><br> </tt><tt>(XEN) All LAPIC state:</tt><tt><br> </tt><tt>(XEN) [vector] ISR TMR IRR<br> (XEN) [1f:01] 00000000 00000000 00000000<br> (XEN) [3f:20] 00016384 4095716568 00000000<br> (XEN) [5f:40] 00000000 4041382474 00000000<br> (XEN) [7f:60] 00000000 3967325758 00000000<br> (XEN) [9f:80] 00000000 2123395250 00000000<br> (XEN) [bf:a0] 00000000 1502837374 00000000<br> (XEN) [df:c0] 00000000 4270415335 00000000<br> (XEN) [ff:e0] 00000000 00000000 00000000<br> </tt><tt><br> So Xen has been interrupted by an interrupt which it believes it has already seen, and is outstanding on the PendingEOI stack, waiting for Dom0 to actually deal with.<br> <br> The In Service Register indicates (given the hex/dec snafu) that only vector 0x2e is in service.<br> <br> I will update my debugging patch with some extra information tomorrow.<br> <br> ~Andrew<br> </tt> </body> </html> --------------070008010106030901080604-- --===============4886096645189369791=Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============4886096645189369791==--
>>> On 03.08.13 at 01:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > Adjusted from my "interesting" idea of printk formatting, > > (XEN) **Pending EOI error > (XEN) irq 29, vector 0x2e > (XEN) s[0] irq 29, vec 0x2e, ready 0, ISR 1, TMR 0, IRR 0 > (XEN) All LAPIC state: > (XEN) [vector] ISR TMR IRR > (XEN) [1f:01] 00000000 00000000 00000000 > (XEN) [3f:20] 00016384 4095716568 00000000 > (XEN) [5f:40] 00000000 4041382474 00000000 > (XEN) [7f:60] 00000000 3967325758 00000000 > (XEN) [9f:80] 00000000 2123395250 00000000 > (XEN) [bf:a0] 00000000 1502837374 00000000 > (XEN) [df:c0] 00000000 4270415335 00000000 > (XEN) [ff:e0] 00000000 00000000 00000000 > > So Xen has been interrupted by an interrupt which it believes it has > already seen, and is outstanding on the PendingEOI stack, waiting for > Dom0 to actually deal with.And which hence should be masked. Is this perhaps a non-maskable MSI, and the device (erroneously?) issues a new interrupts before the old one was really finished with? Jan
On 05/08/13 13:45, Jan Beulich wrote:>>>> On 03.08.13 at 01:32, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> Adjusted from my "interesting" idea of printk formatting, >> >> (XEN) **Pending EOI error >> (XEN) irq 29, vector 0x2e >> (XEN) s[0] irq 29, vec 0x2e, ready 0, ISR 1, TMR 0, IRR 0 >> (XEN) All LAPIC state: >> (XEN) [vector] ISR TMR IRR >> (XEN) [1f:01] 00000000 00000000 00000000 >> (XEN) [3f:20] 00016384 4095716568 00000000 >> (XEN) [5f:40] 00000000 4041382474 00000000 >> (XEN) [7f:60] 00000000 3967325758 00000000 >> (XEN) [9f:80] 00000000 2123395250 00000000 >> (XEN) [bf:a0] 00000000 1502837374 00000000 >> (XEN) [df:c0] 00000000 4270415335 00000000 >> (XEN) [ff:e0] 00000000 00000000 00000000 >> >> So Xen has been interrupted by an interrupt which it believes it has >> already seen, and is outstanding on the PendingEOI stack, waiting for >> Dom0 to actually deal with. > And which hence should be masked. Is this perhaps a non-maskable > MSI, and the device (erroneously?) issues a new interrupts before > the old one was really finished with? > > Jan >All of these crashes are coming out of mwait_idle, so the cpu in question has literally just been in an lower power state. I am wondering whether there is some caching issue where an update to the Pending EOI stack pointer got "lost", but this seems like a little too specific to be reasonably explained as a caching issue. A new debugging patch is on its way (Sorry - it has been a very busy few days) ~Andrew
Next crash occured, debugging output included. One Remark: Over the last days (besides many linux PV guests) 1 Windows Guest (with PV drivers) was running, today I''ve started another Windows guest and during 3 hours two crashed occured, coincidence ? Best regards Thimo (XEN) **Pending EOI error (XEN) irq 29, vector 0x24 (XEN) s[0] irq 29, vec 0x24, ready 0, ISR 00000001, TMR 00000000, IRR 00000000 (XEN) All LAPIC state: (XEN) [vector] ISR TMR IRR (XEN) [1f:00] 00000000 00000000 00000000 (XEN) [3f:20] 00000010 76efa12e 00000000 (XEN) [5f:40] 00000000 e6f0f2fc 00000000 (XEN) [7f:60] 00000000 32d096ca 00000000 (XEN) [9f:80] 00000000 78fcf87a 00000000 (XEN) [bf:a0] 00000000 f9b9fe4e 00000000 (XEN) [df:c0] 00000000 ffdfe7ab 00000000 (XEN) [ff:e0] 00000000 00000000 00000000 (XEN) Peoi stack trace records: (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Guest interrupt information: (XEN) IRQ: 0 affinity:1 vec:f0 type=IO-APIC-edge status=00000000 mapped, unbound (XEN) IRQ: 1 affinity:1 vec:38 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 1(----), (XEN) IRQ: 2 affinity:f vec:00 type=XT-PIC status=00000000 mapped, unbound (XEN) IRQ: 3 affinity:1 vec:40 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 4 affinity:1 vec:48 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 5 affinity:1 vec:50 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 5(----), (XEN) IRQ: 6 affinity:1 vec:58 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 7 affinity:1 vec:60 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 8 affinity:1 vec:68 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 8(----), (XEN) IRQ: 9 affinity:1 vec:70 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 9(----), (XEN) IRQ: 10 affinity:1 vec:78 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 11 affinity:1 vec:88 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 12 affinity:1 vec:90 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 13 affinity:1 vec:98 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 14 affinity:1 vec:a0 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 15 affinity:1 vec:a8 type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 16 affinity:1 vec:db type=IO-APIC-level status=00000010 in-flight=0 domain-list=0: 16(----), (XEN) IRQ: 18 affinity:1 vec:2c type=IO-APIC-level status=00000010 in-flight=0 domain-list=0: 18(----), (XEN) IRQ: 19 affinity:1 vec:51 type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 20 affinity:1 vec:29 type=IO-APIC-level status=00000002 mapped, unbound (XEN) IRQ: 22 affinity:1 vec:bb type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 22(----), (XEN) IRQ: 23 affinity:8 vec:c2 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 23(----), (XEN) IRQ: 24 affinity:1 vec:28 type=DMA_MSI status=00000000 mapped, unbound (XEN) IRQ: 25 affinity:1 vec:30 type=DMA_MSI status=00000000 mapped, unbound (XEN) IRQ: 26 affinity:f vec:c0 type=PCI-MSI status=00000002 mapped, unbound (XEN) IRQ: 27 affinity:f vec:c8 type=PCI-MSI status=00000002 mapped, unbound (XEN) IRQ: 28 affinity:f vec:d0 type=PCI-MSI status=00000002 mapped, unbound (XEN) IRQ: 29 affinity:2 vec:24 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:276(----), (XEN) IRQ: 30 affinity:4 vec:93 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:275(----), (XEN) IRQ: 31 affinity:2 vec:4a type=PCI-MSI status=00000050 in-flight=0 domain-list=0:274(----), (XEN) IRQ: 32 affinity:2 vec:73 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:273(----), (XEN) IRQ: 33 affinity:1 vec:49 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:272(----), (XEN) IRQ: 34 affinity:8 vec:5f type=PCI-MSI status=00000050 in-flight=0 domain-list=0:271(----), (XEN) IO-APIC interrupt information: (XEN) IRQ 0 Vec240: (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 1 Vec 56: (XEN) Apic 0x00, Pin 1: vec=38 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 3 Vec 64: (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 4 Vec 72: (XEN) Apic 0x00, Pin 4: vec=48 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 5 Vec 80: (XEN) Apic 0x00, Pin 5: vec=50 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 6 Vec 88: (XEN) Apic 0x00, Pin 6: vec=58 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 7 Vec 96: (XEN) Apic 0x00, Pin 7: vec=60 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 8 Vec104: (XEN) Apic 0x00, Pin 8: vec=68 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 9 Vec112: (XEN) Apic 0x00, Pin 9: vec=70 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 10 Vec120: (XEN) Apic 0x00, Pin 10: vec=78 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 11 Vec136: (XEN) Apic 0x00, Pin 11: vec=88 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 12 Vec144: (XEN) Apic 0x00, Pin 12: vec=90 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 13 Vec152: (XEN) Apic 0x00, Pin 13: vec=98 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 14 Vec160: (XEN) Apic 0x00, Pin 14: vec=a0 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 15 Vec168: (XEN) Apic 0x00, Pin 15: vec=a8 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 16 Vec219: (XEN) Apic 0x00, Pin 16: vec=db delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 18 Vec 44: (XEN) Apic 0x00, Pin 18: vec=2c delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 19 Vec 81: (XEN) Apic 0x00, Pin 19: vec=51 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 20 Vec 41: (XEN) Apic 0x00, Pin 20: vec=29 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 22 Vec187: (XEN) Apic 0x00, Pin 22: vec=bb delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 23 Vec194: (XEN) Apic 0x00, Pin 23: vec=c2 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) number of MP IRQ sources: 15. (XEN) number of IO-APIC #2 registers: 24. (XEN) testing the IO APIC....................... (XEN) IO APIC #2...... (XEN) .... register #00: 02000000 (XEN) ....... : physical APIC id: 02 (XEN) ....... : Delivery Type: 0 (XEN) ....... : LTS : 0 (XEN) .... register #01: 00170020 (XEN) ....... : max redirection entries: 0017 (XEN) ....... : PRQ implemented: 0 (XEN) ....... : IO APIC version: 0020 (XEN) .... IRQ redirection table: (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: (XEN) 00 000 00 1 0 0 0 0 0 0 00 (XEN) 01 000 00 0 0 0 0 0 1 1 38 (XEN) 02 000 00 0 0 0 0 0 1 1 F0 (XEN) 03 000 00 0 0 0 0 0 1 1 40 (XEN) 04 000 00 0 0 0 0 0 1 1 48 (XEN) 05 000 00 0 0 0 0 0 1 1 50 (XEN) 06 000 00 0 0 0 0 0 1 1 58 (XEN) 07 000 00 0 0 0 0 0 1 1 60 (XEN) 08 000 00 0 0 0 0 0 1 1 68 (XEN) 09 000 00 0 1 0 0 0 1 1 70 (XEN) 0a 000 00 0 0 0 0 0 1 1 78 (XEN) 0b 000 00 0 0 0 0 0 1 1 88 (XEN) 0c 000 00 0 0 0 0 0 1 1 90 (XEN) 0d 000 00 0 0 0 0 0 1 1 98 (XEN) 0e 000 00 0 0 0 0 0 1 1 A0 (XEN) 0f 000 00 0 0 0 0 0 1 1 A8 (XEN) 10 000 00 0 1 0 1 0 1 1 DB (XEN) 11 000 00 1 0 0 0 0 0 0 00 (XEN) 12 000 00 0 1 0 1 0 1 1 2C (XEN) 13 000 00 1 1 0 1 0 1 1 51 (XEN) 14 000 00 1 1 0 1 0 1 1 29 (XEN) 15 07A 0A 1 0 0 0 0 0 2 B4 (XEN) 16 000 00 0 1 0 1 0 1 1 BB (XEN) 17 000 00 0 1 0 1 0 1 1 C2 (XEN) Using vector-based indexing (XEN) IRQ to pin mappings: (XEN) IRQ240 -> 0:2 (XEN) IRQ56 -> 0:1 (XEN) IRQ64 -> 0:3 (XEN) IRQ72 -> 0:4 (XEN) IRQ80 -> 0:5 (XEN) IRQ88 -> 0:6 (XEN) IRQ96 -> 0:7 (XEN) IRQ104 -> 0:8 (XEN) IRQ112 -> 0:9 (XEN) IRQ120 -> 0:10 (XEN) IRQ136 -> 0:11 (XEN) IRQ144 -> 0:12 (XEN) IRQ152 -> 0:13 (XEN) IRQ160 -> 0:14 (XEN) IRQ168 -> 0:15 (XEN) IRQ219 -> 0:16 (XEN) IRQ44 -> 0:18 (XEN) IRQ81 -> 0:19 (XEN) IRQ41 -> 0:20 (XEN) IRQ187 -> 0:22 (XEN) IRQ194 -> 0:23 (XEN) .................................... done. (XEN) (XEN) **************************************** (XEN) Panic on CPU 1: (XEN) CA-107844**************************************** (XEN) (XEN) Reboot in five seconds... (XEN) Executing crash image Am 05.08.2013 16:51, schrieb Andrew Cooper:> All of these crashes are coming out of mwait_idle, so the cpu in > question has literally just been in an lower power state. > > I am wondering whether there is some caching issue where an update to > the Pending EOI stack pointer got "lost", but this seems like a little > too specific to be reasonably explained as a caching issue. > > A new debugging patch is on its way (Sorry - it has been a very busy few > days) > > ~Andrew >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 09/08/13 22:27, Thimo E. wrote:> Next crash occured, debugging output included. > > One Remark: Over the last days (besides many linux PV guests) 1 > Windows Guest (with PV drivers) was running, today I''ve started > another Windows guest and during 3 hours two crashed occured, > coincidence ? > > Best regards > ThimoSo according to my debugging, we really have just pushed the same irq which we have subsequently seen again unexpectedly. This bug has only ever been seen on Haswell hardware, and appears linked to running HVM guests. So either there is an erroneous ACK the LAPIC which is clearing the ISR before the PEOI stack is expecting (which I obviously see, looking at the code), or something more funky is going on with the hardware. CC''ing in the Intel maintainers: Do you have any ideas? Could this be related to APICv? ~Andrew> > (XEN) **Pending EOI error > (XEN) irq 29, vector 0x24 > (XEN) s[0] irq 29, vec 0x24, ready 0, ISR 00000001, TMR 00000000, > IRR 00000000 > (XEN) All LAPIC state: > (XEN) [vector] ISR TMR IRR > (XEN) [1f:00] 00000000 00000000 00000000 > (XEN) [3f:20] 00000010 76efa12e 00000000 > (XEN) [5f:40] 00000000 e6f0f2fc 00000000 > (XEN) [7f:60] 00000000 32d096ca 00000000 > (XEN) [9f:80] 00000000 78fcf87a 00000000 > (XEN) [bf:a0] 00000000 f9b9fe4e 00000000 > (XEN) [df:c0] 00000000 ffdfe7ab 00000000 > (XEN) [ff:e0] 00000000 00000000 00000000 > (XEN) Peoi stack trace records: > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Guest interrupt information: > (XEN) IRQ: 0 affinity:1 vec:f0 type=IO-APIC-edge > status=00000000 mapped, unbound > (XEN) IRQ: 1 affinity:1 vec:38 type=IO-APIC-edge > status=00000050 in-flight=0 domain-list=0: 1(----), > (XEN) IRQ: 2 affinity:f vec:00 type=XT-PIC > status=00000000 mapped, unbound > (XEN) IRQ: 3 affinity:1 vec:40 type=IO-APIC-edge > status=00000002 mapped, unbound > (XEN) IRQ: 4 affinity:1 vec:48 type=IO-APIC-edge > status=00000002 mapped, unbound > (XEN) IRQ: 5 affinity:1 vec:50 type=IO-APIC-edge > status=00000050 in-flight=0 domain-list=0: 5(----), > (XEN) IRQ: 6 affinity:1 vec:58 type=IO-APIC-edge > status=00000002 mapped, unbound > (XEN) IRQ: 7 affinity:1 vec:60 type=IO-APIC-edge > status=00000002 mapped, unbound > (XEN) IRQ: 8 affinity:1 vec:68 type=IO-APIC-edge > status=00000050 in-flight=0 domain-list=0: 8(----), > (XEN) IRQ: 9 affinity:1 vec:70 type=IO-APIC-level > status=00000050 in-flight=0 domain-list=0: 9(----), > (XEN) IRQ: 10 affinity:1 vec:78 type=IO-APIC-edge > status=00000002 mapped, unbound > (XEN) IRQ: 11 affinity:1 vec:88 type=IO-APIC-edge > status=00000002 mapped, unbound > (XEN) IRQ: 12 affinity:1 vec:90 type=IO-APIC-edge > status=00000002 mapped, unbound > (XEN) IRQ: 13 affinity:1 vec:98 type=IO-APIC-edge > status=00000002 mapped, unbound > (XEN) IRQ: 14 affinity:1 vec:a0 type=IO-APIC-edge > status=00000002 mapped, unbound > (XEN) IRQ: 15 affinity:1 vec:a8 type=IO-APIC-edge > status=00000002 mapped, unbound > (XEN) IRQ: 16 affinity:1 vec:db type=IO-APIC-level > status=00000010 in-flight=0 domain-list=0: 16(----), > (XEN) IRQ: 18 affinity:1 vec:2c type=IO-APIC-level > status=00000010 in-flight=0 domain-list=0: 18(----), > (XEN) IRQ: 19 affinity:1 vec:51 type=IO-APIC-level > status=00000002 mapped, unbound > (XEN) IRQ: 20 affinity:1 vec:29 type=IO-APIC-level > status=00000002 mapped, unbound > (XEN) IRQ: 22 affinity:1 vec:bb type=IO-APIC-level > status=00000050 in-flight=0 domain-list=0: 22(----), > (XEN) IRQ: 23 affinity:8 vec:c2 type=IO-APIC-level > status=00000050 in-flight=0 domain-list=0: 23(----), > (XEN) IRQ: 24 affinity:1 vec:28 type=DMA_MSI > status=00000000 mapped, unbound > (XEN) IRQ: 25 affinity:1 vec:30 type=DMA_MSI > status=00000000 mapped, unbound > (XEN) IRQ: 26 affinity:f vec:c0 type=PCI-MSI > status=00000002 mapped, unbound > (XEN) IRQ: 27 affinity:f vec:c8 type=PCI-MSI > status=00000002 mapped, unbound > (XEN) IRQ: 28 affinity:f vec:d0 type=PCI-MSI > status=00000002 mapped, unbound > (XEN) IRQ: 29 affinity:2 vec:24 type=PCI-MSI > status=00000010 in-flight=0 domain-list=0:276(----), > (XEN) IRQ: 30 affinity:4 vec:93 type=PCI-MSI > status=00000050 in-flight=0 domain-list=0:275(----), > (XEN) IRQ: 31 affinity:2 vec:4a type=PCI-MSI > status=00000050 in-flight=0 domain-list=0:274(----), > (XEN) IRQ: 32 affinity:2 vec:73 type=PCI-MSI > status=00000050 in-flight=0 domain-list=0:273(----), > (XEN) IRQ: 33 affinity:1 vec:49 type=PCI-MSI > status=00000050 in-flight=0 domain-list=0:272(----), > (XEN) IRQ: 34 affinity:8 vec:5f type=PCI-MSI > status=00000050 in-flight=0 domain-list=0:271(----), > (XEN) IO-APIC interrupt information: > (XEN) IRQ 0 Vec240: > (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 1 Vec 56: > (XEN) Apic 0x00, Pin 1: vec=38 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 3 Vec 64: > (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 4 Vec 72: > (XEN) Apic 0x00, Pin 4: vec=48 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 5 Vec 80: > (XEN) Apic 0x00, Pin 5: vec=50 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 6 Vec 88: > (XEN) Apic 0x00, Pin 6: vec=58 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 7 Vec 96: > (XEN) Apic 0x00, Pin 7: vec=60 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 8 Vec104: > (XEN) Apic 0x00, Pin 8: vec=68 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 9 Vec112: > (XEN) Apic 0x00, Pin 9: vec=70 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 10 Vec120: > (XEN) Apic 0x00, Pin 10: vec=78 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 11 Vec136: > (XEN) Apic 0x00, Pin 11: vec=88 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 12 Vec144: > (XEN) Apic 0x00, Pin 12: vec=90 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 13 Vec152: > (XEN) Apic 0x00, Pin 13: vec=98 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 14 Vec160: > (XEN) Apic 0x00, Pin 14: vec=a0 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 15 Vec168: > (XEN) Apic 0x00, Pin 15: vec=a8 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 16 Vec219: > (XEN) Apic 0x00, Pin 16: vec=db delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 18 Vec 44: > (XEN) Apic 0x00, Pin 18: vec=2c delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 19 Vec 81: > (XEN) Apic 0x00, Pin 19: vec=51 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=1 dest_id:0 > (XEN) IRQ 20 Vec 41: > (XEN) Apic 0x00, Pin 20: vec=29 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=1 dest_id:0 > (XEN) IRQ 22 Vec187: > (XEN) Apic 0x00, Pin 22: vec=bb delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 23 Vec194: > (XEN) Apic 0x00, Pin 23: vec=c2 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) number of MP IRQ sources: 15. > (XEN) number of IO-APIC #2 registers: 24. > (XEN) testing the IO APIC....................... > (XEN) IO APIC #2...... > (XEN) .... register #00: 02000000 > (XEN) ....... : physical APIC id: 02 > (XEN) ....... : Delivery Type: 0 > (XEN) ....... : LTS : 0 > (XEN) .... register #01: 00170020 > (XEN) ....... : max redirection entries: 0017 > (XEN) ....... : PRQ implemented: 0 > (XEN) ....... : IO APIC version: 0020 > (XEN) .... IRQ redirection table: > (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: > (XEN) 00 000 00 1 0 0 0 0 0 0 00 > (XEN) 01 000 00 0 0 0 0 0 1 1 38 > (XEN) 02 000 00 0 0 0 0 0 1 1 F0 > (XEN) 03 000 00 0 0 0 0 0 1 1 40 > (XEN) 04 000 00 0 0 0 0 0 1 1 48 > (XEN) 05 000 00 0 0 0 0 0 1 1 50 > (XEN) 06 000 00 0 0 0 0 0 1 1 58 > (XEN) 07 000 00 0 0 0 0 0 1 1 60 > (XEN) 08 000 00 0 0 0 0 0 1 1 68 > (XEN) 09 000 00 0 1 0 0 0 1 1 70 > (XEN) 0a 000 00 0 0 0 0 0 1 1 78 > (XEN) 0b 000 00 0 0 0 0 0 1 1 88 > (XEN) 0c 000 00 0 0 0 0 0 1 1 90 > (XEN) 0d 000 00 0 0 0 0 0 1 1 98 > (XEN) 0e 000 00 0 0 0 0 0 1 1 A0 > (XEN) 0f 000 00 0 0 0 0 0 1 1 A8 > (XEN) 10 000 00 0 1 0 1 0 1 1 DB > (XEN) 11 000 00 1 0 0 0 0 0 0 00 > (XEN) 12 000 00 0 1 0 1 0 1 1 2C > (XEN) 13 000 00 1 1 0 1 0 1 1 51 > (XEN) 14 000 00 1 1 0 1 0 1 1 29 > (XEN) 15 07A 0A 1 0 0 0 0 0 2 B4 > (XEN) 16 000 00 0 1 0 1 0 1 1 BB > (XEN) 17 000 00 0 1 0 1 0 1 1 C2 > (XEN) Using vector-based indexing > (XEN) IRQ to pin mappings: > (XEN) IRQ240 -> 0:2 > (XEN) IRQ56 -> 0:1 > (XEN) IRQ64 -> 0:3 > (XEN) IRQ72 -> 0:4 > (XEN) IRQ80 -> 0:5 > (XEN) IRQ88 -> 0:6 > (XEN) IRQ96 -> 0:7 > (XEN) IRQ104 -> 0:8 > (XEN) IRQ112 -> 0:9 > (XEN) IRQ120 -> 0:10 > (XEN) IRQ136 -> 0:11 > (XEN) IRQ144 -> 0:12 > (XEN) IRQ152 -> 0:13 > (XEN) IRQ160 -> 0:14 > (XEN) IRQ168 -> 0:15 > (XEN) IRQ219 -> 0:16 > (XEN) IRQ44 -> 0:18 > (XEN) IRQ81 -> 0:19 > (XEN) IRQ41 -> 0:20 > (XEN) IRQ187 -> 0:22 > (XEN) IRQ194 -> 0:23 > (XEN) .................................... done. > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 1: > (XEN) CA-107844**************************************** > (XEN) > (XEN) Reboot in five seconds... > (XEN) Executing crash image > > > Am 05.08.2013 16:51, schrieb Andrew Cooper: >> All of these crashes are coming out of mwait_idle, so the cpu in >> question has literally just been in an lower power state. >> >> I am wondering whether there is some caching issue where an update to >> the Pending EOI stack pointer got "lost", but this seems like a little >> too specific to be reasonably explained as a caching issue. >> >> A new debugging patch is on its way (Sorry - it has been a very busy few >> days) >> >> ~Andrew >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 09/08/13 22:40, Andrew Cooper wrote:> On 09/08/13 22:27, Thimo E. wrote: >> Next crash occured, debugging output included. >> >> One Remark: Over the last days (besides many linux PV guests) 1 >> Windows Guest (with PV drivers) was running, today I''ve started >> another Windows guest and during 3 hours two crashed occured, >> coincidence ? >> >> Best regards >> Thimo > > So according to my debugging, we really have just pushed the same irq > which we have subsequently seen again unexpectedly. > > This bug has only ever been seen on Haswell hardware, and appears > linked to running HVM guests. > > So either there is an erroneous ACK the LAPIC which is clearing the > ISR before the PEOI stack is expecting (which I"can''t" Apologies for the confusion. ~Andrew> obviously see, looking at the code), or something more funky is going > on with the hardware. > > CC''ing in the Intel maintainers: Do you have any ideas? Could this > be related to APICv? > > ~Andrew > >> >> (XEN) **Pending EOI error >> (XEN) irq 29, vector 0x24 >> (XEN) s[0] irq 29, vec 0x24, ready 0, ISR 00000001, TMR 00000000, >> IRR 00000000 >> (XEN) All LAPIC state: >> (XEN) [vector] ISR TMR IRR >> (XEN) [1f:00] 00000000 00000000 00000000 >> (XEN) [3f:20] 00000010 76efa12e 00000000 >> (XEN) [5f:40] 00000000 e6f0f2fc 00000000 >> (XEN) [7f:60] 00000000 32d096ca 00000000 >> (XEN) [9f:80] 00000000 78fcf87a 00000000 >> (XEN) [bf:a0] 00000000 f9b9fe4e 00000000 >> (XEN) [df:c0] 00000000 ffdfe7ab 00000000 >> (XEN) [ff:e0] 00000000 00000000 00000000 >> (XEN) Peoi stack trace records: >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Guest interrupt information: >> (XEN) IRQ: 0 affinity:1 vec:f0 type=IO-APIC-edge >> status=00000000 mapped, unbound >> (XEN) IRQ: 1 affinity:1 vec:38 type=IO-APIC-edge >> status=00000050 in-flight=0 domain-list=0: 1(----), >> (XEN) IRQ: 2 affinity:f vec:00 type=XT-PIC >> status=00000000 mapped, unbound >> (XEN) IRQ: 3 affinity:1 vec:40 type=IO-APIC-edge >> status=00000002 mapped, unbound >> (XEN) IRQ: 4 affinity:1 vec:48 type=IO-APIC-edge >> status=00000002 mapped, unbound >> (XEN) IRQ: 5 affinity:1 vec:50 type=IO-APIC-edge >> status=00000050 in-flight=0 domain-list=0: 5(----), >> (XEN) IRQ: 6 affinity:1 vec:58 type=IO-APIC-edge >> status=00000002 mapped, unbound >> (XEN) IRQ: 7 affinity:1 vec:60 type=IO-APIC-edge >> status=00000002 mapped, unbound >> (XEN) IRQ: 8 affinity:1 vec:68 type=IO-APIC-edge >> status=00000050 in-flight=0 domain-list=0: 8(----), >> (XEN) IRQ: 9 affinity:1 vec:70 type=IO-APIC-level >> status=00000050 in-flight=0 domain-list=0: 9(----), >> (XEN) IRQ: 10 affinity:1 vec:78 type=IO-APIC-edge >> status=00000002 mapped, unbound >> (XEN) IRQ: 11 affinity:1 vec:88 type=IO-APIC-edge >> status=00000002 mapped, unbound >> (XEN) IRQ: 12 affinity:1 vec:90 type=IO-APIC-edge >> status=00000002 mapped, unbound >> (XEN) IRQ: 13 affinity:1 vec:98 type=IO-APIC-edge >> status=00000002 mapped, unbound >> (XEN) IRQ: 14 affinity:1 vec:a0 type=IO-APIC-edge >> status=00000002 mapped, unbound >> (XEN) IRQ: 15 affinity:1 vec:a8 type=IO-APIC-edge >> status=00000002 mapped, unbound >> (XEN) IRQ: 16 affinity:1 vec:db type=IO-APIC-level >> status=00000010 in-flight=0 domain-list=0: 16(----), >> (XEN) IRQ: 18 affinity:1 vec:2c type=IO-APIC-level >> status=00000010 in-flight=0 domain-list=0: 18(----), >> (XEN) IRQ: 19 affinity:1 vec:51 type=IO-APIC-level >> status=00000002 mapped, unbound >> (XEN) IRQ: 20 affinity:1 vec:29 type=IO-APIC-level >> status=00000002 mapped, unbound >> (XEN) IRQ: 22 affinity:1 vec:bb type=IO-APIC-level >> status=00000050 in-flight=0 domain-list=0: 22(----), >> (XEN) IRQ: 23 affinity:8 vec:c2 type=IO-APIC-level >> status=00000050 in-flight=0 domain-list=0: 23(----), >> (XEN) IRQ: 24 affinity:1 vec:28 type=DMA_MSI >> status=00000000 mapped, unbound >> (XEN) IRQ: 25 affinity:1 vec:30 type=DMA_MSI >> status=00000000 mapped, unbound >> (XEN) IRQ: 26 affinity:f vec:c0 type=PCI-MSI >> status=00000002 mapped, unbound >> (XEN) IRQ: 27 affinity:f vec:c8 type=PCI-MSI >> status=00000002 mapped, unbound >> (XEN) IRQ: 28 affinity:f vec:d0 type=PCI-MSI >> status=00000002 mapped, unbound >> (XEN) IRQ: 29 affinity:2 vec:24 type=PCI-MSI >> status=00000010 in-flight=0 domain-list=0:276(----), >> (XEN) IRQ: 30 affinity:4 vec:93 type=PCI-MSI >> status=00000050 in-flight=0 domain-list=0:275(----), >> (XEN) IRQ: 31 affinity:2 vec:4a type=PCI-MSI >> status=00000050 in-flight=0 domain-list=0:274(----), >> (XEN) IRQ: 32 affinity:2 vec:73 type=PCI-MSI >> status=00000050 in-flight=0 domain-list=0:273(----), >> (XEN) IRQ: 33 affinity:1 vec:49 type=PCI-MSI >> status=00000050 in-flight=0 domain-list=0:272(----), >> (XEN) IRQ: 34 affinity:8 vec:5f type=PCI-MSI >> status=00000050 in-flight=0 domain-list=0:271(----), >> (XEN) IO-APIC interrupt information: >> (XEN) IRQ 0 Vec240: >> (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 1 Vec 56: >> (XEN) Apic 0x00, Pin 1: vec=38 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 3 Vec 64: >> (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 4 Vec 72: >> (XEN) Apic 0x00, Pin 4: vec=48 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 5 Vec 80: >> (XEN) Apic 0x00, Pin 5: vec=50 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 6 Vec 88: >> (XEN) Apic 0x00, Pin 6: vec=58 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 7 Vec 96: >> (XEN) Apic 0x00, Pin 7: vec=60 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 8 Vec104: >> (XEN) Apic 0x00, Pin 8: vec=68 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 9 Vec112: >> (XEN) Apic 0x00, Pin 9: vec=70 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 10 Vec120: >> (XEN) Apic 0x00, Pin 10: vec=78 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 11 Vec136: >> (XEN) Apic 0x00, Pin 11: vec=88 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 12 Vec144: >> (XEN) Apic 0x00, Pin 12: vec=90 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 13 Vec152: >> (XEN) Apic 0x00, Pin 13: vec=98 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 14 Vec160: >> (XEN) Apic 0x00, Pin 14: vec=a0 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 15 Vec168: >> (XEN) Apic 0x00, Pin 15: vec=a8 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 16 Vec219: >> (XEN) Apic 0x00, Pin 16: vec=db delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 18 Vec 44: >> (XEN) Apic 0x00, Pin 18: vec=2c delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 19 Vec 81: >> (XEN) Apic 0x00, Pin 19: vec=51 delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=1 dest_id:0 >> (XEN) IRQ 20 Vec 41: >> (XEN) Apic 0x00, Pin 20: vec=29 delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=1 dest_id:0 >> (XEN) IRQ 22 Vec187: >> (XEN) Apic 0x00, Pin 22: vec=bb delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 23 Vec194: >> (XEN) Apic 0x00, Pin 23: vec=c2 delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) number of MP IRQ sources: 15. >> (XEN) number of IO-APIC #2 registers: 24. >> (XEN) testing the IO APIC....................... >> (XEN) IO APIC #2...... >> (XEN) .... register #00: 02000000 >> (XEN) ....... : physical APIC id: 02 >> (XEN) ....... : Delivery Type: 0 >> (XEN) ....... : LTS : 0 >> (XEN) .... register #01: 00170020 >> (XEN) ....... : max redirection entries: 0017 >> (XEN) ....... : PRQ implemented: 0 >> (XEN) ....... : IO APIC version: 0020 >> (XEN) .... IRQ redirection table: >> (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: >> (XEN) 00 000 00 1 0 0 0 0 0 0 00 >> (XEN) 01 000 00 0 0 0 0 0 1 1 38 >> (XEN) 02 000 00 0 0 0 0 0 1 1 F0 >> (XEN) 03 000 00 0 0 0 0 0 1 1 40 >> (XEN) 04 000 00 0 0 0 0 0 1 1 48 >> (XEN) 05 000 00 0 0 0 0 0 1 1 50 >> (XEN) 06 000 00 0 0 0 0 0 1 1 58 >> (XEN) 07 000 00 0 0 0 0 0 1 1 60 >> (XEN) 08 000 00 0 0 0 0 0 1 1 68 >> (XEN) 09 000 00 0 1 0 0 0 1 1 70 >> (XEN) 0a 000 00 0 0 0 0 0 1 1 78 >> (XEN) 0b 000 00 0 0 0 0 0 1 1 88 >> (XEN) 0c 000 00 0 0 0 0 0 1 1 90 >> (XEN) 0d 000 00 0 0 0 0 0 1 1 98 >> (XEN) 0e 000 00 0 0 0 0 0 1 1 A0 >> (XEN) 0f 000 00 0 0 0 0 0 1 1 A8 >> (XEN) 10 000 00 0 1 0 1 0 1 1 DB >> (XEN) 11 000 00 1 0 0 0 0 0 0 00 >> (XEN) 12 000 00 0 1 0 1 0 1 1 2C >> (XEN) 13 000 00 1 1 0 1 0 1 1 51 >> (XEN) 14 000 00 1 1 0 1 0 1 1 29 >> (XEN) 15 07A 0A 1 0 0 0 0 0 2 B4 >> (XEN) 16 000 00 0 1 0 1 0 1 1 BB >> (XEN) 17 000 00 0 1 0 1 0 1 1 C2 >> (XEN) Using vector-based indexing >> (XEN) IRQ to pin mappings: >> (XEN) IRQ240 -> 0:2 >> (XEN) IRQ56 -> 0:1 >> (XEN) IRQ64 -> 0:3 >> (XEN) IRQ72 -> 0:4 >> (XEN) IRQ80 -> 0:5 >> (XEN) IRQ88 -> 0:6 >> (XEN) IRQ96 -> 0:7 >> (XEN) IRQ104 -> 0:8 >> (XEN) IRQ112 -> 0:9 >> (XEN) IRQ120 -> 0:10 >> (XEN) IRQ136 -> 0:11 >> (XEN) IRQ144 -> 0:12 >> (XEN) IRQ152 -> 0:13 >> (XEN) IRQ160 -> 0:14 >> (XEN) IRQ168 -> 0:15 >> (XEN) IRQ219 -> 0:16 >> (XEN) IRQ44 -> 0:18 >> (XEN) IRQ81 -> 0:19 >> (XEN) IRQ41 -> 0:20 >> (XEN) IRQ187 -> 0:22 >> (XEN) IRQ194 -> 0:23 >> (XEN) .................................... done. >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 1: >> (XEN) CA-107844**************************************** >> (XEN) >> (XEN) Reboot in five seconds... >> (XEN) Executing crash image >> >> >> Am 05.08.2013 16:51, schrieb Andrew Cooper: >>> All of these crashes are coming out of mwait_idle, so the cpu in >>> question has literally just been in an lower power state. >>> >>> I am wondering whether there is some caching issue where an update to >>> the Pending EOI stack pointer got "lost", but this seems like a little >>> too specific to be reasonably explained as a caching issue. >>> >>> A new debugging patch is on its way (Sorry - it has been a very busy few >>> days) >>> >>> ~Andrew >>> > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Hello again, attached you''ll find another crash dump from today. Don''t know if it gives you more information than the last one. Just FYI, this is a system with an Intel Mainboard (H87 chipset) and a Core i5-4670 CPU. Best regards Thimo Am 09.08.2013 23:44, schrieb Andrew Cooper:> On 09/08/13 22:40, Andrew Cooper wrote: >> >> So according to my debugging, we really have just pushed the same irq >> which we have subsequently seen again unexpectedly. >> >> This bug has only ever been seen on Haswell hardware, and appears >> linked to running HVM guests. >> >> So either there is an erroneous ACK the LAPIC which is clearing the >> ISR before the PEOI stack is expecting (which I > > "can''t" > > Apologies for the confusion. > > ~Andrew > >> obviously see, looking at the code), or something more funky is going >> on with the hardware. >> >> CC''ing in the Intel maintainers: Do you have any ideas? Could this >> be related to APICv? >> >> ~Andrew > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper wrote on 2013-08-10:> On 09/08/13 22:27, Thimo E. wrote: > > > Next crash occured, debugging output included. > > > One Remark: Over the last days (besides many linux PV guests) 1 > Windows Guest (with PV drivers) was running, today I''ve started > another Windows guest and during 3 hours two crashed occured, coincidence ? > > Best regards > Thimo > > > > So according to my debugging, we really have just pushed the same irq > which we have subsequently seen again unexpectedly. > > This bug has only ever been seen on Haswell hardware, and appears > linked to running HVM guests. > > So either there is an erroneous ACK the LAPIC which is clearing the > ISR before the PEOI stack is expecting (which I obviously see, looking > at the code), or something more funky is going on with the hardware. > > CC''ing in the Intel maintainers: Do you have any ideas? Could this > be related to APICv?Does your machine support APIC-v?> > ~Andrew > > > > > (XEN) **Pending EOI error (XEN) irq 29, vector 0x24 (XEN) s[0] > irq 29, vec 0x24, ready 0, ISR 00000001, TMR 00000000, IRR 00000000 > (XEN) All LAPIC state: (XEN) [vector] ISR TMR IRR > (XEN) [1f:00] 00000000 00000000 00000000 (XEN) [3f:20] 00000010 > 76efa12e 00000000 (XEN) [5f:40] 00000000 e6f0f2fc 00000000 (XEN) > [7f:60] 00000000 32d096ca 00000000 (XEN) [9f:80] 00000000 78fcf87a > 00000000 (XEN) [bf:a0] 00000000 f9b9fe4e 00000000 (XEN) [df:c0] > 00000000 ffdfe7ab 00000000 (XEN) [ff:e0] 00000000 00000000 00000000 > (XEN) Peoi stack trace records: (XEN) Pushed {sp 0, irq 29, vec > 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp > 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq > 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) > Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec > 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped > entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} > ready (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp > 1, irq 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq > 29, vec 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) > Pushed {sp 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec > 0x24} (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp > 0, irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, > irq 29, vec 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) > Marked {sp 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec > 0x24} (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Marked {sp > 0, irq 29, vec 0x24} ready (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} (XEN) Guest interrupt > information: (XEN) IRQ: 0 affinity:1 vec:f0 type=IO-APIC-edge > status=00000000 mapped, unbound (XEN) IRQ: 1 affinity:1 vec:38 > type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 1(----), > (XEN) IRQ: 2 affinity:f vec:00 type=XT-PIC status=00000000 mapped, > unbound (XEN) IRQ: 3 affinity:1 vec:40 type=IO-APIC-edge > status=00000002 mapped, unbound (XEN) IRQ: 4 affinity:1 vec:48 > type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 5 > affinity:1 vec:50 type=IO-APIC-edge status=00000050 in-flight=0 > domain-list=0: 5(----), (XEN) IRQ: 6 affinity:1 vec:58 > type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 7 > affinity:1 vec:60 type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 8 affinity:1 vec:68 type=IO-APIC-edge status=00000050 > in-flight=0 domain-list=0: 8(----), (XEN) IRQ: 9 affinity:1 > vec:70 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: > 9(----), (XEN) IRQ: 10 affinity:1 vec:78 type=IO-APIC-edge > status=00000002 mapped, unbound (XEN) IRQ: 11 affinity:1 vec:88 > type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 12 > affinity:1 vec:90 type=IO-APIC-edge status=00000002 mapped, unbound > (XEN) IRQ: 13 affinity:1 vec:98 type=IO-APIC-edge status=00000002 > mapped, unbound (XEN) IRQ: 14 affinity:1 vec:a0 type=IO-APIC-edge > status=00000002 mapped, unbound (XEN) IRQ: 15 affinity:1 vec:a8 > type=IO-APIC-edge status=00000002 mapped, unbound (XEN) IRQ: 16 > affinity:1 vec:db type=IO-APIC-level status=00000010 in-flight=0 > domain-list=0: 16(----), (XEN) IRQ: 18 affinity:1 vec:2c > type=IO-APIC-level status=00000010 in-flight=0 domain-list=0: 18(----), > (XEN) IRQ: 19 affinity:1 vec:51 type=IO-APIC-level status=00000002 > mapped, unbound (XEN) IRQ: 20 affinity:1 vec:29 type=IO-APIC-level > status=00000002 mapped, unbound (XEN) IRQ: 22 affinity:1 vec:bb > type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 22(----), > (XEN) IRQ: 23 affinity:8 vec:c2 type=IO-APIC-level status=00000050 > in-flight=0 domain-list=0: 23(----), (XEN) IRQ: 24 affinity:1 > vec:28 type=DMA_MSI status=00000000 mapped, unbound (XEN) IRQ: 25 > affinity:1 vec:30 type=DMA_MSI status=00000000 mapped, unbound (XEN) > IRQ: 26 affinity:f vec:c0 type=PCI-MSI status=00000002 mapped, unbound > (XEN) IRQ: 27 affinity:f vec:c8 type=PCI-MSI status=00000002 > mapped, unbound (XEN) IRQ: 28 affinity:f vec:d0 type=PCI-MSI > status=00000002 mapped, unbound (XEN) IRQ: 29 affinity:2 vec:24 > type=PCI-MSI status=00000010 in-flight=0 domain-list=0:276(----), (XEN) > IRQ: 30 affinity:4 vec:93 type=PCI-MSI status=00000050 in-flight=0 > domain-list=0:275(----), (XEN) IRQ: 31 affinity:2 vec:4a > type=PCI-MSI status=00000050 in-flight=0 domain-list=0:274(----), (XEN) > IRQ: 32 affinity:2 vec:73 type=PCI-MSI status=00000050 in-flight=0 > domain-list=0:273(----), (XEN) IRQ: 33 affinity:1 vec:49 > type=PCI-MSI status=00000050 in-flight=0 domain-list=0:272(----), (XEN) > IRQ: 34 affinity:8 vec:5f type=PCI-MSI status=00000050 in-flight=0 > domain-list=0:271(----), (XEN) IO-APIC interrupt information: (XEN) > IRQ 0 Vec240: (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri > dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ > 1 Vec 56: (XEN) Apic 0x00, Pin 1: vec=38 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 3 Vec > 64: (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 4 Vec > 72: (XEN) Apic 0x00, Pin 4: vec=48 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 5 Vec > 80: (XEN) Apic 0x00, Pin 5: vec=50 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 6 Vec > 88: (XEN) Apic 0x00, Pin 6: vec=58 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 7 Vec > 96: (XEN) Apic 0x00, Pin 7: vec=60 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 8 > Vec104: (XEN) Apic 0x00, Pin 8: vec=68 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 9 > Vec112: (XEN) Apic 0x00, Pin 9: vec=70 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 10 > Vec120: (XEN) Apic 0x00, Pin 10: vec=78 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 11 > Vec136: (XEN) Apic 0x00, Pin 11: vec=88 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 12 > Vec144: (XEN) Apic 0x00, Pin 12: vec=90 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 13 > Vec152: (XEN) Apic 0x00, Pin 13: vec=98 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 14 > Vec160: (XEN) Apic 0x00, Pin 14: vec=a0 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 15 > Vec168: (XEN) Apic 0x00, Pin 15: vec=a8 delivery=LoPri dest=L > status=0 polarity=0 irr=0 trig=E mask=0 dest_id:0 (XEN) IRQ 16 > Vec219: (XEN) Apic 0x00, Pin 16: vec=db delivery=LoPri dest=L > status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 18 Vec > 44: (XEN) Apic 0x00, Pin 18: vec=2c delivery=LoPri dest=L > status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 19 Vec > 81: (XEN) Apic 0x00, Pin 19: vec=51 delivery=LoPri dest=L > status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 20 Vec > 41: (XEN) Apic 0x00, Pin 20: vec=29 delivery=LoPri dest=L > status=0 polarity=1 irr=0 trig=L mask=1 dest_id:0 (XEN) IRQ 22 > Vec187: (XEN) Apic 0x00, Pin 22: vec=bb delivery=LoPri dest=L > status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) IRQ 23 > Vec194: (XEN) Apic 0x00, Pin 23: vec=c2 delivery=LoPri dest=L > status=0 polarity=1 irr=0 trig=L mask=0 dest_id:0 (XEN) number of MP > IRQ sources: 15. (XEN) number of IO-APIC #2 registers: 24. (XEN) > testing the IO APIC....................... (XEN) IO APIC #2...... > (XEN) .... register #00: 02000000 (XEN) ....... : physical APIC id: > 02 (XEN) ....... : Delivery Type: 0 (XEN) ....... : LTS > : 0 (XEN) .... register #01: 00170020 (XEN) ....... : max > redirection entries: 0017 (XEN) ....... : PRQ implemented: 0 (XEN) > ....... : IO APIC version: 0020 (XEN) .... IRQ redirection table: > (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: (XEN) 00 000 > 00 1 0 0 0 0 0 0 00 (XEN) 01 000 00 0 0 0 > 0 0 1 1 38 (XEN) 02 000 00 0 0 0 0 0 1 1 > F0 (XEN) 03 000 00 0 0 0 0 0 1 1 40 (XEN) 04 > 000 00 0 0 0 0 0 1 1 48 (XEN) 05 000 00 0 0 > 0 0 0 1 1 50 (XEN) 06 000 00 0 0 0 0 0 1 > 1 58 (XEN) 07 000 00 0 0 0 0 0 1 1 60 (XEN) > 08 000 00 0 0 0 0 0 1 1 68 (XEN) 09 000 00 0 1 > 0 0 0 1 1 70 (XEN) 0a 000 00 0 0 0 0 0 1 > 1 78 (XEN) 0b 000 00 0 0 0 0 0 1 1 88 (XEN) > 0c 000 00 0 0 0 0 0 1 1 90 (XEN) 0d 000 00 0 > 0 0 0 0 1 1 98 (XEN) 0e 000 00 0 0 0 0 0 > 1 1 A0 (XEN) 0f 000 00 0 0 0 0 0 1 1 A8 > (XEN) 10 000 00 0 1 0 1 0 1 1 DB (XEN) 11 000 00 > 1 0 0 0 0 0 0 00 (XEN) 12 000 00 0 1 0 1 > 0 1 1 2C (XEN) 13 000 00 1 1 0 1 0 1 1 > 51 (XEN) 14 000 00 1 1 0 1 0 1 1 29 (XEN) 15 07A > 0A 1 0 0 0 0 0 2 B4 (XEN) 16 000 00 0 1 0 > 1 0 1 1 BB (XEN) 17 000 00 0 1 0 1 0 1 1 > C2 (XEN) Using vector-based indexing (XEN) IRQ to pin mappings: > (XEN) IRQ240 -> 0:2 (XEN) IRQ56 -> 0:1 (XEN) IRQ64 -> 0:3 (XEN) > IRQ72 -> 0:4 (XEN) IRQ80 -> 0:5 (XEN) IRQ88 -> 0:6 (XEN) IRQ96 -> 0:7 > (XEN) IRQ104 -> 0:8 (XEN) IRQ112 -> 0:9 (XEN) IRQ120 -> 0:10 (XEN) > IRQ136 -> 0:11 (XEN) IRQ144 -> 0:12 (XEN) IRQ152 -> 0:13 (XEN) IRQ160 > -> 0:14 (XEN) IRQ168 -> 0:15 (XEN) IRQ219 -> 0:16 (XEN) IRQ44 -> 0:18 > (XEN) IRQ81 -> 0:19 (XEN) IRQ41 -> 0:20 (XEN) IRQ187 -> 0:22 (XEN) > IRQ194 -> 0:23 (XEN) .................................... done. (XEN) > (XEN) **************************************** (XEN) Panic on CPU 1: > (XEN) CA-107844**************************************** (XEN) (XEN) > Reboot in five seconds... (XEN) Executing crash image > > > Am 05.08.2013 16:51, schrieb Andrew Cooper: > > All of these crashes are coming out of mwait_idle, so the cpu in > question has literally just been in an lower power state. > > I am wondering whether there is some caching issue where an update to > the Pending EOI stack pointer got "lost", but this seems like a little > too specific to be reasonably explained as a caching issue. > > A new debugging patch is on its way (Sorry - it has been a very busy > few days) > > ~Andrew > >Best regards, Yang
Hi Thimo, Can you provide the xen boot log? Best regards, Yang From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Thimo E. Sent: Monday, August 12, 2013 1:47 AM To: Andrew Cooper Cc: Keir Fraser; Jan Beulich; Dong, Eddie; Xen-develList; Nakajima, Jun; Zhang, Xiantao Subject: Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic Hello again, attached you''ll find another crash dump from today. Don''t know if it gives you more information than the last one. Just FYI, this is a system with an Intel Mainboard (H87 chipset) and a Core i5-4670 CPU. Best regards Thimo Am 09.08.2013 23:44, schrieb Andrew Cooper: On 09/08/13 22:40, Andrew Cooper wrote: So according to my debugging, we really have just pushed the same irq which we have subsequently seen again unexpectedly. This bug has only ever been seen on Haswell hardware, and appears linked to running HVM guests. So either there is an erroneous ACK the LAPIC which is clearing the ISR before the PEOI stack is expecting (which I "can''t" Apologies for the confusion. ~Andrew obviously see, looking at the code), or something more funky is going on with the hardware. CC''ing in the Intel maintainers: Do you have any ideas? Could this be related to APICv? ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org<mailto:Xen-devel@lists.xen.org> http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 09.08.13 at 23:27, "Thimo E." <abc@digithi.de> wrote: > (XEN) **Pending EOI error > (XEN) irq 29, vector 0x24 > (XEN) s[0] irq 29, vec 0x24, ready 0, ISR 00000001, TMR 00000000, IRR 00000000 > (XEN) All LAPIC state: > (XEN) [vector] ISR TMR IRR > (XEN) [1f:00] 00000000 00000000 00000000 > (XEN) [3f:20] 00000010 76efa12e 00000000 > (XEN) [5f:40] 00000000 e6f0f2fc 00000000 > (XEN) [7f:60] 00000000 32d096ca 00000000 > (XEN) [9f:80] 00000000 78fcf87a 00000000 > (XEN) [bf:a0] 00000000 f9b9fe4e 00000000 > (XEN) [df:c0] 00000000 ffdfe7ab 00000000 > (XEN) [ff:e0] 00000000 00000000 00000000 > (XEN) Peoi stack trace records:Mind providing (a link to) the patch that was used here, so that one can make sense of the printed information (and perhaps also suggest adjustments to that debugging code)? Nothing I was able to find on the list fully matches the output above... Jan> (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Marked {sp 0, irq 29, vec 0x24} ready > (XEN) Pushed {sp 0, irq 29, vec 0x24} > (XEN) Poped entry {sp 1, irq 29, vec 0x24} > (XEN) Guest interrupt information: > (XEN) IRQ: 0 affinity:1 vec:f0 type=IO-APIC-edge status=00000000 > mapped, unbound > (XEN) IRQ: 1 affinity:1 vec:38 type=IO-APIC-edge status=00000050 > in-flight=0 domain-list=0: 1(----), > (XEN) IRQ: 2 affinity:f vec:00 type=XT-PIC status=00000000 mapped, > unbound > (XEN) IRQ: 3 affinity:1 vec:40 type=IO-APIC-edge status=00000002 > mapped, unbound > (XEN) IRQ: 4 affinity:1 vec:48 type=IO-APIC-edge status=00000002 > mapped, unbound > (XEN) IRQ: 5 affinity:1 vec:50 type=IO-APIC-edge status=00000050 > in-flight=0 domain-list=0: 5(----), > (XEN) IRQ: 6 affinity:1 vec:58 type=IO-APIC-edge status=00000002 > mapped, unbound > (XEN) IRQ: 7 affinity:1 vec:60 type=IO-APIC-edge status=00000002 > mapped, unbound > (XEN) IRQ: 8 affinity:1 vec:68 type=IO-APIC-edge status=00000050 > in-flight=0 domain-list=0: 8(----), > (XEN) IRQ: 9 affinity:1 vec:70 type=IO-APIC-level status=00000050 > in-flight=0 domain-list=0: 9(----), > (XEN) IRQ: 10 affinity:1 vec:78 type=IO-APIC-edge status=00000002 > mapped, unbound > (XEN) IRQ: 11 affinity:1 vec:88 type=IO-APIC-edge status=00000002 > mapped, unbound > (XEN) IRQ: 12 affinity:1 vec:90 type=IO-APIC-edge status=00000002 > mapped, unbound > (XEN) IRQ: 13 affinity:1 vec:98 type=IO-APIC-edge status=00000002 > mapped, unbound > (XEN) IRQ: 14 affinity:1 vec:a0 type=IO-APIC-edge status=00000002 > mapped, unbound > (XEN) IRQ: 15 affinity:1 vec:a8 type=IO-APIC-edge status=00000002 > mapped, unbound > (XEN) IRQ: 16 affinity:1 vec:db type=IO-APIC-level status=00000010 > in-flight=0 domain-list=0: 16(----), > (XEN) IRQ: 18 affinity:1 vec:2c type=IO-APIC-level status=00000010 > in-flight=0 domain-list=0: 18(----), > (XEN) IRQ: 19 affinity:1 vec:51 type=IO-APIC-level status=00000002 > mapped, unbound > (XEN) IRQ: 20 affinity:1 vec:29 type=IO-APIC-level status=00000002 > mapped, unbound > (XEN) IRQ: 22 affinity:1 vec:bb type=IO-APIC-level status=00000050 > in-flight=0 domain-list=0: 22(----), > (XEN) IRQ: 23 affinity:8 vec:c2 type=IO-APIC-level status=00000050 > in-flight=0 domain-list=0: 23(----), > (XEN) IRQ: 24 affinity:1 vec:28 type=DMA_MSI status=00000000 mapped, > unbound > (XEN) IRQ: 25 affinity:1 vec:30 type=DMA_MSI status=00000000 mapped, > unbound > (XEN) IRQ: 26 affinity:f vec:c0 type=PCI-MSI status=00000002 mapped, > unbound > (XEN) IRQ: 27 affinity:f vec:c8 type=PCI-MSI status=00000002 mapped, > unbound > (XEN) IRQ: 28 affinity:f vec:d0 type=PCI-MSI status=00000002 mapped, > unbound > (XEN) IRQ: 29 affinity:2 vec:24 type=PCI-MSI status=00000010 > in-flight=0 domain-list=0:276(----), > (XEN) IRQ: 30 affinity:4 vec:93 type=PCI-MSI status=00000050 > in-flight=0 domain-list=0:275(----), > (XEN) IRQ: 31 affinity:2 vec:4a type=PCI-MSI status=00000050 > in-flight=0 domain-list=0:274(----), > (XEN) IRQ: 32 affinity:2 vec:73 type=PCI-MSI status=00000050 > in-flight=0 domain-list=0:273(----), > (XEN) IRQ: 33 affinity:1 vec:49 type=PCI-MSI status=00000050 > in-flight=0 domain-list=0:272(----), > (XEN) IRQ: 34 affinity:8 vec:5f type=PCI-MSI status=00000050 > in-flight=0 domain-list=0:271(----), > (XEN) IO-APIC interrupt information: > (XEN) IRQ 0 Vec240: > (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 1 Vec 56: > (XEN) Apic 0x00, Pin 1: vec=38 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 3 Vec 64: > (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 4 Vec 72: > (XEN) Apic 0x00, Pin 4: vec=48 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 5 Vec 80: > (XEN) Apic 0x00, Pin 5: vec=50 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 6 Vec 88: > (XEN) Apic 0x00, Pin 6: vec=58 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 7 Vec 96: > (XEN) Apic 0x00, Pin 7: vec=60 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 8 Vec104: > (XEN) Apic 0x00, Pin 8: vec=68 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 9 Vec112: > (XEN) Apic 0x00, Pin 9: vec=70 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 10 Vec120: > (XEN) Apic 0x00, Pin 10: vec=78 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 11 Vec136: > (XEN) Apic 0x00, Pin 11: vec=88 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 12 Vec144: > (XEN) Apic 0x00, Pin 12: vec=90 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 13 Vec152: > (XEN) Apic 0x00, Pin 13: vec=98 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 14 Vec160: > (XEN) Apic 0x00, Pin 14: vec=a0 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 15 Vec168: > (XEN) Apic 0x00, Pin 15: vec=a8 delivery=LoPri dest=L status=0 > polarity=0 irr=0 trig=E mask=0 dest_id:0 > (XEN) IRQ 16 Vec219: > (XEN) Apic 0x00, Pin 16: vec=db delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 18 Vec 44: > (XEN) Apic 0x00, Pin 18: vec=2c delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 19 Vec 81: > (XEN) Apic 0x00, Pin 19: vec=51 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=1 dest_id:0 > (XEN) IRQ 20 Vec 41: > (XEN) Apic 0x00, Pin 20: vec=29 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=1 dest_id:0 > (XEN) IRQ 22 Vec187: > (XEN) Apic 0x00, Pin 22: vec=bb delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) IRQ 23 Vec194: > (XEN) Apic 0x00, Pin 23: vec=c2 delivery=LoPri dest=L status=0 > polarity=1 irr=0 trig=L mask=0 dest_id:0 > (XEN) number of MP IRQ sources: 15. > (XEN) number of IO-APIC #2 registers: 24. > (XEN) testing the IO APIC....................... > (XEN) IO APIC #2...... > (XEN) .... register #00: 02000000 > (XEN) ....... : physical APIC id: 02 > (XEN) ....... : Delivery Type: 0 > (XEN) ....... : LTS : 0 > (XEN) .... register #01: 00170020 > (XEN) ....... : max redirection entries: 0017 > (XEN) ....... : PRQ implemented: 0 > (XEN) ....... : IO APIC version: 0020 > (XEN) .... IRQ redirection table: > (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: > (XEN) 00 000 00 1 0 0 0 0 0 0 00 > (XEN) 01 000 00 0 0 0 0 0 1 1 38 > (XEN) 02 000 00 0 0 0 0 0 1 1 F0 > (XEN) 03 000 00 0 0 0 0 0 1 1 40 > (XEN) 04 000 00 0 0 0 0 0 1 1 48 > (XEN) 05 000 00 0 0 0 0 0 1 1 50 > (XEN) 06 000 00 0 0 0 0 0 1 1 58 > (XEN) 07 000 00 0 0 0 0 0 1 1 60 > (XEN) 08 000 00 0 0 0 0 0 1 1 68 > (XEN) 09 000 00 0 1 0 0 0 1 1 70 > (XEN) 0a 000 00 0 0 0 0 0 1 1 78 > (XEN) 0b 000 00 0 0 0 0 0 1 1 88 > (XEN) 0c 000 00 0 0 0 0 0 1 1 90 > (XEN) 0d 000 00 0 0 0 0 0 1 1 98 > (XEN) 0e 000 00 0 0 0 0 0 1 1 A0 > (XEN) 0f 000 00 0 0 0 0 0 1 1 A8 > (XEN) 10 000 00 0 1 0 1 0 1 1 DB > (XEN) 11 000 00 1 0 0 0 0 0 0 00 > (XEN) 12 000 00 0 1 0 1 0 1 1 2C > (XEN) 13 000 00 1 1 0 1 0 1 1 51 > (XEN) 14 000 00 1 1 0 1 0 1 1 29 > (XEN) 15 07A 0A 1 0 0 0 0 0 2 B4 > (XEN) 16 000 00 0 1 0 1 0 1 1 BB > (XEN) 17 000 00 0 1 0 1 0 1 1 C2 > (XEN) Using vector-based indexing > (XEN) IRQ to pin mappings: > (XEN) IRQ240 -> 0:2 > (XEN) IRQ56 -> 0:1 > (XEN) IRQ64 -> 0:3 > (XEN) IRQ72 -> 0:4 > (XEN) IRQ80 -> 0:5 > (XEN) IRQ88 -> 0:6 > (XEN) IRQ96 -> 0:7 > (XEN) IRQ104 -> 0:8 > (XEN) IRQ112 -> 0:9 > (XEN) IRQ120 -> 0:10 > (XEN) IRQ136 -> 0:11 > (XEN) IRQ144 -> 0:12 > (XEN) IRQ152 -> 0:13 > (XEN) IRQ160 -> 0:14 > (XEN) IRQ168 -> 0:15 > (XEN) IRQ219 -> 0:16 > (XEN) IRQ44 -> 0:18 > (XEN) IRQ81 -> 0:19 > (XEN) IRQ41 -> 0:20 > (XEN) IRQ187 -> 0:22 > (XEN) IRQ194 -> 0:23 > (XEN) .................................... done. > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 1: > (XEN) CA-107844**************************************** > (XEN) > (XEN) Reboot in five seconds... > (XEN) Executing crash image > > > Am 05.08.2013 16:51, schrieb Andrew Cooper: >> All of these crashes are coming out of mwait_idle, so the cpu in >> question has literally just been in an lower power state. >> >> I am wondering whether there is some caching issue where an update to >> the Pending EOI stack pointer got "lost", but this seems like a little >> too specific to be reasonably explained as a caching issue. >> >> A new debugging patch is on its way (Sorry - it has been a very busy few >> days) >> >> ~Andrew >>
Hi Thimo, From your previous experience and log, it shows: 1. The interrupt that triggers the issue is a MSI. 2. MSI are treated as edge-triggered interrupts nomally, except when there is no way to mask the device. In this case, your previous log indicates the device is unmaskable(What special device are you using?Modern PCI devcie should be maskable). 3. The IRQ 29 is belong to dom0, it seems it is not a HVM related issue. 4. The status of IRQ 29 is 10 which means the guest already issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there should be no pending EOI in the EOI stack. If possible, can you add some debug message in the guest EOI code path(like _irq_guest_eoi())) to track the EOI? 5. Both of the log show when the issue occured, most of the other interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. Is it a coincidence? Or it happened only on the special condition like heavy of IRQ migration?Perhaps you can disable irq balance in dom0 and pin the IRQ manually. 6. I guess the interrupt remapping is enabled in your machine. Can you try to disable IR to see whether it still reproduceable? Also, please provide the whole Xen log. Best regards, Yang From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Thimo E. Sent: Monday, August 12, 2013 1:47 AM To: Andrew Cooper Cc: Keir Fraser; Jan Beulich; Dong, Eddie; Xen-develList; Nakajima, Jun; Zhang, Xiantao Subject: Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic Hello again, attached you''ll find another crash dump from today. Don''t know if it gives you more information than the last one. Just FYI, this is a system with an Intel Mainboard (H87 chipset) and a Core i5-4670 CPU. Best regards Thimo Am 09.08.2013 23:44, schrieb Andrew Cooper: On 09/08/13 22:40, Andrew Cooper wrote: So according to my debugging, we really have just pushed the same irq which we have subsequently seen again unexpectedly. This bug has only ever been seen on Haswell hardware, and appears linked to running HVM guests. So either there is an erroneous ACK the LAPIC which is clearing the ISR before the PEOI stack is expecting (which I "can''t" Apologies for the confusion. ~Andrew obviously see, looking at the code), or something more funky is going on with the hardware. CC''ing in the Intel maintainers: Do you have any ideas? Could this be related to APICv? ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org<mailto:Xen-devel@lists.xen.org> http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 12.08.13 at 10:49, "Zhang, Yang Z" <yang.z.zhang@intel.com> wrote: > 5. Both of the log show when the issue occured, most of the other > interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. Is it a > coincidence? Or it happened only on the special condition like heavy of IRQ > migration?Perhaps you can disable irq balance in dom0 and pin the IRQ > manually.Since guest IRQs'' affinities track the vCPU''s placement on pCPU-s, suppressing IRQ movement would not only require IRQ balancing to be suppressed in the respective domain, but also that the vCPU be bound to a single pCPU. Jan
On 11/08/13 18:46, Thimo E. wrote:> Hello again, > > attached you''ll find another crash dump from today. Don''t know if it > gives you more information than the last one. > > Just FYI, this is a system with an Intel Mainboard (H87 chipset) and a > Core i5-4670 CPU. > > Best regards > ThimoIt is still saying the same. irq 29 should already be in-service at the LAPIC (because it is present on the PEOI stack), but isn''t, and we subsequently get reinterrupted with it, causing the assertion to fail. ~Andrew --------------010808090101040101010406 Content-Type: text/html; charset="windows-1252" Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> <div class="moz-cite-prefix">On 11/08/13 18:46, Thimo E. wrote:<br> </div> <blockquote cite="mid:5207CE0C.1000502@digithi.de" type="cite"> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <div class="moz-cite-prefix">Hello again,<br> <br> attached you''ll find another crash dump from today. Don''t know if it gives you more information than the last one.<br> <br> Just FYI, this is a system with an Intel Mainboard (H87 chipset) and a Core i5-4670 CPU.<br> <br> Best regards<br> Thimo<br> </div> </blockquote> <br> It is still saying the same. irq 29 should already be in-service at the LAPIC (because it is present on the PEOI stack), but isn''t, and we subsequently get reinterrupted with it, causing the assertion to fail.<br> <br> ~Andrew </body> </html> --------------010808090101040101010406-- --===============5358254615204301163=Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============5358254615204301163==--
On 12/08/13 09:20, Jan Beulich wrote:>>>> On 09.08.13 at 23:27, "Thimo E." <abc@digithi.de> wrote: >> (XEN) **Pending EOI error >> (XEN) irq 29, vector 0x24 >> (XEN) s[0] irq 29, vec 0x24, ready 0, ISR 00000001, TMR 00000000, IRR 00000000 >> (XEN) All LAPIC state: >> (XEN) [vector] ISR TMR IRR >> (XEN) [1f:00] 00000000 00000000 00000000 >> (XEN) [3f:20] 00000010 76efa12e 00000000 >> (XEN) [5f:40] 00000000 e6f0f2fc 00000000 >> (XEN) [7f:60] 00000000 32d096ca 00000000 >> (XEN) [9f:80] 00000000 78fcf87a 00000000 >> (XEN) [bf:a0] 00000000 f9b9fe4e 00000000 >> (XEN) [df:c0] 00000000 ffdfe7ab 00000000 >> (XEN) [ff:e0] 00000000 00000000 00000000 >> (XEN) Peoi stack trace records: > Mind providing (a link to) the patch that was used here, so that > one can make sense of the printed information (and perhaps > also suggest adjustments to that debugging code)? Nothing I > was able to find on the list fully matches the output above... > > JanAttached ~Andrew> >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Marked {sp 0, irq 29, vec 0x24} ready >> (XEN) Pushed {sp 0, irq 29, vec 0x24} >> (XEN) Poped entry {sp 1, irq 29, vec 0x24} >> (XEN) Guest interrupt information: >> (XEN) IRQ: 0 affinity:1 vec:f0 type=IO-APIC-edge status=00000000 >> mapped, unbound >> (XEN) IRQ: 1 affinity:1 vec:38 type=IO-APIC-edge status=00000050 >> in-flight=0 domain-list=0: 1(----), >> (XEN) IRQ: 2 affinity:f vec:00 type=XT-PIC status=00000000 mapped, >> unbound >> (XEN) IRQ: 3 affinity:1 vec:40 type=IO-APIC-edge status=00000002 >> mapped, unbound >> (XEN) IRQ: 4 affinity:1 vec:48 type=IO-APIC-edge status=00000002 >> mapped, unbound >> (XEN) IRQ: 5 affinity:1 vec:50 type=IO-APIC-edge status=00000050 >> in-flight=0 domain-list=0: 5(----), >> (XEN) IRQ: 6 affinity:1 vec:58 type=IO-APIC-edge status=00000002 >> mapped, unbound >> (XEN) IRQ: 7 affinity:1 vec:60 type=IO-APIC-edge status=00000002 >> mapped, unbound >> (XEN) IRQ: 8 affinity:1 vec:68 type=IO-APIC-edge status=00000050 >> in-flight=0 domain-list=0: 8(----), >> (XEN) IRQ: 9 affinity:1 vec:70 type=IO-APIC-level status=00000050 >> in-flight=0 domain-list=0: 9(----), >> (XEN) IRQ: 10 affinity:1 vec:78 type=IO-APIC-edge status=00000002 >> mapped, unbound >> (XEN) IRQ: 11 affinity:1 vec:88 type=IO-APIC-edge status=00000002 >> mapped, unbound >> (XEN) IRQ: 12 affinity:1 vec:90 type=IO-APIC-edge status=00000002 >> mapped, unbound >> (XEN) IRQ: 13 affinity:1 vec:98 type=IO-APIC-edge status=00000002 >> mapped, unbound >> (XEN) IRQ: 14 affinity:1 vec:a0 type=IO-APIC-edge status=00000002 >> mapped, unbound >> (XEN) IRQ: 15 affinity:1 vec:a8 type=IO-APIC-edge status=00000002 >> mapped, unbound >> (XEN) IRQ: 16 affinity:1 vec:db type=IO-APIC-level status=00000010 >> in-flight=0 domain-list=0: 16(----), >> (XEN) IRQ: 18 affinity:1 vec:2c type=IO-APIC-level status=00000010 >> in-flight=0 domain-list=0: 18(----), >> (XEN) IRQ: 19 affinity:1 vec:51 type=IO-APIC-level status=00000002 >> mapped, unbound >> (XEN) IRQ: 20 affinity:1 vec:29 type=IO-APIC-level status=00000002 >> mapped, unbound >> (XEN) IRQ: 22 affinity:1 vec:bb type=IO-APIC-level status=00000050 >> in-flight=0 domain-list=0: 22(----), >> (XEN) IRQ: 23 affinity:8 vec:c2 type=IO-APIC-level status=00000050 >> in-flight=0 domain-list=0: 23(----), >> (XEN) IRQ: 24 affinity:1 vec:28 type=DMA_MSI status=00000000 mapped, >> unbound >> (XEN) IRQ: 25 affinity:1 vec:30 type=DMA_MSI status=00000000 mapped, >> unbound >> (XEN) IRQ: 26 affinity:f vec:c0 type=PCI-MSI status=00000002 mapped, >> unbound >> (XEN) IRQ: 27 affinity:f vec:c8 type=PCI-MSI status=00000002 mapped, >> unbound >> (XEN) IRQ: 28 affinity:f vec:d0 type=PCI-MSI status=00000002 mapped, >> unbound >> (XEN) IRQ: 29 affinity:2 vec:24 type=PCI-MSI status=00000010 >> in-flight=0 domain-list=0:276(----), >> (XEN) IRQ: 30 affinity:4 vec:93 type=PCI-MSI status=00000050 >> in-flight=0 domain-list=0:275(----), >> (XEN) IRQ: 31 affinity:2 vec:4a type=PCI-MSI status=00000050 >> in-flight=0 domain-list=0:274(----), >> (XEN) IRQ: 32 affinity:2 vec:73 type=PCI-MSI status=00000050 >> in-flight=0 domain-list=0:273(----), >> (XEN) IRQ: 33 affinity:1 vec:49 type=PCI-MSI status=00000050 >> in-flight=0 domain-list=0:272(----), >> (XEN) IRQ: 34 affinity:8 vec:5f type=PCI-MSI status=00000050 >> in-flight=0 domain-list=0:271(----), >> (XEN) IO-APIC interrupt information: >> (XEN) IRQ 0 Vec240: >> (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 1 Vec 56: >> (XEN) Apic 0x00, Pin 1: vec=38 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 3 Vec 64: >> (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 4 Vec 72: >> (XEN) Apic 0x00, Pin 4: vec=48 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 5 Vec 80: >> (XEN) Apic 0x00, Pin 5: vec=50 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 6 Vec 88: >> (XEN) Apic 0x00, Pin 6: vec=58 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 7 Vec 96: >> (XEN) Apic 0x00, Pin 7: vec=60 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 8 Vec104: >> (XEN) Apic 0x00, Pin 8: vec=68 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 9 Vec112: >> (XEN) Apic 0x00, Pin 9: vec=70 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 10 Vec120: >> (XEN) Apic 0x00, Pin 10: vec=78 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 11 Vec136: >> (XEN) Apic 0x00, Pin 11: vec=88 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 12 Vec144: >> (XEN) Apic 0x00, Pin 12: vec=90 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 13 Vec152: >> (XEN) Apic 0x00, Pin 13: vec=98 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 14 Vec160: >> (XEN) Apic 0x00, Pin 14: vec=a0 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 15 Vec168: >> (XEN) Apic 0x00, Pin 15: vec=a8 delivery=LoPri dest=L status=0 >> polarity=0 irr=0 trig=E mask=0 dest_id:0 >> (XEN) IRQ 16 Vec219: >> (XEN) Apic 0x00, Pin 16: vec=db delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 18 Vec 44: >> (XEN) Apic 0x00, Pin 18: vec=2c delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 19 Vec 81: >> (XEN) Apic 0x00, Pin 19: vec=51 delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=1 dest_id:0 >> (XEN) IRQ 20 Vec 41: >> (XEN) Apic 0x00, Pin 20: vec=29 delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=1 dest_id:0 >> (XEN) IRQ 22 Vec187: >> (XEN) Apic 0x00, Pin 22: vec=bb delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) IRQ 23 Vec194: >> (XEN) Apic 0x00, Pin 23: vec=c2 delivery=LoPri dest=L status=0 >> polarity=1 irr=0 trig=L mask=0 dest_id:0 >> (XEN) number of MP IRQ sources: 15. >> (XEN) number of IO-APIC #2 registers: 24. >> (XEN) testing the IO APIC....................... >> (XEN) IO APIC #2...... >> (XEN) .... register #00: 02000000 >> (XEN) ....... : physical APIC id: 02 >> (XEN) ....... : Delivery Type: 0 >> (XEN) ....... : LTS : 0 >> (XEN) .... register #01: 00170020 >> (XEN) ....... : max redirection entries: 0017 >> (XEN) ....... : PRQ implemented: 0 >> (XEN) ....... : IO APIC version: 0020 >> (XEN) .... IRQ redirection table: >> (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: >> (XEN) 00 000 00 1 0 0 0 0 0 0 00 >> (XEN) 01 000 00 0 0 0 0 0 1 1 38 >> (XEN) 02 000 00 0 0 0 0 0 1 1 F0 >> (XEN) 03 000 00 0 0 0 0 0 1 1 40 >> (XEN) 04 000 00 0 0 0 0 0 1 1 48 >> (XEN) 05 000 00 0 0 0 0 0 1 1 50 >> (XEN) 06 000 00 0 0 0 0 0 1 1 58 >> (XEN) 07 000 00 0 0 0 0 0 1 1 60 >> (XEN) 08 000 00 0 0 0 0 0 1 1 68 >> (XEN) 09 000 00 0 1 0 0 0 1 1 70 >> (XEN) 0a 000 00 0 0 0 0 0 1 1 78 >> (XEN) 0b 000 00 0 0 0 0 0 1 1 88 >> (XEN) 0c 000 00 0 0 0 0 0 1 1 90 >> (XEN) 0d 000 00 0 0 0 0 0 1 1 98 >> (XEN) 0e 000 00 0 0 0 0 0 1 1 A0 >> (XEN) 0f 000 00 0 0 0 0 0 1 1 A8 >> (XEN) 10 000 00 0 1 0 1 0 1 1 DB >> (XEN) 11 000 00 1 0 0 0 0 0 0 00 >> (XEN) 12 000 00 0 1 0 1 0 1 1 2C >> (XEN) 13 000 00 1 1 0 1 0 1 1 51 >> (XEN) 14 000 00 1 1 0 1 0 1 1 29 >> (XEN) 15 07A 0A 1 0 0 0 0 0 2 B4 >> (XEN) 16 000 00 0 1 0 1 0 1 1 BB >> (XEN) 17 000 00 0 1 0 1 0 1 1 C2 >> (XEN) Using vector-based indexing >> (XEN) IRQ to pin mappings: >> (XEN) IRQ240 -> 0:2 >> (XEN) IRQ56 -> 0:1 >> (XEN) IRQ64 -> 0:3 >> (XEN) IRQ72 -> 0:4 >> (XEN) IRQ80 -> 0:5 >> (XEN) IRQ88 -> 0:6 >> (XEN) IRQ96 -> 0:7 >> (XEN) IRQ104 -> 0:8 >> (XEN) IRQ112 -> 0:9 >> (XEN) IRQ120 -> 0:10 >> (XEN) IRQ136 -> 0:11 >> (XEN) IRQ144 -> 0:12 >> (XEN) IRQ152 -> 0:13 >> (XEN) IRQ160 -> 0:14 >> (XEN) IRQ168 -> 0:15 >> (XEN) IRQ219 -> 0:16 >> (XEN) IRQ44 -> 0:18 >> (XEN) IRQ81 -> 0:19 >> (XEN) IRQ41 -> 0:20 >> (XEN) IRQ187 -> 0:22 >> (XEN) IRQ194 -> 0:23 >> (XEN) .................................... done. >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 1: >> (XEN) CA-107844**************************************** >> (XEN) >> (XEN) Reboot in five seconds... >> (XEN) Executing crash image >> >> >> Am 05.08.2013 16:51, schrieb Andrew Cooper: >>> All of these crashes are coming out of mwait_idle, so the cpu in >>> question has literally just been in an lower power state. >>> >>> I am wondering whether there is some caching issue where an update to >>> the Pending EOI stack pointer got "lost", but this seems like a little >>> too specific to be reasonably explained as a caching issue. >>> >>> A new debugging patch is on its way (Sorry - it has been a very busy few >>> days) >>> >>> ~Andrew >>> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 12.08.13 at 11:28, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 12/08/13 09:20, Jan Beulich wrote: >>>>> On 09.08.13 at 23:27, "Thimo E." <abc@digithi.de> wrote: >>> (XEN) **Pending EOI error >>> (XEN) irq 29, vector 0x24 >>> (XEN) s[0] irq 29, vec 0x24, ready 0, ISR 00000001, TMR 00000000, IRR > 00000000 >>> (XEN) All LAPIC state: >>> (XEN) [vector] ISR TMR IRR >>> (XEN) [1f:00] 00000000 00000000 00000000 >>> (XEN) [3f:20] 00000010 76efa12e 00000000 >>> (XEN) [5f:40] 00000000 e6f0f2fc 00000000 >>> (XEN) [7f:60] 00000000 32d096ca 00000000 >>> (XEN) [9f:80] 00000000 78fcf87a 00000000 >>> (XEN) [bf:a0] 00000000 f9b9fe4e 00000000 >>> (XEN) [df:c0] 00000000 ffdfe7ab 00000000 >>> (XEN) [ff:e0] 00000000 00000000 00000000 >>> (XEN) Peoi stack trace records: >> Mind providing (a link to) the patch that was used here, so that >> one can make sense of the printed information (and perhaps >> also suggest adjustments to that debugging code)? Nothing I >> was able to find on the list fully matches the output above... > > AttachedThanks. Actually, the second case he sent has an interesting difference: (XEN) s[0] irq 29, vec 0x26, ready 0, ISR 00000001, TMR 00000000, IRR 00000001 i.e. we in fact have _three_ instance of the interrupt (two in-service, and one request). I don''t see an explanation for this other than buggy hardware. Sadly we still don''t know what device it is that is behaving that way (including the confirmation that it''s a non- maskable MSI one). Jan
On 12/08/13 11:05, Jan Beulich wrote:>>>> On 12.08.13 at 11:28, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 12/08/13 09:20, Jan Beulich wrote: >>>>>> On 09.08.13 at 23:27, "Thimo E." <abc@digithi.de> wrote: >>>> (XEN) **Pending EOI error >>>> (XEN) irq 29, vector 0x24 >>>> (XEN) s[0] irq 29, vec 0x24, ready 0, ISR 00000001, TMR 00000000, IRR >> 00000000 >>>> (XEN) All LAPIC state: >>>> (XEN) [vector] ISR TMR IRR >>>> (XEN) [1f:00] 00000000 00000000 00000000 >>>> (XEN) [3f:20] 00000010 76efa12e 00000000 >>>> (XEN) [5f:40] 00000000 e6f0f2fc 00000000 >>>> (XEN) [7f:60] 00000000 32d096ca 00000000 >>>> (XEN) [9f:80] 00000000 78fcf87a 00000000 >>>> (XEN) [bf:a0] 00000000 f9b9fe4e 00000000 >>>> (XEN) [df:c0] 00000000 ffdfe7ab 00000000 >>>> (XEN) [ff:e0] 00000000 00000000 00000000 >>>> (XEN) Peoi stack trace records: >>> Mind providing (a link to) the patch that was used here, so that >>> one can make sense of the printed information (and perhaps >>> also suggest adjustments to that debugging code)? Nothing I >>> was able to find on the list fully matches the output above... >> Attached > Thanks. Actually, the second case he sent has an interesting > difference: > > (XEN) s[0] irq 29, vec 0x26, ready 0, ISR 00000001, TMR 00000000, IRR 00000001 > > i.e. we in fact have _three_ instance of the interrupt (two in-service, > and one request). I don''t see an explanation for this other than > buggy hardware. Sadly we still don''t know what device it is that is > behaving that way (including the confirmation that it''s a non- > maskable MSI one). > > Jan >On the XenServer hardware where we have seen this issue, the problematic interrupt was from: 00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 02) Subsystem: Intel Corporation Device 0000 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 1275 Region 0: Memory at c2700000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at c273e000 (32-bit, non-prefetchable) [size=4K] Region 2: I/O ports at 7080 [size=32] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee00318 Data: 0000 Capabilities: [e0] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP- Kernel driver in use: e1000e Kernel modules: e1000e I am still attempting to reproduce the issue, but we haven’t seen it again since my email at the root of this thread. ~Andrew
Hello Yang, attached you''ll find the kernel dmesg, xen dmesg, lspci and output of /proc/interrupts. If you want to see further logfiles, please let me know. The processor is a Core i5-4670. The board is an Intel DH87MC Mainboard. I am really not sure if it supports APICv, but VT-d is supported enabled enabled.> 4.The status of IRQ 29 is 10 which means the guest already issues the > EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there should > be no pending EOI in the EOI stack. If possible, can you add some > debug message in the guest EOI code path(like _irq_guest_eoi())) to > track the EOI? >I don''t see the IRQ29 in /proc/interrupts, what I see is: cat xen-dmesg.txt |grep "29": (XEN) allocated vector 29 for irq 20 cat dmesg.txt | grep "eth0": [ 23.152355] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20 [ 23.330408] e1000e 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection So is the ethernet irq the bad one ? That is an Onboard Intel network adapter.> 6.I guess the interrupt remapping is enabled in your machine. Can you > try to disable IR to see whether it still reproduceable? >Just to be sure, your proposal is to try the parameter "no-intremap" ? Best regards Thimo Am 12.08.2013 10:49, schrieb Zhang, Yang Z:> > Hi Thimo, > > From your previous experience and log, it shows: > > 1.The interrupt that triggers the issue is a MSI. > > 2.MSI are treated as edge-triggered interrupts nomally, except when > there is no way to mask the device. In this case, your previous log > indicates the device is unmaskable(What special device are you > using?Modern PCI devcie should be maskable). > > 3.The IRQ 29 is belong to dom0, it seems it is not a HVM related issue. > > 4.The status of IRQ 29 is 10 which means the guest already issues the > EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there should > be no pending EOI in the EOI stack. If possible, can you add some > debug message in the guest EOI code path(like _irq_guest_eoi())) to > track the EOI? > > 5.Both of the log show when the issue occured, most of the other > interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. Is it > a coincidence? Or it happened only on the special condition like heavy > of IRQ migration?Perhaps you can disable irq balance in dom0 and pin > the IRQ manually. >|6.I guess the interrupt remapping is enabled in your machine. Can you try to disable IR to see whether it still reproduceable?> > Also, please provide the whole Xen log. > > Best regards, > > Yang >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 12/08/13 12:52, Thimo E wrote:> Hello Yang, > > attached you''ll find the kernel dmesg, xen dmesg, lspci and output of > /proc/interrupts. If you want to see further logfiles, please let me know. > > The processor is a Core i5-4670. The board is an Intel DH87MC > Mainboard. I am really not sure if it supports APICv, but VT-d is > supported enabled enabled. > > >> 4. The status of IRQ 29 is 10 which means the guest already >> issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so >> there should be no pending EOI in the EOI stack. If possible, can you >> add some debug message in the guest EOI code path(like >> _irq_guest_eoi())) to track the EOI? >> > I don''t see the IRQ29 in /proc/interrupts, what I see is: > cat xen-dmesg.txt |grep "29": (XEN) allocated vector 29 for irq 20 > cat dmesg.txt | grep "eth0": [ 23.152355] e1000e 0000:00:19.0: PCI > INT A -> GSI 20 (level, low) -> IRQ 20 > [ 23.330408] > e1000e 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection > > So is the ethernet irq the bad one ? That is an Onboard Intel network > adapter.That would be consistent with the crash seen with our hardware in XenServer> >> 6. I guess the interrupt remapping is enabled in your machine. >> Can you try to disable IR to see whether it still reproduceable? >> >> >> > Just to be sure, your proposal is to try the parameter "no-intremap" ?specifically, iommu=no-intremap> > Best regards > Thimo~Andrew> > Am 12.08.2013 10:49, schrieb Zhang, Yang Z: >> >> Hi Thimo, >> >> From your previous experience and log, it shows: >> >> 1. The interrupt that triggers the issue is a MSI. >> >> 2. MSI are treated as edge-triggered interrupts nomally, except >> when there is no way to mask the device. In this case, your previous >> log indicates the device is unmaskable(What special device are you >> using?Modern PCI devcie should be maskable). >> >> 3. The IRQ 29 is belong to dom0, it seems it is not a HVM >> related issue. >> >> 4. The status of IRQ 29 is 10 which means the guest already >> issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so >> there should be no pending EOI in the EOI stack. If possible, can you >> add some debug message in the guest EOI code path(like >> _irq_guest_eoi())) to track the EOI? >> >> 5. Both of the log show when the issue occured, most of the >> other interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. >> Is it a coincidence? Or it happened only on the special condition >> like heavy of IRQ migration?Perhaps you can disable irq balance in >> dom0 and pin the IRQ manually. >> > |6. I guess the interrupt remapping is enabled in your machine. > Can you try to disable IR to see whether it still reproduceable? >> >> Also, please provide the whole Xen log. >> >> >> >> Best regards, >> >> Yang >> >--------------090101010304030105060108 Content-Type: text/html; charset="windows-1252" Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> <div class="moz-cite-prefix">On 12/08/13 12:52, Thimo E wrote:<br> </div> <blockquote cite="mid:5208CC8A.7070703@digithi.de" type="cite"> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <div class="moz-cite-prefix">Hello Yang,<br> <br> attached you''ll find the kernel dmesg, xen dmesg, lspci and output of /proc/interrupts. If you want to see further logfiles, please let me know.<br> <br> The processor is a Core i5-4670. The board is an Intel DH87MC Mainboard. I am really not sure if it supports APICv, but VT-d is supported enabled enabled.<br> <br> <br> <blockquote cite="mid:A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com" type="cite"> <div class="WordSection1"> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><o:p></o:p></span></p> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"> <span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">4.<span style="font:7.0pt "Times New Roman""> </span></span></span><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">The status of IRQ 29 is 10 which means the guest already issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there should be no pending EOI in the EOI stack. If possible, can you add some debug message in the guest EOI code path(like _irq_guest_eoi())) to track the EOI?</span></p> </div> </blockquote> I don''t see the IRQ29 in /proc/interrupts, what I see is:<br> cat xen-dmesg.txt |grep "29": (XEN) allocated vector 29 for irq 20<br> cat dmesg.txt | grep "eth0": [ 23.152355] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20<br> [ 23.330408] e1000e 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection<br> <br> So is the ethernet irq the bad one ? That is an Onboard Intel network adapter.<br> </div> </blockquote> <br> That would be consistent with the crash seen with our hardware in XenServer<br> <br> <blockquote cite="mid:5208CC8A.7070703@digithi.de" type="cite"> <div class="moz-cite-prefix"> <br> <blockquote cite="mid:A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com" type="cite"> <div class="WordSection1"> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><o:p></o:p></span></p> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"> <span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">6.<span style="font:7.0pt "Times New Roman""> </span></span></span><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">I guess the interrupt remapping is enabled in your machine. Can you try to disable IR to see whether it still reproduceable?<o:p></o:p></span></p> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><o:p> </o:p></span></p> </div> </blockquote> Just to be sure, your proposal is to try the parameter "no-intremap" ?<br> </div> </blockquote> <br> specifically, iommu=no-intremap<br> <br> <blockquote cite="mid:5208CC8A.7070703@digithi.de" type="cite"> <div class="moz-cite-prefix"> <br> Best regards<br> Thimo<br> </div> </blockquote> <br> ~Andrew<br> <br> <blockquote cite="mid:5208CC8A.7070703@digithi.de" type="cite"> <div class="moz-cite-prefix"> <br> Am 12.08.2013 10:49, schrieb Zhang, Yang Z:<br> </div> <blockquote cite="mid:A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com" type="cite"> <meta name="Generator" content="Microsoft Word 14 (filtered medium)"> <style><!-- /* Font Definitions */ @font-face {font-family:SimSun; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face {font-family:SimSun; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:SimSun; panose-1:2 1 6 0 3 1 1 1 1 1;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman","serif"; color:black;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} pre {mso-style-priority:99; mso-style-link:"HTML Preformatted Char"; margin:0cm; margin-bottom:.0001pt; font-size:10.0pt; font-family:"Courier New"; color:black;} p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph {mso-style-priority:34; margin:0cm; margin-bottom:.0001pt; text-indent:21.0pt; font-size:12.0pt; font-family:"Times New Roman","serif"; color:black;} span.HTMLPreformattedChar {mso-style-name:"HTML Preformatted Char"; mso-style-priority:99; mso-style-link:"HTML Preformatted"; font-family:"Courier New"; color:black;} span.EmailStyle19 {mso-style-type:personal-reply; font-family:"Calibri","sans-serif"; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 90.0pt 72.0pt 90.0pt;} div.WordSection1 {page:WordSection1;} /* List Definitions */ @list l0 {mso-list-id:1272785873; mso-list-type:hybrid; mso-list-template-ids:-1415154368 -1729587200 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;} @list l0:level1 {mso-level-tab-stop:none; mso-level-number-position:left; margin-left:18.0pt; text-indent:-18.0pt;} @list l0:level2 {mso-level-number-format:alpha-lower; mso-level-text:"%2\)"; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:42.0pt; text-indent:-21.0pt;} @list l0:level3 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:63.0pt; text-indent:-21.0pt;} @list l0:level4 {mso-level-tab-stop:none; mso-level-number-position:left; margin-left:84.0pt; text-indent:-21.0pt;} @list l0:level5 {mso-level-number-format:alpha-lower; mso-level-text:"%5\)"; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:105.0pt; text-indent:-21.0pt;} @list l0:level6 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:126.0pt; text-indent:-21.0pt;} @list l0:level7 {mso-level-tab-stop:none; mso-level-number-position:left; margin-left:147.0pt; text-indent:-21.0pt;} @list l0:level8 {mso-level-number-format:alpha-lower; mso-level-text:"%8\)"; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:168.0pt; text-indent:-21.0pt;} @list l0:level9 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:189.0pt; text-indent:-21.0pt;} ol {margin-bottom:0cm;} ul {margin-bottom:0cm;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--> <div class="WordSection1"> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><a moz-do-not-send="true" name="_MailEndCompose"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">Hi Thimo,<o:p></o:p></span></a></p> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">From your previous experience and log, it shows:<o:p></o:p></span></p> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"> <!--[if !supportLists]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">1.<span style="font:7.0pt "Times New Roman""> </span></span></span><!--[endif]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">The interrupt that triggers the issue is a MSI.<o:p></o:p></span></p> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"> <!--[if !supportLists]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">2.<span style="font:7.0pt "Times New Roman""> </span></span></span><!--[endif]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">MSI are treated as edge-triggered interrupts nomally, except when there is no way to mask the device. In this case, your previous log indicates the device is unmaskable(What special device are you using?Modern PCI devcie should be maskable). <o:p></o:p></span></p> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"> <!--[if !supportLists]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">3.<span style="font:7.0pt "Times New Roman""> </span></span></span><!--[endif]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">The IRQ 29 is belong to dom0, it seems it is not a HVM related issue.</span></p> </div> </blockquote> <blockquote cite="mid:A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com" type="cite"> <div class="WordSection1"> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">4.<span style="font:7.0pt "Times New Roman""> </span></span></span><!--[endif]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">The status of IRQ 29 is 10 which means the guest already issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there should be no pending EOI in the EOI stack. If possible, can you add some debug message in the guest EOI code path(like _irq_guest_eoi())) to track the EOI?</span></p> </div> </blockquote> <blockquote cite="mid:A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com" type="cite"> <div class="WordSection1"> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">5.<span style="font:7.0pt "Times New Roman""> </span></span></span><!--[endif]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">Both of the log show when the issue occured, most of the other interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. Is it a coincidence? Or it happened only on the special condition like heavy of IRQ migration?Perhaps you can disable irq balance in dom0 and pin the IRQ manually.</span></p> </div> </blockquote> |<span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">6.<span style="font:7.0pt "Times New Roman""> </span></span></span><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">I guess the interrupt remapping is enabled in your machine. Can you try to disable IR to see whether it still reproduceable?<o:p></o:p></span> <blockquote cite="mid:A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com" type="cite"> <div class="WordSection1"> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">Also, please provide the whole Xen log.<o:p></o:p></span></p> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><o:p> </o:p></span></p> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">Best regards,<o:p></o:p></span></p> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">Yang</span><br> </p> </div> </blockquote> <br> </blockquote> <br> </body> </html> --------------090101010304030105060108-- --===============4060292375630638350=Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============4060292375630638350==--
On 12/08/13 14:54, Thimo E wrote:> Hello Yang, > > and attached the next crash dump which occured today, only some > minutes after I''ve created the logfiles I''ve sent in the mail just before. > Perhaps together with the logfiles of the former mail it gives you a > better understand of what is going on. > > I''ve disabled Interrupt remapping now. > > > 4..... > > can you add some debug message in the guest EOI code path(like > _irq_guest_eoi())) to track the EOI? > @Andrew: Is it possible for you to integrate the requested changes > from Yang into your Xen debugging version ?I already have. That would be "Marked {foo} ready" debugging in the PEOI stack section. ~Andrew> > Best regards > Thimo > > Am 12.08.2013 10:49, schrieb Zhang, Yang Z: >> >> Hi Thimo, >> >> From your previous experience and log, it shows: >> >> 1. The interrupt that triggers the issue is a MSI. >> >> 2. MSI are treated as edge-triggered interrupts nomally, except >> when there is no way to mask the device. In this case, your previous >> log indicates the device is unmaskable(What special device are you >> using?Modern PCI devcie should be maskable). >> >> 3. The IRQ 29 is belong to dom0, it seems it is not a HVM >> related issue. >> >> 4. The status of IRQ 29 is 10 which means the guest already >> issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so >> there should be no pending EOI in the EOI stack. If possible, can you >> add some debug message in the guest EOI code path(like >> _irq_guest_eoi())) to track the EOI? >> >> 5. Both of the log show when the issue occured, most of the >> other interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. >> Is it a coincidence? Or it happened only on the special condition >> like heavy of IRQ migration?Perhaps you can disable irq balance in >> dom0 and pin the IRQ manually. >> > |6. I guess the interrupt remapping is enabled in your machine. > Can you try to disable IR to see whether it still reproduceable? >> >> Also, please provide the whole Xen log. >> >> >> >> Best regards, >> >> Yang >> >--------------030606040408030909030305 Content-Type: text/html; charset="windows-1252" Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> <div class="moz-cite-prefix">On 12/08/13 14:54, Thimo E wrote:<br> </div> <blockquote cite="mid:5208E933.1020609@digithi.de" type="cite"> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <div class="moz-cite-prefix">Hello Yang,<br> <br> and attached the next crash dump which occured today, only some minutes after I''ve created the logfiles I''ve sent in the mail just before.<br> Perhaps together with the logfiles of the former mail it gives you a better understand of what is going on.<br> <br> I''ve disabled Interrupt remapping now.<br> <br> <span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">> 4.....<br> > can you add some debug message in the guest EOI code path(like _irq_guest_eoi())) to track the EOI?</span><br> @Andrew: Is it possible for you to integrate the requested changes from Yang into your Xen debugging version ?<br> </div> </blockquote> <br> I already have. That would be "Marked {foo} ready" debugging in the PEOI stack section.<br> <br> ~Andrew<br> <br> <blockquote cite="mid:5208E933.1020609@digithi.de" type="cite"> <div class="moz-cite-prefix"> <br> Best regards<br> Thimo<br> <br> Am 12.08.2013 10:49, schrieb Zhang, Yang Z:<br> </div> <blockquote cite="mid:A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com" type="cite"> <meta name="Generator" content="Microsoft Word 14 (filtered medium)"> <style><!-- /* Font Definitions */ @font-face {font-family:SimSun; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face {font-family:SimSun; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:SimSun; panose-1:2 1 6 0 3 1 1 1 1 1;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0cm; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman","serif"; color:black;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} pre {mso-style-priority:99; mso-style-link:"HTML Preformatted Char"; margin:0cm; margin-bottom:.0001pt; font-size:10.0pt; font-family:"Courier New"; color:black;} p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph {mso-style-priority:34; margin:0cm; margin-bottom:.0001pt; text-indent:21.0pt; font-size:12.0pt; font-family:"Times New Roman","serif"; color:black;} span.HTMLPreformattedChar {mso-style-name:"HTML Preformatted Char"; mso-style-priority:99; mso-style-link:"HTML Preformatted"; font-family:"Courier New"; color:black;} span.EmailStyle19 {mso-style-type:personal-reply; font-family:"Calibri","sans-serif"; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt;} @page WordSection1 {size:612.0pt 792.0pt; margin:72.0pt 90.0pt 72.0pt 90.0pt;} div.WordSection1 {page:WordSection1;} /* List Definitions */ @list l0 {mso-list-id:1272785873; mso-list-type:hybrid; mso-list-template-ids:-1415154368 -1729587200 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;} @list l0:level1 {mso-level-tab-stop:none; mso-level-number-position:left; margin-left:18.0pt; text-indent:-18.0pt;} @list l0:level2 {mso-level-number-format:alpha-lower; mso-level-text:"%2\)"; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:42.0pt; text-indent:-21.0pt;} @list l0:level3 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:63.0pt; text-indent:-21.0pt;} @list l0:level4 {mso-level-tab-stop:none; mso-level-number-position:left; margin-left:84.0pt; text-indent:-21.0pt;} @list l0:level5 {mso-level-number-format:alpha-lower; mso-level-text:"%5\)"; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:105.0pt; text-indent:-21.0pt;} @list l0:level6 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:126.0pt; text-indent:-21.0pt;} @list l0:level7 {mso-level-tab-stop:none; mso-level-number-position:left; margin-left:147.0pt; text-indent:-21.0pt;} @list l0:level8 {mso-level-number-format:alpha-lower; mso-level-text:"%8\)"; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:168.0pt; text-indent:-21.0pt;} @list l0:level9 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:189.0pt; text-indent:-21.0pt;} ol {margin-bottom:0cm;} ul {margin-bottom:0cm;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--> <div class="WordSection1"> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><a moz-do-not-send="true" name="_MailEndCompose"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">Hi Thimo,<o:p></o:p></span></a></p> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">From your previous experience and log, it shows:<o:p></o:p></span></p> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"> <!--[if !supportLists]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">1.<span style="font:7.0pt "Times New Roman""> </span></span></span><!--[endif]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">The interrupt that triggers the issue is a MSI.<o:p></o:p></span></p> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"> <!--[if !supportLists]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">2.<span style="font:7.0pt "Times New Roman""> </span></span></span><!--[endif]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">MSI are treated as edge-triggered interrupts nomally, except when there is no way to mask the device. In this case, your previous log indicates the device is unmaskable(What special device are you using?Modern PCI devcie should be maskable). <o:p></o:p></span></p> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"> <!--[if !supportLists]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">3.<span style="font:7.0pt "Times New Roman""> </span></span></span><!--[endif]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">The IRQ 29 is belong to dom0, it seems it is not a HVM related issue.</span></p> </div> </blockquote> <blockquote cite="mid:A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com" type="cite"> <div class="WordSection1"> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">4.<span style="font:7.0pt "Times New Roman""> </span></span></span><!--[endif]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">The status of IRQ 29 is 10 which means the guest already issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there should be no pending EOI in the EOI stack. If possible, can you add some debug message in the guest EOI code path(like _irq_guest_eoi())) to track the EOI?</span></p> </div> </blockquote> <blockquote cite="mid:A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com" type="cite"> <div class="WordSection1"> <p class="MsoListParagraph" style="margin-left:18.0pt;text-align:justify;text-justify:inter-ideograph;text-indent:-18.0pt;mso-list:l0 level1 lfo1"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">5.<span style="font:7.0pt "Times New Roman""> </span></span></span><!--[endif]--><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">Both of the log show when the issue occured, most of the other interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. Is it a coincidence? Or it happened only on the special condition like heavy of IRQ migration?Perhaps you can disable irq balance in dom0 and pin the IRQ manually.</span></p> </div> </blockquote> |<span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><span style="mso-list:Ignore">6.<span style="font:7.0pt "Times New Roman""> </span></span></span><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">I guess the interrupt remapping is enabled in your machine. Can you try to disable IR to see whether it still reproduceable?<o:p></o:p></span> <blockquote cite="mid:A9667DDFB95DB7438FA9D7D576C3D87E0A8E11A4@SHSMSX104.ccr.corp.intel.com" type="cite"> <div class="WordSection1"> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">Also, please provide the whole Xen log.<o:p></o:p></span></p> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US"><o:p> </o:p></span></p> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">Best regards,<o:p></o:p></span></p> <p class="MsoNormal" style="text-align:justify;text-justify:inter-ideograph"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1F497D" lang="EN-US">Yang</span><br> </p> </div> </blockquote> <br> </blockquote> <br> </body> </html> --------------030606040408030909030305-- --===============4123433626007719309=Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============4123433626007719309==--
Andrew Cooper wrote on 2013-08-12:> On 12/08/13 14:54, Thimo E wrote: > > > Hello Yang, > > and attached the next crash dump which occured today, only some > minutes after I''ve created the logfiles I''ve sent in the mail just before. > Perhaps together with the logfiles of the former mail it gives you a > better understand of what is going on. > > I''ve disabled Interrupt remapping now. > > > 4..... > > can you add some debug message in the guest EOI code path(like > _irq_guest_eoi())) to track the EOI? > @Andrew: Is it possible for you to integrate the requested changes > from Yang into your Xen debugging version ? > > > > I already have. That would be "Marked {foo} ready" debugging in the > PEOI stack section.I didn''t find your debug patch that add PEOI stack tracing. Could you resend it? thanks.> > ~Andrew > > > > > Best regards > Thimo > > Am 12.08.2013 10:49, schrieb Zhang, Yang Z: > > > Hi Thimo, > > From your previous experience and log, it shows: > > 1. The interrupt that triggers the issue is a MSI. > > 2. MSI are treated as edge-triggered interrupts nomally, > except when there is no way to mask the device. In this case, your > previous log indicates the device is unmaskable(What special device > are you using?Modern PCI devcie should be maskable). > > 3. The IRQ 29 is belong to dom0, it seems it is not a HVM > related issue. > > 4. The status of IRQ 29 is 10 which means the guest already > issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so > there should be no pending EOI in the EOI stack. If possible, can you > add some debug message in the guest EOI code path(like _irq_guest_eoi())) to track the EOI? > > 5. Both of the log show when the issue occured, most of the > other interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. > Is it a coincidence? Or it happened only on the special condition like > heavy of IRQ migration?Perhaps you can disable irq balance in dom0 and > pin the IRQ manually. > > |6. I guess the interrupt remapping is enabled in your machine. > Can you try to disable IR to see whether it still reproduceable? > > Also, please provide the whole Xen log. > > > > Best regards, > > Yang > > >Best regards, Yang
Hello, Andrew sent it somewhere yesterday into another branch of this thread, attached you''ll find that patch that corresponds to my debugging output. Best regards Thimo Am 13.08.2013 03:43, schrieb Zhang, Yang Z:> Andrew Cooper wrote on 2013-08-12: >> I already have. That would be "Marked {foo} ready" debugging in the >> PEOI stack section. > I didn''t find your debug patch that add PEOI stack tracing. Could you resend it? thanks. >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Hi Thimo, I am trying to reproduce this issue on my side, unfortunately, I failed to boot up the guest rhel6.4 on top of Xen-4.1.5 RC1 with 3.9.3 domain0 kernel. Since Xen-4.1.5 is a little old, could you please share the guest configuration file you used when this issue happened? Thanks a lot! Thanks, Feng From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Thimo E Sent: Monday, August 12, 2013 9:55 PM To: Zhang, Yang Z Cc: Keir Fraser; Jan Beulich; Andrew Cooper; Dong, Eddie; Xen-develList; Nakajima, Jun; Zhang, Xiantao Subject: Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic Hello Yang, and attached the next crash dump which occured today, only some minutes after I''ve created the logfiles I''ve sent in the mail just before. Perhaps together with the logfiles of the former mail it gives you a better understand of what is going on. I''ve disabled Interrupt remapping now.> 4..... > can you add some debug message in the guest EOI code path(like _irq_guest_eoi())) to track the EOI?@Andrew: Is it possible for you to integrate the requested changes from Yang into your Xen debugging version ? Best regards Thimo Am 12.08.2013 10:49, schrieb Zhang, Yang Z: Hi Thimo, From your previous experience and log, it shows: 1. The interrupt that triggers the issue is a MSI. 2. MSI are treated as edge-triggered interrupts nomally, except when there is no way to mask the device. In this case, your previous log indicates the device is unmaskable(What special device are you using?Modern PCI devcie should be maskable). 3. The IRQ 29 is belong to dom0, it seems it is not a HVM related issue. 4. The status of IRQ 29 is 10 which means the guest already issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there should be no pending EOI in the EOI stack. If possible, can you add some debug message in the guest EOI code path(like _irq_guest_eoi())) to track the EOI? 5. Both of the log show when the issue occured, most of the other interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. Is it a coincidence? Or it happened only on the special condition like heavy of IRQ migration?Perhaps you can disable irq balance in dom0 and pin the IRQ manually. |I guess the interrupt remapping is enabled in your machine. Can you try to disable IR to see whether it still reproduceable? Also, please provide the whole Xen log. Best regards, Yang _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 13/08/13 12:39, Wu, Feng wrote:> > Hi Thimo, > > > > I am trying to reproduce this issue on my side, unfortunately, I > failed to boot up the guest rhel6.4 on top of Xen-4.1.5 RC1 with 3.9.3 > domain0 kernel. Since Xen-4.1.5 is a little old, could you please > share the guest configuration file you used when this issue happened? > Thanks a lot! > > > > Thanks, > > Feng >Stepping in here for a moment, Thimo is running XenServer 6.2 This issue started on the XenServer forums but moved here. For reference, we found this once in XenServer testing (as seen at the root of this email thread), but I have been unable to reproduce the issue since. We have seen the crash on Xen 4.1 and 4.2 ~Andrew> > > *From:*xen-devel-bounces@lists.xen.org > [mailto:xen-devel-bounces@lists.xen.org] *On Behalf Of *Thimo E > *Sent:* Monday, August 12, 2013 9:55 PM > *To:* Zhang, Yang Z > *Cc:* Keir Fraser; Jan Beulich; Andrew Cooper; Dong, Eddie; > Xen-develList; Nakajima, Jun; Zhang, Xiantao > *Subject:* Re: [Xen-devel] cpuidle and un-eoid interrupts at the local > apic > > > > Hello Yang, > > and attached the next crash dump which occured today, only some > minutes after I''ve created the logfiles I''ve sent in the mail just before. > Perhaps together with the logfiles of the former mail it gives you a > better understand of what is going on. > > I''ve disabled Interrupt remapping now. > > > 4..... > > can you add some debug message in the guest EOI code path(like > _irq_guest_eoi())) to track the EOI? > @Andrew: Is it possible for you to integrate the requested changes > from Yang into your Xen debugging version ? > > Best regards > Thimo > > Am 12.08.2013 10:49, schrieb Zhang, Yang Z: > > Hi Thimo, > > From your previous experience and log, it shows: > > 1. The interrupt that triggers the issue is a MSI. > > 2. MSI are treated as edge-triggered interrupts nomally, > except when there is no way to mask the device. In this case, your > previous log indicates the device is unmaskable(What special > device are you using?Modern PCI devcie should be maskable). > > 3. The IRQ 29 is belong to dom0, it seems it is not a HVM > related issue. > > 4. The status of IRQ 29 is 10 which means the guest already > issues the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, > so there should be no pending EOI in the EOI stack. If possible, > can you add some debug message in the guest EOI code path(like > _irq_guest_eoi())) to track the EOI? > > 5. Both of the log show when the issue occured, most of the > other interrupts which owned by dom0 were in IRQ_MOVE_PENDING > status. Is it a coincidence? Or it happened only on the special > condition like heavy of IRQ migration?Perhaps you can disable irq > balance in dom0 and pin the IRQ manually. > > |I guess the interrupt remapping is enabled in your machine. Can you > try to disable IR to see whether it still reproduceable? > > Also, please provide the whole Xen log. > > > > Best regards, > > Yang > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper wrote on 2013-08-12:> On 12/08/13 11:05, Jan Beulich wrote: >>>>> On 12.08.13 at 11:28, Andrew Cooper <andrew.cooper3@citrix.com> > wrote: >>> On 12/08/13 09:20, Jan Beulich wrote: >>>>>>> On 09.08.13 at 23:27, "Thimo E." <abc@digithi.de> wrote: >>>>> (XEN) **Pending EOI error (XEN) irq 29, vector 0x24 (XEN) s[0] >>>>> irq 29, vec 0x24, ready 0, ISR 00000001, TMR 00000000, IRR 00000000 >>>>> (XEN) All LAPIC state: (XEN) [vector] ISR TMR IRR >>>>> (XEN) [1f:00] 00000000 00000000 00000000 (XEN) [3f:20] 00000010 >>>>> 76efa12e 00000000 (XEN) [5f:40] 00000000 e6f0f2fc 00000000 (XEN) >>>>> [7f:60] 00000000 32d096ca 00000000 (XEN) [9f:80] 00000000 78fcf87a >>>>> 00000000 (XEN) [bf:a0] 00000000 f9b9fe4e 00000000 (XEN) [df:c0] >>>>> 00000000 ffdfe7ab 00000000 (XEN) [ff:e0] 00000000 00000000 00000000 >>>>> (XEN) Peoi stack trace records: >>>> Mind providing (a link to) the patch that was used here, so that >>>> one can make sense of the printed information (and perhaps also >>>> suggest adjustments to that debugging code)? Nothing I was able to >>>> find on the list fully matches the output above... >>> Attached >> Thanks. Actually, the second case he sent has an interesting >> difference: >> >> (XEN) s[0] irq 29, vec 0x26, ready 0, ISR 00000001, TMR 00000000, IRR >> 00000001 >> >> i.e. we in fact have _three_ instance of the interrupt (two >> in-service, and one request). I don''t see an explanation for this >> other than buggy hardware. Sadly we still don''t know what device it >> is that is behaving that way (including the confirmation that it''s a >> non- maskable MSI one). >> >> Jan >> > > On the XenServer hardware where we have seen this issue, the > problematic interrupt was from: > > 00:19.0 Ethernet controller: Intel Corporation Ethernet Connection > I217-LM (rev 02) Subsystem: Intel Corporation Device 0000 Control: I/O+ > Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- > FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast > >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin > A routed to IRQ 1275 Region 0: Memory at c2700000 (32-bit, > non-prefetchable) [size=128K] Region 1: Memory at c273e000 (32-bit, > non-prefetchable) [size=4K] Region 2: I/O ports at 7080 [size=32] > Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- > D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- > PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ > Count=1/1 Maskable- 64bit+ Address: 00000000fee00318 Data: 0000 > Capabilities: [e0] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- > AFStatus: TP- Kernel driver in use: e1000e Kernel modules: e1000e > > I am still attempting to reproduce the issue, but we haven''t seen it > again since my email at the root of this thread.Did you see the issue on other HSW machine without this NIC? Also, Thimo, have you tried to pin the vcpu and stop irqbalance in dom0?> > ~Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-develBest regards, Yang
Hello, on last reboot i''ve disabled interrupt remapping, since then it did not crash (but the crashes happened somewhere between 3 hours to 7 days). So still waiting if that option helped. Here the masterplan from my conclusions out of this thread: 1) If the server crashes again I''ll enable dom0_max_vcpus=1 and dom0_vcpus_pin. Could that problem also be a driver error ? Another idea is to update the e1000e driver from 2.3.2-NAPI to 2.4.14 ?! 2) I have two Intel NICs in the server, one onboard and one PCIe NIC. The crash seems to come from the onboard device. So If the server crashes again I''ll disable that internal Intel NIC and put another PCIe network card in it. Best regards Thimo Am 14.08.2013 04:53, schrieb Zhang, Yang Z:> > Did you see the issue on other HSW machine without this NIC? Also, Thimo, have you tried to pin the vcpu and stop irqbalance in dom0? >
On 14/08/13 03:53, Zhang, Yang Z wrote:> Andrew Cooper wrote on 2013-08-12: >> On 12/08/13 11:05, Jan Beulich wrote: >>>>>> On 12.08.13 at 11:28, Andrew Cooper <andrew.cooper3@citrix.com> >> wrote: >>>> On 12/08/13 09:20, Jan Beulich wrote: >>>>>>>> On 09.08.13 at 23:27, "Thimo E." <abc@digithi.de> wrote: >>>>>> (XEN) **Pending EOI error (XEN) irq 29, vector 0x24 (XEN) s[0] >>>>>> irq 29, vec 0x24, ready 0, ISR 00000001, TMR 00000000, IRR 00000000 >>>>>> (XEN) All LAPIC state: (XEN) [vector] ISR TMR IRR >>>>>> (XEN) [1f:00] 00000000 00000000 00000000 (XEN) [3f:20] 00000010 >>>>>> 76efa12e 00000000 (XEN) [5f:40] 00000000 e6f0f2fc 00000000 (XEN) >>>>>> [7f:60] 00000000 32d096ca 00000000 (XEN) [9f:80] 00000000 78fcf87a >>>>>> 00000000 (XEN) [bf:a0] 00000000 f9b9fe4e 00000000 (XEN) [df:c0] >>>>>> 00000000 ffdfe7ab 00000000 (XEN) [ff:e0] 00000000 00000000 00000000 >>>>>> (XEN) Peoi stack trace records: >>>>> Mind providing (a link to) the patch that was used here, so that >>>>> one can make sense of the printed information (and perhaps also >>>>> suggest adjustments to that debugging code)? Nothing I was able to >>>>> find on the list fully matches the output above... >>>> Attached >>> Thanks. Actually, the second case he sent has an interesting >>> difference: >>> >>> (XEN) s[0] irq 29, vec 0x26, ready 0, ISR 00000001, TMR 00000000, IRR >>> 00000001 >>> >>> i.e. we in fact have _three_ instance of the interrupt (two >>> in-service, and one request). I don''t see an explanation for this >>> other than buggy hardware. Sadly we still don''t know what device it >>> is that is behaving that way (including the confirmation that it''s a >>> non- maskable MSI one). >>> >>> Jan >>> >> On the XenServer hardware where we have seen this issue, the >> problematic interrupt was from: >> >> 00:19.0 Ethernet controller: Intel Corporation Ethernet Connection >> I217-LM (rev 02) Subsystem: Intel Corporation Device 0000 Control: I/O+ >> Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- >> FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >>> TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin >> A routed to IRQ 1275 Region 0: Memory at c2700000 (32-bit, >> non-prefetchable) [size=128K] Region 1: Memory at c273e000 (32-bit, >> non-prefetchable) [size=4K] Region 2: I/O ports at 7080 [size=32] >> Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- >> D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- >> PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ >> Count=1/1 Maskable- 64bit+ Address: 00000000fee00318 Data: 0000 >> Capabilities: [e0] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- >> AFStatus: TP- Kernel driver in use: e1000e Kernel modules: e1000e >> >> I am still attempting to reproduce the issue, but we haven''t seen it >> again since my email at the root of this thread. > Did you see the issue on other HSW machine without this NIC? Also, Thimo, have you tried to pin the vcpu and stop irqbalance in dom0?We do not have any Haswell hardware without this NIC. ~Andrew>> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel > > Best regards, > Yang > >
Hello, after one week of testing an intermediate result: Since I''ve set iommu=no-intremap no crash occured so far. The server never ran longer without a crash. So a careful "it''s working", but, because only one 7 days passed so far, not a final horray. Even if this option really avoids the problem I classify it as nothing more than a workaround...obviously a good one because it''s working, but still a workaround. Where could the problem of the source be ? Bug in hardware ? Bug in software ? And what does interrupt remapping really do ? Does disabling remapping have a performance impact ? Best regards Thimo Am 12.08.2013 14:04, schrieb Andrew Cooper:> On 12/08/13 12:52, Thimo E wrote: >> Hello Yang, >> >> attached you''ll find the kernel dmesg, xen dmesg, lspci and output of >> /proc/interrupts. If you want to see further logfiles, please let me >> know. >> >> The processor is a Core i5-4670. The board is an Intel DH87MC >> Mainboard. I am really not sure if it supports APICv, but VT-d is >> supported enabled enabled. >> >> >>> 4.The status of IRQ 29 is 10 which means the guest already issues >>> the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there >>> should be no pending EOI in the EOI stack. If possible, can you add >>> some debug message in the guest EOI code path(like >>> _irq_guest_eoi())) to track the EOI? >>> >> I don''t see the IRQ29 in /proc/interrupts, what I see is: >> cat xen-dmesg.txt |grep "29": (XEN) allocated vector 29 for irq 20 >> cat dmesg.txt | grep "eth0": [ 23.152355] e1000e 0000:00:19.0: PCI >> INT A -> GSI 20 (level, low) -> IRQ 20 >> [ 23.330408] e1000e >> 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection >> >> So is the ethernet irq the bad one ? That is an Onboard Intel network >> adapter. > > That would be consistent with the crash seen with our hardware in > XenServer > >> >>> 6.I guess the interrupt remapping is enabled in your machine. Can >>> you try to disable IR to see whether it still reproduceable? >>> >> Just to be sure, your proposal is to try the parameter "no-intremap" ? > > specifically, iommu=no-intremap > >> >> Best regards >> Thimo > > ~Andrew > >> >> Am 12.08.2013 10:49, schrieb Zhang, Yang Z: >>> >>> Hi Thimo, >>> >>> From your previous experience and log, it shows: >>> >>> 1.The interrupt that triggers the issue is a MSI. >>> >>> 2.MSI are treated as edge-triggered interrupts nomally, except when >>> there is no way to mask the device. In this case, your previous log >>> indicates the device is unmaskable(What special device are you >>> using?Modern PCI devcie should be maskable). >>> >>> 3.The IRQ 29 is belong to dom0, it seems it is not a HVM related issue. >>> >>> 4.The status of IRQ 29 is 10 which means the guest already issues >>> the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there >>> should be no pending EOI in the EOI stack. If possible, can you add >>> some debug message in the guest EOI code path(like >>> _irq_guest_eoi())) to track the EOI? >>> >>> 5.Both of the log show when the issue occured, most of the other >>> interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. Is >>> it a coincidence? Or it happened only on the special condition like >>> heavy of IRQ migration?Perhaps you can disable irq balance in dom0 >>> and pin the IRQ manually. >>> >> |6.I guess the interrupt remapping is enabled in your machine. Can >> you try to disable IR to see whether it still reproduceable? >>> >>> Also, please provide the whole Xen log. >>> >>> Best regards, >>> >>> Yang >>> >> > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Thimo Eichstädt
2013-Aug-20 05:43 UTC
Re: cpuidle and un-eoid interrupts at the local apic
Hello again, ok, I was happy too soon. Crashed again. Now I''ve set the following xen parameters: iommu=no-intremap dom0_max_vcpus=1-1 dom0_vcpus_pin noirqbalance Best regards Thimo Here the crash dump: (XEN) **Pending EOI error^M (XEN) irq 29, vector 0x21^M (XEN) s[0] irq 30, vec 0x31, ready 0, ISR 00000001, TMR 00000000, IRR 00000000^M (XEN) All LAPIC state:^M (XEN) [vector] ISR TMR IRR^M (XEN) [1f:00] 00000000 00000000 00000000^M (XEN) [3f:20] 00020002 00000000 00000000^M (XEN) [5f:40] 00000000 00000000 00000000^M (XEN) [7f:60] 00000000 00000002 00000000^M (XEN) [9f:80] 00000000 00000000 00000000^M (XEN) [bf:a0] 00000000 01010000 00000000^M (XEN) [df:c0] 00000000 01000000 00000000^M (XEN) [ff:e0] 00000000 00000000 08000000^M (XEN) Peoi stack trace records:^M (XEN) Pushed {sp 0, irq 30, vec 0x31}^M (XEN) Poped entry {sp 1, irq 29, vec 0x21}^M (XEN) Marked {sp 0, irq 29, vec 0x21} ready^M (XEN) Pushed {sp 0, irq 29, vec 0x21}^M (XEN) Poped entry {sp 1, irq 29, vec 0x21}^M (XEN) Marked {sp 0, irq 29, vec 0x21} ready^M (XEN) Pushed {sp 0, irq 29, vec 0x21}^M (XEN) Poped entry {sp 1, irq 29, vec 0x21}^M (XEN) Marked {sp 0, irq 29, vec 0x21} ready^M (XEN) Pushed {sp 0, irq 29, vec 0x21}^M (XEN) Poped entry {sp 1, irq 31, vec 0x71}^M (XEN) Marked {sp 0, irq 31, vec 0x71} ready^M (XEN) Pushed {sp 0, irq 31, vec 0x71}^M (XEN) Poped entry {sp 1, irq 30, vec 0x31}^M (XEN) Marked {sp 0, irq 30, vec 0x31} ready^M (XEN) Pushed {sp 0, irq 30, vec 0x31}^M (XEN) Poped entry {sp 1, irq 29, vec 0x21}^M (XEN) Marked {sp 0, irq 29, vec 0x21} ready^M (XEN) Pushed {sp 0, irq 29, vec 0x21}^M (XEN) Poped entry {sp 1, irq 29, vec 0x21}^M (XEN) Marked {sp 0, irq 29, vec 0x21} ready^M (XEN) Pushed {sp 0, irq 29, vec 0x21}^M (XEN) Poped entry {sp 1, irq 29, vec 0x21}^M (XEN) Marked {sp 0, irq 29, vec 0x21} ready^M (XEN) Pushed {sp 0, irq 29, vec 0x21}^M (XEN) Poped entry {sp 1, irq 29, vec 0x21}^M (XEN) Marked {sp 0, irq 29, vec 0x21} ready^M (XEN) Pushed {sp 0, irq 29, vec 0x21}^M (XEN) Poped entry {sp 1, irq 29, vec 0x21}^M (XEN) Marked {sp 0, irq 29, vec 0x21} ready^M (XEN) Pushed {sp 0, irq 29, vec 0x21}^M (XEN) Poped entry {sp 1, irq 29, vec 0x21}^M (XEN) Guest interrupt information:^M (XEN) IRQ: 0 affinity:1 vec:f0 type=IO-APIC-edge status=00000000 mapped, unbound^M (XEN) IRQ: 1 affinity:1 vec:38 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 1(----),^M (XEN) IRQ: 2 affinity:f vec:00 type=XT-PIC status=00000000 mapped, unbound^M (XEN) IRQ: 3 affinity:1 vec:40 type=IO-APIC-edge status=00000006 mapped, unbound^M (XEN) IRQ: 4 affinity:1 vec:48 type=IO-APIC-edge status=00000002 mapped, unbound^M (XEN) IRQ: 5 affinity:1 vec:50 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 5(----),^M (XEN) IRQ: 6 affinity:1 vec:58 type=IO-APIC-edge status=00000002 mapped, unbound^M (XEN) IRQ: 7 affinity:1 vec:60 type=IO-APIC-edge status=00000002 mapped, unbound^M (XEN) IRQ: 8 affinity:1 vec:68 type=IO-APIC-edge status=00000050 in-flight=0 domain-list=0: 8(----),^M (XEN) IRQ: 9 affinity:1 vec:70 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 9(----),^M (XEN) IRQ: 10 affinity:1 vec:78 type=IO-APIC-edge status=00000002 mapped, unbound^M (XEN) IRQ: 11 affinity:1 vec:88 type=IO-APIC-edge status=00000002 mapped, unbound^M (XEN) IRQ: 12 affinity:1 vec:90 type=IO-APIC-edge status=00000002 mapped, unbound^M (XEN) IRQ: 13 affinity:1 vec:98 type=IO-APIC-edge status=00000002 mapped, unbound^M (XEN) IRQ: 14 affinity:1 vec:a0 type=IO-APIC-edge status=00000002 mapped, unbound^M (XEN) IRQ: 15 affinity:1 vec:a8 type=IO-APIC-edge status=00000002 mapped, unbound^M (XEN) IRQ: 16 affinity:4 vec:b0 type=IO-APIC-level status=00000010 in-flight=0 domain-list=0: 16(----),^M (XEN) IRQ: 18 affinity:8 vec:b8 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 18(----),^M (XEN) IRQ: 19 affinity:f vec:29 type=IO-APIC-level status=00000002 mapped, unbound^M (XEN) IRQ: 20 affinity:f vec:39 type=IO-APIC-level status=00000002 mapped, unbound^M (XEN) IRQ: 22 affinity:8 vec:61 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 22(----),^M (XEN) IRQ: 23 affinity:4 vec:d8 type=IO-APIC-level status=00000050 in-flight=0 domain-list=0: 23(----),^M (XEN) IRQ: 24 affinity:1 vec:28 type=DMA_MSI status=00000000 mapped, unbound^M (XEN) IRQ: 25 affinity:1 vec:30 type=DMA_MSI status=00000000 mapped, unbound^M (XEN) IRQ: 26 affinity:f vec:c0 type=PCI-MSI status=00000002 mapped, unbound^M (XEN) IRQ: 27 affinity:f vec:c8 type=PCI-MSI status=00000002 mapped, unbound^M (XEN) IRQ: 28 affinity:f vec:d0 type=PCI-MSI status=00000002 mapped, unbound^M (XEN) IRQ: 29 affinity:4 vec:21 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:276(----),^M (XEN) IRQ: 30 affinity:4 vec:31 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:275(----),^M (XEN) IRQ: 31 affinity:8 vec:71 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:274(----),^M (XEN) IRQ: 32 affinity:4 vec:49 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:273(----),^M (XEN) IRQ: 33 affinity:8 vec:51 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:272(----),^M (XEN) IRQ: 34 affinity:1 vec:59 type=PCI-MSI status=00000050 in-flight=0 domain-list=0:271(----),^M (XEN) IO-APIC interrupt information:^M (XEN) IRQ 0 Vec240:^M (XEN) Apic 0x00, Pin 2: vec=f0 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 1 Vec 56:^M (XEN) Apic 0x00, Pin 1: vec=38 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 3 Vec 64:^M (XEN) Apic 0x00, Pin 3: vec=40 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 4 Vec 72:^M (XEN) Apic 0x00, Pin 4: vec=48 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 5 Vec 80:^M (XEN) Apic 0x00, Pin 5: vec=50 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 6 Vec 88:^M (XEN) Apic 0x00, Pin 6: vec=58 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 7 Vec 96:^M (XEN) Apic 0x00, Pin 7: vec=60 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 8 Vec104:^M (XEN) Apic 0x00, Pin 8: vec=68 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 9 Vec112:^M (XEN) Apic 0x00, Pin 9: vec=70 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=L mask=0 dest_id:1^M (XEN) IRQ 10 Vec120:^M (XEN) Apic 0x00, Pin 10: vec=78 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 11 Vec136:^M (XEN) Apic 0x00, Pin 11: vec=88 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 12 Vec144:^M (XEN) Apic 0x00, Pin 12: vec=90 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 13 Vec152:^M (XEN) Apic 0x00, Pin 13: vec=98 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 14 Vec160:^M (XEN) Apic 0x00, Pin 14: vec=a0 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 15 Vec168:^M (XEN) Apic 0x00, Pin 15: vec=a8 delivery=LoPri dest=L status=0 polarity=0 irr=0 trig=E mask=0 dest_id:1^M (XEN) IRQ 16 Vec176:^M (XEN) Apic 0x00, Pin 16: vec=b0 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:4^M (XEN) IRQ 18 Vec184:^M (XEN) Apic 0x00, Pin 18: vec=b8 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:8^M (XEN) IRQ 19 Vec 41:^M (XEN) Apic 0x00, Pin 19: vec=29 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:15^M (XEN) IRQ 20 Vec 57:^M (XEN) Apic 0x00, Pin 20: vec=39 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=1 dest_id:15^M (XEN) IRQ 22 Vec 97:^M (XEN) Apic 0x00, Pin 22: vec=61 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:8^M (XEN) IRQ 23 Vec216:^M (XEN) Apic 0x00, Pin 23: vec=d8 delivery=LoPri dest=L status=0 polarity=1 irr=0 trig=L mask=0 dest_id:4^M (XEN) number of MP IRQ sources: 15.^M (XEN) number of IO-APIC #2 registers: 24.^M (XEN) testing the IO APIC.......................^M (XEN) IO APIC #2......^M (XEN) .... register #00: 02000000^M (XEN) ....... : physical APIC id: 02^M (XEN) ....... : Delivery Type: 0^M (XEN) ....... : LTS : 0^M (XEN) .... register #01: 00170020^M (XEN) ....... : max redirection entries: 0017^M (XEN) ....... : PRQ implemented: 0^M (XEN) ....... : IO APIC version: 0020^M (XEN) .... IRQ redirection table:^M (XEN) NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: ^M (XEN) 00 000 00 1 0 0 0 0 0 0 00^M (XEN) 01 001 01 0 0 0 0 0 1 1 38^M (XEN) 02 001 01 0 0 0 0 0 1 1 F0^M (XEN) 03 001 01 0 0 0 0 0 1 1 40^M (XEN) 04 001 01 0 0 0 0 0 1 1 48^M (XEN) 05 001 01 0 0 0 0 0 1 1 50^M (XEN) 06 001 01 0 0 0 0 0 1 1 58^M (XEN) 07 001 01 0 0 0 0 0 1 1 60^M (XEN) 08 001 01 0 0 0 0 0 1 1 68^M (XEN) 09 001 01 0 1 0 0 0 1 1 70^M (XEN) 0a 001 01 0 0 0 0 0 1 1 78^M (XEN) 0b 001 01 0 0 0 0 0 1 1 88^M (XEN) 0c 001 01 0 0 0 0 0 1 1 90^M (XEN) 0d 001 01 0 0 0 0 0 1 1 98^M (XEN) 0e 001 01 0 0 0 0 0 1 1 A0^M (XEN) 0f 001 01 0 0 0 0 0 1 1 A8^M (XEN) 10 004 04 0 1 1 1 1 1 1 B0^M (XEN) 11 000 00 1 0 0 0 0 0 0 00^M (XEN) 12 008 08 0 1 0 1 0 1 1 B8^M (XEN) 13 00F 0F 1 1 0 1 0 1 1 29^M (XEN) 14 00F 0F 1 1 0 1 0 1 1 39^M (XEN) 15 07A 0A 1 0 0 0 0 0 2 B4^M (XEN) 16 008 08 0 1 0 1 0 1 1 61^M (XEN) 17 004 04 0 1 0 1 0 1 1 D8^M (XEN) Using vector-based indexing^M (XEN) IRQ to pin mappings:^M (XEN) IRQ240 -> 0:2^M (XEN) IRQ56 -> 0:1^M (XEN) IRQ64 -> 0:3^M (XEN) IRQ72 -> 0:4^M (XEN) IRQ80 -> 0:5^M (XEN) IRQ88 -> 0:6^M (XEN) IRQ96 -> 0:7^M (XEN) IRQ104 -> 0:8^M (XEN) IRQ112 -> 0:9^M (XEN) IRQ120 -> 0:10^M (XEN) IRQ136 -> 0:11^M (XEN) IRQ144 -> 0:12^M (XEN) IRQ152 -> 0:13^M (XEN) IRQ160 -> 0:14^M (XEN) IRQ168 -> 0:15^M (XEN) IRQ176 -> 0:16^M (XEN) IRQ184 -> 0:18^M (XEN) IRQ41 -> 0:19^M (XEN) IRQ57 -> 0:20^M (XEN) IRQ97 -> 0:22^M (XEN) IRQ216 -> 0:23^M (XEN) .................................... done.^M (XEN) ^M (XEN) ****************************************^M (XEN) Panic on CPU 3:^M (XEN) CA-107844****************************************^M (XEN) ^M (XEN) Reboot in five seconds...^M (XEN) Executing crash image^M Am 19.08.2013 17:14, schrieb Thimo E.:> Hello, > > after one week of testing an intermediate result: > > Since I''ve set iommu=no-intremap no crash occured so far. The server > never ran longer without a crash. So a careful "it''s working", but, > because only one 7 days passed so far, not a final horray. > > Even if this option really avoids the problem I classify it as nothing > more than a workaround...obviously a good one because it''s working, > but still a workaround. > > Where could the problem of the source be ? Bug in hardware ? Bug in > software ? > > And what does interrupt remapping really do ? Does disabling remapping > have a performance impact ? > > Best regards > Thimo > > Am 12.08.2013 14:04, schrieb Andrew Cooper: >> On 12/08/13 12:52, Thimo E wrote: >>> Hello Yang, >>> >>> attached you''ll find the kernel dmesg, xen dmesg, lspci and output >>> of /proc/interrupts. If you want to see further logfiles, please let >>> me know. >>> >>> The processor is a Core i5-4670. The board is an Intel DH87MC >>> Mainboard. I am really not sure if it supports APICv, but VT-d is >>> supported enabled enabled. >>> >>> >>>> 4.The status of IRQ 29 is 10 which means the guest already issues >>>> the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there >>>> should be no pending EOI in the EOI stack. If possible, can you add >>>> some debug message in the guest EOI code path(like >>>> _irq_guest_eoi())) to track the EOI? >>>> >>> I don''t see the IRQ29 in /proc/interrupts, what I see is: >>> cat xen-dmesg.txt |grep "29": (XEN) allocated vector 29 for irq 20 >>> cat dmesg.txt | grep "eth0": [ 23.152355] e1000e 0000:00:19.0: PCI >>> INT A -> GSI 20 (level, low) -> IRQ 20 >>> [ 23.330408] >>> e1000e 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection >>> >>> So is the ethernet irq the bad one ? That is an Onboard Intel >>> network adapter. >> >> That would be consistent with the crash seen with our hardware in >> XenServer >> >>> >>>> 6.I guess the interrupt remapping is enabled in your machine. Can >>>> you try to disable IR to see whether it still reproduceable? >>>> >>> Just to be sure, your proposal is to try the parameter "no-intremap" ? >> >> specifically, iommu=no-intremap >> >>> >>> Best regards >>> Thimo >> >> ~Andrew >> >>> >>> Am 12.08.2013 10:49, schrieb Zhang, Yang Z: >>>> >>>> Hi Thimo, >>>> >>>> From your previous experience and log, it shows: >>>> >>>> 1.The interrupt that triggers the issue is a MSI. >>>> >>>> 2.MSI are treated as edge-triggered interrupts nomally, except when >>>> there is no way to mask the device. In this case, your previous log >>>> indicates the device is unmaskable(What special device are you >>>> using?Modern PCI devcie should be maskable). >>>> >>>> 3.The IRQ 29 is belong to dom0, it seems it is not a HVM related issue. >>>> >>>> 4.The status of IRQ 29 is 10 which means the guest already issues >>>> the EOI because the bit IRQ_GUEST_EOI_PENDING is cleared, so there >>>> should be no pending EOI in the EOI stack. If possible, can you add >>>> some debug message in the guest EOI code path(like >>>> _irq_guest_eoi())) to track the EOI? >>>> >>>> 5.Both of the log show when the issue occured, most of the other >>>> interrupts which owned by dom0 were in IRQ_MOVE_PENDING status. Is >>>> it a coincidence? Or it happened only on the special condition like >>>> heavy of IRQ migration?Perhaps you can disable irq balance in dom0 >>>> and pin the IRQ manually. >>>> >>> |6.I guess the interrupt remapping is enabled in your machine. Can >>> you try to disable IR to see whether it still reproduceable? >>>> >>>> Also, please provide the whole Xen log. >>>> >>>> Best regards, >>>> >>>> Yang >>>> >>> >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 20.08.13 at 07:43, Thimo Eichstädt<thimoe@digithi.de> wrote: > (XEN) **Pending EOI error^M > (XEN) irq 29, vector 0x21^M > (XEN) s[0] irq 30, vec 0x31, ready 0, ISR 00000001, TMR 00000000, IRR 00000000^M > (XEN) All LAPIC state:^M > (XEN) [vector] ISR TMR IRR^M > (XEN) [1f:00] 00000000 00000000 00000000^M > (XEN) [3f:20] 00020002 00000000 00000000^MIt ought to be plain impossible to receive an interrupt at vector 0x21 while the ISR bit for vector 0x31 is still set. Intel folks - any input on this? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich wrote on 2013-08-20:>>>> On 20.08.13 at 07:43, Thimo Eichstädt<thimoe@digithi.de> wrote: >> (XEN) **Pending EOI error^M (XEN) irq 29, vector 0x21^M (XEN) s[0] >> irq 30, vec 0x31, ready 0, ISR 00000001, TMR 00000000, IRR 00000000^M >> (XEN) All LAPIC state:^M (XEN) [vector] ISR TMR IRR^M >> (XEN) [1f:00] 00000000 00000000 00000000^M (XEN) [3f:20] 00020002 >> 00000000 00000000^M > > It ought to be plain impossible to receive an interrupt at vector > 0x21 while the ISR bit for vector 0x31 is still set. > > Intel folks - any input on this?I have no idea with this. But I will forward the information to some experts internally for help.> > JanBest regards, Yang _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Thimo Eichstädt
2013-Aug-23 07:22 UTC
Re: cpuidle and un-eoid interrupts at the local apic
Hello Yang, any update from your side ? Did your expert have any idea ? Possible Hardware problem ? Best regards Thimo Am 20.08.2013 10:50, schrieb Zhang, Yang Z:> Jan Beulich wrote on 2013-08-20: >>>>> On 20.08.13 at 07:43, Thimo Eichstädt<thimoe@digithi.de> wrote: >>> (XEN) **Pending EOI error^M (XEN) irq 29, vector 0x21^M (XEN) s[0] >>> irq 30, vec 0x31, ready 0, ISR 00000001, TMR 00000000, IRR 00000000^M >>> (XEN) All LAPIC state:^M (XEN) [vector] ISR TMR IRR^M >>> (XEN) [1f:00] 00000000 00000000 00000000^M (XEN) [3f:20] 00020002 >>> 00000000 00000000^M >> It ought to be plain impossible to receive an interrupt at vector >> 0x21 while the ISR bit for vector 0x31 is still set. >> >> Intel folks - any input on this? > I have no idea with this. But I will forward the information to some experts internally for help. > >> Jan > > Best regards, > Yang > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Thimo Eichstädt wrote on 2013-08-23:> Hello Yang, > > any update from your side ? Did your expert have any idea ? Possible > Hardware problem ?Sorry, no update on this. I am still waiting the answer from hardware team.> > Best regards > Thimo > Am 20.08.2013 10:50, schrieb Zhang, Yang Z: >> Jan Beulich wrote on 2013-08-20: >>>>>> On 20.08.13 at 07:43, Thimo Eichstädt<thimoe@digithi.de> wrote: >>>> (XEN) **Pending EOI error^M (XEN) irq 29, vector 0x21^M (XEN) >>>> s[0] irq 30, vec 0x31, ready 0, ISR 00000001, TMR 00000000, IRR >>>> 00000000^M (XEN) All LAPIC state:^M (XEN) [vector] ISR TMR >>>> IRR^M (XEN) [1f:00] 00000000 00000000 00000000^M (XEN) [3f:20] >>>> 00020002 00000000 00000000^M >>> It ought to be plain impossible to receive an interrupt at vector >>> 0x21 while the ISR bit for vector 0x31 is still set. >>> >>> Intel folks - any input on this? >> I have no idea with this. But I will forward the information to some >> experts internally for help. >> >>> Jan >> >> Best regards, >> Yang >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-develBest regards, Yang _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Zhang, Yang Z wrote on 2013-08-23:> Thimo Eichstädt wrote on 2013-08-23: >> Hello Yang, >> >> any update from your side ? Did your expert have any idea ? Possible >> Hardware problem ? > Sorry, no update on this. I am still waiting the answer from hardware team.Hi Thimo, I remember that the CPU always in idle state when this issue happens. So can you have a try to disable the C state in Xen to see if it helps?> >> >> Best regards >> Thimo >> Am 20.08.2013 10:50, schrieb Zhang, Yang Z: >>> Jan Beulich wrote on 2013-08-20: >>>>>>> On 20.08.13 at 07:43, Thimo Eichstädt<thimoe@digithi.de> wrote: >>>>> (XEN) **Pending EOI error^M (XEN) irq 29, vector 0x21^M (XEN) s[0] >>>>> irq 30, vec 0x31, ready 0, ISR 00000001, TMR 00000000, IRR >>>>> 00000000^M (XEN) All LAPIC state:^M (XEN) [vector] ISR TMR >>>>> IRR^M (XEN) [1f:00] 00000000 00000000 00000000^M (XEN) [3f:20] >>>>> 00020002 00000000 00000000^M >>>> It ought to be plain impossible to receive an interrupt at vector >>>> 0x21 while the ISR bit for vector 0x31 is still set. >>>> >>>> Intel folks - any input on this? >>> I have no idea with this. But I will forward the information to >>> some experts internally for help. >>> >>>> Jan >>> >>> Best regards, >>> Yang >>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xen.org >>> http://lists.xen.org/xen-devel > > > Best regards, > Yang >Best regards, Yang _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 04/09/13 19:32, Thimo E. wrote:> Hello again, > > the last two weeks no crash with pinning dom0_vcpus_pin and > restricting dom0 to 1 cpu. But yesterday it crashed again. So changed > the command line again to: > > iommu=no-intremap noirqbalance com1=115200,8n1,0xe050,0 > console=com1,vga mem=1024G dom0_max_vcpus=4 dom0_mem=752M,max:752M > watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M > cpuid_mask_xsave_eax=0 > > And today server crashed again and produced a lot of debugging > messages, see attached. The "..." in the logfiles mean that the > message above the points was repeated very often. > > My summary so far: > - With only 1 cpu atteched to dom0 the server was stable for 2 weeks, > the crash there did not really show any irq problems, see > crash20130903.txt > You can find Andrews ideas to this in > http://forums.citrix.com/thread.jspa?messageID=1760771#1760771 > - With more than 1 cpu and irqbalance the server produced the crashes > I''ve already posted before > - Without irqbalance crash with some other fancy output, see > crash20130904.txt > > Next step is to change the network card. > > Zhang, any update from your side ? Or do the others have any idea ? > Could "ioapic_ack=old" help somewhere ? > > Best regards > Thimo >Ok - the second attachment (crash20130903.txt) is the one I have triaged before, and the crash is impossible given the expected code flow through the function. %r14 is calculated as a the per-cpu cpu_info, which cannot possibly be -1 at the point of the fault. The only explanation is that the pagefault is a result of a spurious jump to this location. From a quick glance at the other crash, vector 2e was the problematic one (iirc). The "Bad vmexit (reason 3)" at the top would suggest that something on the system has sent an INIT to pcpu 2, which seems antisocial. As we have identified that the hardware is delivering invalid interrupts, I wouldn''t necessarily read any more into this new crash; something is very broken in the hardware. I would be interested for any update from Intel regarding the ISR violation. ~Andrew
Hello Andrew, thanks for your response. At least I''ve seen the trigger of the new crash (2e) already before, so they seem so belong together. I can''t image that I am the only one on the world who is using a haswell board. And as I haven''t seen any other Xen bug/crash reports like mine (and one time you) nor bug reports from users with other operating systems, I ask myself if only my hardware is buggy or if other operating systems handle those "spurious" interrupts in another way ?!?! What does " ioapic_ack=old" change ? Best regards Thimo Am 04.09.2013 20:55, schrieb Andrew Cooper:> On 04/09/13 19:32, Thimo E. wrote: >> Hello again, >> >> the last two weeks no crash with pinning dom0_vcpus_pin and >> restricting dom0 to 1 cpu. But yesterday it crashed again. So changed >> the command line again to: >> >> iommu=no-intremap noirqbalance com1=115200,8n1,0xe050,0 >> console=com1,vga mem=1024G dom0_max_vcpus=4 dom0_mem=752M,max:752M >> watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M >> cpuid_mask_xsave_eax=0 >> >> And today server crashed again and produced a lot of debugging >> messages, see attached. The "..." in the logfiles mean that the >> message above the points was repeated very often. >> >> My summary so far: >> - With only 1 cpu atteched to dom0 the server was stable for 2 weeks, >> the crash there did not really show any irq problems, see >> crash20130903.txt >> You can find Andrews ideas to this in >> http://forums.citrix.com/thread.jspa?messageID=1760771#1760771 >> - With more than 1 cpu and irqbalance the server produced the crashes >> I''ve already posted before >> - Without irqbalance crash with some other fancy output, see >> crash20130904.txt >> >> Next step is to change the network card. >> >> Zhang, any update from your side ? Or do the others have any idea ? >> Could "ioapic_ack=old" help somewhere ? >> >> Best regards >> Thimo >> > Ok - the second attachment (crash20130903.txt) is the one I have triaged > before, and the crash is impossible given the expected code flow through > the function. > > %r14 is calculated as a the per-cpu cpu_info, which cannot possibly be > -1 at the point of the fault. The only explanation is that the > pagefault is a result of a spurious jump to this location. > > From a quick glance at the other crash, vector 2e was the problematic > one (iirc). The "Bad vmexit (reason 3)" at the top would suggest that > something on the system has sent an INIT to pcpu 2, which seems antisocial. > > As we have identified that the hardware is delivering invalid > interrupts, I wouldn''t necessarily read any more into this new crash; > something is very broken in the hardware. > > I would be interested for any update from Intel regarding the ISR violation. > > ~Andrew
On 04/09/2013 20:56, Thimo E. wrote:> Hello Andrew, > > thanks for your response. At least I''ve seen the trigger of the new > crash (2e) already before, so they seem so belong together. > > I can''t image that I am the only one on the world who is using a > haswell board. And as I haven''t seen any other Xen bug/crash reports > like mine (and one time you) nor bug reports from users with other > operating systems, I ask myself if only my hardware is buggy > or if other operating systems handle those "spurious" interrupts in > another way ?!?! > > What does " ioapic_ack=old" change ? > > Best regards > Thimoioapic_ack=old is already in effect - see "Enabled directed EOI with ioapic_ack_old on!" in the boot dmesg. Originally, it was a bugfix workaround for ancient IO-APIC hardware which had a bug on one of the mask bits. Nowadays, it is used with EOI broadcast suppression, which is a APIC transaction performance improvement on recent processors. What it does is affect whether an IO-APIC interrupt gets masked when an interrupt is received. You could certainly try "ioapic_ack=new" and see whether that makes a difference, given a lack of any other ideas. It will disable EOI broadcast suppression. ~Andrew
Thimo E. wrote on 2013-09-05:> Hello again, > > the last two weeks no crash with pinning dom0_vcpus_pin and > restricting > dom0 to 1 cpu. But yesterday it crashed again. So changed the command > line again to: > > iommu=no-intremap noirqbalance com1=115200,8n1,0xe050,0 > console=com1,vga mem=1024G dom0_max_vcpus=4 dom0_mem=752M,max:752M > watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M > cpuid_mask_xsave_eax=0 > > And today server crashed again and produced a lot of debugging > messages, see attached. The "..." in the logfiles mean that the > message above the points was repeated very often. > > My summary so far: > - With only 1 cpu atteched to dom0 the server was stable for 2 weeks, > the crash there did not really show any irq problems, see crash20130903.txt > You can find Andrews ideas to this in > http://forums.citrix.com/thread.jspa?messageID=1760771#1760771 - With > more than 1 cpu and irqbalance the server produced the crashes I've > already posted before - Without irqbalance crash with some other fancy > output, see crash20130904.txt > > Next step is to change the network card. > > Zhang, any update from your side ? Or do the others have any idea ?Our hardware guys said they don't aware of such issue with this CPU. We are trying to find the same platform to reproduce now.> Could "ioapic_ack=old" help somewhere ? > > Best regards > Thimo > Am 27.08.2013 03:03, schrieb Zhang, Yang Z: >> Zhang, Yang Z wrote on 2013-08-23: >>> Thimo Eichstädt wrote on 2013-08-23: >>>> Hello Yang, >>>> >>>> any update from your side ? Did your expert have any idea ? >>>> Possible Hardware problem ? >>> Sorry, no update on this. I am still waiting the answer from hardware team. >> Hi Thimo, >> >> I remember that the CPU always in idle state when this issue happens. >> So can you have a try to disable the C state in Xen to see if it helps? >> >>>> Best regards >>>> Thimo >>>> Am 20.08.2013 10:50, schrieb Zhang, Yang Z: >>>>> Jan Beulich wrote on 2013-08-20: >>>>>>>>> On 20.08.13 at 07:43, Thimo Eichstädt<thimoe@digithi.de> wrote: >>>>>>> (XEN) **Pending EOI error^M (XEN) irq 29, vector 0x21^M (XEN) s[0] >>>>>>> irq 30, vec 0x31, ready 0, ISR 00000001, TMR 00000000, IRR >>>>>>> 00000000^M (XEN) All LAPIC state:^M (XEN) [vector] ISR TMR >>>>>>> IRR^M (XEN) [1f:00] 00000000 00000000 00000000^M (XEN) [3f:20] >>>>>>> 00020002 00000000 00000000^M >>>>>> It ought to be plain impossible to receive an interrupt at vector >>>>>> 0x21 while the ISR bit for vector 0x31 is still set. >>>>>> >>>>>> Intel folks - any input on this? >>>>> I have no idea with this. But I will forward the information to >>>>> some experts internally for help. >>>>> >>>>>> Jan >>>>> Best regards, >>>>> Yang >>>>> >>>>> >>>>> _______________________________________________ >>>>> Xen-devel mailing list >>>>> Xen-devel@lists.xen.org >>>>> http://lists.xen.org/xen-devel >>> >>> Best regards, >>> Yang >>> >> >> Best regards, >> Yang >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-develBest regards, Yang _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper wrote on 2013-09-05:> On 04/09/2013 20:56, Thimo E. wrote: >> Hello Andrew, >> >> thanks for your response. At least I''ve seen the trigger of the new >> crash (2e) already before, so they seem so belong together. >> >> I can''t image that I am the only one on the world who is using a >> haswell board. And as I haven''t seen any other Xen bug/crash reports >> like mine (and one time you) nor bug reports from users with other >> operating systems, I ask myself if only my hardware is buggy or if >> other operating systems handle those "spurious" interrupts in >> another way ?!?! >> >> What does " ioapic_ack=old" change ? >> >> Best regards >> Thimo > > ioapic_ack=old is already in effect - see "Enabled directed EOI with > ioapic_ack_old on!" in the boot dmesg. > > Originally, it was a bugfix workaround for ancient IO-APIC hardware > which had a bug on one of the mask bits. Nowadays, it is used with > EOI broadcast suppression, which is a APIC transaction performance > improvement on recent processors. What it does is affect whether an > IO-APIC interrupt gets masked when an interrupt is received. > > You could certainly try "ioapic_ack=new" and see whether that makes a > difference, given a lack of any other ideas. It will disable EOI > broadcast suppression. > > ~AndrewHi Thimo Did you see this issue if and only if HVM guest running? If yes, can you try to isolate the dom0 VCPUs and HVM guest''s VCPUs? For example, pin all dom0''s VCPUs to some PCPUs and pin all HVM guest''s VCPUs to the remain PCPUs. BTW: you didn''t try the device pass-through when the issue occurs? Best regards, Yang
Am 05.09.2013 03:45, schrieb Zhang, Yang Z:> Hi Thimo > > Did you see this issue if and only if HVM guest running? If yes, can you try to isolate the dom0 VCPUs and HVM guest''s VCPUs? For example, pin all dom0''s VCPUs to some PCPUs and pin all HVM guest''s VCPUs to the remain PCPUs.I have both, PV guests and HVM guests. I''ve tried to shutdown all the HVM guests and use only PV guests, problem is still there. I did not try the other way, only HVM guests, because this is a production system and the PV guests are needed.> > BTW: you didn''t try the device pass-through when the issue occurs?I am not using device pass-through. Best regards Thimo
Hello again, I''ve disabled the internal network card and used another one, problem still exists. I had two crashed during 5 minutes, frustrating. So (assuming disabling the internal card in the bios is working) the source of the problem is not the internal NIC. Every time the pending EOI error occurs I see the mysterious interrupt >>29<<. Only the vectors are changing. See below a summary of the last 5 crashes. My Questions: - How can I see to which hardware device int 29 belongs ? I can''t find int 29 in /proc/interrupts or lspci -vv nor in kernel dmesg or xen dmesg ?!?! - Andrew, what does your output "domain-list=0:276" mean and why is it alway 0:276 for interrupt 29 ? Is it the VM number ? 1) (XEN) irq 29, vector 0x21 (XEN) IRQ: 29 affinity:4 vec:21 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:276(----), 2) (XEN) irq 29, vector 0x26 (XEN) IRQ: 29 affinity:8 vec:26 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:276(----), 3) (XEN) irq 29, vector 0x31 (XEN) IRQ: 29 affinity:2 vec:24 type=PCI-MSI status=00000010 in-flight=0 domain-list=0:276(----), 4) (XEN) irq 29, vector 0x2e (XEN) IRQ: 29 affinity:8 vec:7e type=PCI-MSI status=00000010 in-flight=0 domain-list=0:276(----), 5) (XEN) irq 29, vector 0x3b (XEN) IRQ: 29 affinity:2 vec:3b type=PCI-MSI status=00000010 in-flight=0 domain-list=0:276(----), Best regards Thimo Am 14.08.2013 11:52, schrieb Andrew Cooper:> On 14/08/13 03:53, Zhang, Yang Z wrote: >> Andrew Cooper wrote on 2013-08-12: >>> >>> On the XenServer hardware where we have seen this issue, the >>> problematic interrupt was from: >>> >>> 00:19.0 Ethernet controller: Intel Corporation Ethernet Connection >>> I217-LM (rev 02) Subsystem: Intel Corporation Device 0000 Control: I/O+ >>> Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- >>> FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >>>> TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin >>> A routed to IRQ 1275 Region 0: Memory at c2700000 (32-bit, >>> non-prefetchable) [size=128K] Region 1: Memory at c273e000 (32-bit, >>> non-prefetchable) [size=4K] Region 2: I/O ports at 7080 [size=32] >>> Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- >>> D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- >>> PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] MSI: Enable+ >>> Count=1/1 Maskable- 64bit+ Address: 00000000fee00318 Data: 0000 >>> Capabilities: [e0] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- >>> AFStatus: TP- Kernel driver in use: e1000e Kernel modules: e1000e >>> >>> I am still attempting to reproduce the issue, but we haven''t seen it >>> again since my email at the root of this thread. >> Did you see the issue on other HSW machine without this NIC? Also, Thimo, have you tried to pin the vcpu and stop irqbalance in dom0? > We do not have any Haswell hardware without this NIC. > > ~Andrew >
On 07/09/2013 14:27, Thimo E. wrote:> Hello again, > > I''ve disabled the internal network card and used another one, problem > still exists. I had two crashed during 5 minutes, frustrating. > So (assuming disabling the internal card in the bios is working) the > source of the problem is not the internal NIC. > > Every time the pending EOI error occurs I see the mysterious interrupt > >>29<<. Only the vectors are changing. See below a summary of the last > 5 crashes. > > My Questions: > - How can I see to which hardware device int 29 belongs ? I can''t find > int 29 in /proc/interrupts or lspci -vv nor in kernel dmesg or xen > dmesg ?!?! > - Andrew, what does your output "domain-list=0:276" mean and why is it > alway 0:276 for interrupt 29 ? Is it the VM number ? > > 1) > (XEN) irq 29, vector 0x21 > (XEN) IRQ: 29 affinity:4 vec:21 type=PCI-MSI status=00000010 > in-flight=0 domain-list=0:276(----), > > 2) > (XEN) irq 29, vector 0x26 > (XEN) IRQ: 29 affinity:8 vec:26 type=PCI-MSI status=00000010 > in-flight=0 domain-list=0:276(----), > > 3) > (XEN) irq 29, vector 0x31 > (XEN) IRQ: 29 affinity:2 vec:24 type=PCI-MSI status=00000010 > in-flight=0 domain-list=0:276(----), > > 4) > (XEN) irq 29, vector 0x2e > (XEN) IRQ: 29 affinity:8 vec:7e type=PCI-MSI status=00000010 > in-flight=0 domain-list=0:276(----), > > 5) > (XEN) irq 29, vector 0x3b > (XEN) IRQ: 29 affinity:2 vec:3b type=PCI-MSI status=00000010 > in-flight=0 domain-list=0:276(----), >irq 29 is just an internal Xen number for accounting all interrupts. It doesn''t mean anything specific regarding hardware etc. The vector and affinity would expect to change as dom0s vcpus are moved around by the scheduler. domain-list=0 means that this interrupt is targeted at dom0 (It is a list because certain interrupts have to be shared my more than 1 domain). Helpfully, the keyhandler truncates the pirq field, so 276 is unlikely to be correct. As it is a dom0 MSI, I am guessing it actually matches up with interrupt 1276 in /proc/interrupts, if there is one. Can you provide the results of `xl debug-keys iMQ`, and attach /proc/interrupts to this email (just in case the setup has changed after playing with your BIOS) ~Andrew
Hello Andrew, ok, thanks. This is what I assumed. The output of "xl debug-keys iMQ" is empty. [root@localhost ~]# dmesg |grep arcmsr [ 8.159321] arcmsr 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 8.159413] arcmsr 0000:01:00.0: setting latency timer to 64 [ 8.170316] arcmsr 0000:01:00.0: get owner: 7ff0 [ 8.170414] arcmsr 0000:01:00.0: irq 1276 (276) for MSI/MSI-X [ 8.170421] IRQ 1276/arcmsr: IRQF_DISABLED is not guaranteed on shared IRQs [ 8.170654] arcmsr0: msi enabled [root@localhost /]# cat /proc/irq/1276/spurious count 61007 unhandled 8 last_unhandled 36736990 ms arcmsr is the driver of the Areca Storage Raid Controller. Used it already before with Xenserver 6.0.2 for years, no problems. THe messages "IRQF_DISABLED is not guaranteed...." and "8 unhandled interrupts" look interesting. I am not a kernel hacker but what I interpret from http://lxr.free-electrons.com/source/kernel/irq/manage.c?v=2.6.32: 1025 if ((irqflags & (IRQF_SHARED|IRQF_DISABLED)) =1026 (IRQF_SHARED|IRQF_DISABLED)) { 1027 pr_warning( 1028 "IRQ %d/%s: IRQF_DISABLED is not guaranteed on shared IRQs\n", 1029 irq, devname); ... 738 * Force MSI interrupts to run with interrupts 739 * disabled. The multi vector cards can cause stack 740 * overflows due to nested interrupts when enough of 741 * them are directed to a core and fire at the same 742 * time. 743 */ 744 if (desc->msi_desc) 745 new->flags |= IRQF_DISABLED; --> "IRQF_DISABLED is not guaranteed on shared IRQs" warning is only printed when irqflags IRQF_SHARED and IRQF_DISABLED are set --> Is that what we see in the kernel oops the stack overflow the comment in lines 738-742 is talking about ?! --> IRQF_SHARED is set, so MSI interrupt 1276 is shared ?! I thought that it is not possible that MSI interrupts are shared. Attached you''ll see my /proc/interrupts So what I do now is disabling MSI for the arcmsr driver. Could this be the source of the problem ?! But why is 1276 shared ?! Best regards Thimo Am 07.09.2013 19:02, schrieb Andrew Cooper:> > irq 29 is just an internal Xen number for accounting all interrupts. It > doesn''t mean anything specific regarding hardware etc. The vector and > affinity would expect to change as dom0s vcpus are moved around by the > scheduler. > > domain-list=0 means that this interrupt is targeted at dom0 (It is a > list because certain interrupts have to be shared my more than 1 > domain). Helpfully, the keyhandler truncates the pirq field, so 276 is > unlikely to be correct. As it is a dom0 MSI, I am guessing it actually > matches up with interrupt 1276 in /proc/interrupts, if there is one. > > Can you provide the results of `xl debug-keys iMQ`, and attach > /proc/interrupts to this email (just in case the setup has changed after > playing with your BIOS) > > ~Andrew >
On 08/09/2013 00:37, Thimo E. wrote:> Hello Andrew, > > ok, thanks. This is what I assumed. > > The output of "xl debug-keys iMQ" is empty.Sorry - I should have been more clear. `xl debug-keys` dumps its information into the xen dmesg buffer, so `xl dmesg` will capture the results. ~Andrew> > [root@localhost ~]# dmesg |grep arcmsr > [ 8.159321] arcmsr 0000:01:00.0: PCI INT A -> GSI 16 (level, low) > -> IRQ 16 > [ 8.159413] arcmsr 0000:01:00.0: setting latency timer to 64 > [ 8.170316] arcmsr 0000:01:00.0: get owner: 7ff0 > [ 8.170414] arcmsr 0000:01:00.0: irq 1276 (276) for MSI/MSI-X > [ 8.170421] IRQ 1276/arcmsr: IRQF_DISABLED is not guaranteed on > shared IRQs > [ 8.170654] arcmsr0: msi enabled > > [root@localhost /]# cat /proc/irq/1276/spurious > count 61007 > unhandled 8 > last_unhandled 36736990 ms > > arcmsr is the driver of the Areca Storage Raid Controller. Used it > already before with Xenserver 6.0.2 for years, no problems. > > THe messages "IRQF_DISABLED is not guaranteed...." and "8 unhandled > interrupts" look interesting. I am not a kernel hacker but what I > interpret from > http://lxr.free-electrons.com/source/kernel/irq/manage.c?v=2.6.32: > > 1025 if ((irqflags & (IRQF_SHARED|IRQF_DISABLED)) => 1026 (IRQF_SHARED|IRQF_DISABLED)) { > 1027 pr_warning( > 1028 "IRQ %d/%s: IRQF_DISABLED is not guaranteed on > shared IRQs\n", > 1029 irq, devname); > ... > 738 * Force MSI interrupts to run with interrupts > 739 * disabled. The multi vector cards can cause stack > 740 * overflows due to nested interrupts when enough of > 741 * them are directed to a core and fire at the same > 742 * time. > 743 */ > 744 if (desc->msi_desc) > 745 new->flags |= IRQF_DISABLED; > > --> "IRQF_DISABLED is not guaranteed on shared IRQs" warning is only > printed when irqflags IRQF_SHARED and IRQF_DISABLED are set > --> Is that what we see in the kernel oops the stack overflow the > comment in lines 738-742 is talking about ?! > --> IRQF_SHARED is set, so MSI interrupt 1276 is shared ?! I thought > that it is not possible that MSI interrupts are shared. Attached > you''ll see my /proc/interrupts > > So what I do now is disabling MSI for the arcmsr driver. Could this be > the source of the problem ?! But why is 1276 shared ?! > > Best regards > Thimo > > Am 07.09.2013 19:02, schrieb Andrew Cooper: >> >> irq 29 is just an internal Xen number for accounting all interrupts. It >> doesn''t mean anything specific regarding hardware etc. The vector and >> affinity would expect to change as dom0s vcpus are moved around by the >> scheduler. >> >> domain-list=0 means that this interrupt is targeted at dom0 (It is a >> list because certain interrupts have to be shared my more than 1 >> domain). Helpfully, the keyhandler truncates the pirq field, so 276 is >> unlikely to be correct. As it is a dom0 MSI, I am guessing it actually >> matches up with interrupt 1276 in /proc/interrupts, if there is one. >> >> Can you provide the results of `xl debug-keys iMQ`, and attach >> /proc/interrupts to this email (just in case the setup has changed after >> playing with your BIOS) >> >> ~Andrew >> >
Ah, sorry. Output is attached. Am 08.09.2013 11:53, schrieb Andrew Cooper:> On 08/09/2013 00:37, Thimo E. wrote: >> Hello Andrew, >> >> ok, thanks. This is what I assumed. >> >> The output of "xl debug-keys iMQ" is empty. > Sorry - I should have been more clear. `xl debug-keys` dumps its > information into the xen dmesg buffer, so `xl dmesg` will capture the > results. > > ~Andrew > >> [root@localhost ~]# dmesg |grep arcmsr >> [ 8.159321] arcmsr 0000:01:00.0: PCI INT A -> GSI 16 (level, low) >> -> IRQ 16 >> [ 8.159413] arcmsr 0000:01:00.0: setting latency timer to 64 >> [ 8.170316] arcmsr 0000:01:00.0: get owner: 7ff0 >> [ 8.170414] arcmsr 0000:01:00.0: irq 1276 (276) for MSI/MSI-X >> [ 8.170421] IRQ 1276/arcmsr: IRQF_DISABLED is not guaranteed on >> shared IRQs >> [ 8.170654] arcmsr0: msi enabled >> >> [root@localhost /]# cat /proc/irq/1276/spurious >> count 61007 >> unhandled 8 >> last_unhandled 36736990 ms >> >> arcmsr is the driver of the Areca Storage Raid Controller. Used it >> already before with Xenserver 6.0.2 for years, no problems. >> >> THe messages "IRQF_DISABLED is not guaranteed...." and "8 unhandled >> interrupts" look interesting. I am not a kernel hacker but what I >> interpret from >> http://lxr.free-electrons.com/source/kernel/irq/manage.c?v=2.6.32: >> >> 1025 if ((irqflags & (IRQF_SHARED|IRQF_DISABLED)) =>> 1026 (IRQF_SHARED|IRQF_DISABLED)) { >> 1027 pr_warning( >> 1028 "IRQ %d/%s: IRQF_DISABLED is not guaranteed on >> shared IRQs\n", >> 1029 irq, devname); >> ... >> 738 * Force MSI interrupts to run with interrupts >> 739 * disabled. The multi vector cards can cause stack >> 740 * overflows due to nested interrupts when enough of >> 741 * them are directed to a core and fire at the same >> 742 * time. >> 743 */ >> 744 if (desc->msi_desc) >> 745 new->flags |= IRQF_DISABLED; >> >> --> "IRQF_DISABLED is not guaranteed on shared IRQs" warning is only >> printed when irqflags IRQF_SHARED and IRQF_DISABLED are set >> --> Is that what we see in the kernel oops the stack overflow the >> comment in lines 738-742 is talking about ?! >> --> IRQF_SHARED is set, so MSI interrupt 1276 is shared ?! I thought >> that it is not possible that MSI interrupts are shared. Attached >> you''ll see my /proc/interrupts >> >> So what I do now is disabling MSI for the arcmsr driver. Could this be >> the source of the problem ?! But why is 1276 shared ?! >> >> Best regards >> Thimo >> >> Am 07.09.2013 19:02, schrieb Andrew Cooper: >>> irq 29 is just an internal Xen number for accounting all interrupts. It >>> doesn''t mean anything specific regarding hardware etc. The vector and >>> affinity would expect to change as dom0s vcpus are moved around by the >>> scheduler. >>> >>> domain-list=0 means that this interrupt is targeted at dom0 (It is a >>> list because certain interrupts have to be shared my more than 1 >>> domain). Helpfully, the keyhandler truncates the pirq field, so 276 is >>> unlikely to be correct. As it is a dom0 MSI, I am guessing it actually >>> matches up with interrupt 1276 in /proc/interrupts, if there is one. >>> >>> Can you provide the results of `xl debug-keys iMQ`, and attach >>> /proc/interrupts to this email (just in case the setup has changed after >>> playing with your BIOS) >>> >>> ~Andrew >>>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 07.09.13 at 19:02, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > domain-list=0 means that this interrupt is targeted at dom0 (It is a > list because certain interrupts have to be shared my more than 1 > domain). Helpfully, the keyhandler truncates the pirq field, so 276 is > unlikely to be correct. As it is a dom0 MSI, I am guessing it actually > matches up with interrupt 1276 in /proc/interrupts, if there is one.What truncation are you seeing here? %3d merely pads the number with spaces if it ends up being less than three digits. Wider numbers still get printed in full. Whether that padding is really useful is another question... Jan
On 09/09/13 08:59, Jan Beulich wrote:>>>> On 07.09.13 at 19:02, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> domain-list=0 means that this interrupt is targeted at dom0 (It is a >> list because certain interrupts have to be shared my more than 1 >> domain). Helpfully, the keyhandler truncates the pirq field, so 276 is >> unlikely to be correct. As it is a dom0 MSI, I am guessing it actually >> matches up with interrupt 1276 in /proc/interrupts, if there is one. > What truncation are you seeing here? %3d merely pads the number > with spaces if it ends up being less than three digits. Wider numbers > still get printed in full. Whether that padding is really useful is another > question... > > Jan >Yes - very true. Which means there is some other reason for the 1000 discrepancy between Xen and Dom0 ideas of dom0''s of its pirqs. ~Andrew
On 08/09/13 11:24, Thimo E. wrote:> Ah, sorry. Output is attached.So in this case, irq29 is now your SATA controller. I presume you are still falling over the same basic assertion for the pending EOI stack? ~Andrew
Thimo Eichstädt
2013-Sep-09 14:48 UTC
Re: cpuidle and un-eoid interrupts at the local apic
Hello Andrew, I''ve disabled MSI on that controller, now it is running with level triggered IRQs. No crash so far with these settings. But what I see are a lot of spurious interrupts for every type of IRQ on my machine, Here an example: [root@localhost /]# cat /proc/irq/1276/spurious count 61007 unhandled 0 last_unhandled 36736990 ms I can see this for the ethernet irqs, usb, sata and so on. I''ve already written it into another mail on Sunday: >http://lxr.free-electrons.com/source/kernel/irq/manage.c?v=2.6.32: >1025 if ((irqflags & (IRQF_SHARED|IRQF_DISABLED)) = >1026 (IRQF_SHARED|IRQF_DISABLED)) { >1027 pr_warning( >1028 "IRQ %d/%s: IRQF_DISABLED is not guaranteed on >shared IRQs\n", >1029 irq, devname); >... >738 * Force MSI interrupts to run with interrupts >739 * disabled. The multi vector cards can cause stack >740 * overflows due to nested interrupts when enough of >741 * them are directed to a core and fire at the same >742 * time. >743 */ >744 if (desc->msi_desc) >745 new->flags |= IRQF_DISABLED; --> When using MSI on the SATA controller the kernel indicates me that IRQF_SHARED for that interrupt is set, so the MSI is shared ?! I thought that it is not possible that MSI interrupts are shared. --> Is that what we see in the kernel oops the stack overflow the comment in lines 738-742 is talking about ?! Espacially because the warning in 1028 tells me that IRQF_DISABLED might not be set on shared interrupts. Am 09.09.2013 15:16, schrieb Andrew Cooper:> So in this case, irq29 is now your SATA controller. > > I presume you are still falling over the same basic assertion for the > pending EOI stack? > > ~Andrew >
On 09/09/13 15:48, Thimo Eichstädt wrote:> Hello Andrew, > > I''ve disabled MSI on that controller, now it is running with level > triggered IRQs. No crash so far with these settings. > > But what I see are a lot of spurious interrupts for every type of IRQ > on my machine, Here an example:Given the nature of the problem, I am not surprised in the slightest that there are spurious interrupts.> > [root@localhost /]# cat /proc/irq/1276/spurious > count 61007 > unhandled 0 > last_unhandled 36736990 ms > > I can see this for the ethernet irqs, usb, sata and so on.Line level interrupts are shared between multiple pieces of hardware, leading to the possibility that no device driver claims the interrupt (which is when the interrupt is declared as spurious)> > I''ve already written it into another mail on Sunday: > > >http://lxr.free-electrons.com/source/kernel/irq/manage.c?v=2.6.32: > >1025 if ((irqflags & (IRQF_SHARED|IRQF_DISABLED)) => >1026 (IRQF_SHARED|IRQF_DISABLED)) { > >1027 pr_warning( > >1028 "IRQ %d/%s: IRQF_DISABLED is not guaranteed on > >shared IRQs\n", > >1029 irq, devname); > >... > >738 * Force MSI interrupts to run with interrupts > >739 * disabled. The multi vector cards can cause stack > >740 * overflows due to nested interrupts when enough of > >741 * them are directed to a core and fire at the same > >742 * time. > >743 */ > >744 if (desc->msi_desc) > >745 new->flags |= IRQF_DISABLED; > > --> When using MSI on the SATA controller the kernel indicates me that > IRQF_SHARED for that interrupt is set, so the MSI is shared ?! I > thought that it is not possible that MSI interrupts are shared. > --> Is that what we see in the kernel oops the stack overflow the > comment in lines 738-742 is talking about ?! Espacially because the > warning in 1028 tells me that IRQF_DISABLED might not be set on shared > interrupts.I suspect that this is a red herring. It looks like a generic error path for both legacy interrupts and msi interrupts. Furthermore, dom0''s interrupt handling is rather different under Xen, not least as the event channel mechanism essentially serialises the delivery of interrupts. ~Andrew
Zhang, Yang Z wrote on 2013-09-05:> Thimo E. wrote on 2013-09-05: >> Hello again, >> >> the last two weeks no crash with pinning dom0_vcpus_pin and >> restricting >> dom0 to 1 cpu. But yesterday it crashed again. So changed the >> command line again to: >> >> iommu=no-intremap noirqbalance com1=115200,8n1,0xe050,0 >> console=com1,vga mem=1024G dom0_max_vcpus=4 dom0_mem=752M,max:752M >> watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M >> cpuid_mask_xsave_eax=0 >> >> And today server crashed again and produced a lot of debugging >> messages, see attached. The "..." in the logfiles mean that the >> message above the points was repeated very often. >> >> My summary so far: >> - With only 1 cpu atteched to dom0 the server was stable for 2 >> weeks, the crash there did not really show any irq problems, see crash20130903.txt >> You can find Andrews ideas to this in >> http://forums.citrix.com/thread.jspa?messageID=1760771#1760771 - >> With more than 1 cpu and irqbalance the server produced the crashes >> I've already posted before - Without irqbalance crash with some >> other fancy output, see crash20130904.txt >> >> Next step is to change the network card. >> >> Zhang, any update from your side ? Or do the others have any idea ? > Our hardware guys said they don't aware of such issue with this CPU. > We are trying to find the same platform to reproduce now.Hi, Thimo, I cannot reproduce this issue in my box after running about two weeks: I started four guests (two PV guests and two HVM guests). And each guest runs a simple workload (ping a remote machine). After two weeks, the machine still works no crash and panic happen. Are there any special workload required to reproduce this issue? Attached the cpuinfo and pci info in my box. Please compare it with yours to see whether it is same. Especially, the microcode version. Best regards, Yang _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Hello Yang, The problem that we are discussing here seems to come from the interaction of my raid controller (Areca, ARC-1212) and the hardware. When I enable MSI on that controller the server crashes with the known error messages between 1 day and 7 days. When I disable MSI I don''t see these crashes anymore. Andrew doesn''t have this controller but he could also observe these type of crashes with another MSI enabled card, I think the networc card. But less often, I think 2 times in the last 4 months. My cpuinfo nor dmesg show any microcode information, perhaps Xen hides that info ?! Best regards Thimo Am 17.09.2013 04:09, schrieb Zhang, Yang Z:> Hi, Thimo, > > I cannot reproduce this issue in my box after running about two weeks: > I started four guests (two PV guests and two HVM guests). And each guest runs a simple workload (ping a remote machine). After two weeks, the machine still works no crash and panic happen. Are there any special workload required to reproduce this issue? > > Attached the cpuinfo and pci info in my box. Please compare it with yours to see whether it is same. Especially, the microcode version. > > Best regards, > Yang > >
Thimo E. wrote on 2013-09-17:> Hello Yang, > > The problem that we are discussing here seems to come from the > interaction of my raid controller (Areca, ARC-1212) and the hardware. > When I enable MSI on that controller the server crashes with the known > error messages between 1 day and 7 days. When I disable MSI I don''t > see these crashes anymore. > > Andrew doesn''t have this controller but he could also observe these > type of crashes with another MSI enabled card, I think the networc > card. But less often, I think 2 times in the last 4 months. > > My cpuinfo nor dmesg show any microcode information, perhaps Xen hides > that info ?!''cat /proc/cpuinfo'' will show the microcode version.> > Best regards > Thimo > Am 17.09.2013 04:09, schrieb Zhang, Yang Z: >> Hi, Thimo, >> >> I cannot reproduce this issue in my box after running about two weeks: >> I started four guests (two PV guests and two HVM guests). And each >> guest > runs a simple workload (ping a remote machine). After two weeks, the > machine still works no crash and panic happen. Are there any special > workload required to reproduce this issue? > >> >> Attached the cpuinfo and pci info in my box. Please compare it with >> yours to see whether it is same. Especially, the microcode version. >> >> Best regards, >> Yang >> >>Best regards, Yang
Hello, unfortunately the Xenserver kernel seems to not support reading the microcode, at least it is not populated in /proc/cpuinfo. Andrew, are there any special tricks to get the version out of the Xenserver kernel ? Best regards Thimo Am 17.09.2013 09:43, schrieb Zhang, Yang Z:> Thimo E. wrote on 2013-09-17: >> Hello Yang, >> >> The problem that we are discussing here seems to come from the >> interaction of my raid controller (Areca, ARC-1212) and the hardware. >> When I enable MSI on that controller the server crashes with the known >> error messages between 1 day and 7 days. When I disable MSI I don''t >> see these crashes anymore. >> >> Andrew doesn''t have this controller but he could also observe these >> type of crashes with another MSI enabled card, I think the networc >> card. But less often, I think 2 times in the last 4 months. >> >> My cpuinfo nor dmesg show any microcode information, perhaps Xen hides >> that info ?! > ''cat /proc/cpuinfo'' will show the microcode version. > Best regards, > Yang >
Probably you can use a non-Xen kernel to boot up the system, and then you can get the micro-code in cpuinfo. Thanks! Xiantao -----Original Message----- From: Thimo E. [mailto:abc@digithi.de] Sent: Wednesday, September 18, 2013 5:05 AM To: Zhang, Yang Z Cc: Keir Fraser; Jan Beulich; Andrew Cooper; Dong, Eddie; Xen-develList; Nakajima, Jun; Zhang, Xiantao Subject: Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic Hello, unfortunately the Xenserver kernel seems to not support reading the microcode, at least it is not populated in /proc/cpuinfo. Andrew, are there any special tricks to get the version out of the Xenserver kernel ? Best regards Thimo Am 17.09.2013 09:43, schrieb Zhang, Yang Z:> Thimo E. wrote on 2013-09-17: >> Hello Yang, >> >> The problem that we are discussing here seems to come from the >> interaction of my raid controller (Areca, ARC-1212) and the hardware. >> When I enable MSI on that controller the server crashes with the >> known error messages between 1 day and 7 days. When I disable MSI I >> don''t see these crashes anymore. >> >> Andrew doesn''t have this controller but he could also observe these >> type of crashes with another MSI enabled card, I think the networc >> card. But less often, I think 2 times in the last 4 months. >> >> My cpuinfo nor dmesg show any microcode information, perhaps Xen >> hides that info ?! > ''cat /proc/cpuinfo'' will show the microcode version. > Best regards, > Yang >
On 17/09/2013 22:04, Thimo E. wrote:> Hello, > > unfortunately the Xenserver kernel seems to not support reading the > microcode, at least it is not populated in /proc/cpuinfo. > Andrew, are there any special tricks to get the version out of the > Xenserver kernel ? > > Best regards > ThimoSadly not. In Xen 4.1, the microcode detail printing was unconditionally compiled out in all cases. This behaviour changed in 4.2 (or possibly 4.3). Your best chance is probably to boot a recent Linux LiveCD. ~Andrew
Hello, I''ve looked it up in the bios, microcode version is 9 Best regards Thimo Am 18.09.2013 03:18, schrieb Zhang, Xiantao:> Probably you can use a non-Xen kernel to boot up the system, and then you can get the micro-code in cpuinfo. Thanks! > Xiantao > > -----Original Message----- > From: Thimo E. [mailto:abc@digithi.de] > Sent: Wednesday, September 18, 2013 5:05 AM > To: Zhang, Yang Z > Cc: Keir Fraser; Jan Beulich; Andrew Cooper; Dong, Eddie; Xen-develList; Nakajima, Jun; Zhang, Xiantao > Subject: Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic > > Hello, > > unfortunately the Xenserver kernel seems to not support reading the microcode, at least it is not populated in /proc/cpuinfo. > Andrew, are there any special tricks to get the version out of the Xenserver kernel ? > > Best regards > Thimo > > > Am 17.09.2013 09:43, schrieb Zhang, Yang Z: >> Thimo E. wrote on 2013-09-17: >>> Hello Yang, >>> >>> The problem that we are discussing here seems to come from the >>> interaction of my raid controller (Areca, ARC-1212) and the hardware. >>> When I enable MSI on that controller the server crashes with the >>> known error messages between 1 day and 7 days. When I disable MSI I >>> don''t see these crashes anymore. >>> >>> Andrew doesn''t have this controller but he could also observe these >>> type of crashes with another MSI enabled card, I think the networc >>> card. But less often, I think 2 times in the last 4 months. >>> >>> My cpuinfo nor dmesg show any microcode information, perhaps Xen >>> hides that info ?! >> ''cat /proc/cpuinfo'' will show the microcode version. >> Best regards, >> Yang >>
Seemingly Similar Threads
- High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x
- IO-APIC: tweak debug key info formatting
- [xen-unstable] Commit 2ca9fbd739b8a72b16dd790d0fff7b75f5488fb8 AMD IOMMU: allocate IRTE entries instead of using a static mapping, makes dom0 boot process stall several times.
- Freeze with 2.6.32.19 and xen-4.0.1rc5
- [kvm-unit-tests PATCH] x86: hyperv_synic: Hyper-V SynIC test