Dante Cinco
2010-Jun-28 18:22 UTC
[Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
I have an HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) with Xen 4.0.0 and dom0 Linux 2.6.32.12 x86_64 pvops and domU Linux kernel 2.6.30.1 x86_64. I''m using PCI passthrough (pci-stub) to pass my 4-port 8Gb PMC-Sierra Fibre Channel HBA to domU. After running I/Os for several hours, both dom0 and domU hangs and the Xen console shows the interrupt binding below where IRQ 66 shows in-flight=1 and mask set (---M). What''s the best way to debug this problem? (XEN) IRQ: 66 affinity:00000000,00000000,00000000,00000001 vec:b9 type=PCI-MSI status=00000010 in-flight=1 domain-list=1: 79(---M), (XEN) IRQ: 67 affinity:00000000,00000000,00000000,00000004 vec:d9 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 78(----), (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000010 vec:22 type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 77(----), (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000040 vec:2a type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 76(----), (XEN) 07:00.3 - dom 1 - MSIs < 69 > (XEN) 07:00.2 - dom 1 - MSIs < 68 > (XEN) 07:00.1 - dom 1 - MSIs < 67 > (XEN) 07:00.0 - dom 1 - MSIs < 66 > (XEN) MSI 66 vec=b9 fixed edge assert phys cpu dest=00000000 mask=0/0/-1 (XEN) MSI 67 vec=d9 fixed edge assert phys cpu dest=00000004 mask=0/0/-1 (XEN) MSI 68 vec=22 fixed edge assert phys cpu dest=00000002 mask=0/0/-1 (XEN) MSI 69 vec=2a fixed edge assert phys cpu dest=00000006 mask=0/0/-1 Thanks. Dante _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2010-Jun-29 08:42 UTC
Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
>>> On 28.06.10 at 20:22, Dante Cinco <dantecinco@gmail.com> wrote: > I have an HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) with Xen 4.0.0 > and dom0 Linux 2.6.32.12 x86_64 pvops and domU Linux kernel 2.6.30.1 x86_64. > I''m using PCI passthrough (pci-stub) to pass my 4-port 8Gb PMC-Sierra Fibre > Channel HBA to domU. After running I/Os for several hours, both dom0 and > domU hangs and the Xen console shows the interrupt binding below where IRQ > 66 shows in-flight=1 and mask set (---M). What''s the best way to debug this > problem?There are potentially two problems here: One is that the guest may fail to send the EOI notification. You would want to check whether pirq_guest_eoi() got run after that last occurrence of the interrupt. The more worrying part is that Xen should time out on a guest failing to send the EOI notification, and ack the interrupt nevertheless. Looking at the code I fail to see how the ack_APIC_irq() would get sent in this case: non-maskable MSIs get this issued from end_msi_irq(), but ->end doesn''t get invoked from irq_guest_eoi_timer_fn() (only ->enable does). Keir, am I missing something? Otoh I can''t see how this can work reliably in the first place: Since there''s no other way to mask such interrupts, sending an ack to the LAPIC could result in an interrupt storm. Disabling MSI on the affected device isn''t a good option either, as we know there are devices that switch to legacy IRQ mode irreversibly in that case, and hence the device becomes unusable (presumably until being reset). But very likely this would still be better than hanging the entire box; it probably would just need a more graceful timeout. Jan> (XEN) IRQ: 66 affinity:00000000,00000000,00000000,00000001 vec:b9 > type=PCI-MSI status=00000010 in-flight=1 domain-list=1: 79(---M), > (XEN) IRQ: 67 affinity:00000000,00000000,00000000,00000004 vec:d9 > type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 78(----), > (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000010 vec:22 > type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 77(----), > (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000040 vec:2a > type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 76(----), > > (XEN) 07:00.3 - dom 1 - MSIs < 69 > > (XEN) 07:00.2 - dom 1 - MSIs < 68 > > (XEN) 07:00.1 - dom 1 - MSIs < 67 > > (XEN) 07:00.0 - dom 1 - MSIs < 66 > > > (XEN) MSI 66 vec=b9 fixed edge assert phys cpu dest=00000000 > mask=0/0/-1 > (XEN) MSI 67 vec=d9 fixed edge assert phys cpu dest=00000004 > mask=0/0/-1 > (XEN) MSI 68 vec=22 fixed edge assert phys cpu dest=00000002 > mask=0/0/-1 > (XEN) MSI 69 vec=2a fixed edge assert phys cpu dest=00000006 > mask=0/0/-1 > > Thanks. > > Dante_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Bruce Edge
2010-Aug-17 17:28 UTC
Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
On Tue, Jun 29, 2010 at 1:42 AM, Jan Beulich <JBeulich@novell.com> wrote:> >>> On 28.06.10 at 20:22, Dante Cinco <dantecinco@gmail.com> wrote: > > I have an HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) with Xen 4.0.0 > > and dom0 Linux 2.6.32.12 x86_64 pvops and domU Linux kernel 2.6.30.1 > x86_64. > > I''m using PCI passthrough (pci-stub) to pass my 4-port 8Gb PMC-Sierra > Fibre > > Channel HBA to domU. After running I/Os for several hours, both dom0 and > > domU hangs and the Xen console shows the interrupt binding below where > IRQ > > 66 shows in-flight=1 and mask set (---M). What''s the best way to debug > this > > problem? > > There are potentially two problems here: One is that the guest may > fail to send the EOI notification. You would want to check whether > pirq_guest_eoi() got run after that last occurrence of the interrupt. > > The more worrying part is that Xen should time out on a guest failing > to send the EOI notification, and ack the interrupt nevertheless. > Looking at the code I fail to see how the ack_APIC_irq() would get > sent in this case: non-maskable MSIs get this issued from > end_msi_irq(), but ->end doesn''t get invoked from > irq_guest_eoi_timer_fn() (only ->enable does). Keir, am I missing > something? > > Otoh I can''t see how this can work reliably in the first place: Since > there''s no other way to mask such interrupts, sending an ack to the > LAPIC could result in an interrupt storm. Disabling MSI on the > affected device isn''t a good option either, as we know there are > devices that switch to legacy IRQ mode irreversibly in that case, > and hence the device becomes unusable (presumably until being > reset). But very likely this would still be better than hanging the > entire box; it probably would just need a more graceful timeout. > > Jan >This is still happening. I have 2 identical boxes that were running a stress test and both hung after a few hours. They have identical hardware and software configs so I''ll report the config for one and attach the xen dump for both. dom0 info: HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) # cat /proc/cmdline root=/dev/mapper/system-dom0_0 ro earlyprintk=xen loglevel=10 debug acpi=force console=hvc0,115200n8 # uname -a Linux dpm8800-09 2.6.32.16 #1 SMP Wed Aug 4 15:38:21 PDT 2010 x86_64 GNU/Linux The domU is an Ubuntu 10.04 kernel, 2.6.32.15+drm33.5 in hvm mode. # xm info host : dpm8800-09 release : 2.6.32.16 version : #1 SMP Wed Aug 4 15:38:21 PDT 2010 machine : x86_64 nr_cpus : 16 nr_nodes : 2 cores_per_socket : 4 threads_per_core : 2 cpu_mhz : 2533 hw_caps : bfebfbff:28100800:00000000:00001b40:009ce3bd:00000000:00000001:00000000 virt_caps : hvm hvm_directio total_memory : 12277 free_memory : 11631 node_to_cpu : node0:0,2,4,6,8,10,12,14 node1:1,3,5,7,9,11,13,15 node_to_memory : node0:5601 node1:6029 node_to_dma32_mem : node0:3506 node1:0 max_node_id : 1 xen_major : 4 xen_minor : 0 xen_extra : .1-rc4 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 xen_scheduler : credit xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable xen_commandline : dom0_mem=512M dom0_max_vcpus=1 dom0_vcpus_pin=true iommu=1,passthrough,no-intremap loglvl=all loglvl_guest=all loglevl=10 debug apic=on apic_verbosity=verbose extra_guest_irqs=80 com1=115200,8n1 console=com1 console_to_ring xen-pciback.permissive acpi=force numa=on cc_compiler : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) cc_compile_by : bedge cc_compile_domain : lsi.com cc_compile_date : Sun Aug 1 09:44:29 PDT 2010 xend_config_format : 4 This device (as well as a few more of these) is passed through via pciback: dpm8800-09:~# lspci | grep 10: 10:00.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) 10:00.1 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) 10:00.2 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) 10:00.3 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) <- on both cases it''s this device that loses the interrupt in flight 10:00.3 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) Flags: bus master, fast devsel, latency 0, IRQ 5 I/O ports at a800 [size=256] I/O ports at ac00 [size=256] Memory at fbdc0000 (64-bit, non-prefetchable) [size=32K] Capabilities: [50] Power Management version 3 Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable- Capabilities: [70] Express Endpoint, MSI 01 Capabilities: [b0] MSI-X: Enable- Mask- TabSize=9 Capabilities: [100] Advanced Error Reporting <?>>From host dpm8800-10:(XEN) IRQ: 133 affinity:00000000,00000000,00000000,00000001 vec:94 type=PCI-MSI status=00000050 in-flight=0 domain-list=2:126(----), (XEN) IRQ: 134 affinity:00000000,00000000,00000000,00000001 vec:d4 type=PCI-MSI status=00000050 in-flight=1 domain-list=2:125*(---M)*, (XEN) IRQ: 135 affinity:00000000,00000000,00000000,00000004 vec:9c type=PCI-MSI status=00000010 in-flight=0 domain-list=2:124(----),>From host dpm8800-09:(XEN) IRQ: 131 affinity:00000000,00000000,00000000,00002000 vec:7f type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 62(----), (XEN) IRQ: 132 affinity:00000000,00000000,00000000,00000001 vec:dd type=PCI-MSI status=00000010 in-flight=1 domain-list=2:127(*---M*), (XEN) IRQ: 133 affinity:00000000,00000000,00000000,00000001 vec:3e type=PCI-MSI status=00000010 in-flight=0 domain-list=2:126(----), This time both cases correspond to 10:00.3: (XEN) 10:00.3 - dom 2 - MSIs < 132 > (XEN) MSI 132 vec=dc fixed edge assert phys cpu dest=00000010 mask=0/0/-1 Let me know if there''s anything else I can provide to assist in diagnosing this problem. Thanks -Bruce> > (XEN) IRQ: 66 affinity:00000000,00000000,00000000,00000001 vec:b9 > > type=PCI-MSI status=00000010 in-flight=1 domain-list=1: 79(---M), > > (XEN) IRQ: 67 affinity:00000000,00000000,00000000,00000004 vec:d9 > > type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 78(----), > > (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000010 vec:22 > > type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 77(----), > > (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000040 vec:2a > > type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 76(----), > > > > (XEN) 07:00.3 - dom 1 - MSIs < 69 > > > (XEN) 07:00.2 - dom 1 - MSIs < 68 > > > (XEN) 07:00.1 - dom 1 - MSIs < 67 > > > (XEN) 07:00.0 - dom 1 - MSIs < 66 > > > > > (XEN) MSI 66 vec=b9 fixed edge assert phys cpu dest=00000000 > > mask=0/0/-1 > > (XEN) MSI 67 vec=d9 fixed edge assert phys cpu dest=00000004 > > mask=0/0/-1 > > (XEN) MSI 68 vec=22 fixed edge assert phys cpu dest=00000002 > > mask=0/0/-1 > > (XEN) MSI 69 vec=2a fixed edge assert phys cpu dest=00000006 > > mask=0/0/-1 > > > > Thanks. > > > > Dante > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Aug-17 18:01 UTC
Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
On 17/08/2010 18:28, "Bruce Edge" <bruce.edge@gmail.com> wrote:> On Tue, Jun 29, 2010 at 1:42 AM, Jan Beulich <JBeulich@novell.com> wrote: >>>>> On 28.06.10 at 20:22, Dante Cinco <dantecinco@gmail.com> wrote: >>> I have an HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) with Xen 4.0.0 >>> and dom0 Linux 2.6.32.12 x86_64 pvops and domU Linux kernel 2.6.30.1 x86_64. >>> I''m using PCI passthrough (pci-stub) to pass my 4-port 8Gb PMC-Sierra Fibre >>> Channel HBA to domU. After running I/Os for several hours, both dom0 and >>> domU hangs and the Xen console shows the interrupt binding below where IRQ >>> 66 shows in-flight=1 and mask set (---M). What''s the best way to debug this >>> problem? >> >> There are potentially two problems here: One is that the guest may >> fail to send the EOI notification. You would want to check whether >> pirq_guest_eoi() got run after that last occurrence of the interrupt. >> >> The more worrying part is that Xen should time out on a guest failing >> to send the EOI notification, and ack the interrupt nevertheless. >> Looking at the code I fail to see how the ack_APIC_irq() would get >> sent in this case: non-maskable MSIs get this issued from >> end_msi_irq(), but ->end doesn''t get invoked from >> irq_guest_eoi_timer_fn() (only ->enable does). Keir, am I missing >> something?I don''t think that timer logic is designed to handle non-maskable MSIs, only maskable ones. It ought to be not too hard to fix it up for non-maskable ones too by issuing the ->end() call from the timer handler? -- Keir>> Otoh I can''t see how this can work reliably in the first place: Since >> there''s no other way to mask such interrupts, sending an ack to the >> LAPIC could result in an interrupt storm. Disabling MSI on the >> affected device isn''t a good option either, as we know there are >> devices that switch to legacy IRQ mode irreversibly in that case, >> and hence the device becomes unusable (presumably until being >> reset). But very likely this would still be better than hanging the >> entire box; it probably would just need a more graceful timeout. >> >> Jan > > > This is still happening. I have 2 identical boxes that were running a stress > test and both hung after a few hours. They have identical hardware and > software configs so I''ll report the config for one and attach the xen dump for > both. > > dom0 info: > > HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) > > # cat /proc/cmdline > root=/dev/mapper/system-dom0_0 ro earlyprintk=xen loglevel=10 debug acpi=force > console=hvc0,115200n8 > > # uname -a > Linux dpm8800-09 2.6.32.16 #1 SMP Wed Aug 4 15:38:21 PDT 2010 x86_64 GNU/Linux > > The domU is an Ubuntu 10.04 kernel, 2.6.32.15+drm33.5 in hvm mode. > > # xm info > host : dpm8800-09 > release : 2.6.32.16 > version : #1 SMP Wed Aug 4 15:38:21 PDT 2010 > machine : x86_64 > nr_cpus : 16 > nr_nodes : 2 > cores_per_socket : 4 > threads_per_core : 2 > cpu_mhz : 2533 > hw_caps : > bfebfbff:28100800:00000000:00001b40:009ce3bd:00000000:00000001:00000000 > virt_caps : hvm hvm_directio > total_memory : 12277 > free_memory : 11631 > node_to_cpu : node0:0,2,4,6,8,10,12,14 > node1:1,3,5,7,9,11,13,15 > node_to_memory : node0:5601 > node1:6029 > node_to_dma32_mem : node0:3506 > node1:0 > max_node_id : 1 > xen_major : 4 > xen_minor : 0 > xen_extra : .1-rc4 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 > hvm-3.0-x86_32p hvm-3.0-x86_64 > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > xen_commandline : dom0_mem=512M dom0_max_vcpus=1 dom0_vcpus_pin=true > iommu=1,passthrough,no-intremap loglvl=all loglvl_guest=all loglevl=10 debug > apic=on apic_verbosity=verbose extra_guest_irqs=80 com1=115200,8n1 > console=com1 console_to_ring xen-pciback.permissive acpi=force numa=on > cc_compiler : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) > cc_compile_by : bedge > cc_compile_domain : lsi.com <http://lsi.com> > cc_compile_date : Sun Aug 1 09:44:29 PDT 2010 > xend_config_format : 4 > > This device (as well as a few more of these) is passed through via pciback: > > dpm8800-09:~# lspci | grep 10: > 10:00.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) > 10:00.1 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) > 10:00.2 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) > 10:00.3 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) <- on both cases > it''s this device that loses the interrupt in flight > > 10:00.3 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) > Flags: bus master, fast devsel, latency 0, IRQ 5 > I/O ports at a800 [size=256] > I/O ports at ac00 [size=256] > Memory at fbdc0000 (64-bit, non-prefetchable) [size=32K] > Capabilities: [50] Power Management version 3 > Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ > Queue=0/1 Enable- > Capabilities: [70] Express Endpoint, MSI 01 > Capabilities: [b0] MSI-X: Enable- Mask- TabSize=9 > Capabilities: [100] Advanced Error Reporting <?> > > > From host dpm8800-10: > (XEN) IRQ: 133 affinity:00000000,00000000,00000000,00000001 vec:94 > type=PCI-MSI status=00000050 in-flight=0 domain-list=2:126(----), > (XEN) IRQ: 134 affinity:00000000,00000000,00000000,00000001 vec:d4 > type=PCI-MSI status=00000050 in-flight=1 domain-list=2:125(---M), > (XEN) IRQ: 135 affinity:00000000,00000000,00000000,00000004 vec:9c > type=PCI-MSI status=00000010 in-flight=0 domain-list=2:124(----), > > From host dpm8800-09: > (XEN) IRQ: 131 affinity:00000000,00000000,00000000,00002000 vec:7f > type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 62(----), > (XEN) IRQ: 132 affinity:00000000,00000000,00000000,00000001 vec:dd > type=PCI-MSI status=00000010 in-flight=1 domain-list=2:127(---M), > (XEN) IRQ: 133 affinity:00000000,00000000,00000000,00000001 vec:3e > type=PCI-MSI status=00000010 in-flight=0 domain-list=2:126(----), > > This time both cases correspond to 10:00.3: > > (XEN) 10:00.3 - dom 2 - MSIs < 132 > > > (XEN) MSI 132 vec=dc fixed edge assert phys cpu dest=00000010 > mask=0/0/-1 > > > Let me know if there''s anything else I can provide to assist in diagnosing > this problem. > > Thanks > > -Bruce > >> >>> (XEN) IRQ: 66 affinity:00000000,00000000,00000000,00000001 vec:b9 >>> type=PCI-MSI status=00000010 in-flight=1 domain-list=1: 79(---M), >>> (XEN) IRQ: 67 affinity:00000000,00000000,00000000,00000004 vec:d9 >>> type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 78(----), >>> (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000010 vec:22 >>> type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 77(----), >>> (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000040 vec:2a >>> type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 76(----), >>> >>> (XEN) 07:00.3 - dom 1 - MSIs < 69 > >>> (XEN) 07:00.2 - dom 1 - MSIs < 68 > >>> (XEN) 07:00.1 - dom 1 - MSIs < 67 > >>> (XEN) 07:00.0 - dom 1 - MSIs < 66 > >>> >>> (XEN) MSI 66 vec=b9 fixed edge assert phys cpu dest=00000000 >>> mask=0/0/-1 >>> (XEN) MSI 67 vec=d9 fixed edge assert phys cpu dest=00000004 >>> mask=0/0/-1 >>> (XEN) MSI 68 vec=22 fixed edge assert phys cpu dest=00000002 >>> mask=0/0/-1 >>> (XEN) MSI 69 vec=2a fixed edge assert phys cpu dest=00000006 >>> mask=0/0/-1 >>> >>> Thanks. >>> >>> Dante >> >> >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2010-Aug-18 08:47 UTC
Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
>>> On 17.08.10 at 20:01, Keir Fraser <keir.fraser@eu.citrix.com> wrote: > On 17/08/2010 18:28, "Bruce Edge" <bruce.edge@gmail.com> wrote: > >> On Tue, Jun 29, 2010 at 1:42 AM, Jan Beulich <JBeulich@novell.com> wrote: >>>>>> On 28.06.10 at 20:22, Dante Cinco <dantecinco@gmail.com> wrote: >>>> I have an HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) with Xen 4.0.0 >>>> and dom0 Linux 2.6.32.12 x86_64 pvops and domU Linux kernel 2.6.30.1 x86_64. >>>> I''m using PCI passthrough (pci-stub) to pass my 4-port 8Gb PMC-Sierra Fibre >>>> Channel HBA to domU. After running I/Os for several hours, both dom0 and >>>> domU hangs and the Xen console shows the interrupt binding below where IRQ >>>> 66 shows in-flight=1 and mask set (---M). What''s the best way to debug this >>>> problem? >>> >>> There are potentially two problems here: One is that the guest may >>> fail to send the EOI notification. You would want to check whether >>> pirq_guest_eoi() got run after that last occurrence of the interrupt. >>> >>> The more worrying part is that Xen should time out on a guest failing >>> to send the EOI notification, and ack the interrupt nevertheless. >>> Looking at the code I fail to see how the ack_APIC_irq() would get >>> sent in this case: non-maskable MSIs get this issued from >>> end_msi_irq(), but ->end doesn''t get invoked from >>> irq_guest_eoi_timer_fn() (only ->enable does). Keir, am I missing >>> something? > > I don''t think that timer logic is designed to handle non-maskable MSIs, only > maskable ones. It ought to be not too hard to fix it up for non-maskable > ones too by issuing the ->end() call from the timer handler?Yes, that was what I was trying to hint at, but I wasn''t sure whether calling ->end() here has any unintended side effects and/or requires any extra care (like preventing a subsequent guest initiated EOI to call ->end() again). While looking at this I came across another thing I don''t understand: __pirq_guest_eoi(), for the ACKTYPE_EOI case, calls __set_eoi_ready() in a cpu_test_and_clear() conditional, but __set_eoi_ready() bails out if it finds !cpu_test_and_clear() on the same bitmap - what''s the point of calling __set_eoi_ready() here then (or what am I missing)? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Aug-18 09:40 UTC
Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
On 18/08/2010 09:47, "Jan Beulich" <JBeulich@novell.com> wrote:> Yes, that was what I was trying to hint at, but I wasn''t sure whether > calling ->end() here has any unintended side effects and/or requires > any extra care (like preventing a subsequent guest initiated EOI to > call ->end() again).Oh you can''t naively call ->end() from the time-out handler. You would need to do something like this in irq_guest_eoi_timer_fn: spin_lock(&desc->lock); if ( (desc->status & IRQ_GUEST) && (action->ack_type == ACKTYPE_EOI) ) { cpu_eoi_map = action->cpu_eoi_map; spin_unlock(&desc->lock); on_selected_cpus(&cpu_eoi_map, set_eoi_ready, desc, 0); spin_lock(&desc->lock); } _irq_guest_eoi(desc); spin_unlock(&desc->lock); I don''t think the IRQ_GUEST_EOI_PENDING flag or any of that stuff is needed for the ACKTYPE_EOI case. I''d make the handling of that, calling of ->disable/->enable and so on, dependent on ACKTYPE_NONE.> While looking at this I came across another thing I don''t understand: > __pirq_guest_eoi(), for the ACKTYPE_EOI case, calls __set_eoi_ready() > in a cpu_test_and_clear() conditional, but __set_eoi_ready() bails > out if it finds !cpu_test_and_clear() on the same bitmap - what''s the > point of calling __set_eoi_ready() here then (or what am I missing)?__pirq_guest_eoi() acts on a private on-stack copy of cpu_eoi_map. This is because on_selected_cpus() cannot be called with desc->lock held. But as soon as desc->lock is released, the desc->action structure can be freed by another CPU, so it would be invalid to reference action->cpu_eoi_map directly after desc->lock is released. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Bruce Edge
2010-Aug-19 13:42 UTC
Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
-Bruce On Wed, Aug 18, 2010 at 2:40 AM, Keir Fraser <keir.fraser@eu.citrix.com>wrote:> On 18/08/2010 09:47, "Jan Beulich" <JBeulich@novell.com> wrote: > > > Yes, that was what I was trying to hint at, but I wasn''t sure whether > > calling ->end() here has any unintended side effects and/or requires > > any extra care (like preventing a subsequent guest initiated EOI to > > call ->end() again). > > Oh you can''t naively call ->end() from the time-out handler. You would need > to do something like this in irq_guest_eoi_timer_fn: > spin_lock(&desc->lock); > if ( (desc->status & IRQ_GUEST) && > (action->ack_type == ACKTYPE_EOI) ) { > cpu_eoi_map = action->cpu_eoi_map; > spin_unlock(&desc->lock); > on_selected_cpus(&cpu_eoi_map, set_eoi_ready, desc, 0); > spin_lock(&desc->lock); > } > _irq_guest_eoi(desc); > spin_unlock(&desc->lock); > > I don''t think the IRQ_GUEST_EOI_PENDING flag or any of that stuff is needed > for the ACKTYPE_EOI case. I''d make the handling of that, calling of > ->disable/->enable and so on, dependent on ACKTYPE_NONE. > > > While looking at this I came across another thing I don''t understand: > > __pirq_guest_eoi(), for the ACKTYPE_EOI case, calls __set_eoi_ready() > > in a cpu_test_and_clear() conditional, but __set_eoi_ready() bails > > out if it finds !cpu_test_and_clear() on the same bitmap - what''s the > > point of calling __set_eoi_ready() here then (or what am I missing)? > > __pirq_guest_eoi() acts on a private on-stack copy of cpu_eoi_map. This is > because on_selected_cpus() cannot be called with desc->lock held. But as > soon as desc->lock is released, the desc->action structure can be freed by > another CPU, so it would be invalid to reference action->cpu_eoi_map > directly after desc->lock is released. > > -- Keir > > > Is there any more information that I can provide that would be helpful indiagnosing the direct cause and the appropriate fix? Possibly adding instrumentation or trace code to detect the trigger conditions? This is very repeatable on our target systems after a few hours of load. Thanks -Bruce _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Aug-19 15:48 UTC
Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
On 19/08/2010 14:42, "Bruce Edge" <bruce.edge@gmail.com> wrote:> Is there any more information that I can provide that would be helpful in > diagnosing the direct cause and the appropriate fix? > Possibly adding instrumentation or trace code to detect the trigger > conditions? > This is very repeatable on our target systems after a few hours of load.You can try the attached Xen patch which should clear out the in-flight interrupt after a short timeout. It might kick things back into action, and is probably a fix we should have in the tree anyhow. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Bruce Edge
2010-Aug-20 23:25 UTC
Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
On Thu, Aug 19, 2010 at 8:48 AM, Keir Fraser <keir.fraser@eu.citrix.com>wrote:> On 19/08/2010 14:42, "Bruce Edge" <bruce.edge@gmail.com> wrote: > > > Is there any more information that I can provide that would be helpful in > > diagnosing the direct cause and the appropriate fix? > > Possibly adding instrumentation or trace code to detect the trigger > > conditions? > > This is very repeatable on our target systems after a few hours of load. > > You can try the attached Xen patch which should clear out the in-flight > interrupt after a short timeout. It might kick things back into action, and > is probably a fix we should have in the tree anyhow. > > -- Keir > >Thanks Kier, this patch definitely works. It''s running on 2 machines under load, well past where they consistently have failed in the past. -Bruce _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Aug-21 06:02 UTC
Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
On 21/08/2010 00:25, "Bruce Edge" <bruce.edge@gmail.com> wrote:> > Thanks Kier, this patch definitely works. > > It''s running on 2 machines under load, well past where they consistently have > failed in the past.Good to know. It''s too late for 4.0.1 unfortunately, as I really want to get that out next week whatever, but I''ll put it in xen-unstable and it can be considered for 4.0.2 as well. Thanks, Keir> -Bruce > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel