While I haven''t heard anything whether others were able to reproduce this, I was now able to nail this down to a simple operation: On the system I''m testing with I was able to identify that the handling of the interrupt from the SCSI controller triggers a simultaneous interrupt from one of the USB controllers, of which (by way of looking at the native execution) it is known that it doesn''t generate any interrupts. Hence it was possible to BUG() the box the first time such an interrupt appears. This happens when mask_and_ack_level_ioapic_irq() masks the irq from the first SCSI controller (pin 0 of IOAPIC 3). No matter how large delays I insert before calling mask_IO_APIC_irq(), the other interrupt (pin 16 of IOAPIC 0) becomes visible (in the redirection table''s irr bit) immediately after the write that sets the mask bit for the first interrupt. Obviously I am lost here - I have no way to tell why writing on IOAPIC''s redir entry affects an interrupt routed through a completely different IOAPIC. Nevertheless it is clear that the problem is unique to Xen, because native Linux doesn''t try to mask the IRQ. Besides the massive spurious interrupts that lead to IRQs getting shut off I''m also seeing occasional ones on other interrupt lines, which must have a different reason. I wonder whether this is related to attempts to do irq balancing (which doesn''t seem to work at all under Xen - all device interrupts are always seen bound to vcpu 0). While looking at all this, I also found that CONFIG_PCI_MSI not being supported under Xen is a significant limitation, as some PCI Express devices may not work at all without this (on the box I''m working with, all bridges supporting hotplug). Are there any plans to get this working? Further, while this also may be a native Linux problem, I wonder whether it is appropriate for assign_irq_vector to not use any serialization regardless of the fact that it accesses static variables (in the xenlinux case, the static current_vector can actually be easily converted into an automatic variable). Finally, sufficiently unrelated, I wonder whether xen_create_contiguous_region() (or its caller(s)) shouldn''t special case order being zero, as it seems pointless to go through numerous hypercalls in that case. Thanks for any comments/explanations, Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich >Sent: 2006年4月12日 21:37 > >Finally, sufficiently unrelated, I wonder whether >xen_create_contiguous_region() (or its caller(s)) shouldn''t special >case order being zero, as it seems pointless to go through numerous >hypercalls in that case. >The question is that corresponding gmfns is not necessarily contiguous. Maybe instead of setting order to non-zero, we can change nr_extents to >1 by allocating buffer to contain all gmfns to be released. Another abstraction could be to incorporate set_phys_to_machine into decrease/increase_reservation, and thus allow xen_create_contiguous_region to be used by auto_translated_mode if we want that translated mode to work for backend like for xen/ia64... Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
This happens definitely immediately after writing to the IO-APIC? There is an ack_APIC_irq() not many instructions after that, and it would make much more sense for the IRR flag to be set after that. :-) -- Keir On 12 Apr 2006, at 14:37, Jan Beulich wrote:> While I haven''t heard anything whether others were able to reproduce > this, I was now able to nail this down to a simple > operation: On the system I''m testing with I was able to identify that > the handling of the interrupt from the SCSI > controller triggers a simultaneous interrupt from one of the USB > controllers, of which (by way of looking at the native > execution) it is known that it doesn''t generate any interrupts. Hence > it was possible to BUG() the box the first time > such an interrupt appears. This happens when > mask_and_ack_level_ioapic_irq() masks the irq from the first SCSI > controller (pin 0 of IOAPIC 3). No matter how large delays I insert > before calling mask_IO_APIC_irq(), the other > interrupt (pin 16 of IOAPIC 0) becomes visible (in the redirection > table''s irr bit) immediately after the write that > sets the mask bit for the first interrupt. > > Obviously I am lost here - I have no way to tell why writing on > IOAPIC''s redir entry affects an interrupt routed > through a completely different IOAPIC. Nevertheless it is clear that > the problem is unique to Xen, because native Linux > doesn''t try to mask the IRQ._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi, On Wed, 2006-04-12 at 15:37 +0200, Jan Beulich wrote:> While I haven''t heard anything whether others were able to reproduce > this, I was now able to nail this down to a simple operation: On the > system I''m testing with I was able to identify that the handling of > the interrupt from the SCSI controller triggers a simultaneous > interrupt from one of the USB controllersI''ve been seeing spurious irq16 on a USB port with no devices attached for Xen for some time now, resulting in irq 16: nobody cared on boot and Disabling IRQ #16 ten minutes later once it reaches 100,000 unhandled interrupts. With today''s build, including Keir''s latest APIC changes, not only is serial console input fixed (thanks Keir!), but the spurious IRQ also seems to be solved. (At least, at the time of writing that box is showing only at 15,000 interrupts on irq16 after an hour instead of 100,000 after 10 minutes --- so I should know in another 5 hours whether or not it actually still gets disabled at 100,000.) --Stephen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi, On Wed, 2006-04-12 at 13:48 -0400, Stephen C. Tweedie wrote:> > While I haven''t heard anything whether others were able to reproduce > > this, I was now able to nail this down to a simple operation: On the > > system I''m testing with I was able to identify that the handling of > > the interrupt from the SCSI controller triggers a simultaneous > > interrupt from one of the USB controllersTurns out that I''m seeing the same. On the box that was showing> irq 16: nobody cared > Disabling IRQ #16it seems that the irq16 is indeed getting dispatched every time an interrupt arrives for the disk. In this case it''s sata, ICH5 on irq17.> (At least, at the time of writing that box is showing only > at 15,000 interrupts on irq16 after an hour instead of 100,000 after 10 > minutesthis turned out to be related to disk activity, and a heavy disk load rapidly reproduced the problem even with the latest Xen kernel. --Stephen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> >On Wed, 2006-04-12 at 13:48 -0400, Stephen C. Tweedie wrote: > >> > While I haven''t heard anything whether others were able toreproduce>> > this, I was now able to nail this down to a simple operation: Onthe>> > system I''m testing with I was able to identify that the handling of >> > the interrupt from the SCSI controller triggers a simultaneous >> > interrupt from one of the USB controllers > >Turns out that I''m seeing the same. On the box that was showing > >> irq 16: nobody cared >> Disabling IRQ #16 > >it seems that the irq16 is indeed getting dispatched every time an >interrupt arrives for the disk. In this case it''s sata, ICH5 on irq17. > >> (At least, at the time of writing that box is showing only >> at 15,000 interrupts on irq16 after an hour instead of 100,000 after10>> minutes > >this turned out to be related to disk activity, and a heavy disk load >rapidly reproduced the problem even with the latest Xen kernel. >We see similar behavior with network traffic - irq16 (attached to USB) getting dispatched every time NIC interrupt is triggered. Thanks, Prafulla _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <Keir.Fraser@cl.cam.ac.uk> 12.04.06 19:21:44 >>> >This happens definitely immediately after writing to the IO-APIC? There >is an ack_APIC_irq() not many instructions after that, and it would >make much more sense for the IRR flag to be set after that. :-)Yes, definitely there. I added (long) delays before masking and after masking (and also after ack-ing). The delay routine spins on reading the (expected) other redir entry, and the first iteration of it after the masking sees the ill irr bit turned on. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel