While awaiting final confirmation, all aspects of an issue with certain bnx2 driven NICs known at this point suggest that the disabling of MSI when a second interrupt gets delivered without the first having progressed far enough in its handling to at least mask the corresponding event channel would cause the NIC to stop operating. We''re being told that it is a chip limitation that it can''t recover from MSI getting disabled and then re-enabled - it would switch into INTA mode when MSI gets disabled, but not switch back to MSI mode when MSI gets re-enabled. While this arguably is a limitation that must be worked around in the driver, the logic here raises questions: Since other devices apparently have similar behavior in not disabling IRQ delivery upon IRQ assertion, wouldn''t it to be expected that disabling IRQs for a millisecond could have severe impact on throughput of the device? Also, is it correct at all to disable MSI on a device - i.e., with the affected device going into INTA mode, don''t we risk floods of spurious interrupts if the IO-APIC RTE the device is connected to is shared and happens to be unmasked? Apart from that - wouldn''t the stop_timer() call be carried out more correctly before calling desc->handler->disable(), so that we can be sure the disable actually has an effect? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-13 09:42 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
A perfectly reasonable fix if you are not worried about guest-initiated irq storms (e.g. because all devices are controlled by dom0) would be to remove the eoi_timer logic. Otherwise we could relax it some (e.g., require N IRQs to get stacked up rather than just 1; or add explicit rate limiting). We only disable MSI when the device does not support masking. Perhaps we should make disable/enable no-ops in that case? -- Keir On 13/11/08 09:22, "Jan Beulich" <jbeulich@novell.com> wrote:> While awaiting final confirmation, all aspects of an issue with certain bnx2 > driven NICs known at this point suggest that the disabling of MSI when a > second interrupt gets delivered without the first having progressed far > enough in its handling to at least mask the corresponding event channel > would cause the NIC to stop operating. We''re being told that it is a chip > limitation that it can''t recover from MSI getting disabled and then > re-enabled - it would switch into INTA mode when MSI gets disabled, but > not switch back to MSI mode when MSI gets re-enabled. > > While this arguably is a limitation that must be worked around in the driver, > the logic here raises questions: Since other devices apparently have > similar behavior in not disabling IRQ delivery upon IRQ assertion, wouldn''t > it to be expected that disabling IRQs for a millisecond could have severe > impact on throughput of the device? Also, is it correct at all to disable MSI > on a device - i.e., with the affected device going into INTA mode, don''t > we risk floods of spurious interrupts if the IO-APIC RTE the device is > connected to is shared and happens to be unmasked? > > Apart from that - wouldn''t the stop_timer() call be carried out more > correctly before calling desc->handler->disable(), so that we can be sure > the disable actually has an effect? > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-13 11:07 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 13.11.08 10:42 >>> >A perfectly reasonable fix if you are not worried about guest-initiated irq >storms (e.g. because all devices are controlled by dom0) would be to remove >the eoi_timer logic.No, that doesn''t seem like a good idea to me.>Otherwise we could relax it some (e.g., require N IRQs >to get stacked up rather than just 1; or add explicit rate limiting).For the main problem at hand, that would just reduce the likelihood of the device refusing to work. For the performance issue, that would be an option, as would be reducing the timeout value. However, I would also consider making EVTCHNOP_unmask clear that state, and then perhaps find a way to tell the guest that it should call this even if unmask_evtchn() finds the event channel to be bound to the local CPU. The obvious thing would be to either extend shared_info or have the guest register an address with Xen where per-event-channel overflow status would be reported by Xen.>We only disable MSI when the device does not support masking. Perhaps we >should make disable/enable no-ops in that case?Yes, but don''t we need an alternative way to avoid storms then? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-13 11:16 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 13/11/08 11:07, "Jan Beulich" <jbeulich@novell.com> wrote:>> Otherwise we could relax it some (e.g., require N IRQs >> to get stacked up rather than just 1; or add explicit rate limiting). > > For the main problem at hand, that would just reduce the likelihood of the > device refusing to work. For the performance issue, that would be an > option, as would be reducing the timeout value. However, I would also > consider making EVTCHNOP_unmask clear that state, and then perhaps > find a way to tell the guest that it should call this even if unmask_evtchn() > finds the event channel to be bound to the local CPU. The obvious thing > would be to either extend shared_info or have the guest register an > address with Xen where per-event-channel overflow status would be > reported by Xen.We already have PHYSDEVOP_eoi. We can force guests to always use that. It''ll be no slower than level-triggered IO-APIC IRQs, which we''ve been using since forever with no complaints. And yeah, as an extension we could have a shared-memory pirq bitmap to indicate dynamically whether eoi is required. Quite a secondary concern though, I would say.>> We only disable MSI when the device does not support masking. Perhaps we >> should make disable/enable no-ops in that case? > > Yes, but don''t we need an alternative way to avoid storms then?We should be delaying LAPIC EOI, just as we do for level-triggered IO-APIC IRQs (in that case, because masking RTEs in some cases has stupid side effects on some damn stupid chipsets). All that logic exists, just needs plumbing in for this particular class of non-maskable MSI. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-13 11:22 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
Oh yes, and indeed this could be stashed at the end of shared_info. We could reserve another bitmap of, say, 1024 bits quite easily. Plenty of PIRQs! -- Keir On 13/11/08 11:16, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:> We already have PHYSDEVOP_eoi. We can force guests to always use that. It''ll > be no slower than level-triggered IO-APIC IRQs, which we''ve been using since > forever with no complaints. And yeah, as an extension we could have a > shared-memory pirq bitmap to indicate dynamically whether eoi is required. > Quite a secondary concern though, I would say._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-13 14:53 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 13.11.08 12:16 >>> >On 13/11/08 11:07, "Jan Beulich" <jbeulich@novell.com> wrote: > >>> Otherwise we could relax it some (e.g., require N IRQs >>> to get stacked up rather than just 1; or add explicit rate limiting). >> >> For the main problem at hand, that would just reduce the likelihood of the >> device refusing to work. For the performance issue, that would be an >> option, as would be reducing the timeout value. However, I would also >> consider making EVTCHNOP_unmask clear that state, and then perhaps >> find a way to tell the guest that it should call this even if unmask_evtchn() >> finds the event channel to be bound to the local CPU. The obvious thing >> would be to either extend shared_info or have the guest register an >> address with Xen where per-event-channel overflow status would be >> reported by Xen. > >We already have PHYSDEVOP_eoi. We can force guests to always use that. It''ll >be no slower than level-triggered IO-APIC IRQs, which we''ve been using since >forever with no complaints. And yeah, as an extension we could have a >shared-memory pirq bitmap to indicate dynamically whether eoi is required. >Quite a secondary concern though, I would say.Avoiding the EOI query is certainly a secondary issue. What I was asking was rather a means for the guest to know whether Xen started that EOI timer, so that it could indicate to Xen to terminate it and unmask the respective IRQ. This shouldn''t require always using PHYSDEVOP_eoi, and from an abstract point of view also would belong there, but rather in unmask_evtchn(). Since it would be an obvious thing that if you unmask an event channel, you also want the underlying PIRQ unmasked, this could be a compatible addition to the existing EVTCHNOP_unmask. The only thing missing is a way for the guest to know when to actually use the hypercall based unmasking - that''s what I wanted to add a vector for.>>> We only disable MSI when the device does not support masking. Perhaps we >>> should make disable/enable no-ops in that case? >> >> Yes, but don''t we need an alternative way to avoid storms then? > >We should be delaying LAPIC EOI, just as we do for level-triggered IO-APIC >IRQs (in that case, because masking RTEs in some cases has stupid side >effects on some damn stupid chipsets). All that logic exists, just needs >plumbing in for this particular class of non-maskable MSI.Like this you mean? Lightly tested it appears to work (but not tested on a problem system, yet). --- 2008-10-27.orig/xen/arch/x86/irq.c +++ 2008-10-27/xen/arch/x86/irq.c @@ -463,14 +463,19 @@ int pirq_acktype(struct domain *d, int i /* * Edge-triggered IO-APIC and LAPIC interrupts need no final * acknowledgement: we ACK early during interrupt processing. - * MSIs are treated as edge-triggered interrupts. */ if ( !strcmp(desc->handler->typename, "IO-APIC-edge") || - !strcmp(desc->handler->typename, "local-APIC-edge") || - !strcmp(desc->handler->typename, "PCI-MSI") ) + !strcmp(desc->handler->typename, "local-APIC-edge") ) return ACKTYPE_NONE; /* + * MSIs are treated as edge-triggered interrupts, except + * when there is no proper way to mask them. + */ + if ( desc->handler == &pci_msi_type ) + return msi_maskable_irq(desc->msi_desc) ? ACKTYPE_NONE : ACKTYPE_EOI; + + /* * Level-triggered IO-APIC interrupts need to be acknowledged on the CPU * on which they were received. This is because we tickle the LAPIC to EOI. */ --- 2008-10-27.orig/xen/arch/x86/msi.c +++ 2008-10-27/xen/arch/x86/msi.c @@ -303,6 +303,13 @@ static void msix_flush_writes(unsigned i } } +int msi_maskable_irq(const struct msi_desc *entry) +{ + BUG_ON(!entry); + return entry->msi_attrib.type != PCI_CAP_ID_MSI + || entry->msi_attrib.maskbit; +} + static void msi_set_mask_bit(unsigned int irq, int flag) { struct msi_desc *entry = irq_desc[irq].msi_desc; @@ -323,8 +330,6 @@ static void msi_set_mask_bit(unsigned in mask_bits &= ~(1); mask_bits |= flag; pci_conf_write32(bus, slot, func, pos, mask_bits); - } else { - msi_set_enable(entry->dev, !flag); } break; case PCI_CAP_ID_MSIX: @@ -654,7 +659,7 @@ static int __pci_enable_msix(struct msi_ pos = pci_find_cap_offset(msi->bus, slot, func, PCI_CAP_ID_MSIX); control = pci_conf_read16(msi->bus, slot, func, msi_control_reg(pos)); nr_entries = multi_msix_capable(control); - if (msi->entry_nr > nr_entries) + if (msi->entry_nr >= nr_entries) { spin_unlock(&pdev->lock); return -EINVAL; --- 2008-10-27.orig/xen/include/asm-x86/msi.h +++ 2008-10-27/xen/include/asm-x86/msi.h @@ -97,6 +97,8 @@ struct msi_desc { int remap_index; /* index in interrupt remapping table */ }; +int msi_maskable_irq(const struct msi_desc *); + /* * Assume the maximum number of hot plug slots supported by the system is about * ten. The worstcase is that each of these slots is hot-added with a device, Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-13 15:06 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 13/11/08 14:53, "Jan Beulich" <jbeulich@novell.com> wrote:> Avoiding the EOI query is certainly a secondary issue. What I was asking > was rather a means for the guest to know whether Xen started that EOI > timer, so that it could indicate to Xen to terminate it and unmask the > respective IRQ. This shouldn''t require always using PHYSDEVOP_eoi, and > from an abstract point of view also would belong there, but rather in > unmask_evtchn(). Since it would be an obvious thing that if you unmask > an event channel, you also want the underlying PIRQ unmasked, this > could be a compatible addition to the existing EVTCHNOP_unmask. The > only thing missing is a way for the guest to know when to actually use > the hypercall based unmasking - that''s what I wanted to add a vector > for.PHYSDEVOP_eoi and unmask happen at the same time for pirqs. The fact that we only need this new mechanism for pirqs, and that we already have a gated hypercall for pirq eoi (and can gate it further if need be) is an argument for hanging this off PHYSDEVOP_eoi imo.>> We should be delaying LAPIC EOI, just as we do for level-triggered IO-APIC >> IRQs (in that case, because masking RTEs in some cases has stupid side >> effects on some damn stupid chipsets). All that logic exists, just needs >> plumbing in for this particular class of non-maskable MSI. > > Like this you mean? Lightly tested it appears to work (but not tested on a > problem system, yet).Yes, a patch like this would be a good thing. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-13 15:28 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 13.11.08 16:06 >>> >On 13/11/08 14:53, "Jan Beulich" <jbeulich@novell.com> wrote: > >> Avoiding the EOI query is certainly a secondary issue. What I was asking >> was rather a means for the guest to know whether Xen started that EOI >> timer, so that it could indicate to Xen to terminate it and unmask the >> respective IRQ. This shouldn''t require always using PHYSDEVOP_eoi, and >> from an abstract point of view also would belong there, but rather in >> unmask_evtchn(). Since it would be an obvious thing that if you unmask >> an event channel, you also want the underlying PIRQ unmasked, this >> could be a compatible addition to the existing EVTCHNOP_unmask. The >> only thing missing is a way for the guest to know when to actually use >> the hypercall based unmasking - that''s what I wanted to add a vector >> for. > >PHYSDEVOP_eoi and unmask happen at the same time for pirqs. The fact that we >only need this new mechanism for pirqs, and that we already have a gated >hypercall for pirq eoi (and can gate it further if need be) is an argument >for hanging this off PHYSDEVOP_eoi imo.But then there''d be a hypercall for each MSI instance, most of the time without any real need. With a high interrupt rate I''m afraid this does matter. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-13 16:21 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 13/11/08 15:28, "Jan Beulich" <jbeulich@novell.com> wrote:>> PHYSDEVOP_eoi and unmask happen at the same time for pirqs. The fact that we >> only need this new mechanism for pirqs, and that we already have a gated >> hypercall for pirq eoi (and can gate it further if need be) is an argument >> for hanging this off PHYSDEVOP_eoi imo. > > But then there''d be a hypercall for each MSI instance, most of the time > without any real need. With a high interrupt rate I''m afraid this does > matter.I don''t understand what you mean. There is a one-to-one-one relationship between MSIs, PIRQs, and event channels. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-13 16:22 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 13/11/08 16:21, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:>> But then there''d be a hypercall for each MSI instance, most of the time >> without any real need. With a high interrupt rate I''m afraid this does >> matter. > > I don''t understand what you mean. There is a one-to-one-one relationship > between MSIs, PIRQs, and event channels.Also I''ll add we currently do a hypercall for every level-triggered IO-APIC IRQ, which was really all we supported until recently. Seemed to work well enough performance-wise in that case. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-13 16:43 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 13.11.08 17:22 >>> >On 13/11/08 16:21, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote: > >>> But then there''d be a hypercall for each MSI instance, most of the time >>> without any real need. With a high interrupt rate I''m afraid this does >>> matter. >> >> I don''t understand what you mean. There is a one-to-one-one relationship >> between MSIs, PIRQs, and event channels.Up to now, MSI didn''t require an EOI, and devices that support masking (in particular all MSI-X ones) wouldn''t generally require an EOI even with the patch send earlier. What you propose would make them all require an EOI all of the sudden, despite them needing hypervisor assistance only when the interrupt got masked.>Also I''ll add we currently do a hypercall for every level-triggered IO-APIC >IRQ, which was really all we supported until recently. Seemed to work well >enough performance-wise in that case.While that may be correct (I doubt anyone measured the throughput difference - really, there was nothing to measure in the IO-APIC case as there was no alternative to doing an EOI hypercall), I don''t view this as a valid argument. If we can do with less hypercalls, we should. And this especially when using a feature (MSI) the particular goal of which is to improve performance. Otherwise, the only reason for having MSI support would be for devices that don''t support INTA (which presumably aren''t that many). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-13 16:50 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 13/11/08 16:43, "Jan Beulich" <jbeulich@novell.com> wrote:> Up to now, MSI didn''t require an EOI, and devices that support masking (in > particular all MSI-X ones) wouldn''t generally require an EOI even with the > patch send earlier. What you propose would make them all require an EOI > all of the sudden, despite them needing hypervisor assistance only when > the interrupt got masked. > >> Also I''ll add we currently do a hypercall for every level-triggered IO-APIC >> IRQ, which was really all we supported until recently. Seemed to work well >> enough performance-wise in that case.So we''d add a pirq-indexed bitmap to mitigate that. Whether we use PHYSDEVOP_irq_eoi or EVTCHNOP_unmask, we need a new shared-memory bitmap, right? Might as well use irq_eoi and index by pirq, I''d say.> While that may be correct (I doubt anyone measured the throughput > difference - really, there was nothing to measure in the IO-APIC case as > there was no alternative to doing an EOI hypercall), I don''t view this as a > valid argument. If we can do with less hypercalls, we should. And this > especially when using a feature (MSI) the particular goal of which is to > improve performance. Otherwise, the only reason for having MSI support > would be for devices that don''t support INTA (which presumably aren''t > that many).More likely it''s to reduce pin counts and hence production costs. :-) Still, indeed, fewer hypercalls is better in general, I would have to agree. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-14 09:28 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 13.11.08 17:50 >>> >On 13/11/08 16:43, "Jan Beulich" <jbeulich@novell.com> wrote: > >> Up to now, MSI didn''t require an EOI, and devices that support masking (in >> particular all MSI-X ones) wouldn''t generally require an EOI even with the >> patch send earlier. What you propose would make them all require an EOI >> all of the sudden, despite them needing hypervisor assistance only when >> the interrupt got masked. >> >>> Also I''ll add we currently do a hypercall for every level-triggered IO-APIC >>> IRQ, which was really all we supported until recently. Seemed to work well >>> enough performance-wise in that case. > >So we''d add a pirq-indexed bitmap to mitigate that. Whether we use >PHYSDEVOP_irq_eoi or EVTCHNOP_unmask, we need a new shared-memory bitmap, >right? Might as well use irq_eoi and index by pirq, I''d say.Hmm, I''m still not convinced: With what you propose, it''s unclear to me who would when clear the bit in that bitmap for the ''temporarily masked'' case. Anyway, unless you get to implement your version earlier (and thus convince me that things will work out correctly), I''ll try to get implemented what I would think should be appropriate here once I find time to do so. The other concern I have (as a consequence of the NR_IRQS related discussion) is that adding a NR_IRQS (or NR_PIRQS) indexed bitmap to shared_info seems problematic wrt forward compatibility: You hinted at making the value build-time configurable, and even if it remained a manifest constant, any value (you suggested 1024) would only call for being too small at some future point in time (as will unavoidably be the case for the number of event channels). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-14 09:42 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 14/11/08 09:28, "Jan Beulich" <jbeulich@novell.com> wrote:>> So we''d add a pirq-indexed bitmap to mitigate that. Whether we use >> PHYSDEVOP_irq_eoi or EVTCHNOP_unmask, we need a new shared-memory bitmap, >> right? Might as well use irq_eoi and index by pirq, I''d say. > > Hmm, I''m still not convinced: With what you propose, it''s unclear to me who > would when clear the bit in that bitmap for the ''temporarily masked'' case. > Anyway, unless you get to implement your version earlier (and thus > convince me that things will work out correctly), I''ll try to get implemented > what I would think should be appropriate here once I find time to do so.Perhaps if we go your route we can make PHYSDEVOP_irq_eoi obsolete? It''s only really called where we also do an unmask, and it''s pointless to have two hypercalls where one will do. So a guest that detects the new bitmap could then know it only needs to unmask-by-hypercall, rather than use PHYSDEVOP_irq_eoi at all. That would at least make me happier about your approach. :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-14 13:00 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 14.11.08 10:42 >>> >On 14/11/08 09:28, "Jan Beulich" <jbeulich@novell.com> wrote: > >>> So we''d add a pirq-indexed bitmap to mitigate that. Whether we use >>> PHYSDEVOP_irq_eoi or EVTCHNOP_unmask, we need a new shared-memory bitmap, >>> right? Might as well use irq_eoi and index by pirq, I''d say. >> >> Hmm, I''m still not convinced: With what you propose, it''s unclear to me who >> would when clear the bit in that bitmap for the ''temporarily masked'' case. >> Anyway, unless you get to implement your version earlier (and thus >> convince me that things will work out correctly), I''ll try to get implemented >> what I would think should be appropriate here once I find time to do so. > >Perhaps if we go your route we can make PHYSDEVOP_irq_eoi obsolete? It''s >only really called where we also do an unmask, and it''s pointless to have >two hypercalls where one will do. So a guest that detects the new bitmap >could then know it only needs to unmask-by-hypercall, rather than use >PHYSDEVOP_irq_eoi at all.Yes, folding the potentially two hypercalls into one was a parallel idea. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-24 16:34 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 14.11.08 10:42 >>> >On 14/11/08 09:28, "Jan Beulich" <jbeulich@novell.com> wrote: >Perhaps if we go your route we can make PHYSDEVOP_irq_eoi obsolete? It''s >only really called where we also do an unmask, and it''s pointless to have >two hypercalls where one will do. So a guest that detects the new bitmap >could then know it only needs to unmask-by-hypercall, rather than use >PHYSDEVOP_irq_eoi at all.Perhaps the other way around: Make PHYSDEVOP_eoi imply an unmask, as that''s what is always happening (whereas not every unmask also wants an EOI to be signaled). Below is a draft (compile-tested only) patch that, before coding the guest side, I''d appreciate to get comments on - especially if it appears reasonable to be done that way, if it meets your naming and coding preferences (I''m pretty sure it won''t), and of course whether it''s obviously broken in some respect. Thanks, Jan Index: 2008-11-20/xen/arch/x86/domain.c ==================================================================--- 2008-11-20.orig/xen/arch/x86/domain.c 2008-11-20 08:56:58.000000000 +0100 +++ 2008-11-20/xen/arch/x86/domain.c 2008-11-24 15:03:21.000000000 +0100 @@ -1823,6 +1823,12 @@ int domain_relinquish_resources(struct d unmap_vcpu_info(v); } + if ( d->arch.pirq_eoi_mfn ) + { + put_page_and_type(mfn_to_page(d->arch.pirq_eoi_mfn)); + d->arch.pirq_eoi_mfn = 0; + } + d->arch.relmem = RELMEM_xen; /* fallthrough */ Index: 2008-11-20/xen/arch/x86/irq.c ==================================================================--- 2008-11-20.orig/xen/arch/x86/irq.c 2008-11-20 10:13:47.000000000 +0100 +++ 2008-11-20/xen/arch/x86/irq.c 2008-11-24 17:25:28.000000000 +0100 @@ -18,6 +18,7 @@ #include <xen/iommu.h> #include <asm/msi.h> #include <asm/current.h> +#include <asm/flushtlb.h> #include <public/physdev.h> /* opt_noirqbalance: If true, software IRQ balancing/affinity is disabled. */ @@ -206,16 +207,63 @@ struct pending_eoi { static DEFINE_PER_CPU(struct pending_eoi, pending_eoi[NR_VECTORS]); #define pending_eoi_sp(p) ((p)[NR_VECTORS-1].vector) +#ifdef __i386__ +/* Cannot use map_domain_page(). */ +static inline unsigned long *map_pirq_eoi(struct domain *d) +{ + void *v = (void *)fix_to_virt(FIX_PIRQ_EOI); + + __set_fixmap(FIX_PIRQ_EOI, d->arch.pirq_eoi_mfn, __PAGE_HYPERVISOR); + flush_tlb_one_local(v); + return v; +} + +static inline void flush_pirq_eoi(void) +{ + __set_fixmap(FIX_PIRQ_EOI, 0, 0); + flush_tlb_one_local(fix_to_virt(FIX_PIRQ_EOI)); +} +#else +# define map_pirq_eoi(d) ((unsigned long *)mfn_to_virt((d)->arch.pirq_eoi_mfn)) +# define flush_pirq_eoi() +#endif + +static inline void set_pirq_eoi(struct domain *d, unsigned int irq) +{ + if ( d->arch.pirq_eoi_mfn ) + set_bit(irq, map_pirq_eoi(d)); +} + +static inline void clear_pirq_eoi(struct domain *d, unsigned int irq) +{ + if ( d->arch.pirq_eoi_mfn ) + clear_bit(irq, map_pirq_eoi(d)); +} + +static void _irq_guest_eoi(irq_desc_t *desc) +{ + irq_guest_action_t *action = (irq_guest_action_t *)desc->action; + unsigned int i, vector = desc - irq_desc; + + for ( i = 0; i < action->nr_guests; ++i ) + clear_pirq_eoi(action->guest[i], + domain_vector_to_irq(action->guest[i], vector)); + flush_pirq_eoi(); + + desc->status &= ~IRQ_INPROGRESS; + desc->handler->enable(vector); +} + static struct timer irq_guest_eoi_timer[NR_VECTORS]; +static DECLARE_BITMAP(irq_guest_eoi_pending, NR_VECTORS); static void irq_guest_eoi_timer_fn(void *data) { irq_desc_t *desc = data; - unsigned vector = desc - irq_desc; unsigned long flags; spin_lock_irqsave(&desc->lock, flags); - desc->status &= ~IRQ_INPROGRESS; - desc->handler->enable(vector); + if ( test_and_clear_bit(desc - irq_desc, irq_guest_eoi_pending) ) + _irq_guest_eoi(desc); spin_unlock_irqrestore(&desc->lock, flags); } @@ -272,8 +320,15 @@ static void __do_IRQ_guest(int vector) if ( already_pending == action->nr_guests ) { - desc->handler->disable(vector); stop_timer(&irq_guest_eoi_timer[vector]); + desc->handler->disable(vector); + set_bit(vector, irq_guest_eoi_pending); + for ( i = 0; i < already_pending; ++i ) + { + d = action->guest[i]; + set_pirq_eoi(d, domain_vector_to_irq(d, vector)); + } + flush_pirq_eoi(); init_timer(&irq_guest_eoi_timer[vector], irq_guest_eoi_timer_fn, desc, smp_processor_id()); set_timer(&irq_guest_eoi_timer[vector], NOW() + MILLISECS(1)); @@ -382,8 +437,15 @@ static void __pirq_guest_eoi(struct doma action = (irq_guest_action_t *)desc->action; vector = desc - irq_desc; - ASSERT(!test_bit(irq, d->pirq_mask) || - (action->ack_type != ACKTYPE_NONE)); + if ( action->ack_type == ACKTYPE_NONE ) + { + ASSERT(!test_bit(irq, d->pirq_mask)); + if ( test_and_clear_bit(vector, irq_guest_eoi_pending) ) + { + kill_timer(&irq_guest_eoi_timer[vector]); + _irq_guest_eoi(desc); + } + } if ( unlikely(!test_and_clear_bit(irq, d->pirq_mask)) || unlikely(--action->in_flight != 0) ) @@ -607,6 +669,12 @@ int pirq_guest_bind(struct vcpu *v, int action->guest[action->nr_guests++] = v->domain; + if ( action->ack_type != ACKTYPE_NONE ) + set_pirq_eoi(v->domain, irq); + else + clear_pirq_eoi(v->domain, irq); + flush_pirq_eoi(); + unlock_out: spin_unlock_irq(&desc->lock); out: Index: 2008-11-20/xen/arch/x86/physdev.c ==================================================================--- 2008-11-20.orig/xen/arch/x86/physdev.c 2008-10-13 13:36:27.000000000 +0200 +++ 2008-11-20/xen/arch/x86/physdev.c 2008-11-24 17:06:48.000000000 +0100 @@ -191,10 +191,35 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H ret = -EFAULT; if ( copy_from_guest(&eoi, arg, 1) != 0 ) break; + ret = -EINVAL; + if ( eoi.irq < 0 || eoi.irq >= NR_IRQS ) + break; + if ( v->domain->arch.pirq_eoi_mfn ) + evtchn_unmask(v->domain->pirq_to_evtchn[eoi.irq]); ret = pirq_guest_eoi(v->domain, eoi.irq); break; } + case PHYSDEVOP_pirq_eoi_mfn: { + struct physdev_pirq_eoi_mfn info; + + BUILD_BUG_ON(NR_IRQS > PAGE_SIZE * 8); + ret = -EFAULT; + if ( copy_from_guest(&info, arg, 1) != 0 ) + break; + ret = -EBUSY; + if ( v->domain->arch.pirq_eoi_mfn ) + break; + ret = -EINVAL; + if ( !mfn_valid(info.mfn) || + !get_page_and_type(mfn_to_page(info.mfn), v->domain, + PGT_writable_page) ) + break; + v->domain->arch.pirq_eoi_mfn = info.mfn; + ret = 0; + break; + } + /* Legacy since 0x00030202. */ case PHYSDEVOP_IRQ_UNMASK_NOTIFY: { ret = pirq_guest_unmask(v->domain); Index: 2008-11-20/xen/common/event_channel.c ==================================================================--- 2008-11-20.orig/xen/common/event_channel.c 2008-10-24 11:21:38.000000000 +0200 +++ 2008-11-20/xen/common/event_channel.c 2008-11-24 17:14:44.000000000 +0100 @@ -762,10 +762,9 @@ long evtchn_bind_vcpu(unsigned int port, } -static long evtchn_unmask(evtchn_unmask_t *unmask) +int evtchn_unmask(unsigned int port) { struct domain *d = current->domain; - int port = unmask->port; struct vcpu *v; spin_lock(&d->event_lock); @@ -916,7 +915,7 @@ long do_event_channel_op(int cmd, XEN_GU struct evtchn_unmask unmask; if ( copy_from_guest(&unmask, arg, 1) != 0 ) return -EFAULT; - rc = evtchn_unmask(&unmask); + rc = evtchn_unmask(unmask.port); break; } Index: 2008-11-20/xen/include/asm-x86/domain.h ==================================================================--- 2008-11-20.orig/xen/include/asm-x86/domain.h 2008-11-20 08:48:26.000000000 +0100 +++ 2008-11-20/xen/include/asm-x86/domain.h 2008-11-24 14:52:49.000000000 +0100 @@ -238,6 +238,8 @@ struct arch_domain int vector_pirq[NR_VECTORS]; s16 pirq_vector[NR_IRQS]; + unsigned long pirq_eoi_mfn; + /* Pseudophysical e820 map (XENMEM_memory_map). */ struct e820entry e820[3]; unsigned int nr_e820; Index: 2008-11-20/xen/include/asm-x86/fixmap.h ==================================================================--- 2008-11-20.orig/xen/include/asm-x86/fixmap.h 2008-11-05 16:54:22.000000000 +0100 +++ 2008-11-20/xen/include/asm-x86/fixmap.h 2008-11-24 16:43:03.000000000 +0100 @@ -33,6 +33,7 @@ enum fixed_addresses { #ifdef __i386__ FIX_PAE_HIGHMEM_0, FIX_PAE_HIGHMEM_END = FIX_PAE_HIGHMEM_0 + NR_CPUS-1, + FIX_PIRQ_EOI, #endif FIX_APIC_BASE, FIX_IO_APIC_BASE_0, Index: 2008-11-20/xen/include/public/physdev.h ==================================================================--- 2008-11-20.orig/xen/include/public/physdev.h 2008-08-15 16:18:55.000000000 +0200 +++ 2008-11-20/xen/include/public/physdev.h 2008-11-24 15:16:10.000000000 +0100 @@ -41,6 +41,21 @@ typedef struct physdev_eoi physdev_eoi_t DEFINE_XEN_GUEST_HANDLE(physdev_eoi_t); /* + * Register a shared page for the hypervisor to indicate whether the guest + * must issue PHYSDEVOP_eoi. The semantics of PHYSDEVOP_eoi change slightly + * once the guest used this function in that the associated event channel + * will automatically get unmasked. The page registered is used as a bit + * array indexed by Xen''s PIRQ value. + */ +#define PHYSDEVOP_pirq_eoi_mfn 17 +struct physdev_pirq_eoi_mfn { + /* IN */ + xen_pfn_t mfn; +}; +typedef struct physdev_pirq_eoi_mfn physdev_pirq_eoi_mfn_t; +DEFINE_XEN_GUEST_HANDLE(physdev_pirq_eoi_mfn_t); + +/* * Query the status of an IRQ line. * @arg == pointer to physdev_irq_status_query structure. */ Index: 2008-11-20/xen/include/xen/event.h ==================================================================--- 2008-11-20.orig/xen/include/xen/event.h 2008-09-19 14:28:28.000000000 +0200 +++ 2008-11-20/xen/include/xen/event.h 2008-11-24 17:12:01.000000000 +0100 @@ -44,6 +44,9 @@ int evtchn_send(struct domain *d, unsign /* Bind a local event-channel port to the specified VCPU. */ long evtchn_bind_vcpu(unsigned int port, unsigned int vcpu_id); +/* unmask the current domain''s event-channel port. */ +int evtchn_unmask(unsigned int port); + /* Allocate/free a Xen-attached event channel port. */ int alloc_unbound_xen_event_channel( struct vcpu *local_vcpu, domid_t remote_domid); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-24 17:02 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 24/11/08 16:34, "Jan Beulich" <jbeulich@novell.com> wrote:> Perhaps the other way around: Make PHYSDEVOP_eoi imply an unmask, > as that''s what is always happening (whereas not every unmask also wants > an EOI to be signaled). Below is a draft (compile-tested only) patch that, > before coding the guest side, I''d appreciate to get comments on - > especially if it appears reasonable to be done that way, if it meets your > naming and coding preferences (I''m pretty sure it won''t), and of course > whether it''s obviously broken in some respect.I don''t care which way round you do it (PHYSDEVOP_eoi implies unmask, or vice versa) although you had just about convinced me that you should do it the other way round to how you''ve chosen. I don''t really mind either way though. The fixmap stuff is a bit ugly and I would just have done a map_domain_page_global() for 32-bit Xen (good enough as far as I''m concerned). I''m not dead set against your approach if you like it very much, though. Setting the need-a-hypercall bit looks racey. Don''t you need to set the bit, then check the guest didn''t unmask meanwhile? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-25 08:15 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 24.11.08 18:02 >>> >On 24/11/08 16:34, "Jan Beulich" <jbeulich@novell.com> wrote: > >> Perhaps the other way around: Make PHYSDEVOP_eoi imply an unmask, >> as that''s what is always happening (whereas not every unmask also wants >> an EOI to be signaled). Below is a draft (compile-tested only) patch that, >> before coding the guest side, I''d appreciate to get comments on - >> especially if it appears reasonable to be done that way, if it meets your >> naming and coding preferences (I''m pretty sure it won''t), and of course >> whether it''s obviously broken in some respect. > >I don''t care which way round you do it (PHYSDEVOP_eoi implies unmask, or >vice versa) although you had just about convinced me that you should do it >the other way round to how you''ve chosen. I don''t really mind either way >though.I may have expressed myself a little imprecisely. I really just meant the two operations to be folded into one. While looking at it in more detail I realized that tying the EOI to the unmask wouldn''t be nice as ''normal'' unmasks must still be possible. My main concern was what the bitmap is to express, and indeed following your approach of simply indicating the need for an EOI notification (rather than indicating a need for unmask through hypercall) finally seemed more consistent (and apparently making for simpler guest side code).>The fixmap stuff is a bit ugly and I would just have done a >map_domain_page_global() for 32-bit Xen (good enough as far as I''m >concerned). I''m not dead set against your approach if you like it very much, >though.But just as map_domain_page(), map_domain_page_global() can''t be used out of IRQ context...>Setting the need-a-hypercall bit looks racey. Don''t you need to set the bit, >then check the guest didn''t unmask meanwhile?We could, but I don''t think it''s strictly needed: The bit geting temporarily set (as opposed to the case where it''s being set for the lifetime of the IRQ) is a performance optimization only anyway, i.e. to prevent the IRQ from remaining masked for longer than it really needs to be. But yes, I''ll see whether the unmasking case can be taken care of without making the code much more complicated. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-25 08:30 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 25/11/08 08:15, "Jan Beulich" <jbeulich@novell.com> wrote:>> The fixmap stuff is a bit ugly and I would just have done a >> map_domain_page_global() for 32-bit Xen (good enough as far as I''m >> concerned). I''m not dead set against your approach if you like it very much, >> though. > > But just as map_domain_page(), map_domain_page_global() can''t be used > out of IRQ context...I would have kept the page mapped from when it was registered until domain death. Same as we do for guest-registered vcpu_info.>> Setting the need-a-hypercall bit looks racey. Don''t you need to set the bit, >> then check the guest didn''t unmask meanwhile? > > We could, but I don''t think it''s strictly needed: The bit geting temporarily > set (as opposed to the case where it''s being set for the lifetime of the IRQ) > is a performance optimization only anyway, i.e. to prevent the IRQ from > remaining masked for longer than it really needs to be. But yes, I''ll see > whether the unmasking case can be taken care of without making the code > much more complicated.Yeah, okay. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-25 08:40 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 25.11.08 09:30 >>> >On 25/11/08 08:15, "Jan Beulich" <jbeulich@novell.com> wrote: > >>> The fixmap stuff is a bit ugly and I would just have done a >>> map_domain_page_global() for 32-bit Xen (good enough as far as I''m >>> concerned). I''m not dead set against your approach if you like it very much, >>> though. >> >> But just as map_domain_page(), map_domain_page_global() can''t be used >> out of IRQ context... > >I would have kept the page mapped from when it was registered until domain >death. Same as we do for guest-registered vcpu_info.Oh, I indeed didn''t consider that option. The question is - does it scale? 2Mb doesn''t seem all that much for a global pool, but otoh it''s perhaps unlikely that 32-bit hypervisors would have to manage too large an amount of domains/vCPU-s. So yes, I''ll go that route. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-25 08:44 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 25/11/08 08:40, "Jan Beulich" <jbeulich@novell.com> wrote:>>> But just as map_domain_page(), map_domain_page_global() can''t be used >>> out of IRQ context... >> >> I would have kept the page mapped from when it was registered until domain >> death. Same as we do for guest-registered vcpu_info. > > Oh, I indeed didn''t consider that option. The question is - does it scale? > 2Mb doesn''t seem all that much for a global pool, but otoh it''s perhaps > unlikely that 32-bit hypervisors would have to manage too large an > amount of domains/vCPU-s. So yes, I''ll go that route.Yeah, exactly. Anyone who wants a larger number of domains/VCPUs probably has 64-bit-capable hardware. And lack of 64-bit capability is the only reason not to run 64-bit Xen these days. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-25 08:52 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 25.11.08 09:44 >>> >... And lack of 64-bit capability is the only reason not to run 64-bit Xen these days.Is it, really? That is, are all the save/restore/migrate issues indeed fixed? I didn''t think so, but then again I don''t follow closely what happens on the tools side... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-25 09:38 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 25/11/08 08:52, "Jan Beulich" <jbeulich@novell.com> wrote:>>>> Keir Fraser <keir.fraser@eu.citrix.com> 25.11.08 09:44 >>> >> ... And lack of 64-bit capability is the only reason not to run 64-bit Xen >> these days. > > Is it, really? That is, are all the save/restore/migrate issues indeed fixed? > I > didn''t think so, but then again I don''t follow closely what happens on the > tools side...That''s a dom0 vs domU issue. You can run 32-bit dom0 on 64-bit Xen, and that suffices to avoid Xen''s address-space constraints. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Nov-25 10:12 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
>>> Keir Fraser <keir.fraser@eu.citrix.com> 24.11.08 18:02 >>> >Setting the need-a-hypercall bit looks racey. Don''t you need to set the bit, >then check the guest didn''t unmask meanwhile?What would the action be in that case? Try send_guest_pirq() a second time, and not arm the timer (along with clearing all the bits again) if one of the guests is now able to accept it? Would seem too complicated to me (the more that then the HVM case would need to be taken care of explicitly), given that all we try to avoid is a 1ms gap in delivering the next event. But it would be doable of course. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-25 10:47 UTC
Re: [Xen-devel] irq_guest_eoi_timer interaction with MSI
On 25/11/08 10:12, "Jan Beulich" <jbeulich@novell.com> wrote:>>>> Keir Fraser <keir.fraser@eu.citrix.com> 24.11.08 18:02 >>> >> Setting the need-a-hypercall bit looks racey. Don''t you need to set the bit, >> then check the guest didn''t unmask meanwhile? > > What would the action be in that case? Try send_guest_pirq() a second > time, and not arm the timer (along with clearing all the bits again) if one > of the guests is now able to accept it? Would seem too complicated to me > (the more that then the HVM case would need to be taken care of > explicitly), given that all we try to avoid is a 1ms gap in delivering the > next event. But it would be doable of course.Yes, I think it would be something like that. As you say, the patch you propose is a strict improvement on the current state of affairs, so we can just go with that and extend it only if it proves necessary. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel