Jan Beulich
2009-Sep-22 08:18 UTC
[Xen-devel] another regression from IRQ handling changes
An issue I had fixed a little while before your changes now appears to exist again (can''t actively verify it due to lack of access to a sufficiently big system): While we now can handle more than 192 interrupt sources, those again are confined to the first 256 IO-APIC pins. We know, however, that there are systems with well over 300 pins (most of which are typically unused, and hence being able to "only" handle 192 interrupt sources doesn''t really present a problem on these systems. Clearly, handling of more than 256 (non-MSI) interrupt sources cannot be done without a kernel side change, since there needs to be a replacement for the 8-bit vector information conveyed through the kernel writes to the IO-APIC redirection table entries. However, up to 256 interrupt sources could easily be handled without kernel side change, by making PHYSDEVOP_alloc_irq_vector return a fake vector (rather than the unmodified irq that got passed in). Thoughts? Thanks, Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-22 08:39 UTC
Re: [Xen-devel] another regression from IRQ handling changes
On 22/09/2009 09:18, "Jan Beulich" <JBeulich@novell.com> wrote:> An issue I had fixed a little while before your changes now appears to > exist again (can''t actively verify it due to lack of access to a sufficiently > big system): While we now can handle more than 192 interrupt sources, > those again are confined to the first 256 IO-APIC pins. We know, > however, that there are systems with well over 300 pins (most of which > are typically unused, and hence being able to "only" handle 192 interrupt > sources doesn''t really present a problem on these systems. > > Clearly, handling of more than 256 (non-MSI) interrupt sources cannot > be done without a kernel side change, since there needs to be a > replacement for the 8-bit vector information conveyed through the > kernel writes to the IO-APIC redirection table entries. However, up to > 256 interrupt sources could easily be handled without kernel side > change, by making PHYSDEVOP_alloc_irq_vector return a fake vector > (rather than the unmodified irq that got passed in).If it wasn''t broken before, it was probably broken by Xiantao''s follow-up patch to fix NetBSD dom0 (at least as much as possible, to avoid a nasty regression with NetBSD). What we probably need to do is have a 256-entry dom0_vector_to_dom0_irq[] array, and allocate an entry from that for every fresh irq we see at PHYSDEVOP_alloc_irq_vector. Then when the vector gets passed back in on ioapic writes, we index into that array rather than using naked rte.vector. How does that sound? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-22 08:53 UTC
Re: [Xen-devel] another regression from IRQ handling changes
>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.09.09 10:39 >>> >On 22/09/2009 09:18, "Jan Beulich" <JBeulich@novell.com> wrote: > >> An issue I had fixed a little while before your changes now appears to >> exist again (can''t actively verify it due to lack of access to a sufficiently >> big system): While we now can handle more than 192 interrupt sources, >> those again are confined to the first 256 IO-APIC pins. We know, >> however, that there are systems with well over 300 pins (most of which >> are typically unused, and hence being able to "only" handle 192 interrupt >> sources doesn''t really present a problem on these systems. >> >> Clearly, handling of more than 256 (non-MSI) interrupt sources cannot >> be done without a kernel side change, since there needs to be a >> replacement for the 8-bit vector information conveyed through the >> kernel writes to the IO-APIC redirection table entries. However, up to >> 256 interrupt sources could easily be handled without kernel side >> change, by making PHYSDEVOP_alloc_irq_vector return a fake vector >> (rather than the unmodified irq that got passed in). > >If it wasn''t broken before, it was probably broken by Xiantao''s follow-up >patch to fix NetBSD dom0 (at least as much as possible, to avoid a nasty >regression with NetBSD). What we probably need to do is have a 256-entry >dom0_vector_to_dom0_irq[] array, and allocate an entry from that for every >fresh irq we see at PHYSDEVOP_alloc_irq_vector. Then when the vector gets >passed back in on ioapic writes, we index into that array rather than using >naked rte.vector. > >How does that sound?Yeah, that''s what I would view as the solution to get old functionality back. But my question also extended to possible solutions to get beyond 256 here, especially such that are also acceptable to the pv-ops Dom0, which I''m much less certain about. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Zhang, Xiantao
2009-Sep-22 08:56 UTC
[Xen-devel] RE: another regression from IRQ handling changes
Jan Beulich wrote:> An issue I had fixed a little while before your changes now appears to > exist again (can''t actively verify it due to lack of access to a > sufficiently big system): While we now can handle more than 192 > interrupt sources, > those again are confined to the first 256 IO-APIC pins. We know, > however, that there are systems with well over 300 pins (most of which > are typically unused, and hence being able to "only" handle 192 > interrupt sources doesn''t really present a problem on these systems. > > Clearly, handling of more than 256 (non-MSI) interrupt sources cannot > be done without a kernel side change, since there needs to be a > replacement for the 8-bit vector information conveyed through the > kernel writes to the IO-APIC redirection table entries. However, up to > 256 interrupt sources could easily be handled without kernel side > change, by making PHYSDEVOP_alloc_irq_vector return a fake vector > (rather than the unmodified irq that got passed in).Jan, Are you sure it worked well with more than 256 pins before my IRQ changes ? You may refer to dom0''s code(linux-2.6.18.8-xen.hg) and the gsi irq number shouldn''t be bigger than 256 in any time. You can say this is an old issue we need to address through modifying some ABIs between dom0 and hypervisor rather than saying this issue is introduced by the IRQ handling changes, I think. That is to say, Xen never works with the big system with more than 256 pins of ioapic at any time. Xiantao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-22 09:14 UTC
Re: [Xen-devel] another regression from IRQ handling changes
On 22/09/2009 09:53, "Jan Beulich" <JBeulich@novell.com> wrote:>> If it wasn''t broken before, it was probably broken by Xiantao''s follow-up >> patch to fix NetBSD dom0 (at least as much as possible, to avoid a nasty >> regression with NetBSD). What we probably need to do is have a 256-entry >> dom0_vector_to_dom0_irq[] array, and allocate an entry from that for every >> fresh irq we see at PHYSDEVOP_alloc_irq_vector. Then when the vector gets >> passed back in on ioapic writes, we index into that array rather than using >> naked rte.vector. > > Yeah, that''s what I would view as the solution to get old functionality > back. But my question also extended to possible solutions to get beyond > 256 here, especially such that are also acceptable to the pv-ops Dom0, > which I''m much less certain about.Well, we could assume that if we see an irq greater than 256 at PHYSDEVOP_alloc_irq_vector then the dom0 is dealing in GSIs, and in that case the ''vector'' we return and then gets passed to ioapic_write is rather redundant. We can work out the GSI from the ioapic rte that is being modified, after all. So perhaps we could just flip into a non-legacy mode of operation in that case (perhaps reserve a single magic ''vector'' value to return from PHYSDEVOP_alloc_irq_vector in this case, and if we see that value in the ioapic write handler then we know to assume that guest pirq =gsi). The legacy case is just to handle NetBSD, which throws non-GSIs at the PHYSDEVOP_alloc_irq_vector interface, and I doubt it will have worked with those mega-big systems at any time. So I don''t think we''re dealing with a regression there. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-22 09:19 UTC
[Xen-devel] RE: another regression from IRQ handling changes
>>> "Zhang, Xiantao" <xiantao.zhang@intel.com> 22.09.09 10:56 >>> >Jan Beulich wrote: >> An issue I had fixed a little while before your changes now appears to >> exist again (can''t actively verify it due to lack of access to a >> sufficiently big system): While we now can handle more than 192 >> interrupt sources, >> those again are confined to the first 256 IO-APIC pins. We know, >> however, that there are systems with well over 300 pins (most of which >> are typically unused, and hence being able to "only" handle 192 >> interrupt sources doesn''t really present a problem on these systems. >> >> Clearly, handling of more than 256 (non-MSI) interrupt sources cannot >> be done without a kernel side change, since there needs to be a >> replacement for the 8-bit vector information conveyed through the >> kernel writes to the IO-APIC redirection table entries. However, up to >> 256 interrupt sources could easily be handled without kernel side >> change, by making PHYSDEVOP_alloc_irq_vector return a fake vector >> (rather than the unmodified irq that got passed in). > >Jan, > Are you sure it worked well with more than 256 pins before my IRQ >changes ? You may refer to dom0''s code(linux-2.6.18.8-xen.hg) and >the gsi irq number shouldn''t be bigger than 256 in any time. You can >say this is an old issue we need to address through modifying some >ABIs between dom0 and hypervisor rather than saying this issue is >introduced by the IRQ handling changes, I think. That is to say, Xen >never works with the big system with more than 256 pins of ioapic at >any time.I can''t say anything about the 2.6.18 tree, but I''m certain it worked with newer Dom0-s, e.g. the forward ported 2.6.27-based ones. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-22 09:53 UTC
Re: [Xen-devel] another regression from IRQ handling changes
>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.09.09 11:14 >>> >On 22/09/2009 09:53, "Jan Beulich" <JBeulich@novell.com> wrote: > >>> If it wasn''t broken before, it was probably broken by Xiantao''s follow-up >>> patch to fix NetBSD dom0 (at least as much as possible, to avoid a nasty >>> regression with NetBSD). What we probably need to do is have a 256-entry >>> dom0_vector_to_dom0_irq[] array, and allocate an entry from that for every >>> fresh irq we see at PHYSDEVOP_alloc_irq_vector. Then when the vector gets >>> passed back in on ioapic writes, we index into that array rather than using >>> naked rte.vector. >> >> Yeah, that''s what I would view as the solution to get old functionality >> back. But my question also extended to possible solutions to get beyond >> 256 here, especially such that are also acceptable to the pv-ops Dom0, >> which I''m much less certain about. > >Well, we could assume that if we see an irq greater than 256 at >PHYSDEVOP_alloc_irq_vector then the dom0 is dealing in GSIs, and in that >case the ''vector'' we return and then gets passed to ioapic_write is rather >redundant. We can work out the GSI from the ioapic rte that is being >modified, after all. So perhaps we could just flip into a non-legacy mode of >operation in that case (perhaps reserve a single magic ''vector'' value to >return from PHYSDEVOP_alloc_irq_vector in this case, and if we see that >value in the ioapic write handler then we know to assume that guest pirq =>gsi).I wouldn''t base this on the passed in IRQ number, but instead on the number of IRQs mapped - if the proposed array doesn''t have a spare slot anymore, switch to passing back the magic vector (and assume pirq == irq in ioapic_guest_write() - we could even add checking of that for all previously enabled IRQs this relation is true, and fail PHYSDEVOP_alloc_irq_vector if the array got exhausted and Dom0 didn''t use only GSIs before). But besides that detail your idea sounds fine at least from a Linux perspective. Are you planning on getting this done, or should I?>The legacy case is just to handle NetBSD, which throws non-GSIs at the >PHYSDEVOP_alloc_irq_vector interface, and I doubt it will have worked with >those mega-big systems at any time. So I don''t think we''re dealing with a >regression there.Hmm, no, the regression is with (newer?) Linux, which should have been capable of dealing with interrupts coming in on IO-APIC pins above 256 before that change. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Zhang, Xiantao
2009-Sep-22 10:10 UTC
RE: [Xen-devel] another regression from IRQ handling changes
Jan Beulich wrote:>>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.09.09 11:14 >>> >> On 22/09/2009 09:53, "Jan Beulich" <JBeulich@novell.com> wrote: >> >>>> If it wasn''t broken before, it was probably broken by Xiantao''s >>>> follow-up patch to fix NetBSD dom0 (at least as much as possible, >>>> to avoid a nasty regression with NetBSD). What we probably need to >>>> do is have a 256-entry dom0_vector_to_dom0_irq[] array, and >>>> allocate an entry from that for every fresh irq we see at >>>> PHYSDEVOP_alloc_irq_vector. Then when the vector gets passed back >>>> in on ioapic writes, we index into that array rather than using >>>> naked rte.vector. >>> >>> Yeah, that''s what I would view as the solution to get old >>> functionality back. But my question also extended to possible >>> solutions to get beyond 256 here, especially such that are also >>> acceptable to the pv-ops Dom0, which I''m much less certain about. >> >> Well, we could assume that if we see an irq greater than 256 at >> PHYSDEVOP_alloc_irq_vector then the dom0 is dealing in GSIs, and in >> that case the ''vector'' we return and then gets passed to >> ioapic_write is rather redundant. We can work out the GSI from the >> ioapic rte that is being modified, after all. So perhaps we could >> just flip into a non-legacy mode of operation in that case (perhaps >> reserve a single magic ''vector'' value to return from >> PHYSDEVOP_alloc_irq_vector in this case, and if we see that value in >> the ioapic write handler then we know to assume that guest pirq =>> gsi). > > I wouldn''t base this on the passed in IRQ number, but instead on the > number of IRQs mapped - if the proposed array doesn''t have a spare > slot anymore, switch to passing back the magic vector (and assume > pirq == irq in ioapic_guest_write() - we could even add checking of > that for all previously enabled IRQs this relation is true, and fail > PHYSDEVOP_alloc_irq_vector if the array got exhausted and Dom0 > didn''t use only GSIs before). But besides that detail your idea sounds > fine at least from a Linux perspective. > > Are you planning on getting this done, or should I?Don''t be so complex to handle it. I think the following patch should fix the potential issue. For linux dom0, it shouldn''t depend on the value of rte.vector at all when GSI irq > 256, and just make pirq equal to irq and then build the pirq and irq mapping. diff -r 3a71e070e3c5 xen/arch/x86/io_apic.c --- a/xen/arch/x86/io_apic.c Fri Sep 18 14:45:40 2009 +0100 +++ b/xen/arch/x86/io_apic.c Tue Sep 22 18:03:50 2009 +0800 @@ -2195,9 +2195,13 @@ int ioapic_guest_write(unsigned long phy /* Since PHYSDEVOP_alloc_irq_vector is dummy, rte.vector is the pirq which corresponds to this ioapic pin, retrieve it for building - pirq and irq mapping. - */ - pirq = rte.vector; + pirq and irq mapping, and this is only for NetBSD dom0. For Linux + dom0, pirq == irq at any time. + */ + if (irq >= 256) + pirq = irq; + else + pirq = rte.vector; /* Make NetBSD dom0 work. */ if(pirq < 0 || pirq >= dom0->nr_pirqs) return -EINVAL;> >> The legacy case is just to handle NetBSD, which throws non-GSIs at >> the PHYSDEVOP_alloc_irq_vector interface, and I doubt it will have >> worked with those mega-big systems at any time. So I don''t think >> we''re dealing with a regression there. > > Hmm, no, the regression is with (newer?) Linux, which should have been > capable of dealing with interrupts coming in on IO-APIC pins above 256 > before that change._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-22 10:30 UTC
Re: [Xen-devel] another regression from IRQ handling changes
On 22/09/2009 10:53, "Jan Beulich" <JBeulich@novell.com> wrote:> I wouldn''t base this on the passed in IRQ number, but instead on the > number of IRQs mapped - if the proposed array doesn''t have a spare > slot anymore, switch to passing back the magic vector (and assume > pirq == irq in ioapic_guest_write() - we could even add checking of > that for all previously enabled IRQs this relation is true, and fail > PHYSDEVOP_alloc_irq_vector if the array got exhausted and Dom0 > didn''t use only GSIs before). But besides that detail your idea sounds > fine at least from a Linux perspective. > > Are you planning on getting this done, or should I?Feel free to go ahead. Unfortunately I guess neither of us can properly test this case? :-( -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Sep-22 10:32 UTC
Re: [Xen-devel] another regression from IRQ handling changes
On 22/09/2009 11:10, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote:> Don''t be so complex to handle it. I think the following patch should fix the > potential issue. > For linux dom0, it shouldn''t depend on the value of rte.vector at all when GSI > irq > 256, and just make pirq equal to irq and then build the pirq and irq > mapping.This would be acceptable to me. What do you think, Jan? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Sep-22 11:33 UTC
Re: [Xen-devel] another regression from IRQ handling changes
>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.09.09 12:32 >>> >On 22/09/2009 11:10, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote: > >> Don''t be so complex to handle it. I think the following patch should fix the >> potential issue. >> For linux dom0, it shouldn''t depend on the value of rte.vector at all when GSI >> irq > 256, and just make pirq equal to irq and then build the pirq and irq >> mapping. > >This would be acceptable to me. What do you think, Jan?Indeed, that should do it (we''ll supposedly get this tested later this year). Thanks! Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Zhang, Xiantao
2009-Sep-22 11:54 UTC
RE: [Xen-devel] another regression from IRQ handling changes
Attached the patch with the sign-off-by and the description, thanks for the useful discussion about this topic! Xiantao -----Original Message----- From: Jan Beulich [mailto:JBeulich@novell.com] Sent: Tuesday, September 22, 2009 7:34 PM To: Keir Fraser; Zhang, Xiantao Cc: xen-devel@lists.xensource.com Subject: Re: [Xen-devel] another regression from IRQ handling changes>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.09.09 12:32 >>> >On 22/09/2009 11:10, "Zhang, Xiantao" <xiantao.zhang@intel.com> wrote: > >> Don''t be so complex to handle it. I think the following patch should fix the >> potential issue. >> For linux dom0, it shouldn''t depend on the value of rte.vector at all when GSI >> irq > 256, and just make pirq equal to irq and then build the pirq and irq >> mapping. > >This would be acceptable to me. What do you think, Jan?Indeed, that should do it (we''ll supposedly get this tested later this year). Thanks! Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel