thr3ads.net - Xen devel - [Xen-devel] another regression from IRQ handling changes [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Jan Beulich

2009-Sep-22 08:18 UTC

[Xen-devel] another regression from IRQ handling changes

An issue I had fixed a little while before your changes now appears to
exist again (can''t actively verify it due to lack of access to a
sufficiently
big system): While we now can handle more than 192 interrupt sources,
those again are confined to the first 256 IO-APIC pins. We know,
however, that there are systems with well over 300 pins (most of which
are typically unused, and hence being able to "only" handle 192
interrupt
sources doesn''t really present a problem on these systems.

Clearly, handling of more than 256 (non-MSI) interrupt sources cannot
be done without a kernel side change, since there needs to be a
replacement for the 8-bit vector information conveyed through the
kernel writes to the IO-APIC redirection table entries. However, up to
256 interrupt sources could easily be handled without kernel side
change, by making PHYSDEVOP_alloc_irq_vector return a fake vector
(rather than the unmodified irq that got passed in).

Thoughts?

Thanks, Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Sep-22 08:39 UTC

head link

Re: [Xen-devel] another regression from IRQ handling changes

On 22/09/2009 09:18, "Jan Beulich" <JBeulich@novell.com> wrote:
> An issue I had fixed a little while before your changes now appears to
> exist again (can''t actively verify it due to lack of access to a
sufficiently
> big system): While we now can handle more than 192 interrupt sources,
> those again are confined to the first 256 IO-APIC pins. We know,
> however, that there are systems with well over 300 pins (most of which
> are typically unused, and hence being able to "only" handle 192
interrupt
> sources doesn''t really present a problem on these systems.
> 
> Clearly, handling of more than 256 (non-MSI) interrupt sources cannot
> be done without a kernel side change, since there needs to be a
> replacement for the 8-bit vector information conveyed through the
> kernel writes to the IO-APIC redirection table entries. However, up to
> 256 interrupt sources could easily be handled without kernel side
> change, by making PHYSDEVOP_alloc_irq_vector return a fake vector
> (rather than the unmodified irq that got passed in).
If it wasn''t broken before, it was probably broken by
Xiantao''s follow-up
patch to fix NetBSD dom0 (at least as much as possible, to avoid a nasty
regression with NetBSD). What we probably need to do is have a 256-entry
dom0_vector_to_dom0_irq[] array, and allocate an entry from that for every
fresh irq we see at PHYSDEVOP_alloc_irq_vector. Then when the vector gets
passed back in on ioapic writes, we index into that array rather than using
naked rte.vector.

How does that sound?

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2009-Sep-22 08:53 UTC

head link

Re: [Xen-devel] another regression from IRQ handling changes

>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.09.09 10:39
>>>
>On 22/09/2009 09:18, "Jan Beulich" <JBeulich@novell.com>
wrote:
>
>> An issue I had fixed a little while before your changes now appears to
>> exist again (can''t actively verify it due to lack of access to
a sufficiently
>> big system): While we now can handle more than 192 interrupt sources,
>> those again are confined to the first 256 IO-APIC pins. We know,
>> however, that there are systems with well over 300 pins (most of which
>> are typically unused, and hence being able to "only" handle
192 interrupt
>> sources doesn''t really present a problem on these systems.
>> 
>> Clearly, handling of more than 256 (non-MSI) interrupt sources cannot
>> be done without a kernel side change, since there needs to be a
>> replacement for the 8-bit vector information conveyed through the
>> kernel writes to the IO-APIC redirection table entries. However, up to
>> 256 interrupt sources could easily be handled without kernel side
>> change, by making PHYSDEVOP_alloc_irq_vector return a fake vector
>> (rather than the unmodified irq that got passed in).
>
>If it wasn''t broken before, it was probably broken by
Xiantao''s follow-up
>patch to fix NetBSD dom0 (at least as much as possible, to avoid a nasty
>regression with NetBSD). What we probably need to do is have a 256-entry
>dom0_vector_to_dom0_irq[] array, and allocate an entry from that for every
>fresh irq we see at PHYSDEVOP_alloc_irq_vector. Then when the vector gets
>passed back in on ioapic writes, we index into that array rather than using
>naked rte.vector.
>
>How does that sound?
Yeah, that''s what I would view as the solution to get old functionality
back. But my question also extended to possible solutions to get beyond
256 here, especially such that are also acceptable to the pv-ops Dom0,
which I''m much less certain about.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Zhang, Xiantao

2009-Sep-22 08:56 UTC

head link

[Xen-devel] RE: another regression from IRQ handling changes

Jan Beulich wrote:> An issue I had fixed a little while before your changes now appears to
> exist again (can''t actively verify it due to lack of access to a
> sufficiently big system): While we now can handle more than 192
> interrupt sources, 
> those again are confined to the first 256 IO-APIC pins. We know,
> however, that there are systems with well over 300 pins (most of which
> are typically unused, and hence being able to "only" handle 192
> interrupt sources doesn''t really present a problem on these
systems.
> 
> Clearly, handling of more than 256 (non-MSI) interrupt sources cannot
> be done without a kernel side change, since there needs to be a
> replacement for the 8-bit vector information conveyed through the
> kernel writes to the IO-APIC redirection table entries. However, up to
> 256 interrupt sources could easily be handled without kernel side
> change, by making PHYSDEVOP_alloc_irq_vector return a fake vector
> (rather than the unmodified irq that got passed in).
Jan, 
   Are you sure it worked well with more than 256 pins before my IRQ changes ? 
You may refer to dom0''s code(linux-2.6.18.8-xen.hg) and the gsi irq
number shouldn''t be bigger than 256 in any time.  You can say this is
an old issue we need to address through modifying some ABIs between dom0 and
hypervisor rather than saying this issue is introduced by the IRQ handling
changes, I think.    That is to say, Xen never works with the big system with
more than 256 pins of ioapic at any time.
Xiantao


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Sep-22 09:14 UTC

head link

Re: [Xen-devel] another regression from IRQ handling changes

On 22/09/2009 09:53, "Jan Beulich" <JBeulich@novell.com> wrote:
>> If it wasn''t broken before, it was probably broken by
Xiantao''s follow-up
>> patch to fix NetBSD dom0 (at least as much as possible, to avoid a
nasty
>> regression with NetBSD). What we probably need to do is have a
256-entry
>> dom0_vector_to_dom0_irq[] array, and allocate an entry from that for
every
>> fresh irq we see at PHYSDEVOP_alloc_irq_vector. Then when the vector
gets
>> passed back in on ioapic writes, we index into that array rather than
using
>> naked rte.vector.
> 
> Yeah, that''s what I would view as the solution to get old
functionality
> back. But my question also extended to possible solutions to get beyond
> 256 here, especially such that are also acceptable to the pv-ops Dom0,
> which I''m much less certain about.
Well, we could assume that if we see an irq greater than 256 at
PHYSDEVOP_alloc_irq_vector then the dom0 is dealing in GSIs, and in that
case the ''vector'' we return and then gets passed to
ioapic_write is rather
redundant. We can work out the GSI from the ioapic rte that is being
modified, after all. So perhaps we could just flip into a non-legacy mode of
operation in that case (perhaps reserve a single magic
''vector'' value to
return from PHYSDEVOP_alloc_irq_vector in this case, and if we see that
value in the ioapic write handler then we know to assume that guest pirq =gsi).

The legacy case is just to handle NetBSD, which throws non-GSIs at the
PHYSDEVOP_alloc_irq_vector interface, and I doubt it will have worked with
those mega-big systems at any time. So I don''t think we''re
dealing with a
regression there.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2009-Sep-22 09:19 UTC

head link

[Xen-devel] RE: another regression from IRQ handling changes

>>> "Zhang, Xiantao" <xiantao.zhang@intel.com> 22.09.09
10:56 >>>
>Jan Beulich wrote:
>> An issue I had fixed a little while before your changes now appears to
>> exist again (can''t actively verify it due to lack of access to
a
>> sufficiently big system): While we now can handle more than 192
>> interrupt sources, 
>> those again are confined to the first 256 IO-APIC pins. We know,
>> however, that there are systems with well over 300 pins (most of which
>> are typically unused, and hence being able to "only" handle
192
>> interrupt sources doesn''t really present a problem on these
systems.
>> 
>> Clearly, handling of more than 256 (non-MSI) interrupt sources cannot
>> be done without a kernel side change, since there needs to be a
>> replacement for the 8-bit vector information conveyed through the
>> kernel writes to the IO-APIC redirection table entries. However, up to
>> 256 interrupt sources could easily be handled without kernel side
>> change, by making PHYSDEVOP_alloc_irq_vector return a fake vector
>> (rather than the unmodified irq that got passed in).
>
>Jan, 
>   Are you sure it worked well with more than 256 pins before my IRQ
>changes ?  You may refer to dom0''s code(linux-2.6.18.8-xen.hg) and
>the gsi irq number shouldn''t be bigger than 256 in any time.  You
can
>say this is an old issue we need to address through modifying some
>ABIs between dom0 and hypervisor rather than saying this issue is
>introduced by the IRQ handling changes, I think.    That is to say, Xen
>never works with the big system with more than 256 pins of ioapic at
>any time.  
I can''t say anything about the 2.6.18 tree, but I''m certain it
worked
with newer Dom0-s, e.g. the forward ported 2.6.27-based ones.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2009-Sep-22 09:53 UTC

head link

Re: [Xen-devel] another regression from IRQ handling changes

>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.09.09 11:14
>>>
>On 22/09/2009 09:53, "Jan Beulich" <JBeulich@novell.com>
wrote:
>
>>> If it wasn''t broken before, it was probably broken by
Xiantao''s follow-up
>>> patch to fix NetBSD dom0 (at least as much as possible, to avoid a
nasty
>>> regression with NetBSD). What we probably need to do is have a
256-entry
>>> dom0_vector_to_dom0_irq[] array, and allocate an entry from that
for every
>>> fresh irq we see at PHYSDEVOP_alloc_irq_vector. Then when the
vector gets
>>> passed back in on ioapic writes, we index into that array rather
than using
>>> naked rte.vector.
>> 
>> Yeah, that''s what I would view as the solution to get old
functionality
>> back. But my question also extended to possible solutions to get beyond
>> 256 here, especially such that are also acceptable to the pv-ops Dom0,
>> which I''m much less certain about.
>
>Well, we could assume that if we see an irq greater than 256 at
>PHYSDEVOP_alloc_irq_vector then the dom0 is dealing in GSIs, and in that
>case the ''vector'' we return and then gets passed to
ioapic_write is rather
>redundant. We can work out the GSI from the ioapic rte that is being
>modified, after all. So perhaps we could just flip into a non-legacy mode of
>operation in that case (perhaps reserve a single magic
''vector'' value to
>return from PHYSDEVOP_alloc_irq_vector in this case, and if we see that
>value in the ioapic write handler then we know to assume that guest pirq
=>gsi).
I wouldn''t base this on the passed in IRQ number, but instead on the
number of IRQs mapped - if the proposed array doesn''t have a spare
slot anymore, switch to passing back the magic vector (and assume
pirq == irq in ioapic_guest_write() - we could even add checking of
that for all previously enabled IRQs this relation is true, and fail
PHYSDEVOP_alloc_irq_vector if the array got exhausted and Dom0
didn''t use only GSIs before). But besides that detail your idea sounds
fine at least from a Linux perspective.

Are you planning on getting this done, or should I?
>The legacy case is just to handle NetBSD, which throws non-GSIs at the
>PHYSDEVOP_alloc_irq_vector interface, and I doubt it will have worked with
>those mega-big systems at any time. So I don''t think we''re
dealing with a
>regression there.
Hmm, no, the regression is with (newer?) Linux, which should have been
capable of dealing with interrupts coming in on IO-APIC pins above 256
before that change.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Zhang, Xiantao

2009-Sep-22 10:10 UTC

head link

RE: [Xen-devel] another regression from IRQ handling changes

Jan Beulich wrote:>>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.09.09 11:14
>>>
>> On 22/09/2009 09:53, "Jan Beulich"
<JBeulich@novell.com> wrote:
>> 
>>>> If it wasn''t broken before, it was probably broken by
Xiantao''s
>>>> follow-up patch to fix NetBSD dom0 (at least as much as
possible,
>>>> to avoid a nasty regression with NetBSD). What we probably need
to
>>>> do is have a 256-entry dom0_vector_to_dom0_irq[] array, and
>>>> allocate an entry from that for every fresh irq we see at
>>>> PHYSDEVOP_alloc_irq_vector. Then when the vector gets passed
back
>>>> in on ioapic writes, we index into that array rather than using
>>>> naked rte.vector. 
>>> 
>>> Yeah, that''s what I would view as the solution to get old
>>> functionality back. But my question also extended to possible
>>> solutions to get beyond 256 here, especially such that are also
>>> acceptable to the pv-ops Dom0, which I''m much less certain
about.
>> 
>> Well, we could assume that if we see an irq greater than 256 at
>> PHYSDEVOP_alloc_irq_vector then the dom0 is dealing in GSIs, and in
>> that case the ''vector'' we return and then gets passed
to
>> ioapic_write is rather redundant. We can work out the GSI from the
>> ioapic rte that is being modified, after all. So perhaps we could
>> just flip into a non-legacy mode of operation in that case (perhaps
>> reserve a single magic ''vector'' value to return from
>> PHYSDEVOP_alloc_irq_vector in this case, and if we see that value in
>> the ioapic write handler then we know to assume that guest pirq
=>> gsi).
> 
> I wouldn''t base this on the passed in IRQ number, but instead on
the
> number of IRQs mapped - if the proposed array doesn''t have a spare
> slot anymore, switch to passing back the magic vector (and assume
> pirq == irq in ioapic_guest_write() - we could even add checking of
> that for all previously enabled IRQs this relation is true, and fail
> PHYSDEVOP_alloc_irq_vector if the array got exhausted and Dom0
> didn''t use only GSIs before). But besides that detail your idea
sounds
> fine at least from a Linux perspective.
> 
> Are you planning on getting this done, or should I?
Don''t be so complex to handle it. I think the following patch should
fix the potential issue.
For linux dom0, it shouldn''t depend on the value of rte.vector at all
when GSI irq > 256, and just make pirq equal to irq and then build the pirq
and irq mapping.

diff -r 3a71e070e3c5 xen/arch/x86/io_apic.c
--- a/xen/arch/x86/io_apic.c    Fri Sep 18 14:45:40 2009 +0100
+++ b/xen/arch/x86/io_apic.c    Tue Sep 22 18:03:50 2009 +0800
@@ -2195,9 +2195,13 @@ int ioapic_guest_write(unsigned long phy

     /* Since PHYSDEVOP_alloc_irq_vector is dummy, rte.vector is the pirq
        which corresponds to this ioapic pin, retrieve it for building
-       pirq and irq mapping.
-     */
-    pirq = rte.vector;
+       pirq and irq mapping, and this is only for NetBSD dom0. For Linux
+       dom0, pirq == irq at any time.
+     */
+    if (irq >= 256)
+        pirq = irq;
+    else
+        pirq = rte.vector; /* Make NetBSD dom0 work. */
     if(pirq < 0 || pirq >= dom0->nr_pirqs)
         return -EINVAL;

> 
>> The legacy case is just to handle NetBSD, which throws non-GSIs at
>> the PHYSDEVOP_alloc_irq_vector interface, and I doubt it will have
>> worked with those mega-big systems at any time. So I don''t
think
>> we''re dealing with a regression there.
> 
> Hmm, no, the regression is with (newer?) Linux, which should have been
> capable of dealing with interrupts coming in on IO-APIC pins above 256
> before that change.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Sep-22 10:30 UTC

head link

Re: [Xen-devel] another regression from IRQ handling changes

On 22/09/2009 10:53, "Jan Beulich" <JBeulich@novell.com> wrote:
> I wouldn''t base this on the passed in IRQ number, but instead on
the
> number of IRQs mapped - if the proposed array doesn''t have a spare
> slot anymore, switch to passing back the magic vector (and assume
> pirq == irq in ioapic_guest_write() - we could even add checking of
> that for all previously enabled IRQs this relation is true, and fail
> PHYSDEVOP_alloc_irq_vector if the array got exhausted and Dom0
> didn''t use only GSIs before). But besides that detail your idea
sounds
> fine at least from a Linux perspective.
> 
> Are you planning on getting this done, or should I?
Feel free to go ahead. Unfortunately I guess neither of us can properly test
this case? :-(

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Sep-22 10:32 UTC

head link

Re: [Xen-devel] another regression from IRQ handling changes

On 22/09/2009 11:10, "Zhang, Xiantao" <xiantao.zhang@intel.com>
wrote:
> Don''t be so complex to handle it. I think the following patch
should fix the
> potential issue. 
> For linux dom0, it shouldn''t depend on the value of rte.vector at
all when GSI
> irq > 256, and just make pirq equal to irq and then build the pirq and
irq
> mapping.  
This would be acceptable to me. What do you think, Jan?

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2009-Sep-22 11:33 UTC

head link

Re: [Xen-devel] another regression from IRQ handling changes

>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.09.09 12:32
>>>
>On 22/09/2009 11:10, "Zhang, Xiantao"
<xiantao.zhang@intel.com> wrote:
>
>> Don''t be so complex to handle it. I think the following patch
should fix the
>> potential issue. 
>> For linux dom0, it shouldn''t depend on the value of rte.vector
at all when GSI
>> irq > 256, and just make pirq equal to irq and then build the pirq
and irq
>> mapping.  
>
>This would be acceptable to me. What do you think, Jan?
Indeed, that should do it (we''ll supposedly get this tested later this
year).

Thanks!

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Zhang, Xiantao

2009-Sep-22 11:54 UTC

head link

RE: [Xen-devel] another regression from IRQ handling changes

Attached the patch with the sign-off-by and the description, thanks for the
useful discussion about this topic!
Xiantao 

-----Original Message-----
From: Jan Beulich [mailto:JBeulich@novell.com] 
Sent: Tuesday, September 22, 2009 7:34 PM
To: Keir Fraser; Zhang, Xiantao
Cc: xen-devel@lists.xensource.com
Subject: Re: [Xen-devel] another regression from IRQ handling changes
>>> Keir Fraser <keir.fraser@eu.citrix.com> 22.09.09 12:32
>>>
>On 22/09/2009 11:10, "Zhang, Xiantao"
<xiantao.zhang@intel.com> wrote:
>
>> Don''t be so complex to handle it. I think the following patch
should fix the
>> potential issue. 
>> For linux dom0, it shouldn''t depend on the value of rte.vector
at all when GSI
>> irq > 256, and just make pirq equal to irq and then build the pirq
and irq
>> mapping.  
>
>This would be acceptable to me. What do you think, Jan?
Indeed, that should do it (we''ll supposedly get this tested later this
year).

Thanks!

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Sep 2009 - another regression from IRQ handling changes

[Xen-devel] another regression from IRQ handling changes

Re: [Xen-devel] another regression from IRQ handling changes

Re: [Xen-devel] another regression from IRQ handling changes

[Xen-devel] RE: another regression from IRQ handling changes

Re: [Xen-devel] another regression from IRQ handling changes

[Xen-devel] RE: another regression from IRQ handling changes

Re: [Xen-devel] another regression from IRQ handling changes

RE: [Xen-devel] another regression from IRQ handling changes

Re: [Xen-devel] another regression from IRQ handling changes

Re: [Xen-devel] another regression from IRQ handling changes

Re: [Xen-devel] another regression from IRQ handling changes

RE: [Xen-devel] another regression from IRQ handling changes