thr3ads.net - Xen devel - [Xen-devel] Dom0 losing interrupts??? [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Juergen Gross

2011-Feb-14 06:59 UTC

[Xen-devel] Dom0 losing interrupts???

Hi,

while trying to reproduce Andre''s cpupool problem I ran into another
issue:

Dom0 seems to lose hardware interrupts when it has more vcpus than pcpus
available. First I thought this could be due to my cpupool patches, but the
problem can be easily reproduced by pinning all Dom0 vcpus to a few physical
cpus and doing a parallel build then.

I used xen-unstable, kernel 2.6.32.24 from SLES11 SP1 on a 12 core INTEL
nehalem machine. I pinned all 12 Dom0 vcpus to pcpu 1-2 and started a parallel
build. After about 2 minutes the first missing interrupts were reported, a
little bit later the next one, no xen messages are printed:

[230644.814834] ata1: lost interrupt (Status 0x50)
[230682.814399] ata1: lost interrupt (Status 0x50)
[230690.814467] ata1: lost interrupt (Status 0x58)
...
[230856.718437] sd 4:2:0:0: [sda] megasas: RESET -843713 cmd=2a retries=0
[230856.739457] megaraid_sas: HBA reset handler invoked without an internal 
reset condition.
[230856.766435] megasas: [ 0]waiting for 16 commands to complete

Has anyone observed a similar behavior?


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

André Przywara

2011-Feb-14 08:58 UTC

head link

Re: [Xen-devel] Dom0 losing interrupts???

Am 14.02.2011 07:59, schrieb Juergen Gross:> Hi,
>
> while trying to reproduce Andre''s cpupool problem I ran into
another issue:
>
> Dom0 seems to lose hardware interrupts when it has more vcpus than pcpus
> available. First I thought this could be due to my cpupool patches, but the
> problem can be easily reproduced by pinning all Dom0 vcpus to a few
physical
> cpus and doing a parallel build then.
>
> I used xen-unstable, kernel 2.6.32.24 from SLES11 SP1 on a 12 core INTEL
> nehalem machine. I pinned all 12 Dom0 vcpus to pcpu 1-2 and started a
parallel
> build. After about 2 minutes the first missing interrupts were reported, a
> little bit later the next one, no xen messages are printed:
>
> [230644.814834] ata1: lost interrupt (Status 0x50)
> [230682.814399] ata1: lost interrupt (Status 0x50)
> [230690.814467] ata1: lost interrupt (Status 0x58)
> ...
> [230856.718437] sd 4:2:0:0: [sda] megasas: RESET -843713 cmd=2a retries=0
> [230856.739457] megaraid_sas: HBA reset handler invoked without an internal
> reset condition.
> [230856.766435] megasas: [ 0]waiting for 16 commands to complete
>
> Has anyone observed a similar behavior?
Yes, me again:-)

On the rare occasions where I couldn''t trigger the bug (like when using
a restricted Dom0) I observed interrupt problems, which mostly killed 
the network connection:
(XEN) do_IRQ: 0.89 No irq handler for vector (irq -1)
I could solve this issue temporarily be down-ing and up-ing the network 
interface, but the box became unstable later.
hypervisor and tools c/s 22858, Dom0 latest tip of PVOPS 
xen/stable-2.6.32.x (2.6.32.27)

Regards,
Andre.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2011-Feb-14 09:26 UTC

head link

Re: [Xen-devel] Dom0 losing interrupts???

>>> On 14.02.11 at 07:59, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> I used xen-unstable, kernel 2.6.32.24 from SLES11 SP1 on a 12 core INTEL
> nehalem machine. I pinned all 12 Dom0 vcpus to pcpu 1-2 and started a 
> parallel
> build. After about 2 minutes the first missing interrupts were reported, a
> little bit later the next one, no xen messages are printed:
That''s certainly not too surprising, somewhat depending on the
maximally tolerated latencies. It seems unlikely to me for a 6-fold
CPU over-commit to promise stable operation, yet certain
adjustments could probably be done to make it work better (like
temporarily boosting the priority of a hardware interrupt''s target
vCPU).

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Juergen Gross

2011-Feb-14 09:38 UTC

head link

Re: [Xen-devel] Dom0 losing interrupts???

On 02/14/11 10:26, Jan Beulich wrote:>>>> On 14.02.11 at 07:59, Juergen
Gross<juergen.gross@ts.fujitsu.com>  wrote:
>> I used xen-unstable, kernel 2.6.32.24 from SLES11 SP1 on a 12 core
INTEL
>> nehalem machine. I pinned all 12 Dom0 vcpus to pcpu 1-2 and started a
>> parallel
>> build. After about 2 minutes the first missing interrupts were
reported, a
>> little bit later the next one, no xen messages are printed:
>
> That''s certainly not too surprising, somewhat depending on the
> maximally tolerated latencies. It seems unlikely to me for a 6-fold
> CPU over-commit to promise stable operation, yet certain
> adjustments could probably be done to make it work better (like
> temporarily boosting the priority of a hardware interrupt''s target
> vCPU).
I would understand timeouts. But shouldn''t the interrupt come in sooner
or
later? At least the megasas driver seems not to be able to recover from this
problem, as a result my root filesystem is set to read-only...

This would mean there is a problem in the megasas driver, correct?
And Andre reports stability problems of his machine in similar cases, but
in his case the network driver seems to be the reason.

Are you planning to prepare a patch for boosting the priority of vcpus being
the target for a hardware interrupt? I think I would have to search some time
to find the correct places to change...


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2011-Feb-14 09:58 UTC

head link

Re: [Xen-devel] Dom0 losing interrupts???

>>> On 14.02.11 at 10:38, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
> On 02/14/11 10:26, Jan Beulich wrote:
>>>>> On 14.02.11 at 07:59, Juergen
Gross<juergen.gross@ts.fujitsu.com>  wrote:
>>> I used xen-unstable, kernel 2.6.32.24 from SLES11 SP1 on a 12 core
INTEL
>>> nehalem machine. I pinned all 12 Dom0 vcpus to pcpu 1-2 and started
a
>>> parallel
>>> build. After about 2 minutes the first missing interrupts were
reported, a
>>> little bit later the next one, no xen messages are printed:
>>
>> That''s certainly not too surprising, somewhat depending on the
>> maximally tolerated latencies. It seems unlikely to me for a 6-fold
>> CPU over-commit to promise stable operation, yet certain
>> adjustments could probably be done to make it work better (like
>> temporarily boosting the priority of a hardware interrupt''s
target
>> vCPU).
> 
> I would understand timeouts. But shouldn''t the interrupt come in
sooner or
> later? At least the megasas driver seems not to be able to recover from
this
> problem, as a result my root filesystem is set to read-only...
I''m sure these interrupts arrive eventually, but the driver not
seeing them within an expected time window may still make it
report them as "lost".
> This would mean there is a problem in the megasas driver, correct?
> And Andre reports stability problems of his machine in similar cases, but
> in his case the network driver seems to be the reason.
Yes, this certainly depends on how the driver is implemented.
> Are you planning to prepare a patch for boosting the priority of vcpus
being
> the target for a hardware interrupt? I think I would have to search some 
> time
> to find the correct places to change...
So far I had no plan to do so, and I too would have to do some
looking around. Nor am I convinced everyone would appreciate
such fiddling with priorities - I was merely suggesting that might
be one route to go. George?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2011-Feb-14 11:21 UTC

head link

Re: [Xen-devel] Dom0 losing interrupts???

My sense is that:
* Pinning N vcpus to N-M pcpus (where M is a significant fraction of
N) is just a really bad idea; it would be better just not to do that.
It would be ideal if somehow when dom0''s cpu pool shrinks, it
automatically offlines an appropriate number of vcpus; but it
shouldn''t be difficult for an administrator to do that themselves.
* On average, a vcpu shouldn''t have to wait more than 60ms or so for
an interrupt.  It seems like there''s a non-negligible possibility that
there''s some kind of bug in the interrupt delivery and handling,
either on the Xen side or the Linux side (or as Jan pointed out, a bug
in the driver).  In that case, doing something in the scheduler isn''t
actually fixing the problem, it''s just making it less likely to
happen.  (NB that we''ve had intermittent failures in the xen.org
testing infrastructure with what looks like might be missed interrupts
as well -- and those weren''t on heavily loaded boxes.)
* Even if it is ultimately a scheduler bug, understanding exactly what
the scheduler is doing and why is key to making a proper fix.  It''s
possible that there''s just a simple quirk in the algorithm, such that
a general fix will make everything work better without needing to
introduce a special case for hardware interrupts.
* I''m not opposed in principle to a mechanism which will prioritize
vcpus awaiting hardware interrupts.  But I am wary of guessing what
the problem is and then introducing a patch without proper root-cause
analysis.  Even if it seems to fix the immediate problem, it may
simply be masking the real problem, and may also cause problems of its
own.  Behavior of the scheduler is hard enough to understand already,
and every special case makes it even harder.

So to conclude: I think the first answer to someone with this problem
should be, "Make sure that V<=P", where P is the number of physical
cpus a VM can be scheduled on and V is the number of virtual cpus.  If
there are still problems, then we need to find out how it is that
interrupts come to be missing before attempting a fix.

 -George

On Mon, Feb 14, 2011 at 9:58 AM, Jan Beulich <JBeulich@novell.com>
wrote:>>>> On 14.02.11 at 10:38, Juergen Gross
<juergen.gross@ts.fujitsu.com> wrote:
>> On 02/14/11 10:26, Jan Beulich wrote:
>>>>>> On 14.02.11 at 07:59, Juergen
Gross<juergen.gross@ts.fujitsu.com>  wrote:
>>>> I used xen-unstable, kernel 2.6.32.24 from SLES11 SP1 on a 12
core INTEL
>>>> nehalem machine. I pinned all 12 Dom0 vcpus to pcpu 1-2 and
started a
>>>> parallel
>>>> build. After about 2 minutes the first missing interrupts were
reported, a
>>>> little bit later the next one, no xen messages are printed:
>>>
>>> That''s certainly not too surprising, somewhat depending on
the
>>> maximally tolerated latencies. It seems unlikely to me for a 6-fold
>>> CPU over-commit to promise stable operation, yet certain
>>> adjustments could probably be done to make it work better (like
>>> temporarily boosting the priority of a hardware
interrupt''s target
>>> vCPU).
>>
>> I would understand timeouts. But shouldn''t the interrupt come
in sooner or
>> later? At least the megasas driver seems not to be able to recover from
this
>> problem, as a result my root filesystem is set to read-only...
>
> I''m sure these interrupts arrive eventually, but the driver not
> seeing them within an expected time window may still make it
> report them as "lost".
>
>> This would mean there is a problem in the megasas driver, correct?
>> And Andre reports stability problems of his machine in similar cases,
but
>> in his case the network driver seems to be the reason.
>
> Yes, this certainly depends on how the driver is implemented.
>
>> Are you planning to prepare a patch for boosting the priority of vcpus
being
>> the target for a hardware interrupt? I think I would have to search
some
>> time
>> to find the correct places to change...
>
> So far I had no plan to do so, and I too would have to do some
> looking around. Nor am I convinced everyone would appreciate
> such fiddling with priorities - I was merely suggesting that might
> be one route to go. George?
>
> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Juergen Gross

2011-Feb-14 11:46 UTC

head link

Re: [Xen-devel] Dom0 losing interrupts???

On 02/14/11 12:21, George Dunlap wrote:> My sense is that:
> * Pinning N vcpus to N-M pcpus (where M is a significant fraction of
> N) is just a really bad idea; it would be better just not to do that.
I just wanted to make sure the interrupts are not lost due to the cpupool
operation itself.
So I tried with an extreme configuration and was proved right :-)
> It would be ideal if somehow when dom0''s cpu pool shrinks, it
> automatically offlines an appropriate number of vcpus; but it
> shouldn''t be difficult for an administrator to do that themselves.
I''ve sent a patch for the cpupool-numa-split case, which will always
remove a
significant number of physical cpus for dom0.
> * On average, a vcpu shouldn''t have to wait more than 60ms or so
for
> an interrupt.  It seems like there''s a non-negligible possibility
that
> there''s some kind of bug in the interrupt delivery and handling,
> either on the Xen side or the Linux side (or as Jan pointed out, a bug
> in the driver).  In that case, doing something in the scheduler
isn''t
> actually fixing the problem, it''s just making it less likely to
> happen.  (NB that we''ve had intermittent failures in the xen.org
> testing infrastructure with what looks like might be missed interrupts
> as well -- and those weren''t on heavily loaded boxes.)
Any idea what I could do to help? Our larger test machines are not just
idling, but I could use one from time to time without much problems.
It''s rather easy for me to reproduce the problem, OTOH it should be
easy for
others with a reasonable large machine, too.
> * Even if it is ultimately a scheduler bug, understanding exactly what
> the scheduler is doing and why is key to making a proper fix. 
It''s
> possible that there''s just a simple quirk in the algorithm, such
that
> a general fix will make everything work better without needing to
> introduce a special case for hardware interrupts.
> * I''m not opposed in principle to a mechanism which will
prioritize
> vcpus awaiting hardware interrupts.  But I am wary of guessing what
> the problem is and then introducing a patch without proper root-cause
> analysis.  Even if it seems to fix the immediate problem, it may
> simply be masking the real problem, and may also cause problems of its
> own.  Behavior of the scheduler is hard enough to understand already,
> and every special case makes it even harder.
I absolutely agree!


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Feb 2011 - Dom0 losing interrupts???

[Xen-devel] Dom0 losing interrupts???

Re: [Xen-devel] Dom0 losing interrupts???

Re: [Xen-devel] Dom0 losing interrupts???

Re: [Xen-devel] Dom0 losing interrupts???

Re: [Xen-devel] Dom0 losing interrupts???

Re: [Xen-devel] Dom0 losing interrupts???

Re: [Xen-devel] Dom0 losing interrupts???