thr3ads.net - Xen devel - [Xen-devel] cpuidle causing Dom0 soft lockups [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Jan Beulich

2010-Jan-21 09:51 UTC

[Xen-devel] cpuidle causing Dom0 soft lockups

On large systems and with Dom0 booting with (significantly) more than
32 vCPU-s we have got multiple reports that the now by default
enabled C-state management is causing soft lockups, usually preventing
the boot from completing.

The observations are:

Reducing the number of vCPU-s (or pCPU-s) sufficiently much makes
the systems work.

max_cstate=0 makes the systems work.

max_cstate=1 makes the problem less severe on one (bigger) system,
and eliminates it completely on another (smaller) one.

When appearing to hang, all vCPU-s are in Dom0''s timer_interrupt(),
and all (sometimes all but one) are attempting to acquire xtime_lock.
However, due to our use of ticket locks we can verify that this is not
a deadlock (repeatedly sending ''0'' shows forward progress, as
the
tickets [visible on the stack] continue to increase). Additionally, there
is always one vCPU that has its polling event channel (used for
waking the next waiting vCPU when a lock becomes available)
signaled.

In one case (but not in the other) it is always the same vCPU that
is apparently taking very long to wake up from the polling request.
This may be coincidence, but output after sending ''c'' also
indicates
a significantly higher (about 3 times) usage value for C2 than the
second highest one; the duration printed is roughly the same for
all CPUs.

While I don''t know this code well, it would seem that we''re
suffering
from extremely long wakeup times. This suggests that there likely is
a (performance) problem even for smaller numbers of vCPU-s.
Hence, unless it can be fixed before 4.0 releases, I would suggest
disabling C-state management by default again.

I can provide full logs in case needed.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Jan-21 10:11 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

On 21/01/2010 09:51, "Jan Beulich" <JBeulich@novell.com> wrote:
> While I don''t know this code well, it would seem that
we''re suffering
> from extremely long wakeup times. This suggests that there likely is
> a (performance) problem even for smaller numbers of vCPU-s.
> Hence, unless it can be fixed before 4.0 releases, I would suggest
> disabling C-state management by default again.
It''s enabled in 3.4 by feault too though, already. Are there problems
there
too?

 -- keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Jan-21 10:16 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 11:11
>>>
>On 21/01/2010 09:51, "Jan Beulich" <JBeulich@novell.com>
wrote:
>
>> While I don''t know this code well, it would seem that
we''re suffering
>> from extremely long wakeup times. This suggests that there likely is
>> a (performance) problem even for smaller numbers of vCPU-s.
>> Hence, unless it can be fixed before 4.0 releases, I would suggest
>> disabling C-state management by default again.
>
>It''s enabled in 3.4 by feault too though, already. Are there
problems there
>too?
We never shipped 3.4.x for anything that would normally be run on
large systems. And 3.4 doesn''t support more than 32 vCPU-s per
domain, hence the problem - even if present - would be hidden.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Jan-21 10:26 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

On 21/01/2010 10:16, "Jan Beulich" <JBeulich@novell.com> wrote:
>>> While I don''t know this code well, it would seem that
we''re suffering
>>> from extremely long wakeup times. This suggests that there likely
is
>>> a (performance) problem even for smaller numbers of vCPU-s.
>>> Hence, unless it can be fixed before 4.0 releases, I would suggest
>>> disabling C-state management by default again.
>> 
>> It''s enabled in 3.4 by feault too though, already. Are there
problems there
>> too?
> 
> We never shipped 3.4.x for anything that would normally be run on
> large systems. And 3.4 doesn''t support more than 32 vCPU-s per
> domain, hence the problem - even if present - would be hidden.
Hm, okay. Another thing to consider is not using ticket locks when running
on Xen. Regardless of slow wakeup times (and you can''t really depend on
them
being fast, in general, in a virtualised environment) experiments have shown
that, as you might expect, contended ticket locks perform *abysmally* in
virtualised environments.

Turning off our major source of power-management wins by default in Xen 4.0
is not really a goer.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2010-Jan-21 10:35 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Jan Beulich wrote:> Reducing the number of vCPU-s (or pCPU-s) sufficiently much makes
> the systems work.
How many vCPU-s or pCPU-s at most can system work with in your experience?
> max_cstate=1 makes the problem less severe on one (bigger) system,
> and eliminates it completely on another (smaller) one.
How bigger/smaller? vCPU nr & pCPU nr.
> I can provide full logs in case needed.
Please provide full logs.

Jimmy
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Jan-21 10:53 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 11:26
>>>
>Hm, okay. Another thing to consider is not using ticket locks when running
>on Xen. Regardless of slow wakeup times (and you can''t really
depend on them
>being fast, in general, in a virtualised environment) experiments have shown
>that, as you might expect, contended ticket locks perform *abysmally* in
>virtualised environments.
But that''s exactly why we use the polling method: That way we
don''t
need to wake up all potential waiters, but just the one that''s going to
get the lock next. We now even go further and detect whether the
lock owner is not currently running, and go polling right away (instead
of spinning on the lock a few hundred times first).
>Turning off our major source of power-management wins by default in
>Xen 4.0 is not really a goer.
I can see your point. But how can you consider shipping with something
apparently severely broken. As said before - the fact that this manifests
itself by hanging many-vCPU Dom0 has the very likely implication that
there are (so far unnoticed) problems with smaller Dom0-s. If I had a
machine at hand that supports C3, I''d try to do some measurements
with smaller domains...

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Jan-21 11:03 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote:
> But that''s exactly why we use the polling method: That way we
don''t
> need to wake up all potential waiters, but just the one that''s
going to
> get the lock next. We now even go further and detect whether the
> lock owner is not currently running, and go polling right away (instead
> of spinning on the lock a few hundred times first).
Ah, these are pv spinlocks?
> I can see your point. But how can you consider shipping with something
> apparently severely broken. As said before - the fact that this manifests
> itself by hanging many-vCPU Dom0 has the very likely implication that
> there are (so far unnoticed) problems with smaller Dom0-s. If I had a
> machine at hand that supports C3, I''d try to do some measurements
> with smaller domains...
Well it''s a fallback I guess. If we can''t make progress on
solving it then I
suppose I agree.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Jan-21 11:13 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03
>>>
>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com>
wrote:
>
>> But that''s exactly why we use the polling method: That way we
don''t
>> need to wake up all potential waiters, but just the one that''s
going to
>> get the lock next. We now even go further and detect whether the
>> lock owner is not currently running, and go polling right away (instead
>> of spinning on the lock a few hundred times first).
>
>Ah, these are pv spinlocks?
Yes.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Jan-21 12:07 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Hi Jan,

Could you try the following debugging patch. it can help to narrow down the root
cause:

diff -r ea02c95af387 xen/arch/x86/acpi/cpu_idle.c
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -228,7 +228,6 @@ static void acpi_processor_idle(void)

     cpufreq_dbs_timer_suspend();

-    sched_tick_suspend();
     /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();

Regards
Ke>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com
>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Jan Beulich
>Sent: Thursday, January 21, 2010 5:52 PM
>To: xen-devel@lists.xensource.com
>Subject: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>On large systems and with Dom0 booting with (significantly) more than
>32 vCPU-s we have got multiple reports that the now by default
>enabled C-state management is causing soft lockups, usually preventing
>the boot from completing.
>
>The observations are:
>
>Reducing the number of vCPU-s (or pCPU-s) sufficiently much makes
>the systems work.
>
>max_cstate=0 makes the systems work.
>
>max_cstate=1 makes the problem less severe on one (bigger) system,
>and eliminates it completely on another (smaller) one.
>
>When appearing to hang, all vCPU-s are in Dom0''s timer_interrupt(),
>and all (sometimes all but one) are attempting to acquire xtime_lock.
>However, due to our use of ticket locks we can verify that this is not
>a deadlock (repeatedly sending ''0'' shows forward progress,
as the
>tickets [visible on the stack] continue to increase). Additionally, there
>is always one vCPU that has its polling event channel (used for
>waking the next waiting vCPU when a lock becomes available)
>signaled.
>
>In one case (but not in the other) it is always the same vCPU that
>is apparently taking very long to wake up from the polling request.
>This may be coincidence, but output after sending ''c'' also
indicates
>a significantly higher (about 3 times) usage value for C2 than the
>second highest one; the duration printed is roughly the same for
>all CPUs.
>
>While I don''t know this code well, it would seem that
we''re suffering
>from extremely long wakeup times. This suggests that there likely is
>a (performance) problem even for smaller numbers of vCPU-s.
>Hence, unless it can be fixed before 4.0 releases, I would suggest
>disabling C-state management by default again.
>
>I can provide full logs in case needed.
>
>Jan
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Jan-25 08:08 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Yu, Ke" <ke.yu@intel.com> 21.01.10 13:07
>>>
>Could you try the following debugging patch. it can help to narrow down the
root cause:
According to our testing this did not have any effect. I added you to
the Cc list of the respective bug, so you have access to the relevant
logs.
>diff -r ea02c95af387 xen/arch/x86/acpi/cpu_idle.c
>--- a/xen/arch/x86/acpi/cpu_idle.c
>+++ b/xen/arch/x86/acpi/cpu_idle.c
>@@ -228,7 +228,6 @@ static void acpi_processor_idle(void)
>
>     cpufreq_dbs_timer_suspend();
>
>-    sched_tick_suspend();
>     /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
>     process_pending_softirqs();
Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-02 07:54 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03
>>>
>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com>
wrote:
>> I can see your point. But how can you consider shipping with something
>> apparently severely broken. As said before - the fact that this
manifests
>> itself by hanging many-vCPU Dom0 has the very likely implication that
>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a
>> machine at hand that supports C3, I''d try to do some
measurements
>> with smaller domains...
>
>Well it''s a fallback I guess. If we can''t make progress on
solving it then I
>suppose I agree.
Just fyi, we now also have seen an issue on a 24-CPU system that went
away with cpuidle=0 (and static analysis of the hang hinted in that
direction). All I can judge so far is that this likely has something to do
with our kernel''s intensive use of the poll hypercall (i.e. we see
vCPU-s
not waking up from the call despite there being pending unmasked or
polled for events).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Juergen Gross

2010-Feb-02 08:13 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Jan Beulich wrote:>>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03
>>>
>> On 21/01/2010 10:53, "Jan Beulich"
<JBeulich@novell.com> wrote:
>>> I can see your point. But how can you consider shipping with
something
>>> apparently severely broken. As said before - the fact that this
manifests
>>> itself by hanging many-vCPU Dom0 has the very likely implication
that
>>> there are (so far unnoticed) problems with smaller Dom0-s. If I had
a
>>> machine at hand that supports C3, I''d try to do some
measurements
>>> with smaller domains...
>> Well it''s a fallback I guess. If we can''t make
progress on solving it then I
>> suppose I agree.
> 
> Just fyi, we now also have seen an issue on a 24-CPU system that went
> away with cpuidle=0 (and static analysis of the hang hinted in that
> direction). All I can judge so far is that this likely has something to do
> with our kernel''s intensive use of the poll hypercall (i.e. we see
vCPU-s
> not waking up from the call despite there being pending unmasked or
> polled for events).
Interesting. I see this problem on a 4-core system.
Can I help investigating?

Data of my machine (Fujitsu TX300-S5):

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
stepping        : 5
...


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Feb-02 17:07 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@novell.com]
>Sent: Tuesday, February 02, 2010 3:55 PM
>To: Keir Fraser; Yu, Ke
>Cc: xen-devel@lists.xensource.com
>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03
>>>
>>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com>
wrote:
>>> I can see your point. But how can you consider shipping with
something
>>> apparently severely broken. As said before - the fact that this
manifests
>>> itself by hanging many-vCPU Dom0 has the very likely implication
that
>>> there are (so far unnoticed) problems with smaller Dom0-s. If I had
a
>>> machine at hand that supports C3, I''d try to do some
measurements
>>> with smaller domains...
>>
>>Well it''s a fallback I guess. If we can''t make
progress on solving it then I
>>suppose I agree.
>
>Just fyi, we now also have seen an issue on a 24-CPU system that went
>away with cpuidle=0 (and static analysis of the hang hinted in that
>direction). All I can judge so far is that this likely has something to do
>with our kernel''s intensive use of the poll hypercall (i.e. we see
vCPU-s
>not waking up from the call despite there being pending unmasked or
>polled for events).
>
>Jan
Hi Jan,

We just identified the cause of this issue, and is trying to find appropriate
way to fix it.

This issue is the result of following sequence:
1. every dom0 vCPU has one 250HZ timer (i.e. 4ms period). The vCPU
timer_interrupt handler will acquire a global ticket spin lock xtime_lock. When
xtime_lock is hold by other vCPU, the vCPU will poll event channel and become
blocked. As a result, the pCPU where the vCPU runs will become idle. Later, when
the lock holder release xtime_lock, it will notify event channel to wake up the
vCPU. As a result, the pCPU will wake up from idle state, and schedule the vCPU
to run.
>From the above, we can see the latency of vCPU timer interrupt is consisted
of the following items. The "latency" here means the time between
beginning to acquire lock and finally lock acquired.T1 - CPU execution time ( e.g. timer interrupt lock holding time, event channel
notification time)
T2 - CPU idle wake up time, i.e. the time CPU wake up from deep C state (e.g.
C3) to C0, usually it is in the order of several 10us or 100us

2. then let''s consider the case of large number of CPUs, e.g. 64 pCPU
and 64 VCPU in dom0, let''s assume the lock holding sequence is VCPU0
-> VCPU1->VCPU2 ... ->VCPU63.
Then vCPU63 will spend 64*(T1 + T2) to acquire the xtime_lock. if T1+T2 is
100us, then the total latency would be ~6ms.
As we have known that the timer is 250HZ, or 4ms period, so when event channel
notification issued, and pCPU schedule vCPU63, hypervisor will find the timer is
over-due, and will send another TIMER_VIRQ for vCPU63 (see
schedule()->vcpu_periodic_timer_work() for detail). In this case, vCPU63 will
be always busy handling timer interrupt, and not be able to update the watch
dog, thus cause the softlock up.

So from the above sequence, we can see:
- cpuidle driver add extra latency, thus make this issue more easy to occurs.
- Large number of CPU multiply the latency
- ticket spin lock lead fixed lock acquiring sequence, thus lead the latency
repeatedly being 64*(T1+T2), thus make this issue more easy to occurs.
and the fundamental cause of this issue is that vCPU timer interrupt handler is
not good for scaling, due to the global xtime_lock.
>From cpuidle point of view, one thing we are trying to do is: changing the
cpuidle driver to not enter deep C state when there is vCPU with local irq
disabled and event channel polling. In this case, the T2 latency will be
eliminated.
Anyway, cpuidle is just one side, we can anticipate that if CPU number is large
enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another
way is to make dom0 scaling well by not using xtime_lock, although this is
pretty hard currently. Or another way is to limit dom0 vCPU number to certain
reasonable level.

Regards
Ke

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Feb-03 07:32 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Hi Jan,

Could you please try the attached patch. this patch try to avoid entering deep C
state when there is vCPU local irq disabled, and polling event channel. When
tested in my 64 CPU box, this issue is gone with this patch.

Best Regards
Ke
>-----Original Message-----
>From: Yu, Ke
>Sent: Wednesday, February 03, 2010 1:07 AM
>To: Jan Beulich; Keir Fraser
>Cc: xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>-----Original Message-----
>>From: Jan Beulich [mailto:JBeulich@novell.com]
>>Sent: Tuesday, February 02, 2010 3:55 PM
>>To: Keir Fraser; Yu, Ke
>>Cc: xen-devel@lists.xensource.com
>>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>>
>>>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10
12:03 >>>
>>>On 21/01/2010 10:53, "Jan Beulich"
<JBeulich@novell.com> wrote:
>>>> I can see your point. But how can you consider shipping with
something
>>>> apparently severely broken. As said before - the fact that this
manifests
>>>> itself by hanging many-vCPU Dom0 has the very likely
implication that
>>>> there are (so far unnoticed) problems with smaller Dom0-s. If I
had a
>>>> machine at hand that supports C3, I''d try to do some
measurements
>>>> with smaller domains...
>>>
>>>Well it''s a fallback I guess. If we can''t make
progress on solving it then I
>>>suppose I agree.
>>
>>Just fyi, we now also have seen an issue on a 24-CPU system that went
>>away with cpuidle=0 (and static analysis of the hang hinted in that
>>direction). All I can judge so far is that this likely has something to
do
>>with our kernel''s intensive use of the poll hypercall (i.e. we
see vCPU-s
>>not waking up from the call despite there being pending unmasked or
>>polled for events).
>>
>>Jan
>
>Hi Jan,
>
>We just identified the cause of this issue, and is trying to find
appropriate way
>to fix it.
>
>This issue is the result of following sequence:
>1. every dom0 vCPU has one 250HZ timer (i.e. 4ms period). The vCPU
>timer_interrupt handler will acquire a global ticket spin lock xtime_lock.
>When xtime_lock is hold by other vCPU, the vCPU will poll event channel and
>become blocked. As a result, the pCPU where the vCPU runs will become idle.
>Later, when the lock holder release xtime_lock, it will notify event channel
to
>wake up the vCPU. As a result, the pCPU will wake up from idle state, and
>schedule the vCPU to run.
>
>From the above, we can see the latency of vCPU timer interrupt is consisted
>of the following items. The "latency" here means the time between
beginning
>to acquire lock and finally lock acquired.
>T1 - CPU execution time ( e.g. timer interrupt lock holding time, event
channel
>notification time)
>T2 - CPU idle wake up time, i.e. the time CPU wake up from deep C state
(e.g.
>C3) to C0, usually it is in the order of several 10us or 100us
>
>2. then let''s consider the case of large number of CPUs, e.g. 64
pCPU and 64
>VCPU in dom0, let''s assume the lock holding sequence is VCPU0 ->
>VCPU1->VCPU2 ... ->VCPU63.
>Then vCPU63 will spend 64*(T1 + T2) to acquire the xtime_lock. if T1+T2 is
>100us, then the total latency would be ~6ms.
>As we have known that the timer is 250HZ, or 4ms period, so when event
>channel notification issued, and pCPU schedule vCPU63, hypervisor will find
>the timer is over-due, and will send another TIMER_VIRQ for vCPU63 (see
>schedule()->vcpu_periodic_timer_work() for detail). In this case, vCPU63
will
>be always busy handling timer interrupt, and not be able to update the watch
>dog, thus cause the softlock up.
>
>So from the above sequence, we can see:
>- cpuidle driver add extra latency, thus make this issue more easy to
occurs.
>- Large number of CPU multiply the latency
>- ticket spin lock lead fixed lock acquiring sequence, thus lead the latency
>repeatedly being 64*(T1+T2), thus make this issue more easy to occurs.
>and the fundamental cause of this issue is that vCPU timer interrupt handler
>is not good for scaling, due to the global xtime_lock.
>
>From cpuidle point of view, one thing we are trying to do is: changing the
>cpuidle driver to not enter deep C state when there is vCPU with local irq
>disabled and event channel polling. In this case, the T2 latency will be
>eliminated.
>
>Anyway, cpuidle is just one side, we can anticipate that if CPU number is
large
>enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So
another
>way is to make dom0 scaling well by not using xtime_lock, although this is
>pretty hard currently. Or another way is to limit dom0 vCPU number to
>certain reasonable level.
>
>Regards
>Ke

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-03 10:15 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07
>>>
>>Just fyi, we now also have seen an issue on a 24-CPU system that went
>>away with cpuidle=0 (and static analysis of the hang hinted in that
>>direction). All I can judge so far is that this likely has something to
do
>>with our kernel''s intensive use of the poll hypercall (i.e. we
see vCPU-s
>>not waking up from the call despite there being pending unmasked or
>>polled for events).
>
>We just identified the cause of this issue, and is trying to find
appropriate way to fix it.
Hmm, while I agree that the scenario you describe can be a problem, I
don''t think it can explain the behavior on the 24-CPU system pointed
out above, nor the one Juergen Gross pointed out yesterday.

Nor can it explain why this happens at boot time (when you can take it
for granted that several/most of the CPUs are idle [and hence would
have their periodic timer stopped]).

Also I would think that the rate at which xtime_lock is being acquired
may not be the highest one in the entire system, and hence problems
may continue to result even if we fixed timer_interrupt().
>Anyway, cpuidle is just one side, we can anticipate that if CPU number is
large enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So
another way is to make dom0 scaling well by not using xtime_lock, although this
is pretty hard currently. Or another way is to limit dom0 vCPU number to certain
reasonable level.
I would not think that dealing with the xtime_lock scalability issue in
timer_interrupt() should be *that* difficult. In particular it should be
possibly to assign an on-duty CPU (permanent or on a round-robin
basis) that deals with updating jiffies/wallclock, and all other CPUs
just update their local clocks. I had thought about this before, but
never found a strong need to experiment with that.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-03 10:23 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Yu, Ke" <ke.yu@intel.com> 03.02.10 08:32
>>>
>Could you please try the attached patch. this patch try to avoid entering
deep C state when there is vCPU local irq disabled, and polling event channel.
When tested in my 64 CPU box, this issue is gone with this patch.
We could try it, but I''m not convinced of the approach. Why is the
urgent determination dependent upon event delivery being disabled
on the respective vCPU? If at all, it should imo be polling *or* event
delivery disabled, not *and*.

Also, iterating over all vCPU-s in that function doesn''t seem very
scalable. It would seem more reasonable for the scheduler to track
how many "urgent" vCPU-s a pCPU currently has.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-03 12:10 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Jan Beulich
>Sent: 2010年2月3日 18:16
>
>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07
>>>
>>>Just fyi, we now also have seen an issue on a 24-CPU system that
went
>>>away with cpuidle=0 (and static analysis of the hang hinted in that
>>>direction). All I can judge so far is that this likely has 
>something to do
>>>with our kernel's intensive use of the poll hypercall (i.e. 
>we see vCPU-s
>>>not waking up from the call despite there being pending unmasked or
>>>polled for events).
>>
>>We just identified the cause of this issue, and is trying to 
>find appropriate way to fix it.
>
>Hmm, while I agree that the scenario you describe can be a problem, I
>don't think it can explain the behavior on the 24-CPU system pointed
>out above, nor the one Juergen Gross pointed out yesterday.
Is 24-CPU system observed with same likelihood as 64-CPU system to
hang at boot time, or less frequent? Ke just did some theoretical analysis
by assuming some values. There could be other factors added to latency
and each system may have different characteristics too. We can't
draw conclusion whether smaller system will face same issue, by simply
changing CPU number in Ke's formula. :-) Possibly you can provide cpuidle
information on your 24-core system for further comparison.
>
>Nor can it explain why this happens at boot time (when you can take it
>for granted that several/most of the CPUs are idle [and hence would
>have their periodic timer stopped]).
Well, it's true that several/most of CPUs are idle at boot time, and when
they are blocked into Xen, periodic timer is stopped. However it doesn't 
mean that those 'idle' CPUs will block into Xen and never wake up in 
whole boot process (here we are talking about dozens of seconds). 
There're random events to wake blocked vCPU up: random process/
thread creations, device interrupts (based on affinity), per-cpu kernel 
thread may periodically wake up, etc. Most importantly is, that idle
vCPU will hook a singleshot timer when it's blocked into xen, and that
singleshot timer is calculated based on min_val of nearest timer
expiration and soft lockup threshold. Based on those assumption, 
although one vCPU of dom0 may be mostly idle in whole boot process,
it will still wake up occasionally.

Once a blocked vCPU is waken up, Xen will check whether it's next
periodic timer interrupt should be delivered. If vCPU has been blocked
over one dom0 tick (4ms in this specific SLES case), a virtual timer
will pend before resuming to guest. So every dom0 vCPU still enters
timer interrupt to acquire xtime_lock intermittently. I'm not sure how
much possibility to have those idle vCPUs waken up at same time. If
there's no special treatment to interleave vCPU timer interrupt, they
may be closely injected based on vCPU boot time point. Anyway, given
a situation where some idle vCPUs begin to contend with each other, 
and several of them exceeds busy-wait threshod to block into xen, 
latency from deep Cx will add chance for more vCPUs to contend same 
xtime_lock in same loop. Once many vCPUs may contend at same time,
and a full circle from 1st ticket to last ticket exceeds 4ms expected
by dom0, a new timer interrupt will be injected again. Then no vCPU
can escape from that contending loop.
>
>Also I would think that the rate at which xtime_lock is being acquired
>may not be the highest one in the entire system, and hence problems
>may continue to result even if we fixed timer_interrupt().
Although the rate acquiring xtime_look is unlikely the highest one, it's
a big kernel lock to be acquired on each CPU. ticket spinlock order
plus cpuidle latency may rendezvous more and more vCPUs in same 
contending loop, once latency for a vCPU to wait for specific ticket
starts to accumulate.

As Ke said, cpuidle just exaggerated spinlock issue in virtualization.
Considering another case when system is very busy and physical
CPU is overcommitted. Even when one vCPU is waken up, it may
be still in runqueue for dozens of milliseconds. Similarly vCPU holding
larger tickets have to wait for long time. If applying to xtime_lock,
you may still observe softlockup warning then. Possibly the real
solution is to not have dom0 with large virtual vCPUs. 

Thanks,
Kevin
>
>>Anyway, cpuidle is just one side, we can anticipate that if 
>CPU number is large enough to lead NR_CPU * T1 > 4ms, this 
>issue will occurs again. So another way is to make dom0 
>scaling well by not using xtime_lock, although this is pretty 
>hard currently. Or another way is to limit dom0 vCPU number to 
>certain reasonable level.
>
>I would not think that dealing with the xtime_lock scalability issue in
>timer_interrupt() should be *that* difficult. In particular it 
>should be
>possibly to assign an on-duty CPU (permanent or on a round-robin
>basis) that deals with updating jiffies/wallclock, and all other CPUs
>just update their local clocks. I had thought about this before, but
>never found a strong need to experiment with that.
>
>Jan
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Juergen Gross

2010-Feb-03 12:18 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Tian, Kevin wrote:>> From: Jan Beulich
>> Sent: 2010年2月3日 18:16
>>
>>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07
>>>
>>>> Just fyi, we now also have seen an issue on a 24-CPU system
that went
>>>> away with cpuidle=0 (and static analysis of the hang hinted in
that
>>>> direction). All I can judge so far is that this likely has 
>> something to do
>>>> with our kernel''s intensive use of the poll hypercall
(i.e.
>> we see vCPU-s
>>>> not waking up from the call despite there being pending
unmasked or
>>>> polled for events).
>>> We just identified the cause of this issue, and is trying to 
>> find appropriate way to fix it.
>>
>> Hmm, while I agree that the scenario you describe can be a problem, I
>> don''t think it can explain the behavior on the 24-CPU system
pointed
>> out above, nor the one Juergen Gross pointed out yesterday.
> 
> Is 24-CPU system observed with same likelihood as 64-CPU system to
> hang at boot time, or less frequent? Ke just did some theoretical analysis
> by assuming some values. There could be other factors added to latency
> and each system may have different characteristics too. We can''t
> draw conclusion whether smaller system will face same issue, by simply
> changing CPU number in Ke''s formula. :-) Possibly you can provide
cpuidle
> information on your 24-core system for further comparison.
My 4-core system hangs _always_. For minutes. If I press any key on the
console it will resume booting with soft lockup messages (all cpus were
in xen_safe_halt).
Sometimes another hang occurs, sometimes the system will come up without
further hangs.

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-03 13:20 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Tian, Kevin" <kevin.tian@intel.com> 03.02.10 13:10
>>>
> Possibly the real solution is to not have dom0 with large virtual vCPUs. 
But you realize that this is a Dom0-only issue only as long as DomU-s
with more vCPU-s cannot be created? I.e. it''ll become an issue
affecting
any kind of guest as soon as that limitation gets out of the way. So I
don''t view this as a solution - it''s at best a workaround
until a solution
can be found.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Feb-03 14:46 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Kevin has explained details on your comments, so I just pick some point for more
explanation.
>
>I would not think that dealing with the xtime_lock scalability issue in
>timer_interrupt() should be *that* difficult. In particular it should be
>possibly to assign an on-duty CPU (permanent or on a round-robin
>basis) that deals with updating jiffies/wallclock, and all other CPUs
>just update their local clocks. I had thought about this before, but
>never found a strong need to experiment with that.
>
>Jan
This  is good. Eliminating global lock is always good practice for scalability,
especially that there will be more and more CPUs in the future. I would expect
this to be the best solution to the softlockup issue.

And If the global xtime_lock can be eliminated, the cpuidle patch may not be
needed anymore.
>>Could you please try the attached patch. this patch try to avoid
entering
>deep C state when there is vCPU local irq disabled, and polling event
channel.
>When tested in my 64 CPU box, this issue is gone with this patch.
>
>We could try it, but I''m not convinced of the approach. Why is the
>urgent determination dependent upon event delivery being disabled
>on the respective vCPU? If at all, it should imo be polling *or* event
>delivery disabled, not *and*.
Then rationale of this patch is: disabling vCPU local irq usually means vCPU
have urgent task to finish ASAP, and don''t want to be interrupted.

As first step patch, I am a bit conservative to combine them by *and*. Once it
is verified working, I can extend this hint to *or *, as long as the *or*does
not include unwanted case that hurt the power saving significantly.
>
>Also, iterating over all vCPU-s in that function doesn''t seem very
>scalable. It would seem more reasonable for the scheduler to track
>how many "urgent" vCPU-s a pCPU currently has.
Yes, we can do this optimization. Current patch is just for quick verification
purpose.

Regards
Ke

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-04 01:40 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Juergen Gross [mailto:juergen.gross@ts.fujitsu.com] 
>Sent: 2010年2月3日 20:19
>
>Tian, Kevin wrote:
>>> From: Jan Beulich
>>> Sent: 2010年2月3日 18:16
>>>
>>>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10
18:07 >>>
>>>>> Just fyi, we now also have seen an issue on a 24-CPU 
>system that went
>>>>> away with cpuidle=0 (and static analysis of the hang 
>hinted in that
>>>>> direction). All I can judge so far is that this likely has 
>>> something to do
>>>>> with our kernel's intensive use of the poll hypercall
(i.e.
>>> we see vCPU-s
>>>>> not waking up from the call despite there being pending 
>unmasked or
>>>>> polled for events).
>>>> We just identified the cause of this issue, and is trying to 
>>> find appropriate way to fix it.
>>>
>>> Hmm, while I agree that the scenario you describe can be a 
>problem, I
>>> don't think it can explain the behavior on the 24-CPU system
pointed
>>> out above, nor the one Juergen Gross pointed out yesterday.
>> 
>> Is 24-CPU system observed with same likelihood as 64-CPU system to
>> hang at boot time, or less frequent? Ke just did some 
>theoretical analysis
>> by assuming some values. There could be other factors added 
>to latency
>> and each system may have different characteristics too. We can't
>> draw conclusion whether smaller system will face same issue, 
>by simply
>> changing CPU number in Ke's formula. :-) Possibly you can 
>provide cpuidle
>> information on your 24-core system for further comparison.
>
>My 4-core system hangs _always_. For minutes. If I press any key on the
>console it will resume booting with soft lockup messages (all cpus were
>in xen_safe_halt).
>Sometimes another hang occurs, sometimes the system will come 
>up without
>further hangs.
>
>Juergen
>
interesting. Then did you also observe hang disappeared by disabling
cpuidle? Your case really looks like some missed event scenario, in
which key press just kicks cpu alive...

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-04 01:48 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月3日 21:21
>
>>>> "Tian, Kevin" <kevin.tian@intel.com> 03.02.10
13:10 >>>
>> Possibly the real solution is to not have dom0 with large 
>virtual vCPUs. 
>
>But you realize that this is a Dom0-only issue only as long as DomU-s
>with more vCPU-s cannot be created? I.e. it'll become an issue 
>affecting
>any kind of guest as soon as that limitation gets out of the way. So I
>don't view this as a solution - it's at best a workaround 
>until a solution
>can be found.
Sure, but then it's all about how spinlock itself should be implemented 
in virtualization environment, or how xtime_lock should be used in this
specific case, to allow scale up with large vcpu numbers.

There's always limitation about how it may scale w/o triggering softlock
warning. We agree that cpuidle made it worse inadvertently, and then
come up patch to recover it back to the level of original limitation. And
we just realize that current ticket pv spinlock may still encounter such
limitation once system becomes hot when lock holder of xtime_lock
may stay in runqueue for relatively long time, even w/o cpuidle.


Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Juergen Gross

2010-Feb-04 06:31 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Tian, Kevin wrote:>> From: Juergen Gross [mailto:juergen.gross@ts.fujitsu.com] 
>> Sent: 2010年2月3日 20:19
>>
>> Tian, Kevin wrote:
>>>> From: Jan Beulich
>>>> Sent: 2010年2月3日 18:16
>>>>
>>>>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10
18:07 >>>
>>>>>> Just fyi, we now also have seen an issue on a 24-CPU 
>> system that went
>>>>>> away with cpuidle=0 (and static analysis of the hang 
>> hinted in that
>>>>>> direction). All I can judge so far is that this likely
has
>>>> something to do
>>>>>> with our kernel''s intensive use of the poll
hypercall (i.e.
>>>> we see vCPU-s
>>>>>> not waking up from the call despite there being pending
>> unmasked or
>>>>>> polled for events).
>>>>> We just identified the cause of this issue, and is trying
to
>>>> find appropriate way to fix it.
>>>>
>>>> Hmm, while I agree that the scenario you describe can be a 
>> problem, I
>>>> don''t think it can explain the behavior on the 24-CPU
system pointed
>>>> out above, nor the one Juergen Gross pointed out yesterday.
>>> Is 24-CPU system observed with same likelihood as 64-CPU system to
>>> hang at boot time, or less frequent? Ke just did some 
>> theoretical analysis
>>> by assuming some values. There could be other factors added 
>> to latency
>>> and each system may have different characteristics too. We
can''t
>>> draw conclusion whether smaller system will face same issue, 
>> by simply
>>> changing CPU number in Ke''s formula. :-) Possibly you can 
>> provide cpuidle
>>> information on your 24-core system for further comparison.
>> My 4-core system hangs _always_. For minutes. If I press any key on the
>> console it will resume booting with soft lockup messages (all cpus were
>> in xen_safe_halt).
>> Sometimes another hang occurs, sometimes the system will come 
>> up without
>> further hangs.
>>
>> Juergen
>>
> 
> interesting. Then did you also observe hang disappeared by disabling
> cpuidle? Your case really looks like some missed event scenario, in
> which key press just kicks cpu alive...
Yes, cpuidle=0 made the problem disappear.

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-05 08:48 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Yes, this patch works for us too. So a non-hacky version of it would be
appreciated.

I also meanwhile tried out the idea to reduce the contention on
xtime_lock (attached for reference). Things appear to work fine, but
there is an obvious problem (with - thus far to me - no obvious
explanation) with it: The number of timer interrupts on CPUs not on
duty to run do_timer() and alike is increasing significantly, with spikes
of over 100,000 per second. I''m investigating this, but of course any
idea anyone of you might have what could be causing this would be
very welcome.

Jan
>>> "Yu, Ke" <ke.yu@intel.com> 03.02.10 08:32
>>>Hi Jan,

Could you please try the attached patch. this patch try to avoid entering deep C
state when there is vCPU local irq disabled, and polling event channel. When
tested in my 64 CPU box, this issue is gone with this patch.

Best Regards
Ke



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-05 09:00 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月5日 16:49
>
>Yes, this patch works for us too. So a non-hacky version of it would be
>appreciated.
>
>I also meanwhile tried out the idea to reduce the contention on
>xtime_lock (attached for reference). Things appear to work fine, but
>there is an obvious problem (with - thus far to me - no obvious
>explanation) with it: The number of timer interrupts on CPUs not on
>duty to run do_timer() and alike is increasing significantly, 
>with spikes
>of over 100,000 per second. I'm investigating this, but of course any
>idea anyone of you might have what could be causing this would be
>very welcome.
>
forgive my poor english. From your patch, only cpu on duty will invoke
do_timer to update global timestamp. Why in your test it's CPUs 'not
on duty' to have high frequent do_timer? I may read it wrong. :-(

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-05 09:14 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 10:00
>>>
>>From: Jan Beulich [mailto:JBeulich@novell.com] 
>>Sent: 2010年2月5日 16:49
>>
>>Yes, this patch works for us too. So a non-hacky version of it would be
>>appreciated.
>>
>>I also meanwhile tried out the idea to reduce the contention on
>>xtime_lock (attached for reference). Things appear to work fine, but
>>there is an obvious problem (with - thus far to me - no obvious
>>explanation) with it: The number of timer interrupts on CPUs not on
>>duty to run do_timer() and alike is increasing significantly, 
>>with spikes
>>of over 100,000 per second. I''m investigating this, but of
course any
>>idea anyone of you might have what could be causing this would be
>>very welcome.
>>
>
>forgive my poor english. From your patch, only cpu on duty will invoke
>do_timer to update global timestamp. Why in your test it''s CPUs
''not
>on duty'' to have high frequent do_timer? I may read it wrong. :-(
If you look at the patch, I added extra statistics for those timer
interrupts that occur when a CPU is "on duty" (recorded as IRQ0,
which is otherwise unused) and when not "on duty" (recorded as MCEs,
since those hopefully(!!!) won''t occur either, and in no case at a high
rate).
>From that I know that the rate of interrupts (not the rate of do_timer()invocations) is much higher on not-on-duty CPUs, but is roughly as
without the patch for the on-duty one.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Feb-05 09:16 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@novell.com]
>Sent: Friday, February 05, 2010 4:49 PM
>To: Yu, Ke
>Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>Yes, this patch works for us too. So a non-hacky version of it would be
>appreciated.
That is good. I will polish the patch. 

Regards
Ke


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-05 09:52 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月5日 17:14
>
>>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10
10:00 >>>
>>>From: Jan Beulich [mailto:JBeulich@novell.com] 
>>>Sent: 2010年2月5日 16:49
>>>
>>>Yes, this patch works for us too. So a non-hacky version of 
>it would be
>>>appreciated.
>>>
>>>I also meanwhile tried out the idea to reduce the contention on
>>>xtime_lock (attached for reference). Things appear to work fine, but
>>>there is an obvious problem (with - thus far to me - no obvious
>>>explanation) with it: The number of timer interrupts on CPUs not on
>>>duty to run do_timer() and alike is increasing significantly, 
>>>with spikes
>>>of over 100,000 per second. I'm investigating this, but of
course any
>>>idea anyone of you might have what could be causing this would be
>>>very welcome.
>>>
>>
>>forgive my poor english. From your patch, only cpu on duty will invoke
>>do_timer to update global timestamp. Why in your test it's CPUs
'not
>>on duty' to have high frequent do_timer? I may read it wrong. :-(
>
>If you look at the patch, I added extra statistics for those timer
>interrupts that occur when a CPU is "on duty" (recorded as IRQ0,
>which is otherwise unused) and when not "on duty" (recorded as
MCEs,
>since those hopefully(!!!) won't occur either, and in no case at a high
>rate).
>
>From that I know that the rate of interrupts (not the rate of 
>do_timer()
>invocations) is much higher on not-on-duty CPUs, but is roughly as
>without the patch for the on-duty one.
>
Are 100,000 per second a sum-up on all non-duty CPUs or observed on
just one? How about the level without the patch?

Your patch is quite straightforward, and in a glimpse it shouldn't have
such weird issue... Timer interrupts are caused in two paths: one is 
periodic timer when vcpu is running. As you have HZ=250, you won't 
get >250 interrupts in this portion as Xen accurately drives injections 
in 4ms pace. The other is singleshot timer when vcpu is entering idle. 
periodic timer will be stopped in that case, and a singleshot timer is 
registered with nearest expiration in local timer queue. That portion is
unpredictable, then possibly your patch increases hotness in that part.
But I haven't figured out how it could be. :-(

Thanks,
Kevin 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-05 10:37 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 10:52
>>>
>Are 100,000 per second a sum-up on all non-duty CPUs or observed on
>just one? How about the level without the patch?
That''s 100,000 per CPU per second. Normally on an idle system the
number is about 25 per CPU per second.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-05 11:16 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Jan Beulich
>Sent: 2010年2月5日 18:37
>
>>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10
10:52 >>>
>>Are 100,000 per second a sum-up on all non-duty CPUs or observed on
>>just one? How about the level without the patch?
>
>That's 100,000 per CPU per second. Normally on an idle system the
>number is about 25 per CPU per second.
>
I think I know the reason. With your patch, now only one duty CPU
will update global jiffies, however this duty CPU may be preempted
over several ticks. On the other hand, all other non-duty CPU 
calculate their singleshot time expirations based on jiffies (see
stop_hz_timer). There're some conditions to get jiffies+1 as result.
When duty CPU is preempted, it's possible for jiffies value behind 
actual time for several ticks. That means non-duty CPU may request
an old time to Xen which expires and be pended with a timer interrupt
immediately by Xen. Then vcpu_block returns immediately. As 
non-duty CPU is in idle loop, this will loop to get your interrupt stat
very high until duty CPU gets re-scheduled to update jiffies.

Without your patch, jiffies can be updated timely as long as there's
running CPU in dom0.

Possible options is to take jiffies, system timestamp, and per-cpu
timestamp in stop_hz_timer, to generate a local latest value. Or
a per-cpu jiffies can be generated in per-cpu timer interrupt, which
is only used in per-cpu style like in stop_hz_timer.

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-05 14:59 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 12:16
>>>
>Possible options is to take jiffies, system timestamp, and per-cpu
>timestamp in stop_hz_timer, to generate a local latest value. Or
>a per-cpu jiffies can be generated in per-cpu timer interrupt, which
>is only used in per-cpu style like in stop_hz_timer.
That was my next thought too - but it doesn''t work (unless I made a
blatant mistake). Not only does it not get the interrupt rate down,
it even grows it on the duty CPU, and it creates visible brief hangs
as well a rushes of time.

Next I''ll do is try to detect when the duty CPU is de-scheduled, and
move on the duty to one that is scheduled (i.e. one that is currently
executing timer_interrupt()).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-05 15:51 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Jan Beulich" <JBeulich@novell.com> 05.02.10 15:59
>>>
>Next I''ll do is try to detect when the duty CPU is de-scheduled,
and
>move on the duty to one that is scheduled (i.e. one that is currently
>executing timer_interrupt()).
This improves the situation (the highest spike I saw so far was 2,000
interrupts per CPU per second), but doesn''t get it back the way it
ought to be (apart from the spikes, as with the original version of the
patch, interrupt activity is also generally too high, very erratic, and
even during the more quiet periods doesn''t go down to the original
level).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-06 01:50 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月5日 22:59
>
>>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10
12:16 >>>
>>Possible options is to take jiffies, system timestamp, and per-cpu
>>timestamp in stop_hz_timer, to generate a local latest value. Or
>>a per-cpu jiffies can be generated in per-cpu timer interrupt, which
>>is only used in per-cpu style like in stop_hz_timer.
>
>That was my next thought too - but it doesn't work (unless I made a
>blatant mistake). Not only does it not get the interrupt rate down,
>it even grows it on the duty CPU, and it creates visible brief hangs
>as well a rushes of time.
This really twists my head. Ideally with above option singleshot timer
is postponed and there's really no point to instead kick interrupt rate
up... Possibly you can add some trace in Xen side to check singleshot
timer registration/exipiration pattern. At least this is only suspicious
virtual timer pending path from my perspective, which is worthy of
some checking to understand what's going wrong from another angle.
>
>Next I'll do is try to detect when the duty CPU is de-scheduled, and
>move on the duty to one that is scheduled (i.e. one that is currently
>executing timer_interrupt()).
>
It's still possible that you move duty to one active CPU, while that
active CPU is in stop_hz_timer with interrupt disabled. In that case,
again no one is able to update jiffies and again stale jiffies can be
observed... 

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-06 01:52 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月5日 23:52
>
>>>> "Jan Beulich" <JBeulich@novell.com> 05.02.10
15:59 >>>
>>Next I'll do is try to detect when the duty CPU is de-scheduled, and
>>move on the duty to one that is scheduled (i.e. one that is currently
>>executing timer_interrupt()).
>
>This improves the situation (the highest spike I saw so far was 2,000
>interrupts per CPU per second), but doesn't get it back the way it
>ought to be (apart from the spikes, as with the original version of the
>patch, interrupt activity is also generally too high, very erratic, and
>even during the more quiet periods doesn't go down to the original
>level).
>
could you send out your new patch? in same time, tweaking singleshot
timer stat from Xen side would be helpful as I said earlier. :-)

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Feb-07 15:36 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Jan,

The attached is the updated patch, it has two changes
- change the logic from local irq disabled *and* poll event to local irq
disabled *or* poll event
- Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The
original scheduler does not provide such kind of list, so this patch implement
the list in scheduler code.

Regards
Ke
>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@novell.com]
>Sent: Friday, February 05, 2010 4:49 PM
>To: Yu, Ke
>Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>Yes, this patch works for us too. So a non-hacky version of it would be
>appreciated.
>
>I also meanwhile tried out the idea to reduce the contention on
>xtime_lock (attached for reference). Things appear to work fine, but
>there is an obvious problem (with - thus far to me - no obvious
>explanation) with it: The number of timer interrupts on CPUs not on
>duty to run do_timer() and alike is increasing significantly, with spikes
>of over 100,000 per second. I''m investigating this, but of course
any
>idea anyone of you might have what could be causing this would be
>very welcome.
>
>Jan
>
>>>> "Yu, Ke" <ke.yu@intel.com> 03.02.10 08:32
>>>
>Hi Jan,
>
>Could you please try the attached patch. this patch try to avoid entering
deep
>C state when there is vCPU local irq disabled, and polling event channel.
>When tested in my 64 CPU box, this issue is gone with this patch.
>
>Best Regards
>Ke


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-08 08:36 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Tian, Kevin" <kevin.tian@intel.com> 06.02.10 02:50
>>>
>>Next I''ll do is try to detect when the duty CPU is
de-scheduled, and
>>move on the duty to one that is scheduled (i.e. one that is currently
>>executing timer_interrupt()).
>>
>
>It''s still possible that you move duty to one active CPU, while
that
>active CPU is in stop_hz_timer with interrupt disabled. In that case,
>again no one is able to update jiffies and again stale jiffies can be
>observed... 
No, since the first thing a CPU entering stop_hz_timer() does is
relinquish the duty of updating jiffies.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-08 08:45 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Tian, Kevin" <kevin.tian@intel.com> 06.02.10 02:52
>>>
>>From: Jan Beulich [mailto:JBeulich@novell.com] 
>>Sent: 2010年2月5日 23:52
>>
>>>>> "Jan Beulich" <JBeulich@novell.com>
05.02.10 15:59 >>>
>>>Next I''ll do is try to detect when the duty CPU is
de-scheduled, and
>>>move on the duty to one that is scheduled (i.e. one that is
currently
>>>executing timer_interrupt()).
>>
>>This improves the situation (the highest spike I saw so far was 2,000
>>interrupts per CPU per second), but doesn''t get it back the way
it
>>ought to be (apart from the spikes, as with the original version of the
>>patch, interrupt activity is also generally too high, very erratic, and
>>even during the more quiet periods doesn''t go down to the
original
>>level).
>>
>
>could you send out your new patch? in same time, tweaking singleshot
Attached. After another refinement (in stop_hz_timer()) I didn''t see
spikes above 1,000 interrupts per CPU per second anymore. But it''s
still far from being as quiescent as without the patch.

What''s also interesting is that there''s an initial period (a
minute or so)
where the interrupt rate is really stable (though still not as low as
previously), and only then it starts becoming erratic.
>timer stat from Xen side would be helpful as I said earlier. :-)
Didn''t get to do that yet.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-08 09:08 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36
>>>
>The attached is the updated patch, it has two changes
>- change the logic from local irq disabled *and* poll event to local irq
disabled *or* poll event
Thanks.
>- Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The
original scheduler does not provide such kind of list, so this patch implement
the list in scheduler code.
I''m still not really happy with that solution. I''d rather say
that e.g.
vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
whether that one is "urgent", and the scheduler should just maintain
a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU
is put to sleep guarantees that it won''t be mis-treated if it got woken
by the time acpi_processor_idle() looks at it (or at least the window
would be minimal - not sure if it can be eliminated completely). Plus
not having to traverse a list is certainly better for scalability, not the
least since you''re traversing a list (necessarily) including sleeping
vCPU-s (i.e. the ones that shouldn''t affect the performance/
responsiveness of the system).

But in the end it would certainly depend much more on Keir''s view on
it than on mine...

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Feb-08 10:11 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

On 08/02/2010 09:08, "Jan Beulich" <JBeulich@novell.com> wrote:
>> - Use per-CPU vcpu list to iterate the VCPU, which is more scalable.
The
>> original scheduler does not provide such kind of list, so this patch
>> implement the list in scheduler code.
> 
> I''m still not really happy with that solution. I''d rather
say that e.g.
> vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
> whether that one is "urgent", and the scheduler should just
maintain
> a counter of "urgent" vCPU-s per pCPU. Setting the flag when a
vCPU
> is put to sleep guarantees that it won''t be mis-treated if it got
woken
> by the time acpi_processor_idle() looks at it (or at least the window
> would be minimal - not sure if it can be eliminated completely). Plus
> not having to traverse a list is certainly better for scalability, not the
> least since you''re traversing a list (necessarily) including
sleeping
> vCPU-s (i.e. the ones that shouldn''t affect the performance/
> responsiveness of the system).
> 
> But in the end it would certainly depend much more on Keir''s view
on
> it than on mine...
Your suggestion makes sense to me.

 K.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-09 07:55 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月8日 16:46
>>>> "Tian, Kevin" <kevin.tian@intel.com> 06.02.10
02:52 >>>
>>>From: Jan Beulich [mailto:JBeulich@novell.com] 
>>>Sent: 2010年2月5日 23:52
>>>
>>>>>> "Jan Beulich" <JBeulich@novell.com>
05.02.10 15:59 >>>
>>>>Next I'll do is try to detect when the duty CPU is
de-scheduled, and
>>>>move on the duty to one that is scheduled (i.e. one that is 
>currently
>>>>executing timer_interrupt()).
>>>
>>>This improves the situation (the highest spike I saw so far was
2,000
>>>interrupts per CPU per second), but doesn't get it back the way
it
>>>ought to be (apart from the spikes, as with the original 
>version of the
>>>patch, interrupt activity is also generally too high, very 
>erratic, and
>>>even during the more quiet periods doesn't go down to the
original
>>>level).
>>>
>>
>>could you send out your new patch? in same time, tweaking singleshot
>
>Attached. After another refinement (in stop_hz_timer()) I didn't see
>spikes above 1,000 interrupts per CPU per second anymore. But it's
>still far from being as quiescent as without the patch.
Would you mind elaborating what's refinement and how that may
reduce spikes? Understand those variants may help draw big picture
about whole issue.
>
>What's also interesting is that there's an initial period (a 
>minute or so)
>where the interrupt rate is really stable (though still not as low as
>previously), and only then it starts becoming erratic.
>
What is average interrupt rate for 'stable' and 'erratic' case?
Is it
close to spike (~1000)?
>>timer stat from Xen side would be helpful as I said earlier. :-)
>
>Didn't get to do that yet.
This stat would be helpful, given that you can get actual singleshot
timer trace instead of just speculating from dom0 inside.

In same time, possibly you can pin dom0 vcpu as a simplified case.

BTW, with your current patch there could be still possibility for
several vCPUs to contend for xtime_lock at same time. Current duty
vCPU may be preempted in ISR, and then other non-duty vCPU
will note it not in RUNSTATE_running and then designate itself
to take new duty. This may not be big issue, compared to original
always-contending style. But just raise it here and please make
sure it's actually what's desired by you.

Thanks,
Kevin

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-09 08:02 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月8日 17:08
>
>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36
>>>
>>The attached is the updated patch, it has two changes
>>- change the logic from local irq disabled *and* poll event 
>to local irq disabled *or* poll event 
>
>Thanks.
>
>>- Use per-CPU vcpu list to iterate the VCPU, which is more 
>scalable. The original scheduler does not provide such kind of 
>list, so this patch implement the list in scheduler code.
>
>I'm still not really happy with that solution. I'd rather say that
e.g.
>vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
>whether that one is "urgent", and the scheduler should just
maintain
>a counter of "urgent" vCPU-s per pCPU. Setting the flag when a
vCPU
>is put to sleep guarantees that it won't be mis-treated if it got woken
>by the time acpi_processor_idle() looks at it (or at least the window
>would be minimal - not sure if it can be eliminated completely). Plus
>not having to traverse a list is certainly better for 
>scalability, not the
>least since you're traversing a list (necessarily) including sleeping
>vCPU-s (i.e. the ones that shouldn't affect the performance/
>responsiveness of the system).
>
>But in the end it would certainly depend much more on Keir's view on
>it than on mine...
>
Yes, that's good point. Actually it's the 1st choice when Ke tried
to implement it, but gave up later due to failure to maintain counter
at approriate entry/exit points. Introduce a new 'urgent' flag would
make it easier. Another reason to use per-CPU vcpu-list is in view
of reusability in other scenarios. But it looks not strong now.

Anyway, a new patch per your suggesstion is in progess now.

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-09 12:35 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Tian, Kevin" <kevin.tian@intel.com> 09.02.10 08:55
>>>
>>From: Jan Beulich [mailto:JBeulich@novell.com] 
>>Attached. After another refinement (in stop_hz_timer()) I
didn''t see
>>spikes above 1,000 interrupts per CPU per second anymore. But
it''s
>>still far from being as quiescent as without the patch.
>
>Would you mind elaborating what''s refinement and how that may
>reduce spikes? Understand those variants may help draw big picture
>about whole issue.
This was the last hunk in stop_hz_timer() (forcing timeout_abs_ns
to never be in the past relative to the local accumulated time).
>>What''s also interesting is that there''s an initial
period (a
>>minute or so)
>>where the interrupt rate is really stable (though still not as low as
>>previously), and only then it starts becoming erratic.
>
>What is average interrupt rate for ''stable'' and
''erratic'' case? Is it
>close to spike (~1000)?
No, during the stable period the rate is less than 50; erratic has
an average rate (not counting the spikes) of about 125.
>BTW, with your current patch there could be still possibility for
>several vCPUs to contend for xtime_lock at same time. Current duty
>vCPU may be preempted in ISR, and then other non-duty vCPU
>will note it not in RUNSTATE_running and then designate itself
>to take new duty. This may not be big issue, compared to original
>always-contending style. But just raise it here and please make
>sure it''s actually what''s desired by you.
Yes, I realize that. It''s a tradeoff (and would probably need further
refinement, if only the whole thing would first work as desired). And
it''s getting us back to the fact that pv guests should be allowed to
avoid being preempted for short periods of time.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-11 14:44 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Tian, Kevin" <kevin.tian@intel.com> 09.02.10 08:55
>>>
>This stat would be helpful, given that you can get actual singleshot
>timer trace instead of just speculating from dom0 inside.
Finally got to that. While without the patch the number of single-shot
events is less than that of periodic ones, it is much bigger with. Both
kernel (relative to extrapolated local time) and hypervisor (relative to
extrapolated system time) agree that many of them get scheduled
with a time (up to about 6ms) in the past. Which means to me that
the per-CPU processed_system_time isn''t being maintained
accurately, and I think this is a result of how it is being updated in
timer_interrupt(): Other than with the global processed_system_time,
the per-CPU one may not get increased even if delta_cpu was close
to 3*NS_PER_TICK, due to the stolen/blocked accounting. Probably
this was not a problem so far because no code outside of
timer_interrupt() read this value - Keir, since I think you wrote that
logic originally, any word on this?

While I think this should be changed (so that the value can be used
in stop_hz_timer() for a second adjustment similar to the one done
with jiffies, setting the timeout to last timer interrupt + 1 tick), I
meanwhile thought of a different approach to the original problem:
Instead of dedicating a CPU to have the duty of maintaining global
time, just limit the number of CPUs that may concurrently try to
acquire xtime_lock in timer_interrupt(). While for now I set the
threshold to __fls(nr_cpu_ids), even setting it to 1 gives pretty
good results - no unduly high interrupt rates anymore, and only
very infrequent (one every half hour or so) single-shot events
that are determined to have a timeout more than 1ms in the past.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Feb-11 17:01 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

On 11/02/2010 14:44, "Jan Beulich" <JBeulich@novell.com> wrote:
> Other than with the global processed_system_time,
> the per-CPU one may not get increased even if delta_cpu was close
> to 3*NS_PER_TICK, due to the stolen/blocked accounting. Probably
> this was not a problem so far because no code outside of
> timer_interrupt() read this value - Keir, since I think you wrote that
> logic originally, any word on this?
What you say is true, as clearly it is currently implemented that way almost
by design. I''m lost in the intricacies of your current discussion
though, so
not sure exactly why it''s a problem, and how we should fix it?

 Thanks,
 Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-12 09:21 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> Keir Fraser <keir.fraser@eu.citrix.com> 11.02.10 18:01
>>>
>On 11/02/2010 14:44, "Jan Beulich" <JBeulich@novell.com>
wrote:
>
>> Other than with the global processed_system_time,
>> the per-CPU one may not get increased even if delta_cpu was close
>> to 3*NS_PER_TICK, due to the stolen/blocked accounting. Probably
>> this was not a problem so far because no code outside of
>> timer_interrupt() read this value - Keir, since I think you wrote that
>> logic originally, any word on this?
>
>What you say is true, as clearly it is currently implemented that way almost
>by design. I''m lost in the intricacies of your current discussion
though, so
>not sure exactly why it''s a problem, and how we should fix it?
First of all I don''t think anything necessarily needs to be fixed in
the
2.6.18 tree, as that one will never support >32 vCPU-s, and I don''t
think the scalability issue we''re talking about here is of concern
there.

The problem we''re trying to address is the contention on xtime_lock.
It is clear that there generally is no need for all CPUs in the system
to try to update to global time variables, so some filtering on the
number of CPUs concurrently trying to acquire xtime_lock is
reasonable. With any filtering done, there is however potential for
a CPU to see its local processed time ahead of the global one, but
while setting a single shot timer in the past (or very near future)
would guarantee that it would execute timer_interrupt() (almost)
right away, it does not guarantee that it would now be among
those CPUs that would try to acquire xtime_lock (i.e. the situation
wouldn''t necessarily have improved after the interrupt was handled,
and hence an interrupt storm is possible).

Consequently, along with capping the timeout to be set in
stop_hz_timer() to jiffies+1, the timeout would also reasonably be
capped to per_cpu(processed_system_time, cpu) + NS_PER_TICK.
This in turn only makes sense is the per-CPU processed time is
accurate (i.e. within NS_PER_TICK from when the last timer
interrupt occurred). That however doesn''t hold: Due to the stolen/
blocked calculations subtracting exact nanosecond values from
delta_cpu, but only adding tick granular values into per-CPU
processed_system_time, the error can accumulate up to a little
less than 3*NS_PER_TICK.

The supposed change would be to do only a single adjustment to
per-CPU processed_system_time (using the originally calculated
delta_cpu value). What I couldn''t convince myself of so far was
that this wouldn''t influence the stolen/blocked accounting (since
the delta_cpu calculated on the next timer interrupt would now
necessarily be different from the one calculated with the
current logic) - in particular the adjustments commented with
"clamp local-time progress" are what would appear to get used
more frequently with the thought of change.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Feb-13 02:28 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Hi Jan,

The attached is the updated patch per your suggestion. generally this patch use
the per-CPU urgent vCPU count to indicate if cpu should enter deep C state. it
introduce per-VCPU urgent flag, and update the urgent VCPU count when vCPU state
is changed. Could you please take a look. Thanks

Regards
Ke
>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@novell.com]
>Sent: Monday, February 08, 2010 5:08 PM
>To: Yu, Ke
>Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36
>>>
>>The attached is the updated patch, it has two changes
>>- change the logic from local irq disabled *and* poll event to local irq
>disabled *or* poll event
>
>Thanks.
>
>>- Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The
>original scheduler does not provide such kind of list, so this patch
implement
>the list in scheduler code.
>
>I''m still not really happy with that solution. I''d rather
say that e.g.
>vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
>whether that one is "urgent", and the scheduler should just
maintain
>a counter of "urgent" vCPU-s per pCPU. Setting the flag when a
vCPU
>is put to sleep guarantees that it won''t be mis-treated if it got
woken
>by the time acpi_processor_idle() looks at it (or at least the window
>would be minimal - not sure if it can be eliminated completely). Plus
>not having to traverse a list is certainly better for scalability, not the
>least since you''re traversing a list (necessarily) including
sleeping
>vCPU-s (i.e. the ones that shouldn''t affect the performance/
>responsiveness of the system).
>
>But in the end it would certainly depend much more on Keir''s view
on
>it than on mine...
>
>Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Feb-15 08:24 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

I''m going to do a -rc3 today I think. Shall I roll this one in?

 -- Keir

On 13/02/2010 02:28, "Yu, Ke" <ke.yu@intel.com> wrote:
> Hi Jan,
> 
> The attached is the updated patch per your suggestion. generally this patch
> use the per-CPU urgent vCPU count to indicate if cpu should enter deep C
> state. it introduce per-VCPU urgent flag, and update the urgent VCPU count
> when vCPU state is changed. Could you please take a look. Thanks
> 
> Regards
> Ke
> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@novell.com]
>> Sent: Monday, February 08, 2010 5:08 PM
>> To: Yu, Ke
>> Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>> Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>> 
>>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36
>>>
>>> The attached is the updated patch, it has two changes
>>> - change the logic from local irq disabled *and* poll event to
local irq
>> disabled *or* poll event
>> 
>> Thanks.
>> 
>>> - Use per-CPU vcpu list to iterate the VCPU, which is more
scalable. The
>> original scheduler does not provide such kind of list, so this patch
>> implement
>> the list in scheduler code.
>> 
>> I''m still not really happy with that solution. I''d
rather say that e.g.
>> vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
>> whether that one is "urgent", and the scheduler should just
maintain
>> a counter of "urgent" vCPU-s per pCPU. Setting the flag when
a vCPU
>> is put to sleep guarantees that it won''t be mis-treated if it
got woken
>> by the time acpi_processor_idle() looks at it (or at least the window
>> would be minimal - not sure if it can be eliminated completely). Plus
>> not having to traverse a list is certainly better for scalability, not
the
>> least since you''re traversing a list (necessarily) including
sleeping
>> vCPU-s (i.e. the ones that shouldn''t affect the performance/
>> responsiveness of the system).
>> 
>> But in the end it would certainly depend much more on Keir''s
view on
>> it than on mine...
>> 
>> Jan
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Feb-15 17:33 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Attached is a better version of your patch (I think). I haven''t applied
it
because I don''t see why the ASSERT() in sched_credit.c is correct. How
do
you know for sure that !v->is_urgent there (and therefore avoid urgent_count
manipulation)?

 -- Keir

On 13/02/2010 02:28, "Yu, Ke" <ke.yu@intel.com> wrote:
> Hi Jan,
> 
> The attached is the updated patch per your suggestion. generally this patch
> use the per-CPU urgent vCPU count to indicate if cpu should enter deep C
> state. it introduce per-VCPU urgent flag, and update the urgent VCPU count
> when vCPU state is changed. Could you please take a look. Thanks
> 
> Regards
> Ke
> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@novell.com]
>> Sent: Monday, February 08, 2010 5:08 PM
>> To: Yu, Ke
>> Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>> Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>> 
>>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36
>>>
>>> The attached is the updated patch, it has two changes
>>> - change the logic from local irq disabled *and* poll event to
local irq
>> disabled *or* poll event
>> 
>> Thanks.
>> 
>>> - Use per-CPU vcpu list to iterate the VCPU, which is more
scalable. The
>> original scheduler does not provide such kind of list, so this patch
>> implement
>> the list in scheduler code.
>> 
>> I''m still not really happy with that solution. I''d
rather say that e.g.
>> vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
>> whether that one is "urgent", and the scheduler should just
maintain
>> a counter of "urgent" vCPU-s per pCPU. Setting the flag when
a vCPU
>> is put to sleep guarantees that it won''t be mis-treated if it
got woken
>> by the time acpi_processor_idle() looks at it (or at least the window
>> would be minimal - not sure if it can be eliminated completely). Plus
>> not having to traverse a list is certainly better for scalability, not
the
>> least since you''re traversing a list (necessarily) including
sleeping
>> vCPU-s (i.e. the ones that shouldn''t affect the performance/
>> responsiveness of the system).
>> 
>> But in the end it would certainly depend much more on Keir''s
view on
>> it than on mine...
>> 
>> Jan
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Feb-16 04:59 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Thanks for the refinement.

For the ASSERT, the reason is that this is runnable vcpu and it should be non
urgent. Think about the vCPU changed from RUNSTATE_blocked/RUNSTATE_offline to
RUNSTATE_runnable via vcpu_wake. vcpu_wake will call vcpu_runstate_change and in
turn vcpu_urgent_count_update, the v->is_urgent will be updated accordingly.
vcpu_wake is protected by scheduler lock, so it is atomic.

Best Regards
Ke
>-----Original Message-----
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
>Sent: Tuesday, February 16, 2010 1:34 AM
>To: Yu, Ke; Jan Beulich
>Cc: Tian, Kevin; xen-devel@lists.xensource.com
>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>Attached is a better version of your patch (I think). I haven''t
applied it
>because I don''t see why the ASSERT() in sched_credit.c is correct.
How do
>you know for sure that !v->is_urgent there (and therefore avoid
urgent_count
>manipulation)?
>
> -- Keir
>
>On 13/02/2010 02:28, "Yu, Ke" <ke.yu@intel.com> wrote:
>
>> Hi Jan,
>>
>> The attached is the updated patch per your suggestion. generally this
patch
>> use the per-CPU urgent vCPU count to indicate if cpu should enter deep
C
>> state. it introduce per-VCPU urgent flag, and update the urgent VCPU
count
>> when vCPU state is changed. Could you please take a look. Thanks
>>
>> Regards
>> Ke
>>
>>> -----Original Message-----
>>> From: Jan Beulich [mailto:JBeulich@novell.com]
>>> Sent: Monday, February 08, 2010 5:08 PM
>>> To: Yu, Ke
>>> Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>>> Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>>>
>>>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10
16:36 >>>
>>>> The attached is the updated patch, it has two changes
>>>> - change the logic from local irq disabled *and* poll event to
local irq
>>> disabled *or* poll event
>>>
>>> Thanks.
>>>
>>>> - Use per-CPU vcpu list to iterate the VCPU, which is more
scalable. The
>>> original scheduler does not provide such kind of list, so this
patch
>>> implement
>>> the list in scheduler code.
>>>
>>> I''m still not really happy with that solution.
I''d rather say that e.g.
>>> vcpu_sleep_nosync() should set a flag in the vcpu structure
indicating
>>> whether that one is "urgent", and the scheduler should
just maintain
>>> a counter of "urgent" vCPU-s per pCPU. Setting the flag
when a vCPU
>>> is put to sleep guarantees that it won''t be mis-treated if
it got woken
>>> by the time acpi_processor_idle() looks at it (or at least the
window
>>> would be minimal - not sure if it can be eliminated completely).
Plus
>>> not having to traverse a list is certainly better for scalability,
not the
>>> least since you''re traversing a list (necessarily)
including sleeping
>>> vCPU-s (i.e. the ones that shouldn''t affect the
performance/
>>> responsiveness of the system).
>>>
>>> But in the end it would certainly depend much more on
Keir''s view on
>>> it than on mine...
>>>
>>> Jan
>>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-16 07:59 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33
>>>
>Attached is a better version of your patch (I think). I haven''t
applied it
>because I don''t see why the ASSERT() in sched_credit.c is correct.
How do
>you know for sure that !v->is_urgent there (and therefore avoid
urgent_count
>manipulation)?
Two remarks: For one, your patch doesn''t consider vCPU-s with event
delivery disabled urgent anymore. Second, here
>+    /*
>+     * Transfer urgency status to new CPU before switching CPUs, as once
>+     * the switch occurs, v->is_urgent is no longer protected by the
per-CPU
>+     * scheduler lock we are holding.
>+     */
>+    if ( unlikely(v->is_urgent) )
>+    {
>+        atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count);
>+        atomic_inc(&per_cpu(schedule_data, new_cpu).urgent_count);
>+    }
I would think we should either avoid the atomic ops altogether if
old_cpu == new_cpu, or switch the updating order (inc before dec).

As to your other question - yes, I''d certainly like to see this
included in
-rc3.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Feb-16 13:12 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33
>>>
>>Attached is a better version of your patch (I think). I haven''t
applied it
>>because I don''t see why the ASSERT() in sched_credit.c is
correct. How do
>>you know for sure that !v->is_urgent there (and therefore avoid
>urgent_count
>>manipulation)?
>
>Two remarks: For one, your patch doesn''t consider vCPU-s with event
>delivery disabled urgent anymore. 
Oh, sorry that I made this change without telling the reason. When vCPU is
blocked with event delivery disabled, it is either guest CPU offline or guest
CPU polling on event channel. Offlined guest CPU should not be treated as urgent
vCPU, so we only need to track the event channel polling case. this is the
reason why I simplify the logic to only treat vCPU polling on event channel as
urgent vCPU.
>Second, here
>
>>+    /*
>>+     * Transfer urgency status to new CPU before switching CPUs, as
once
>>+     * the switch occurs, v->is_urgent is no longer protected by the
>per-CPU
>>+     * scheduler lock we are holding.
>>+     */
>>+    if ( unlikely(v->is_urgent) )
>>+    {
>>+        atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count);
>>+        atomic_inc(&per_cpu(schedule_data, new_cpu).urgent_count);
>>+    }
>
>I would think we should either avoid the atomic ops altogether if
>old_cpu == new_cpu, or switch the updating order (inc before dec).
Do you mean when old_cpu == new_cpu, and if urgent_count == 1, current approach
(dec before inc) has small time window (after dec, before inc) that
urgent_count==0, thus may mislead couidle driver. if this is the case, I am fine
with it and prefer to switching the updating order.

Regards
Ke

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-16 14:24 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Yu, Ke" <ke.yu@intel.com> 16.02.10 14:12
>>>
>>Two remarks: For one, your patch doesn''t consider vCPU-s with
event
>>delivery disabled urgent anymore. 
>
>Oh, sorry that I made this change without telling the reason. When vCPU is
blocked with event delivery disabled, it is either guest CPU offline or guest
CPU polling on event channel. Offlined guest CPU should not be treated as urgent
vCPU, so we only need to track the event channel polling case. this is the
reason why I simplify the logic to only treat vCPU polling on event channel as
urgent vCPU.
Ah, yes, that makes sense. But it also makes me think about the concept of the
change as a whole again: A vCPU polling for an event doesn''t really
indicate whether it is urgent. It just so happens that polling can be used from
the spin lock path. There could be other uses of polling that would not warrant
keeping the underlying pCPU from entering deep C states. Perhaps this should
rather be based on a guest hint (passed either with the poll hypercall or
associated permanently with an event channel) then?
>>I would think we should either avoid the atomic ops altogether if
>>old_cpu == new_cpu, or switch the updating order (inc before dec).
>
>Do you mean when old_cpu == new_cpu, and if urgent_count == 1, current
approach (dec before inc) has small time window (after dec, before inc) that
urgent_count==0, thus may mislead couidle driver. if this is the case, I am fine
with it and prefer to switching the updating order.
Yes, that was my point.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-22 13:29 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33
>>>
>--- a/xen/common/sched_credit.c	Mon Feb 15 08:19:07 2010 +0000
>+++ b/xen/common/sched_credit.c	Mon Feb 15 17:25:29 2010 +0000
>@@ -1060,6 +1060,7 @@
>                 /* We got a candidate. Grab it! */
>                 CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
>                 CSCHED_STAT_CRANK(migrate_queued);
>+                ASSERT(!vc->is_urgent);
>                 __runq_remove(speer);
>                 vc->processor = cpu;
>                 return speer;
By what is this assertion motivated? I.e. why shouldn''t an urgent vCPU
be getting here? We''re seeing this (BUG_ON() in the checked in version)
trigger...

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Feb-22 13:44 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

On 22/02/2010 13:29, "Jan Beulich" <JBeulich@novell.com> wrote:
>>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33
>>>
>> --- a/xen/common/sched_credit.c Mon Feb 15 08:19:07 2010 +0000
>> +++ b/xen/common/sched_credit.c Mon Feb 15 17:25:29 2010 +0000
>> @@ -1060,6 +1060,7 @@
>>                 /* We got a candidate. Grab it! */
>>                 CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
>>                 CSCHED_STAT_CRANK(migrate_queued);
>> +                ASSERT(!vc->is_urgent);
>>                 __runq_remove(speer);
>>                 vc->processor = cpu;
>>                 return speer;
> 
> By what is this assertion motivated? I.e. why shouldn''t an urgent
vCPU
> be getting here? We''re seeing this (BUG_ON() in the checked in
version)
> trigger...
The author''s assertion was that vc must be runnable and hence cannot be
polling and hence cannot be is_urgent. It sounded dodgy to me hence I
upgarded it to a BUG_ON(), but couldn''t actually eyeball a way in which
vc->is_urgent could be true at that point in the code. We have the lock on
the peer cpu''s schedule_lock, so is_urgent cannot change under our
feet, and
vc is not running so it cannot be added to the domain''s poll_mask under
our
feet.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Feb-23 09:32 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com
>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser
>Sent: Monday, February 22, 2010 9:45 PM
>To: Jan Beulich; Yu, Ke
>Cc: Tian, Kevin; xen-devel@lists.xensource.com
>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>On 22/02/2010 13:29, "Jan Beulich" <JBeulich@novell.com>
wrote:
>
>>>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10
18:33 >>>
>>> --- a/xen/common/sched_credit.c Mon Feb 15 08:19:07 2010 +0000
>>> +++ b/xen/common/sched_credit.c Mon Feb 15 17:25:29 2010 +0000
>>> @@ -1060,6 +1060,7 @@
>>>                 /* We got a candidate. Grab it! */
>>>                 CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
>>>                 CSCHED_STAT_CRANK(migrate_queued);
>>> +                ASSERT(!vc->is_urgent);
>>>                 __runq_remove(speer);
>>>                 vc->processor = cpu;
>>>                 return speer;
>>
>> By what is this assertion motivated? I.e. why shouldn''t an
urgent vCPU
>> be getting here? We''re seeing this (BUG_ON() in the checked in
version)
>> trigger...
>
>The author''s assertion was that vc must be runnable and hence
cannot be
>polling and hence cannot be is_urgent. It sounded dodgy to me hence I
>upgarded it to a BUG_ON(), but couldn''t actually eyeball a way in
which
>vc->is_urgent could be true at that point in the code. We have the lock
on
>the peer cpu''s schedule_lock, so is_urgent cannot change under our
feet, and
>vc is not running so it cannot be added to the domain''s poll_mask
under our
>feet.
>
> -- Keir
Right. According to the code, there should be no way to this BUG_ON. If it
happens, that reveal either bugs of code or the necessity of adding code to
migrate urgent vcpu count. Do you have more information on how this BUG_ON
happens?

Regards
Ke

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-23 10:37 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Yu, Ke" <ke.yu@intel.com> 23.02.10 10:32
>>>
>>The author''s assertion was that vc must be runnable and hence
cannot be
>>polling and hence cannot be is_urgent. It sounded dodgy to me hence I
>>upgarded it to a BUG_ON(), but couldn''t actually eyeball a way
in which
>>vc->is_urgent could be true at that point in the code. We have the
lock on
>>the peer cpu''s schedule_lock, so is_urgent cannot change under
our feet, and
>>vc is not running so it cannot be added to the domain''s
poll_mask under our
>>feet.
>Right. According to the code, there should be no way to this BUG_ON.
>If it happens, that reveal either bugs of code or the necessity of
>adding code to migrate urgent vcpu count. Do you have more
>information on how this BUG_ON happens?
Obviously there are vCPU-s that get inserted on a run queue with
is_urgent set (which according to my reading of Keir''s description
shouldn''t happen). In particular, this

--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c
 
     BUG_ON( __vcpu_on_runq(svc) );
     BUG_ON( cpu != svc->vcpu->processor );
+WARN_ON(svc->vcpu->is_urgent);//temp
 
     list_for_each( iter, runq )
     {

triggers occasionally:

(XEN) Xen call trace:
(XEN)    [<ffff82c48011a1e8>] csched_vcpu_wake+0x1e8/0x200
(XEN)    [<ffff82c48011dde2>] vcpu_wake+0x152/0x3a0
(XEN)    [<ffff82c48014b6cd>] vcpu_kick+0x1d/0x80
(XEN)    [<ffff82c480108465>] evtchn_set_pending+0x145/0x1d0
(XEN)    [<ffff82c4801087e8>] evtchn_send+0xa8/0x150
(XEN)    [<ffff82c480108de2>] do_event_channel_op+0x552/0x10d0
(XEN)    [<ffff82c4801f3b61>] do_iret+0xc1/0x1a0
(XEN)    [<ffff82c4801ef169>] syscall_enter+0xa9/0xae

As I understand it, a vCPU with is_urgent set being on any run
queue is enough for the questionable BUG_ON() to eventually
trigger.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Feb-23 10:57 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

On 23/02/2010 10:37, "Jan Beulich" <JBeulich@novell.com> wrote:
>> Right. According to the code, there should be no way to this BUG_ON.
>> If it happens, that reveal either bugs of code or the necessity of
>> adding code to migrate urgent vcpu count. Do you have more
>> information on how this BUG_ON happens?
> 
> Obviously there are vCPU-s that get inserted on a run queue with
> is_urgent set (which according to my reading of Keir''s description
> shouldn''t happen). In particular, this
Is it possible for a polling VCPU to become runnable without it being
cleared from poll_mask? I suspect maybe that is the problem, and that needs
dealing with, or the proper handling needs to be added to sched_credit.c.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-23 16:44 UTC

head link

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> Keir Fraser <keir.fraser@eu.citrix.com> 23.02.10 11:57
>>>
>On 23/02/2010 10:37, "Jan Beulich" <JBeulich@novell.com>
wrote:
>
>>> Right. According to the code, there should be no way to this
BUG_ON.
>>> If it happens, that reveal either bugs of code or the necessity of
>>> adding code to migrate urgent vcpu count. Do you have more
>>> information on how this BUG_ON happens?
>> 
>> Obviously there are vCPU-s that get inserted on a run queue with
>> is_urgent set (which according to my reading of Keir''s
description
>> shouldn''t happen). In particular, this
>
>Is it possible for a polling VCPU to become runnable without it being
>cleared from poll_mask? I suspect maybe that is the problem, and that needs
>dealing with, or the proper handling needs to be added to sched_credit.c.
I don''t think that''s the case, at least not exclusively. Using

--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c
 
     BUG_ON( __vcpu_on_runq(svc) );
     BUG_ON( cpu != svc->vcpu->processor );
+WARN_ON(svc->vcpu->is_urgent);//temp
 
     list_for_each( iter, runq )
     {
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -139,6 +139,7 @@ static inline void vcpu_runstate_change(
    
ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).schedule_lock));
 
     vcpu_urgent_count_update(v);
+WARN_ON(v->is_urgent && new_state <= RUNSTATE_runnable);//temp
 
     trace_runstate_change(v, new_state);
 
I get pairs of warnings (i.e. each for the same vCPU):

(XEN) Xen WARN at schedule.c:142
(XEN) Xen call trace:
(XEN)    [<ffff82c48011c8d5>] schedule+0x375/0x510
(XEN)    [<ffff82c48011deb8>] __do_softirq+0x58/0x80
(XEN)    [<ffff82c4801e61e6>] process_softirqs+0x6/0x10

(XEN) Xen WARN at sched_credit.c:204
(XEN) Xen call trace:
(XEN)    [<ffff82c4801186b9>] csched_vcpu_wake+0x169/0x1a0
(XEN)    [<ffff82c4801497f2>] update_runstate_area+0x102/0x110
(XEN)    [<ffff82c48011cdcf>] vcpu_wake+0x13f/0x390
(XEN)    [<ffff82c48014b1a0>] context_switch+0x760/0xed0
(XEN)    [<ffff82c48014913d>] vcpu_kick+0x1d/0x80
(XEN)    [<ffff82c480107feb>] evtchn_set_pending+0xab/0x1b0
(XEN)    [<ffff82c4801083a9>] evtchn_send+0x129/0x150
(XEN)    [<ffff82c480108950>] do_event_channel_op+0x4c0/0xf50
(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
(XEN)    [<ffff82c48011fd44>] timer_softirq_action+0x1a4/0x360
(XEN)    [<ffff82c4801e6169>] syscall_enter+0xa9/0xae

In schedule() this is always "prev" transitioning to RUNSTATE_runnable
(i.e. _VPF_blocked not set), yet the second call trace shows that
_VPF_blocked must have been set at that point (otherwise
vcpu_unblock(), tail-called from vcpu_kick(), would not have called
vcpu_wake()). If the order wasn''t always that shown, or if the two
traces got intermixed, this could hint at a race - but they are always
that way, which so far I cannot make sense of.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2010-Feb-24 03:08 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月24日 0:45
>
>>>> Keir Fraser <keir.fraser@eu.citrix.com> 23.02.10 11:57
>>>
>>On 23/02/2010 10:37, "Jan Beulich" <JBeulich@novell.com>
wrote:
>>
>>>> Right. According to the code, there should be no way to 
>this BUG_ON.
>>>> If it happens, that reveal either bugs of code or the necessity
of
>>>> adding code to migrate urgent vcpu count. Do you have more
>>>> information on how this BUG_ON happens?
>>> 
>>> Obviously there are vCPU-s that get inserted on a run queue with
>>> is_urgent set (which according to my reading of Keir's
description
>>> shouldn't happen). In particular, this
>>
>>Is it possible for a polling VCPU to become runnable without it being
>>cleared from poll_mask? I suspect maybe that is the problem, 
>and that needs
>>dealing with, or the proper handling needs to be added to 
>sched_credit.c.
>
>I don't think that's the case, at least not exclusively. Using
>
>--- a/xen/common/sched_credit.c
>+++ b/xen/common/sched_credit.c
>@@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c
> 
>     BUG_ON( __vcpu_on_runq(svc) );
>     BUG_ON( cpu != svc->vcpu->processor );
>+WARN_ON(svc->vcpu->is_urgent);//temp
> 
>     list_for_each( iter, runq )
>     {
>--- a/xen/common/schedule.c
>+++ b/xen/common/schedule.c
>@@ -139,6 +139,7 @@ static inline void vcpu_runstate_change(
>     
>ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).sche
>dule_lock));
> 
>     vcpu_urgent_count_update(v);
>+WARN_ON(v->is_urgent && new_state <=
RUNSTATE_runnable);//temp
> 
>     trace_runstate_change(v, new_state);
> 
>I get pairs of warnings (i.e. each for the same vCPU):
>
>(XEN) Xen WARN at schedule.c:142
>(XEN) Xen call trace:
>(XEN)    [<ffff82c48011c8d5>] schedule+0x375/0x510
>(XEN)    [<ffff82c48011deb8>] __do_softirq+0x58/0x80
>(XEN)    [<ffff82c4801e61e6>] process_softirqs+0x6/0x10
>
>(XEN) Xen WARN at sched_credit.c:204
>(XEN) Xen call trace:
>(XEN)    [<ffff82c4801186b9>] csched_vcpu_wake+0x169/0x1a0
>(XEN)    [<ffff82c4801497f2>] update_runstate_area+0x102/0x110
>(XEN)    [<ffff82c48011cdcf>] vcpu_wake+0x13f/0x390
>(XEN)    [<ffff82c48014b1a0>] context_switch+0x760/0xed0
>(XEN)    [<ffff82c48014913d>] vcpu_kick+0x1d/0x80
>(XEN)    [<ffff82c480107feb>] evtchn_set_pending+0xab/0x1b0
>(XEN)    [<ffff82c4801083a9>] evtchn_send+0x129/0x150
>(XEN)    [<ffff82c480108950>] do_event_channel_op+0x4c0/0xf50
>(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
>(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
>(XEN)    [<ffff82c48011fd44>] timer_softirq_action+0x1a4/0x360
>(XEN)    [<ffff82c4801e6169>] syscall_enter+0xa9/0xae
>
>In schedule() this is always "prev" transitioning to
RUNSTATE_runnable
>(i.e. _VPF_blocked not set), yet the second call trace shows that
>_VPF_blocked must have been set at that point (otherwise
>vcpu_unblock(), tail-called from vcpu_kick(), would not have called
>vcpu_wake()). If the order wasn't always that shown, or if the two
>traces got intermixed, this could hint at a race - but they are always
>that way, which so far I cannot make sense of.
>
Such race surely exists, since two paths are each updating multiple
fileds which however are not all protected with same scheduler lock.
vcpu_unblock manipulates _VPF_blocked w/o acuquiring scheduler
lock. Then below sequence is possible:

enter schedule() with _VPF_blocked
...
<-vcpu_unblock clears _VPF_blocked. poll_mask hasn't been cleared yet
...
vcpu_runstate_change(
        prev,
        (test_bit(_VPF_blocked, &prev->pause_flags) ? RUNSTATE_blocked :
         (vcpu_runnable(prev) ? RUNSTATE_runnable : RUNSTATE_offline)),
        now);

Then RUNSTATE_runnable is chosen. Then vcpu_urgent_count_update
will set is_urgent flag since poll_mask has bit set. Then vcpu_unblock
clears poll_mask and then invoke vcpu_wake. vcpu_wake wait for
scheduler lock and then will see is_urgent flag set for a runnable vcpu.

Here one necessary check is missed in vcpu_urgent_count_update, as
only polling vcpu in blocked state should be cared in cpuidle. If there's
any runnable vcpu in temporary polling state, idle vcpu won't get 
scheduled. is_urgent can be only set when new runstate is blocked, and
when vcpu is in poll_mask. As runstate change always happen with
scheduler lock, this can effectively ensure set/clear of is_urgent.

Thanks
Kevin


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Yu, Ke

2010-Feb-24 06:51 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Here is the patch according to Kevin's analysis. Jan, could you give it a
try? Thanks

diff -r 55eb480d82ae xen/common/schedule.c
--- a/xen/common/schedule.c     Wed Feb 24 13:24:54 2010 +0800
+++ b/xen/common/schedule.c     Wed Feb 24 13:55:50 2010 +0800
@@ -107,7 +107,8 @@ static inline void vcpu_urgent_count_upd

     if ( unlikely(v->is_urgent) )
     {
-        if ( !test_bit(v->vcpu_id, v->domain->poll_mask) )
+        if ( !test_bit(_VPF_blocked, &v->pause_flags) ||
+                !test_bit(v->vcpu_id, v->domain->poll_mask) )
         {
             v->is_urgent = 0;
            
atomic_dec(&per_cpu(schedule_data,v->processor).urgent_count);
@@ -115,7 +116,8 @@ static inline void vcpu_urgent_count_upd
     }
     else
     {
-        if ( unlikely(test_bit(v->vcpu_id, v->domain->poll_mask)) )
+        if ( unlikely(test_bit(_VPF_blocked, &v->pause_flags) &&
+                    test_bit(v->vcpu_id, v->domain->poll_mask)) )
         {
             v->is_urgent = 1;
            
atomic_inc(&per_cpu(schedule_data,v->processor).urgent_count);
>-----Original Message-----
>From: Tian, Kevin
>Sent: Wednesday, February 24, 2010 11:09 AM
>To: Jan Beulich; Keir Fraser; Yu, Ke
>Cc: xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>From: Jan Beulich [mailto:JBeulich@novell.com]
>>Sent: 2010年2月24日 0:45
>>
>>>>> Keir Fraser <keir.fraser@eu.citrix.com> 23.02.10
11:57 >>>
>>>On 23/02/2010 10:37, "Jan Beulich"
<JBeulich@novell.com> wrote:
>>>
>>>>> Right. According to the code, there should be no way to
>>this BUG_ON.
>>>>> If it happens, that reveal either bugs of code or the
necessity of
>>>>> adding code to migrate urgent vcpu count. Do you have more
>>>>> information on how this BUG_ON happens?
>>>>
>>>> Obviously there are vCPU-s that get inserted on a run queue
with
>>>> is_urgent set (which according to my reading of Keir's
description
>>>> shouldn't happen). In particular, this
>>>
>>>Is it possible for a polling VCPU to become runnable without it
being
>>>cleared from poll_mask? I suspect maybe that is the problem,
>>and that needs
>>>dealing with, or the proper handling needs to be added to
>>sched_credit.c.
>>
>>I don't think that's the case, at least not exclusively. Using
>>
>>--- a/xen/common/sched_credit.c
>>+++ b/xen/common/sched_credit.c
>>@@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c
>>
>>     BUG_ON( __vcpu_on_runq(svc) );
>>     BUG_ON( cpu != svc->vcpu->processor );
>>+WARN_ON(svc->vcpu->is_urgent);//temp
>>
>>     list_for_each( iter, runq )
>>     {
>>--- a/xen/common/schedule.c
>>+++ b/xen/common/schedule.c
>>@@ -139,6 +139,7 @@ static inline void vcpu_runstate_change(
>>
>>ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).sche
>>dule_lock));
>>
>>     vcpu_urgent_count_update(v);
>>+WARN_ON(v->is_urgent && new_state <=
RUNSTATE_runnable);//temp
>>
>>     trace_runstate_change(v, new_state);
>>
>>I get pairs of warnings (i.e. each for the same vCPU):
>>
>>(XEN) Xen WARN at schedule.c:142
>>(XEN) Xen call trace:
>>(XEN)    [<ffff82c48011c8d5>] schedule+0x375/0x510
>>(XEN)    [<ffff82c48011deb8>] __do_softirq+0x58/0x80
>>(XEN)    [<ffff82c4801e61e6>] process_softirqs+0x6/0x10
>>
>>(XEN) Xen WARN at sched_credit.c:204
>>(XEN) Xen call trace:
>>(XEN)    [<ffff82c4801186b9>] csched_vcpu_wake+0x169/0x1a0
>>(XEN)    [<ffff82c4801497f2>] update_runstate_area+0x102/0x110
>>(XEN)    [<ffff82c48011cdcf>] vcpu_wake+0x13f/0x390
>>(XEN)    [<ffff82c48014b1a0>] context_switch+0x760/0xed0
>>(XEN)    [<ffff82c48014913d>] vcpu_kick+0x1d/0x80
>>(XEN)    [<ffff82c480107feb>] evtchn_set_pending+0xab/0x1b0
>>(XEN)    [<ffff82c4801083a9>] evtchn_send+0x129/0x150
>>(XEN)    [<ffff82c480108950>] do_event_channel_op+0x4c0/0xf50
>>(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
>>(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
>>(XEN)    [<ffff82c48011fd44>] timer_softirq_action+0x1a4/0x360
>>(XEN)    [<ffff82c4801e6169>] syscall_enter+0xa9/0xae
>>
>>In schedule() this is always "prev" transitioning to
RUNSTATE_runnable
>>(i.e. _VPF_blocked not set), yet the second call trace shows that
>>_VPF_blocked must have been set at that point (otherwise
>>vcpu_unblock(), tail-called from vcpu_kick(), would not have called
>>vcpu_wake()). If the order wasn't always that shown, or if the two
>>traces got intermixed, this could hint at a race - but they are always
>>that way, which so far I cannot make sense of.
>>
>
>Such race surely exists, since two paths are each updating multiple
>fileds which however are not all protected with same scheduler lock.
>vcpu_unblock manipulates _VPF_blocked w/o acuquiring scheduler
>lock. Then below sequence is possible:
>
>enter schedule() with _VPF_blocked
>...
><-vcpu_unblock clears _VPF_blocked. poll_mask hasn't been cleared yet
>...
>vcpu_runstate_change(
>        prev,
>        (test_bit(_VPF_blocked, &prev->pause_flags) ?
RUNSTATE_blocked :
>         (vcpu_runnable(prev) ? RUNSTATE_runnable : RUNSTATE_offline)),
>        now);
>
>Then RUNSTATE_runnable is chosen. Then vcpu_urgent_count_update
>will set is_urgent flag since poll_mask has bit set. Then vcpu_unblock
>clears poll_mask and then invoke vcpu_wake. vcpu_wake wait for
>scheduler lock and then will see is_urgent flag set for a runnable vcpu.
>
>Here one necessary check is missed in vcpu_urgent_count_update, as
>only polling vcpu in blocked state should be cared in cpuidle. If
there's
>any runnable vcpu in temporary polling state, idle vcpu won't get
>scheduled. is_urgent can be only set when new runstate is blocked, and
>when vcpu is in poll_mask. As runstate change always happen with
>scheduler lock, this can effectively ensure set/clear of is_urgent.
>
>Thanks
>Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Feb-24 08:56 UTC

head link

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

>>> "Yu, Ke" <ke.yu@intel.com> 24.02.10 07:51
>>>
>Here is the patch according to Kevin''s analysis. Jan, could you
give it a try? Thanks
Indeed, all of the warnings are gone with that after running for almost
an hour where previously the BUG_ON() (converted to a WARN_ON())
would hit within minutes.

Not sure whether Keir needs another formal submission or could push
this in right away.

Anyway, I believe the original BUG_ON() causing this issue should also
be converted to a WARN_ON(). After all, the state checked doesn''t
indicate a problem the system cannot survive - all that it indicates is
that the urgent vCPU accounting is broken, and hence deep C states
may get entered more or less frequently than supposed to.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jan 2010 - cpuidle causing Dom0 soft lockups

[Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

Re: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

RE: [Xen-devel] cpuidle causing Dom0 soft lockups