On large systems and with Dom0 booting with (significantly) more than 32 vCPU-s we have got multiple reports that the now by default enabled C-state management is causing soft lockups, usually preventing the boot from completing. The observations are: Reducing the number of vCPU-s (or pCPU-s) sufficiently much makes the systems work. max_cstate=0 makes the systems work. max_cstate=1 makes the problem less severe on one (bigger) system, and eliminates it completely on another (smaller) one. When appearing to hang, all vCPU-s are in Dom0''s timer_interrupt(), and all (sometimes all but one) are attempting to acquire xtime_lock. However, due to our use of ticket locks we can verify that this is not a deadlock (repeatedly sending ''0'' shows forward progress, as the tickets [visible on the stack] continue to increase). Additionally, there is always one vCPU that has its polling event channel (used for waking the next waiting vCPU when a lock becomes available) signaled. In one case (but not in the other) it is always the same vCPU that is apparently taking very long to wake up from the polling request. This may be coincidence, but output after sending ''c'' also indicates a significantly higher (about 3 times) usage value for C2 than the second highest one; the duration printed is roughly the same for all CPUs. While I don''t know this code well, it would seem that we''re suffering from extremely long wakeup times. This suggests that there likely is a (performance) problem even for smaller numbers of vCPU-s. Hence, unless it can be fixed before 4.0 releases, I would suggest disabling C-state management by default again. I can provide full logs in case needed. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 21/01/2010 09:51, "Jan Beulich" <JBeulich@novell.com> wrote:> While I don''t know this code well, it would seem that we''re suffering > from extremely long wakeup times. This suggests that there likely is > a (performance) problem even for smaller numbers of vCPU-s. > Hence, unless it can be fixed before 4.0 releases, I would suggest > disabling C-state management by default again.It''s enabled in 3.4 by feault too though, already. Are there problems there too? -- keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 11:11 >>> >On 21/01/2010 09:51, "Jan Beulich" <JBeulich@novell.com> wrote: > >> While I don''t know this code well, it would seem that we''re suffering >> from extremely long wakeup times. This suggests that there likely is >> a (performance) problem even for smaller numbers of vCPU-s. >> Hence, unless it can be fixed before 4.0 releases, I would suggest >> disabling C-state management by default again. > >It''s enabled in 3.4 by feault too though, already. Are there problems there >too?We never shipped 3.4.x for anything that would normally be run on large systems. And 3.4 doesn''t support more than 32 vCPU-s per domain, hence the problem - even if present - would be hidden. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 21/01/2010 10:16, "Jan Beulich" <JBeulich@novell.com> wrote:>>> While I don''t know this code well, it would seem that we''re suffering >>> from extremely long wakeup times. This suggests that there likely is >>> a (performance) problem even for smaller numbers of vCPU-s. >>> Hence, unless it can be fixed before 4.0 releases, I would suggest >>> disabling C-state management by default again. >> >> It''s enabled in 3.4 by feault too though, already. Are there problems there >> too? > > We never shipped 3.4.x for anything that would normally be run on > large systems. And 3.4 doesn''t support more than 32 vCPU-s per > domain, hence the problem - even if present - would be hidden.Hm, okay. Another thing to consider is not using ticket locks when running on Xen. Regardless of slow wakeup times (and you can''t really depend on them being fast, in general, in a virtualised environment) experiments have shown that, as you might expect, contended ticket locks perform *abysmally* in virtualised environments. Turning off our major source of power-management wins by default in Xen 4.0 is not really a goer. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:> Reducing the number of vCPU-s (or pCPU-s) sufficiently much makes > the systems work.How many vCPU-s or pCPU-s at most can system work with in your experience?> max_cstate=1 makes the problem less severe on one (bigger) system, > and eliminates it completely on another (smaller) one.How bigger/smaller? vCPU nr & pCPU nr.> I can provide full logs in case needed.Please provide full logs. Jimmy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 11:26 >>> >Hm, okay. Another thing to consider is not using ticket locks when running >on Xen. Regardless of slow wakeup times (and you can''t really depend on them >being fast, in general, in a virtualised environment) experiments have shown >that, as you might expect, contended ticket locks perform *abysmally* in >virtualised environments.But that''s exactly why we use the polling method: That way we don''t need to wake up all potential waiters, but just the one that''s going to get the lock next. We now even go further and detect whether the lock owner is not currently running, and go polling right away (instead of spinning on the lock a few hundred times first).>Turning off our major source of power-management wins by default in >Xen 4.0 is not really a goer.I can see your point. But how can you consider shipping with something apparently severely broken. As said before - the fact that this manifests itself by hanging many-vCPU Dom0 has the very likely implication that there are (so far unnoticed) problems with smaller Dom0-s. If I had a machine at hand that supports C3, I''d try to do some measurements with smaller domains... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote:> But that''s exactly why we use the polling method: That way we don''t > need to wake up all potential waiters, but just the one that''s going to > get the lock next. We now even go further and detect whether the > lock owner is not currently running, and go polling right away (instead > of spinning on the lock a few hundred times first).Ah, these are pv spinlocks?> I can see your point. But how can you consider shipping with something > apparently severely broken. As said before - the fact that this manifests > itself by hanging many-vCPU Dom0 has the very likely implication that > there are (so far unnoticed) problems with smaller Dom0-s. If I had a > machine at hand that supports C3, I''d try to do some measurements > with smaller domains...Well it''s a fallback I guess. If we can''t make progress on solving it then I suppose I agree. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03 >>> >On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote: > >> But that''s exactly why we use the polling method: That way we don''t >> need to wake up all potential waiters, but just the one that''s going to >> get the lock next. We now even go further and detect whether the >> lock owner is not currently running, and go polling right away (instead >> of spinning on the lock a few hundred times first). > >Ah, these are pv spinlocks?Yes. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Jan,
Could you try the following debugging patch. it can help to narrow down the root
cause:
diff -r ea02c95af387 xen/arch/x86/acpi/cpu_idle.c
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -228,7 +228,6 @@ static void acpi_processor_idle(void)
     cpufreq_dbs_timer_suspend();
-    sched_tick_suspend();
     /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();
Regards
Ke>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com
>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Jan Beulich
>Sent: Thursday, January 21, 2010 5:52 PM
>To: xen-devel@lists.xensource.com
>Subject: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>On large systems and with Dom0 booting with (significantly) more than
>32 vCPU-s we have got multiple reports that the now by default
>enabled C-state management is causing soft lockups, usually preventing
>the boot from completing.
>
>The observations are:
>
>Reducing the number of vCPU-s (or pCPU-s) sufficiently much makes
>the systems work.
>
>max_cstate=0 makes the systems work.
>
>max_cstate=1 makes the problem less severe on one (bigger) system,
>and eliminates it completely on another (smaller) one.
>
>When appearing to hang, all vCPU-s are in Dom0''s timer_interrupt(),
>and all (sometimes all but one) are attempting to acquire xtime_lock.
>However, due to our use of ticket locks we can verify that this is not
>a deadlock (repeatedly sending ''0'' shows forward progress,
as the
>tickets [visible on the stack] continue to increase). Additionally, there
>is always one vCPU that has its polling event channel (used for
>waking the next waiting vCPU when a lock becomes available)
>signaled.
>
>In one case (but not in the other) it is always the same vCPU that
>is apparently taking very long to wake up from the polling request.
>This may be coincidence, but output after sending ''c'' also
indicates
>a significantly higher (about 3 times) usage value for C2 than the
>second highest one; the duration printed is roughly the same for
>all CPUs.
>
>While I don''t know this code well, it would seem that
we''re suffering
>from extremely long wakeup times. This suggests that there likely is
>a (performance) problem even for smaller numbers of vCPU-s.
>Hence, unless it can be fixed before 4.0 releases, I would suggest
>disabling C-state management by default again.
>
>I can provide full logs in case needed.
>
>Jan
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
>>> "Yu, Ke" <ke.yu@intel.com> 21.01.10 13:07 >>> >Could you try the following debugging patch. it can help to narrow down the root cause:According to our testing this did not have any effect. I added you to the Cc list of the respective bug, so you have access to the relevant logs.>diff -r ea02c95af387 xen/arch/x86/acpi/cpu_idle.c >--- a/xen/arch/x86/acpi/cpu_idle.c >+++ b/xen/arch/x86/acpi/cpu_idle.c >@@ -228,7 +228,6 @@ static void acpi_processor_idle(void) > > cpufreq_dbs_timer_suspend(); > >- sched_tick_suspend(); > /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */ > process_pending_softirqs();Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03 >>> >On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote: >> I can see your point. But how can you consider shipping with something >> apparently severely broken. As said before - the fact that this manifests >> itself by hanging many-vCPU Dom0 has the very likely implication that >> there are (so far unnoticed) problems with smaller Dom0-s. If I had a >> machine at hand that supports C3, I''d try to do some measurements >> with smaller domains... > >Well it''s a fallback I guess. If we can''t make progress on solving it then I >suppose I agree.Just fyi, we now also have seen an issue on a 24-CPU system that went away with cpuidle=0 (and static analysis of the hang hinted in that direction). All I can judge so far is that this likely has something to do with our kernel''s intensive use of the poll hypercall (i.e. we see vCPU-s not waking up from the call despite there being pending unmasked or polled for events). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich wrote:>>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03 >>> >> On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote: >>> I can see your point. But how can you consider shipping with something >>> apparently severely broken. As said before - the fact that this manifests >>> itself by hanging many-vCPU Dom0 has the very likely implication that >>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a >>> machine at hand that supports C3, I''d try to do some measurements >>> with smaller domains... >> Well it''s a fallback I guess. If we can''t make progress on solving it then I >> suppose I agree. > > Just fyi, we now also have seen an issue on a 24-CPU system that went > away with cpuidle=0 (and static analysis of the hang hinted in that > direction). All I can judge so far is that this likely has something to do > with our kernel''s intensive use of the poll hypercall (i.e. we see vCPU-s > not waking up from the call despite there being pending unmasked or > polled for events).Interesting. I see this problem on a 4-core system. Can I help investigating? Data of my machine (Fujitsu TX300-S5): # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping : 5 ... Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technolgy Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: Tuesday, February 02, 2010 3:55 PM >To: Keir Fraser; Yu, Ke >Cc: xen-devel@lists.xensource.com >Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups > >>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03 >>> >>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote: >>> I can see your point. But how can you consider shipping with something >>> apparently severely broken. As said before - the fact that this manifests >>> itself by hanging many-vCPU Dom0 has the very likely implication that >>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a >>> machine at hand that supports C3, I''d try to do some measurements >>> with smaller domains... >> >>Well it''s a fallback I guess. If we can''t make progress on solving it then I >>suppose I agree. > >Just fyi, we now also have seen an issue on a 24-CPU system that went >away with cpuidle=0 (and static analysis of the hang hinted in that >direction). All I can judge so far is that this likely has something to do >with our kernel''s intensive use of the poll hypercall (i.e. we see vCPU-s >not waking up from the call despite there being pending unmasked or >polled for events). > >JanHi Jan, We just identified the cause of this issue, and is trying to find appropriate way to fix it. This issue is the result of following sequence: 1. every dom0 vCPU has one 250HZ timer (i.e. 4ms period). The vCPU timer_interrupt handler will acquire a global ticket spin lock xtime_lock. When xtime_lock is hold by other vCPU, the vCPU will poll event channel and become blocked. As a result, the pCPU where the vCPU runs will become idle. Later, when the lock holder release xtime_lock, it will notify event channel to wake up the vCPU. As a result, the pCPU will wake up from idle state, and schedule the vCPU to run.>From the above, we can see the latency of vCPU timer interrupt is consisted of the following items. The "latency" here means the time between beginning to acquire lock and finally lock acquired.T1 - CPU execution time ( e.g. timer interrupt lock holding time, event channel notification time) T2 - CPU idle wake up time, i.e. the time CPU wake up from deep C state (e.g. C3) to C0, usually it is in the order of several 10us or 100us 2. then let''s consider the case of large number of CPUs, e.g. 64 pCPU and 64 VCPU in dom0, let''s assume the lock holding sequence is VCPU0 -> VCPU1->VCPU2 ... ->VCPU63. Then vCPU63 will spend 64*(T1 + T2) to acquire the xtime_lock. if T1+T2 is 100us, then the total latency would be ~6ms. As we have known that the timer is 250HZ, or 4ms period, so when event channel notification issued, and pCPU schedule vCPU63, hypervisor will find the timer is over-due, and will send another TIMER_VIRQ for vCPU63 (see schedule()->vcpu_periodic_timer_work() for detail). In this case, vCPU63 will be always busy handling timer interrupt, and not be able to update the watch dog, thus cause the softlock up. So from the above sequence, we can see: - cpuidle driver add extra latency, thus make this issue more easy to occurs. - Large number of CPU multiply the latency - ticket spin lock lead fixed lock acquiring sequence, thus lead the latency repeatedly being 64*(T1+T2), thus make this issue more easy to occurs. and the fundamental cause of this issue is that vCPU timer interrupt handler is not good for scaling, due to the global xtime_lock.>From cpuidle point of view, one thing we are trying to do is: changing the cpuidle driver to not enter deep C state when there is vCPU with local irq disabled and event channel polling. In this case, the T2 latency will be eliminated.Anyway, cpuidle is just one side, we can anticipate that if CPU number is large enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another way is to make dom0 scaling well by not using xtime_lock, although this is pretty hard currently. Or another way is to limit dom0 vCPU number to certain reasonable level. Regards Ke _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Jan, Could you please try the attached patch. this patch try to avoid entering deep C state when there is vCPU local irq disabled, and polling event channel. When tested in my 64 CPU box, this issue is gone with this patch. Best Regards Ke>-----Original Message----- >From: Yu, Ke >Sent: Wednesday, February 03, 2010 1:07 AM >To: Jan Beulich; Keir Fraser >Cc: xen-devel@lists.xensource.com >Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups > >>-----Original Message----- >>From: Jan Beulich [mailto:JBeulich@novell.com] >>Sent: Tuesday, February 02, 2010 3:55 PM >>To: Keir Fraser; Yu, Ke >>Cc: xen-devel@lists.xensource.com >>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups >> >>>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03 >>> >>>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote: >>>> I can see your point. But how can you consider shipping with something >>>> apparently severely broken. As said before - the fact that this manifests >>>> itself by hanging many-vCPU Dom0 has the very likely implication that >>>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a >>>> machine at hand that supports C3, I''d try to do some measurements >>>> with smaller domains... >>> >>>Well it''s a fallback I guess. If we can''t make progress on solving it then I >>>suppose I agree. >> >>Just fyi, we now also have seen an issue on a 24-CPU system that went >>away with cpuidle=0 (and static analysis of the hang hinted in that >>direction). All I can judge so far is that this likely has something to do >>with our kernel''s intensive use of the poll hypercall (i.e. we see vCPU-s >>not waking up from the call despite there being pending unmasked or >>polled for events). >> >>Jan > >Hi Jan, > >We just identified the cause of this issue, and is trying to find appropriate way >to fix it. > >This issue is the result of following sequence: >1. every dom0 vCPU has one 250HZ timer (i.e. 4ms period). The vCPU >timer_interrupt handler will acquire a global ticket spin lock xtime_lock. >When xtime_lock is hold by other vCPU, the vCPU will poll event channel and >become blocked. As a result, the pCPU where the vCPU runs will become idle. >Later, when the lock holder release xtime_lock, it will notify event channel to >wake up the vCPU. As a result, the pCPU will wake up from idle state, and >schedule the vCPU to run. > >From the above, we can see the latency of vCPU timer interrupt is consisted >of the following items. The "latency" here means the time between beginning >to acquire lock and finally lock acquired. >T1 - CPU execution time ( e.g. timer interrupt lock holding time, event channel >notification time) >T2 - CPU idle wake up time, i.e. the time CPU wake up from deep C state (e.g. >C3) to C0, usually it is in the order of several 10us or 100us > >2. then let''s consider the case of large number of CPUs, e.g. 64 pCPU and 64 >VCPU in dom0, let''s assume the lock holding sequence is VCPU0 -> >VCPU1->VCPU2 ... ->VCPU63. >Then vCPU63 will spend 64*(T1 + T2) to acquire the xtime_lock. if T1+T2 is >100us, then the total latency would be ~6ms. >As we have known that the timer is 250HZ, or 4ms period, so when event >channel notification issued, and pCPU schedule vCPU63, hypervisor will find >the timer is over-due, and will send another TIMER_VIRQ for vCPU63 (see >schedule()->vcpu_periodic_timer_work() for detail). In this case, vCPU63 will >be always busy handling timer interrupt, and not be able to update the watch >dog, thus cause the softlock up. > >So from the above sequence, we can see: >- cpuidle driver add extra latency, thus make this issue more easy to occurs. >- Large number of CPU multiply the latency >- ticket spin lock lead fixed lock acquiring sequence, thus lead the latency >repeatedly being 64*(T1+T2), thus make this issue more easy to occurs. >and the fundamental cause of this issue is that vCPU timer interrupt handler >is not good for scaling, due to the global xtime_lock. > >From cpuidle point of view, one thing we are trying to do is: changing the >cpuidle driver to not enter deep C state when there is vCPU with local irq >disabled and event channel polling. In this case, the T2 latency will be >eliminated. > >Anyway, cpuidle is just one side, we can anticipate that if CPU number is large >enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another >way is to make dom0 scaling well by not using xtime_lock, although this is >pretty hard currently. Or another way is to limit dom0 vCPU number to >certain reasonable level. > >Regards >Ke_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07 >>> >>Just fyi, we now also have seen an issue on a 24-CPU system that went >>away with cpuidle=0 (and static analysis of the hang hinted in that >>direction). All I can judge so far is that this likely has something to do >>with our kernel''s intensive use of the poll hypercall (i.e. we see vCPU-s >>not waking up from the call despite there being pending unmasked or >>polled for events). > >We just identified the cause of this issue, and is trying to find appropriate way to fix it.Hmm, while I agree that the scenario you describe can be a problem, I don''t think it can explain the behavior on the 24-CPU system pointed out above, nor the one Juergen Gross pointed out yesterday. Nor can it explain why this happens at boot time (when you can take it for granted that several/most of the CPUs are idle [and hence would have their periodic timer stopped]). Also I would think that the rate at which xtime_lock is being acquired may not be the highest one in the entire system, and hence problems may continue to result even if we fixed timer_interrupt().>Anyway, cpuidle is just one side, we can anticipate that if CPU number is large enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another way is to make dom0 scaling well by not using xtime_lock, although this is pretty hard currently. Or another way is to limit dom0 vCPU number to certain reasonable level.I would not think that dealing with the xtime_lock scalability issue in timer_interrupt() should be *that* difficult. In particular it should be possibly to assign an on-duty CPU (permanent or on a round-robin basis) that deals with updating jiffies/wallclock, and all other CPUs just update their local clocks. I had thought about this before, but never found a strong need to experiment with that. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Yu, Ke" <ke.yu@intel.com> 03.02.10 08:32 >>> >Could you please try the attached patch. this patch try to avoid entering deep C state when there is vCPU local irq disabled, and polling event channel. When tested in my 64 CPU box, this issue is gone with this patch.We could try it, but I''m not convinced of the approach. Why is the urgent determination dependent upon event delivery being disabled on the respective vCPU? If at all, it should imo be polling *or* event delivery disabled, not *and*. Also, iterating over all vCPU-s in that function doesn''t seem very scalable. It would seem more reasonable for the scheduler to track how many "urgent" vCPU-s a pCPU currently has. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich >Sent: 2010年2月3日 18:16 > >>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07 >>> >>>Just fyi, we now also have seen an issue on a 24-CPU system that went >>>away with cpuidle=0 (and static analysis of the hang hinted in that >>>direction). All I can judge so far is that this likely has >something to do >>>with our kernel's intensive use of the poll hypercall (i.e. >we see vCPU-s >>>not waking up from the call despite there being pending unmasked or >>>polled for events). >> >>We just identified the cause of this issue, and is trying to >find appropriate way to fix it. > >Hmm, while I agree that the scenario you describe can be a problem, I >don't think it can explain the behavior on the 24-CPU system pointed >out above, nor the one Juergen Gross pointed out yesterday.Is 24-CPU system observed with same likelihood as 64-CPU system to hang at boot time, or less frequent? Ke just did some theoretical analysis by assuming some values. There could be other factors added to latency and each system may have different characteristics too. We can't draw conclusion whether smaller system will face same issue, by simply changing CPU number in Ke's formula. :-) Possibly you can provide cpuidle information on your 24-core system for further comparison.> >Nor can it explain why this happens at boot time (when you can take it >for granted that several/most of the CPUs are idle [and hence would >have their periodic timer stopped]).Well, it's true that several/most of CPUs are idle at boot time, and when they are blocked into Xen, periodic timer is stopped. However it doesn't mean that those 'idle' CPUs will block into Xen and never wake up in whole boot process (here we are talking about dozens of seconds). There're random events to wake blocked vCPU up: random process/ thread creations, device interrupts (based on affinity), per-cpu kernel thread may periodically wake up, etc. Most importantly is, that idle vCPU will hook a singleshot timer when it's blocked into xen, and that singleshot timer is calculated based on min_val of nearest timer expiration and soft lockup threshold. Based on those assumption, although one vCPU of dom0 may be mostly idle in whole boot process, it will still wake up occasionally. Once a blocked vCPU is waken up, Xen will check whether it's next periodic timer interrupt should be delivered. If vCPU has been blocked over one dom0 tick (4ms in this specific SLES case), a virtual timer will pend before resuming to guest. So every dom0 vCPU still enters timer interrupt to acquire xtime_lock intermittently. I'm not sure how much possibility to have those idle vCPUs waken up at same time. If there's no special treatment to interleave vCPU timer interrupt, they may be closely injected based on vCPU boot time point. Anyway, given a situation where some idle vCPUs begin to contend with each other, and several of them exceeds busy-wait threshod to block into xen, latency from deep Cx will add chance for more vCPUs to contend same xtime_lock in same loop. Once many vCPUs may contend at same time, and a full circle from 1st ticket to last ticket exceeds 4ms expected by dom0, a new timer interrupt will be injected again. Then no vCPU can escape from that contending loop.> >Also I would think that the rate at which xtime_lock is being acquired >may not be the highest one in the entire system, and hence problems >may continue to result even if we fixed timer_interrupt().Although the rate acquiring xtime_look is unlikely the highest one, it's a big kernel lock to be acquired on each CPU. ticket spinlock order plus cpuidle latency may rendezvous more and more vCPUs in same contending loop, once latency for a vCPU to wait for specific ticket starts to accumulate. As Ke said, cpuidle just exaggerated spinlock issue in virtualization. Considering another case when system is very busy and physical CPU is overcommitted. Even when one vCPU is waken up, it may be still in runqueue for dozens of milliseconds. Similarly vCPU holding larger tickets have to wait for long time. If applying to xtime_lock, you may still observe softlockup warning then. Possibly the real solution is to not have dom0 with large virtual vCPUs. Thanks, Kevin> >>Anyway, cpuidle is just one side, we can anticipate that if >CPU number is large enough to lead NR_CPU * T1 > 4ms, this >issue will occurs again. So another way is to make dom0 >scaling well by not using xtime_lock, although this is pretty >hard currently. Or another way is to limit dom0 vCPU number to >certain reasonable level. > >I would not think that dealing with the xtime_lock scalability issue in >timer_interrupt() should be *that* difficult. In particular it >should be >possibly to assign an on-duty CPU (permanent or on a round-robin >basis) that deals with updating jiffies/wallclock, and all other CPUs >just update their local clocks. I had thought about this before, but >never found a strong need to experiment with that. > >Jan > > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin wrote:>> From: Jan Beulich >> Sent: 2010年2月3日 18:16 >> >>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07 >>> >>>> Just fyi, we now also have seen an issue on a 24-CPU system that went >>>> away with cpuidle=0 (and static analysis of the hang hinted in that >>>> direction). All I can judge so far is that this likely has >> something to do >>>> with our kernel''s intensive use of the poll hypercall (i.e. >> we see vCPU-s >>>> not waking up from the call despite there being pending unmasked or >>>> polled for events). >>> We just identified the cause of this issue, and is trying to >> find appropriate way to fix it. >> >> Hmm, while I agree that the scenario you describe can be a problem, I >> don''t think it can explain the behavior on the 24-CPU system pointed >> out above, nor the one Juergen Gross pointed out yesterday. > > Is 24-CPU system observed with same likelihood as 64-CPU system to > hang at boot time, or less frequent? Ke just did some theoretical analysis > by assuming some values. There could be other factors added to latency > and each system may have different characteristics too. We can''t > draw conclusion whether smaller system will face same issue, by simply > changing CPU number in Ke''s formula. :-) Possibly you can provide cpuidle > information on your 24-core system for further comparison.My 4-core system hangs _always_. For minutes. If I press any key on the console it will resume booting with soft lockup messages (all cpus were in xen_safe_halt). Sometimes another hang occurs, sometimes the system will come up without further hangs. Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technolgy Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Tian, Kevin" <kevin.tian@intel.com> 03.02.10 13:10 >>> > Possibly the real solution is to not have dom0 with large virtual vCPUs.But you realize that this is a Dom0-only issue only as long as DomU-s with more vCPU-s cannot be created? I.e. it''ll become an issue affecting any kind of guest as soon as that limitation gets out of the way. So I don''t view this as a solution - it''s at best a workaround until a solution can be found. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Kevin has explained details on your comments, so I just pick some point for more explanation.> >I would not think that dealing with the xtime_lock scalability issue in >timer_interrupt() should be *that* difficult. In particular it should be >possibly to assign an on-duty CPU (permanent or on a round-robin >basis) that deals with updating jiffies/wallclock, and all other CPUs >just update their local clocks. I had thought about this before, but >never found a strong need to experiment with that. > >JanThis is good. Eliminating global lock is always good practice for scalability, especially that there will be more and more CPUs in the future. I would expect this to be the best solution to the softlockup issue. And If the global xtime_lock can be eliminated, the cpuidle patch may not be needed anymore.>>Could you please try the attached patch. this patch try to avoid entering >deep C state when there is vCPU local irq disabled, and polling event channel. >When tested in my 64 CPU box, this issue is gone with this patch. > >We could try it, but I''m not convinced of the approach. Why is the >urgent determination dependent upon event delivery being disabled >on the respective vCPU? If at all, it should imo be polling *or* event >delivery disabled, not *and*.Then rationale of this patch is: disabling vCPU local irq usually means vCPU have urgent task to finish ASAP, and don''t want to be interrupted. As first step patch, I am a bit conservative to combine them by *and*. Once it is verified working, I can extend this hint to *or *, as long as the *or*does not include unwanted case that hurt the power saving significantly.> >Also, iterating over all vCPU-s in that function doesn''t seem very >scalable. It would seem more reasonable for the scheduler to track >how many "urgent" vCPU-s a pCPU currently has.Yes, we can do this optimization. Current patch is just for quick verification purpose. Regards Ke _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Juergen Gross [mailto:juergen.gross@ts.fujitsu.com] >Sent: 2010年2月3日 20:19 > >Tian, Kevin wrote: >>> From: Jan Beulich >>> Sent: 2010年2月3日 18:16 >>> >>>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07 >>> >>>>> Just fyi, we now also have seen an issue on a 24-CPU >system that went >>>>> away with cpuidle=0 (and static analysis of the hang >hinted in that >>>>> direction). All I can judge so far is that this likely has >>> something to do >>>>> with our kernel's intensive use of the poll hypercall (i.e. >>> we see vCPU-s >>>>> not waking up from the call despite there being pending >unmasked or >>>>> polled for events). >>>> We just identified the cause of this issue, and is trying to >>> find appropriate way to fix it. >>> >>> Hmm, while I agree that the scenario you describe can be a >problem, I >>> don't think it can explain the behavior on the 24-CPU system pointed >>> out above, nor the one Juergen Gross pointed out yesterday. >> >> Is 24-CPU system observed with same likelihood as 64-CPU system to >> hang at boot time, or less frequent? Ke just did some >theoretical analysis >> by assuming some values. There could be other factors added >to latency >> and each system may have different characteristics too. We can't >> draw conclusion whether smaller system will face same issue, >by simply >> changing CPU number in Ke's formula. :-) Possibly you can >provide cpuidle >> information on your 24-core system for further comparison. > >My 4-core system hangs _always_. For minutes. If I press any key on the >console it will resume booting with soft lockup messages (all cpus were >in xen_safe_halt). >Sometimes another hang occurs, sometimes the system will come >up without >further hangs. > >Juergen >interesting. Then did you also observe hang disappeared by disabling cpuidle? Your case really looks like some missed event scenario, in which key press just kicks cpu alive... Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: 2010年2月3日 21:21 > >>>> "Tian, Kevin" <kevin.tian@intel.com> 03.02.10 13:10 >>> >> Possibly the real solution is to not have dom0 with large >virtual vCPUs. > >But you realize that this is a Dom0-only issue only as long as DomU-s >with more vCPU-s cannot be created? I.e. it'll become an issue >affecting >any kind of guest as soon as that limitation gets out of the way. So I >don't view this as a solution - it's at best a workaround >until a solution >can be found.Sure, but then it's all about how spinlock itself should be implemented in virtualization environment, or how xtime_lock should be used in this specific case, to allow scale up with large vcpu numbers. There's always limitation about how it may scale w/o triggering softlock warning. We agree that cpuidle made it worse inadvertently, and then come up patch to recover it back to the level of original limitation. And we just realize that current ticket pv spinlock may still encounter such limitation once system becomes hot when lock holder of xtime_lock may stay in runqueue for relatively long time, even w/o cpuidle. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin wrote:>> From: Juergen Gross [mailto:juergen.gross@ts.fujitsu.com] >> Sent: 2010年2月3日 20:19 >> >> Tian, Kevin wrote: >>>> From: Jan Beulich >>>> Sent: 2010年2月3日 18:16 >>>> >>>>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07 >>> >>>>>> Just fyi, we now also have seen an issue on a 24-CPU >> system that went >>>>>> away with cpuidle=0 (and static analysis of the hang >> hinted in that >>>>>> direction). All I can judge so far is that this likely has >>>> something to do >>>>>> with our kernel''s intensive use of the poll hypercall (i.e. >>>> we see vCPU-s >>>>>> not waking up from the call despite there being pending >> unmasked or >>>>>> polled for events). >>>>> We just identified the cause of this issue, and is trying to >>>> find appropriate way to fix it. >>>> >>>> Hmm, while I agree that the scenario you describe can be a >> problem, I >>>> don''t think it can explain the behavior on the 24-CPU system pointed >>>> out above, nor the one Juergen Gross pointed out yesterday. >>> Is 24-CPU system observed with same likelihood as 64-CPU system to >>> hang at boot time, or less frequent? Ke just did some >> theoretical analysis >>> by assuming some values. There could be other factors added >> to latency >>> and each system may have different characteristics too. We can''t >>> draw conclusion whether smaller system will face same issue, >> by simply >>> changing CPU number in Ke''s formula. :-) Possibly you can >> provide cpuidle >>> information on your 24-core system for further comparison. >> My 4-core system hangs _always_. For minutes. If I press any key on the >> console it will resume booting with soft lockup messages (all cpus were >> in xen_safe_halt). >> Sometimes another hang occurs, sometimes the system will come >> up without >> further hangs. >> >> Juergen >> > > interesting. Then did you also observe hang disappeared by disabling > cpuidle? Your case really looks like some missed event scenario, in > which key press just kicks cpu alive...Yes, cpuidle=0 made the problem disappear. Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technolgy Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Yes, this patch works for us too. So a non-hacky version of it would be appreciated. I also meanwhile tried out the idea to reduce the contention on xtime_lock (attached for reference). Things appear to work fine, but there is an obvious problem (with - thus far to me - no obvious explanation) with it: The number of timer interrupts on CPUs not on duty to run do_timer() and alike is increasing significantly, with spikes of over 100,000 per second. I''m investigating this, but of course any idea anyone of you might have what could be causing this would be very welcome. Jan>>> "Yu, Ke" <ke.yu@intel.com> 03.02.10 08:32 >>>Hi Jan, Could you please try the attached patch. this patch try to avoid entering deep C state when there is vCPU local irq disabled, and polling event channel. When tested in my 64 CPU box, this issue is gone with this patch. Best Regards Ke _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: 2010年2月5日 16:49 > >Yes, this patch works for us too. So a non-hacky version of it would be >appreciated. > >I also meanwhile tried out the idea to reduce the contention on >xtime_lock (attached for reference). Things appear to work fine, but >there is an obvious problem (with - thus far to me - no obvious >explanation) with it: The number of timer interrupts on CPUs not on >duty to run do_timer() and alike is increasing significantly, >with spikes >of over 100,000 per second. I'm investigating this, but of course any >idea anyone of you might have what could be causing this would be >very welcome. >forgive my poor english. From your patch, only cpu on duty will invoke do_timer to update global timestamp. Why in your test it's CPUs 'not on duty' to have high frequent do_timer? I may read it wrong. :-( Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 10:00 >>> >>From: Jan Beulich [mailto:JBeulich@novell.com] >>Sent: 2010年2月5日 16:49 >> >>Yes, this patch works for us too. So a non-hacky version of it would be >>appreciated. >> >>I also meanwhile tried out the idea to reduce the contention on >>xtime_lock (attached for reference). Things appear to work fine, but >>there is an obvious problem (with - thus far to me - no obvious >>explanation) with it: The number of timer interrupts on CPUs not on >>duty to run do_timer() and alike is increasing significantly, >>with spikes >>of over 100,000 per second. I''m investigating this, but of course any >>idea anyone of you might have what could be causing this would be >>very welcome. >> > >forgive my poor english. From your patch, only cpu on duty will invoke >do_timer to update global timestamp. Why in your test it''s CPUs ''not >on duty'' to have high frequent do_timer? I may read it wrong. :-(If you look at the patch, I added extra statistics for those timer interrupts that occur when a CPU is "on duty" (recorded as IRQ0, which is otherwise unused) and when not "on duty" (recorded as MCEs, since those hopefully(!!!) won''t occur either, and in no case at a high rate).>From that I know that the rate of interrupts (not the rate of do_timer()invocations) is much higher on not-on-duty CPUs, but is roughly as without the patch for the on-duty one. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: Friday, February 05, 2010 4:49 PM >To: Yu, Ke >Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com >Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups > >Yes, this patch works for us too. So a non-hacky version of it would be >appreciated.That is good. I will polish the patch. Regards Ke _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: 2010年2月5日 17:14 > >>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 10:00 >>> >>>From: Jan Beulich [mailto:JBeulich@novell.com] >>>Sent: 2010年2月5日 16:49 >>> >>>Yes, this patch works for us too. So a non-hacky version of >it would be >>>appreciated. >>> >>>I also meanwhile tried out the idea to reduce the contention on >>>xtime_lock (attached for reference). Things appear to work fine, but >>>there is an obvious problem (with - thus far to me - no obvious >>>explanation) with it: The number of timer interrupts on CPUs not on >>>duty to run do_timer() and alike is increasing significantly, >>>with spikes >>>of over 100,000 per second. I'm investigating this, but of course any >>>idea anyone of you might have what could be causing this would be >>>very welcome. >>> >> >>forgive my poor english. From your patch, only cpu on duty will invoke >>do_timer to update global timestamp. Why in your test it's CPUs 'not >>on duty' to have high frequent do_timer? I may read it wrong. :-( > >If you look at the patch, I added extra statistics for those timer >interrupts that occur when a CPU is "on duty" (recorded as IRQ0, >which is otherwise unused) and when not "on duty" (recorded as MCEs, >since those hopefully(!!!) won't occur either, and in no case at a high >rate). > >From that I know that the rate of interrupts (not the rate of >do_timer() >invocations) is much higher on not-on-duty CPUs, but is roughly as >without the patch for the on-duty one. >Are 100,000 per second a sum-up on all non-duty CPUs or observed on just one? How about the level without the patch? Your patch is quite straightforward, and in a glimpse it shouldn't have such weird issue... Timer interrupts are caused in two paths: one is periodic timer when vcpu is running. As you have HZ=250, you won't get >250 interrupts in this portion as Xen accurately drives injections in 4ms pace. The other is singleshot timer when vcpu is entering idle. periodic timer will be stopped in that case, and a singleshot timer is registered with nearest expiration in local timer queue. That portion is unpredictable, then possibly your patch increases hotness in that part. But I haven't figured out how it could be. :-( Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 10:52 >>> >Are 100,000 per second a sum-up on all non-duty CPUs or observed on >just one? How about the level without the patch?That''s 100,000 per CPU per second. Normally on an idle system the number is about 25 per CPU per second. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich >Sent: 2010年2月5日 18:37 > >>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 10:52 >>> >>Are 100,000 per second a sum-up on all non-duty CPUs or observed on >>just one? How about the level without the patch? > >That's 100,000 per CPU per second. Normally on an idle system the >number is about 25 per CPU per second. >I think I know the reason. With your patch, now only one duty CPU will update global jiffies, however this duty CPU may be preempted over several ticks. On the other hand, all other non-duty CPU calculate their singleshot time expirations based on jiffies (see stop_hz_timer). There're some conditions to get jiffies+1 as result. When duty CPU is preempted, it's possible for jiffies value behind actual time for several ticks. That means non-duty CPU may request an old time to Xen which expires and be pended with a timer interrupt immediately by Xen. Then vcpu_block returns immediately. As non-duty CPU is in idle loop, this will loop to get your interrupt stat very high until duty CPU gets re-scheduled to update jiffies. Without your patch, jiffies can be updated timely as long as there's running CPU in dom0. Possible options is to take jiffies, system timestamp, and per-cpu timestamp in stop_hz_timer, to generate a local latest value. Or a per-cpu jiffies can be generated in per-cpu timer interrupt, which is only used in per-cpu style like in stop_hz_timer. Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 12:16 >>> >Possible options is to take jiffies, system timestamp, and per-cpu >timestamp in stop_hz_timer, to generate a local latest value. Or >a per-cpu jiffies can be generated in per-cpu timer interrupt, which >is only used in per-cpu style like in stop_hz_timer.That was my next thought too - but it doesn''t work (unless I made a blatant mistake). Not only does it not get the interrupt rate down, it even grows it on the duty CPU, and it creates visible brief hangs as well a rushes of time. Next I''ll do is try to detect when the duty CPU is de-scheduled, and move on the duty to one that is scheduled (i.e. one that is currently executing timer_interrupt()). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Jan Beulich" <JBeulich@novell.com> 05.02.10 15:59 >>> >Next I''ll do is try to detect when the duty CPU is de-scheduled, and >move on the duty to one that is scheduled (i.e. one that is currently >executing timer_interrupt()).This improves the situation (the highest spike I saw so far was 2,000 interrupts per CPU per second), but doesn''t get it back the way it ought to be (apart from the spikes, as with the original version of the patch, interrupt activity is also generally too high, very erratic, and even during the more quiet periods doesn''t go down to the original level). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: 2010年2月5日 22:59 > >>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 12:16 >>> >>Possible options is to take jiffies, system timestamp, and per-cpu >>timestamp in stop_hz_timer, to generate a local latest value. Or >>a per-cpu jiffies can be generated in per-cpu timer interrupt, which >>is only used in per-cpu style like in stop_hz_timer. > >That was my next thought too - but it doesn't work (unless I made a >blatant mistake). Not only does it not get the interrupt rate down, >it even grows it on the duty CPU, and it creates visible brief hangs >as well a rushes of time.This really twists my head. Ideally with above option singleshot timer is postponed and there's really no point to instead kick interrupt rate up... Possibly you can add some trace in Xen side to check singleshot timer registration/exipiration pattern. At least this is only suspicious virtual timer pending path from my perspective, which is worthy of some checking to understand what's going wrong from another angle.> >Next I'll do is try to detect when the duty CPU is de-scheduled, and >move on the duty to one that is scheduled (i.e. one that is currently >executing timer_interrupt()). >It's still possible that you move duty to one active CPU, while that active CPU is in stop_hz_timer with interrupt disabled. In that case, again no one is able to update jiffies and again stale jiffies can be observed... Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: 2010年2月5日 23:52 > >>>> "Jan Beulich" <JBeulich@novell.com> 05.02.10 15:59 >>> >>Next I'll do is try to detect when the duty CPU is de-scheduled, and >>move on the duty to one that is scheduled (i.e. one that is currently >>executing timer_interrupt()). > >This improves the situation (the highest spike I saw so far was 2,000 >interrupts per CPU per second), but doesn't get it back the way it >ought to be (apart from the spikes, as with the original version of the >patch, interrupt activity is also generally too high, very erratic, and >even during the more quiet periods doesn't go down to the original >level). >could you send out your new patch? in same time, tweaking singleshot timer stat from Xen side would be helpful as I said earlier. :-) Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan, The attached is the updated patch, it has two changes - change the logic from local irq disabled *and* poll event to local irq disabled *or* poll event - Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The original scheduler does not provide such kind of list, so this patch implement the list in scheduler code. Regards Ke>-----Original Message----- >From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: Friday, February 05, 2010 4:49 PM >To: Yu, Ke >Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com >Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups > >Yes, this patch works for us too. So a non-hacky version of it would be >appreciated. > >I also meanwhile tried out the idea to reduce the contention on >xtime_lock (attached for reference). Things appear to work fine, but >there is an obvious problem (with - thus far to me - no obvious >explanation) with it: The number of timer interrupts on CPUs not on >duty to run do_timer() and alike is increasing significantly, with spikes >of over 100,000 per second. I''m investigating this, but of course any >idea anyone of you might have what could be causing this would be >very welcome. > >Jan > >>>> "Yu, Ke" <ke.yu@intel.com> 03.02.10 08:32 >>> >Hi Jan, > >Could you please try the attached patch. this patch try to avoid entering deep >C state when there is vCPU local irq disabled, and polling event channel. >When tested in my 64 CPU box, this issue is gone with this patch. > >Best Regards >Ke_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Tian, Kevin" <kevin.tian@intel.com> 06.02.10 02:50 >>> >>Next I''ll do is try to detect when the duty CPU is de-scheduled, and >>move on the duty to one that is scheduled (i.e. one that is currently >>executing timer_interrupt()). >> > >It''s still possible that you move duty to one active CPU, while that >active CPU is in stop_hz_timer with interrupt disabled. In that case, >again no one is able to update jiffies and again stale jiffies can be >observed...No, since the first thing a CPU entering stop_hz_timer() does is relinquish the duty of updating jiffies. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Tian, Kevin" <kevin.tian@intel.com> 06.02.10 02:52 >>> >>From: Jan Beulich [mailto:JBeulich@novell.com] >>Sent: 2010年2月5日 23:52 >> >>>>> "Jan Beulich" <JBeulich@novell.com> 05.02.10 15:59 >>> >>>Next I''ll do is try to detect when the duty CPU is de-scheduled, and >>>move on the duty to one that is scheduled (i.e. one that is currently >>>executing timer_interrupt()). >> >>This improves the situation (the highest spike I saw so far was 2,000 >>interrupts per CPU per second), but doesn''t get it back the way it >>ought to be (apart from the spikes, as with the original version of the >>patch, interrupt activity is also generally too high, very erratic, and >>even during the more quiet periods doesn''t go down to the original >>level). >> > >could you send out your new patch? in same time, tweaking singleshotAttached. After another refinement (in stop_hz_timer()) I didn''t see spikes above 1,000 interrupts per CPU per second anymore. But it''s still far from being as quiescent as without the patch. What''s also interesting is that there''s an initial period (a minute or so) where the interrupt rate is really stable (though still not as low as previously), and only then it starts becoming erratic.>timer stat from Xen side would be helpful as I said earlier. :-)Didn''t get to do that yet. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>> >The attached is the updated patch, it has two changes >- change the logic from local irq disabled *and* poll event to local irq disabled *or* poll eventThanks.>- Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The original scheduler does not provide such kind of list, so this patch implement the list in scheduler code.I''m still not really happy with that solution. I''d rather say that e.g. vcpu_sleep_nosync() should set a flag in the vcpu structure indicating whether that one is "urgent", and the scheduler should just maintain a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU is put to sleep guarantees that it won''t be mis-treated if it got woken by the time acpi_processor_idle() looks at it (or at least the window would be minimal - not sure if it can be eliminated completely). Plus not having to traverse a list is certainly better for scalability, not the least since you''re traversing a list (necessarily) including sleeping vCPU-s (i.e. the ones that shouldn''t affect the performance/ responsiveness of the system). But in the end it would certainly depend much more on Keir''s view on it than on mine... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 08/02/2010 09:08, "Jan Beulich" <JBeulich@novell.com> wrote:>> - Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The >> original scheduler does not provide such kind of list, so this patch >> implement the list in scheduler code. > > I''m still not really happy with that solution. I''d rather say that e.g. > vcpu_sleep_nosync() should set a flag in the vcpu structure indicating > whether that one is "urgent", and the scheduler should just maintain > a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU > is put to sleep guarantees that it won''t be mis-treated if it got woken > by the time acpi_processor_idle() looks at it (or at least the window > would be minimal - not sure if it can be eliminated completely). Plus > not having to traverse a list is certainly better for scalability, not the > least since you''re traversing a list (necessarily) including sleeping > vCPU-s (i.e. the ones that shouldn''t affect the performance/ > responsiveness of the system). > > But in the end it would certainly depend much more on Keir''s view on > it than on mine...Your suggestion makes sense to me. K. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: 2010年2月8日 16:46 >>>> "Tian, Kevin" <kevin.tian@intel.com> 06.02.10 02:52 >>> >>>From: Jan Beulich [mailto:JBeulich@novell.com] >>>Sent: 2010年2月5日 23:52 >>> >>>>>> "Jan Beulich" <JBeulich@novell.com> 05.02.10 15:59 >>> >>>>Next I'll do is try to detect when the duty CPU is de-scheduled, and >>>>move on the duty to one that is scheduled (i.e. one that is >currently >>>>executing timer_interrupt()). >>> >>>This improves the situation (the highest spike I saw so far was 2,000 >>>interrupts per CPU per second), but doesn't get it back the way it >>>ought to be (apart from the spikes, as with the original >version of the >>>patch, interrupt activity is also generally too high, very >erratic, and >>>even during the more quiet periods doesn't go down to the original >>>level). >>> >> >>could you send out your new patch? in same time, tweaking singleshot > >Attached. After another refinement (in stop_hz_timer()) I didn't see >spikes above 1,000 interrupts per CPU per second anymore. But it's >still far from being as quiescent as without the patch.Would you mind elaborating what's refinement and how that may reduce spikes? Understand those variants may help draw big picture about whole issue.> >What's also interesting is that there's an initial period (a >minute or so) >where the interrupt rate is really stable (though still not as low as >previously), and only then it starts becoming erratic. >What is average interrupt rate for 'stable' and 'erratic' case? Is it close to spike (~1000)?>>timer stat from Xen side would be helpful as I said earlier. :-) > >Didn't get to do that yet.This stat would be helpful, given that you can get actual singleshot timer trace instead of just speculating from dom0 inside. In same time, possibly you can pin dom0 vcpu as a simplified case. BTW, with your current patch there could be still possibility for several vCPUs to contend for xtime_lock at same time. Current duty vCPU may be preempted in ISR, and then other non-duty vCPU will note it not in RUNSTATE_running and then designate itself to take new duty. This may not be big issue, compared to original always-contending style. But just raise it here and please make sure it's actually what's desired by you. Thanks, Kevin Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: 2010年2月8日 17:08 > >>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>> >>The attached is the updated patch, it has two changes >>- change the logic from local irq disabled *and* poll event >to local irq disabled *or* poll event > >Thanks. > >>- Use per-CPU vcpu list to iterate the VCPU, which is more >scalable. The original scheduler does not provide such kind of >list, so this patch implement the list in scheduler code. > >I'm still not really happy with that solution. I'd rather say that e.g. >vcpu_sleep_nosync() should set a flag in the vcpu structure indicating >whether that one is "urgent", and the scheduler should just maintain >a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU >is put to sleep guarantees that it won't be mis-treated if it got woken >by the time acpi_processor_idle() looks at it (or at least the window >would be minimal - not sure if it can be eliminated completely). Plus >not having to traverse a list is certainly better for >scalability, not the >least since you're traversing a list (necessarily) including sleeping >vCPU-s (i.e. the ones that shouldn't affect the performance/ >responsiveness of the system). > >But in the end it would certainly depend much more on Keir's view on >it than on mine... >Yes, that's good point. Actually it's the 1st choice when Ke tried to implement it, but gave up later due to failure to maintain counter at approriate entry/exit points. Introduce a new 'urgent' flag would make it easier. Another reason to use per-CPU vcpu-list is in view of reusability in other scenarios. But it looks not strong now. Anyway, a new patch per your suggesstion is in progess now. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Tian, Kevin" <kevin.tian@intel.com> 09.02.10 08:55 >>> >>From: Jan Beulich [mailto:JBeulich@novell.com] >>Attached. After another refinement (in stop_hz_timer()) I didn''t see >>spikes above 1,000 interrupts per CPU per second anymore. But it''s >>still far from being as quiescent as without the patch. > >Would you mind elaborating what''s refinement and how that may >reduce spikes? Understand those variants may help draw big picture >about whole issue.This was the last hunk in stop_hz_timer() (forcing timeout_abs_ns to never be in the past relative to the local accumulated time).>>What''s also interesting is that there''s an initial period (a >>minute or so) >>where the interrupt rate is really stable (though still not as low as >>previously), and only then it starts becoming erratic. > >What is average interrupt rate for ''stable'' and ''erratic'' case? Is it >close to spike (~1000)?No, during the stable period the rate is less than 50; erratic has an average rate (not counting the spikes) of about 125.>BTW, with your current patch there could be still possibility for >several vCPUs to contend for xtime_lock at same time. Current duty >vCPU may be preempted in ISR, and then other non-duty vCPU >will note it not in RUNSTATE_running and then designate itself >to take new duty. This may not be big issue, compared to original >always-contending style. But just raise it here and please make >sure it''s actually what''s desired by you.Yes, I realize that. It''s a tradeoff (and would probably need further refinement, if only the whole thing would first work as desired). And it''s getting us back to the fact that pv guests should be allowed to avoid being preempted for short periods of time. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Tian, Kevin" <kevin.tian@intel.com> 09.02.10 08:55 >>> >This stat would be helpful, given that you can get actual singleshot >timer trace instead of just speculating from dom0 inside.Finally got to that. While without the patch the number of single-shot events is less than that of periodic ones, it is much bigger with. Both kernel (relative to extrapolated local time) and hypervisor (relative to extrapolated system time) agree that many of them get scheduled with a time (up to about 6ms) in the past. Which means to me that the per-CPU processed_system_time isn''t being maintained accurately, and I think this is a result of how it is being updated in timer_interrupt(): Other than with the global processed_system_time, the per-CPU one may not get increased even if delta_cpu was close to 3*NS_PER_TICK, due to the stolen/blocked accounting. Probably this was not a problem so far because no code outside of timer_interrupt() read this value - Keir, since I think you wrote that logic originally, any word on this? While I think this should be changed (so that the value can be used in stop_hz_timer() for a second adjustment similar to the one done with jiffies, setting the timeout to last timer interrupt + 1 tick), I meanwhile thought of a different approach to the original problem: Instead of dedicating a CPU to have the duty of maintaining global time, just limit the number of CPUs that may concurrently try to acquire xtime_lock in timer_interrupt(). While for now I set the threshold to __fls(nr_cpu_ids), even setting it to 1 gives pretty good results - no unduly high interrupt rates anymore, and only very infrequent (one every half hour or so) single-shot events that are determined to have a timeout more than 1ms in the past. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 11/02/2010 14:44, "Jan Beulich" <JBeulich@novell.com> wrote:> Other than with the global processed_system_time, > the per-CPU one may not get increased even if delta_cpu was close > to 3*NS_PER_TICK, due to the stolen/blocked accounting. Probably > this was not a problem so far because no code outside of > timer_interrupt() read this value - Keir, since I think you wrote that > logic originally, any word on this?What you say is true, as clearly it is currently implemented that way almost by design. I''m lost in the intricacies of your current discussion though, so not sure exactly why it''s a problem, and how we should fix it? Thanks, Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 11.02.10 18:01 >>> >On 11/02/2010 14:44, "Jan Beulich" <JBeulich@novell.com> wrote: > >> Other than with the global processed_system_time, >> the per-CPU one may not get increased even if delta_cpu was close >> to 3*NS_PER_TICK, due to the stolen/blocked accounting. Probably >> this was not a problem so far because no code outside of >> timer_interrupt() read this value - Keir, since I think you wrote that >> logic originally, any word on this? > >What you say is true, as clearly it is currently implemented that way almost >by design. I''m lost in the intricacies of your current discussion though, so >not sure exactly why it''s a problem, and how we should fix it?First of all I don''t think anything necessarily needs to be fixed in the 2.6.18 tree, as that one will never support >32 vCPU-s, and I don''t think the scalability issue we''re talking about here is of concern there. The problem we''re trying to address is the contention on xtime_lock. It is clear that there generally is no need for all CPUs in the system to try to update to global time variables, so some filtering on the number of CPUs concurrently trying to acquire xtime_lock is reasonable. With any filtering done, there is however potential for a CPU to see its local processed time ahead of the global one, but while setting a single shot timer in the past (or very near future) would guarantee that it would execute timer_interrupt() (almost) right away, it does not guarantee that it would now be among those CPUs that would try to acquire xtime_lock (i.e. the situation wouldn''t necessarily have improved after the interrupt was handled, and hence an interrupt storm is possible). Consequently, along with capping the timeout to be set in stop_hz_timer() to jiffies+1, the timeout would also reasonably be capped to per_cpu(processed_system_time, cpu) + NS_PER_TICK. This in turn only makes sense is the per-CPU processed time is accurate (i.e. within NS_PER_TICK from when the last timer interrupt occurred). That however doesn''t hold: Due to the stolen/ blocked calculations subtracting exact nanosecond values from delta_cpu, but only adding tick granular values into per-CPU processed_system_time, the error can accumulate up to a little less than 3*NS_PER_TICK. The supposed change would be to do only a single adjustment to per-CPU processed_system_time (using the originally calculated delta_cpu value). What I couldn''t convince myself of so far was that this wouldn''t influence the stolen/blocked accounting (since the delta_cpu calculated on the next timer interrupt would now necessarily be different from the one calculated with the current logic) - in particular the adjustments commented with "clamp local-time progress" are what would appear to get used more frequently with the thought of change. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Jan, The attached is the updated patch per your suggestion. generally this patch use the per-CPU urgent vCPU count to indicate if cpu should enter deep C state. it introduce per-VCPU urgent flag, and update the urgent VCPU count when vCPU state is changed. Could you please take a look. Thanks Regards Ke>-----Original Message----- >From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: Monday, February 08, 2010 5:08 PM >To: Yu, Ke >Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com >Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups > >>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>> >>The attached is the updated patch, it has two changes >>- change the logic from local irq disabled *and* poll event to local irq >disabled *or* poll event > >Thanks. > >>- Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The >original scheduler does not provide such kind of list, so this patch implement >the list in scheduler code. > >I''m still not really happy with that solution. I''d rather say that e.g. >vcpu_sleep_nosync() should set a flag in the vcpu structure indicating >whether that one is "urgent", and the scheduler should just maintain >a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU >is put to sleep guarantees that it won''t be mis-treated if it got woken >by the time acpi_processor_idle() looks at it (or at least the window >would be minimal - not sure if it can be eliminated completely). Plus >not having to traverse a list is certainly better for scalability, not the >least since you''re traversing a list (necessarily) including sleeping >vCPU-s (i.e. the ones that shouldn''t affect the performance/ >responsiveness of the system). > >But in the end it would certainly depend much more on Keir''s view on >it than on mine... > >Jan_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I''m going to do a -rc3 today I think. Shall I roll this one in? -- Keir On 13/02/2010 02:28, "Yu, Ke" <ke.yu@intel.com> wrote:> Hi Jan, > > The attached is the updated patch per your suggestion. generally this patch > use the per-CPU urgent vCPU count to indicate if cpu should enter deep C > state. it introduce per-VCPU urgent flag, and update the urgent VCPU count > when vCPU state is changed. Could you please take a look. Thanks > > Regards > Ke > >> -----Original Message----- >> From: Jan Beulich [mailto:JBeulich@novell.com] >> Sent: Monday, February 08, 2010 5:08 PM >> To: Yu, Ke >> Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com >> Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups >> >>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>> >>> The attached is the updated patch, it has two changes >>> - change the logic from local irq disabled *and* poll event to local irq >> disabled *or* poll event >> >> Thanks. >> >>> - Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The >> original scheduler does not provide such kind of list, so this patch >> implement >> the list in scheduler code. >> >> I''m still not really happy with that solution. I''d rather say that e.g. >> vcpu_sleep_nosync() should set a flag in the vcpu structure indicating >> whether that one is "urgent", and the scheduler should just maintain >> a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU >> is put to sleep guarantees that it won''t be mis-treated if it got woken >> by the time acpi_processor_idle() looks at it (or at least the window >> would be minimal - not sure if it can be eliminated completely). Plus >> not having to traverse a list is certainly better for scalability, not the >> least since you''re traversing a list (necessarily) including sleeping >> vCPU-s (i.e. the ones that shouldn''t affect the performance/ >> responsiveness of the system). >> >> But in the end it would certainly depend much more on Keir''s view on >> it than on mine... >> >> Jan >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Attached is a better version of your patch (I think). I haven''t applied it because I don''t see why the ASSERT() in sched_credit.c is correct. How do you know for sure that !v->is_urgent there (and therefore avoid urgent_count manipulation)? -- Keir On 13/02/2010 02:28, "Yu, Ke" <ke.yu@intel.com> wrote:> Hi Jan, > > The attached is the updated patch per your suggestion. generally this patch > use the per-CPU urgent vCPU count to indicate if cpu should enter deep C > state. it introduce per-VCPU urgent flag, and update the urgent VCPU count > when vCPU state is changed. Could you please take a look. Thanks > > Regards > Ke > >> -----Original Message----- >> From: Jan Beulich [mailto:JBeulich@novell.com] >> Sent: Monday, February 08, 2010 5:08 PM >> To: Yu, Ke >> Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com >> Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups >> >>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>> >>> The attached is the updated patch, it has two changes >>> - change the logic from local irq disabled *and* poll event to local irq >> disabled *or* poll event >> >> Thanks. >> >>> - Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The >> original scheduler does not provide such kind of list, so this patch >> implement >> the list in scheduler code. >> >> I''m still not really happy with that solution. I''d rather say that e.g. >> vcpu_sleep_nosync() should set a flag in the vcpu structure indicating >> whether that one is "urgent", and the scheduler should just maintain >> a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU >> is put to sleep guarantees that it won''t be mis-treated if it got woken >> by the time acpi_processor_idle() looks at it (or at least the window >> would be minimal - not sure if it can be eliminated completely). Plus >> not having to traverse a list is certainly better for scalability, not the >> least since you''re traversing a list (necessarily) including sleeping >> vCPU-s (i.e. the ones that shouldn''t affect the performance/ >> responsiveness of the system). >> >> But in the end it would certainly depend much more on Keir''s view on >> it than on mine... >> >> Jan >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Thanks for the refinement. For the ASSERT, the reason is that this is runnable vcpu and it should be non urgent. Think about the vCPU changed from RUNSTATE_blocked/RUNSTATE_offline to RUNSTATE_runnable via vcpu_wake. vcpu_wake will call vcpu_runstate_change and in turn vcpu_urgent_count_update, the v->is_urgent will be updated accordingly. vcpu_wake is protected by scheduler lock, so it is atomic. Best Regards Ke>-----Original Message----- >From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: Tuesday, February 16, 2010 1:34 AM >To: Yu, Ke; Jan Beulich >Cc: Tian, Kevin; xen-devel@lists.xensource.com >Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups > >Attached is a better version of your patch (I think). I haven''t applied it >because I don''t see why the ASSERT() in sched_credit.c is correct. How do >you know for sure that !v->is_urgent there (and therefore avoid urgent_count >manipulation)? > > -- Keir > >On 13/02/2010 02:28, "Yu, Ke" <ke.yu@intel.com> wrote: > >> Hi Jan, >> >> The attached is the updated patch per your suggestion. generally this patch >> use the per-CPU urgent vCPU count to indicate if cpu should enter deep C >> state. it introduce per-VCPU urgent flag, and update the urgent VCPU count >> when vCPU state is changed. Could you please take a look. Thanks >> >> Regards >> Ke >> >>> -----Original Message----- >>> From: Jan Beulich [mailto:JBeulich@novell.com] >>> Sent: Monday, February 08, 2010 5:08 PM >>> To: Yu, Ke >>> Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com >>> Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups >>> >>>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>> >>>> The attached is the updated patch, it has two changes >>>> - change the logic from local irq disabled *and* poll event to local irq >>> disabled *or* poll event >>> >>> Thanks. >>> >>>> - Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The >>> original scheduler does not provide such kind of list, so this patch >>> implement >>> the list in scheduler code. >>> >>> I''m still not really happy with that solution. I''d rather say that e.g. >>> vcpu_sleep_nosync() should set a flag in the vcpu structure indicating >>> whether that one is "urgent", and the scheduler should just maintain >>> a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU >>> is put to sleep guarantees that it won''t be mis-treated if it got woken >>> by the time acpi_processor_idle() looks at it (or at least the window >>> would be minimal - not sure if it can be eliminated completely). Plus >>> not having to traverse a list is certainly better for scalability, not the >>> least since you''re traversing a list (necessarily) including sleeping >>> vCPU-s (i.e. the ones that shouldn''t affect the performance/ >>> responsiveness of the system). >>> >>> But in the end it would certainly depend much more on Keir''s view on >>> it than on mine... >>> >>> Jan >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33 >>> >Attached is a better version of your patch (I think). I haven''t applied it >because I don''t see why the ASSERT() in sched_credit.c is correct. How do >you know for sure that !v->is_urgent there (and therefore avoid urgent_count >manipulation)?Two remarks: For one, your patch doesn''t consider vCPU-s with event delivery disabled urgent anymore. Second, here>+ /* >+ * Transfer urgency status to new CPU before switching CPUs, as once >+ * the switch occurs, v->is_urgent is no longer protected by the per-CPU >+ * scheduler lock we are holding. >+ */ >+ if ( unlikely(v->is_urgent) ) >+ { >+ atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count); >+ atomic_inc(&per_cpu(schedule_data, new_cpu).urgent_count); >+ }I would think we should either avoid the atomic ops altogether if old_cpu == new_cpu, or switch the updating order (inc before dec). As to your other question - yes, I''d certainly like to see this included in -rc3. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33 >>> >>Attached is a better version of your patch (I think). I haven''t applied it >>because I don''t see why the ASSERT() in sched_credit.c is correct. How do >>you know for sure that !v->is_urgent there (and therefore avoid >urgent_count >>manipulation)? > >Two remarks: For one, your patch doesn''t consider vCPU-s with event >delivery disabled urgent anymore.Oh, sorry that I made this change without telling the reason. When vCPU is blocked with event delivery disabled, it is either guest CPU offline or guest CPU polling on event channel. Offlined guest CPU should not be treated as urgent vCPU, so we only need to track the event channel polling case. this is the reason why I simplify the logic to only treat vCPU polling on event channel as urgent vCPU.>Second, here > >>+ /* >>+ * Transfer urgency status to new CPU before switching CPUs, as once >>+ * the switch occurs, v->is_urgent is no longer protected by the >per-CPU >>+ * scheduler lock we are holding. >>+ */ >>+ if ( unlikely(v->is_urgent) ) >>+ { >>+ atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count); >>+ atomic_inc(&per_cpu(schedule_data, new_cpu).urgent_count); >>+ } > >I would think we should either avoid the atomic ops altogether if >old_cpu == new_cpu, or switch the updating order (inc before dec).Do you mean when old_cpu == new_cpu, and if urgent_count == 1, current approach (dec before inc) has small time window (after dec, before inc) that urgent_count==0, thus may mislead couidle driver. if this is the case, I am fine with it and prefer to switching the updating order. Regards Ke _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Yu, Ke" <ke.yu@intel.com> 16.02.10 14:12 >>> >>Two remarks: For one, your patch doesn''t consider vCPU-s with event >>delivery disabled urgent anymore. > >Oh, sorry that I made this change without telling the reason. When vCPU is blocked with event delivery disabled, it is either guest CPU offline or guest CPU polling on event channel. Offlined guest CPU should not be treated as urgent vCPU, so we only need to track the event channel polling case. this is the reason why I simplify the logic to only treat vCPU polling on event channel as urgent vCPU.Ah, yes, that makes sense. But it also makes me think about the concept of the change as a whole again: A vCPU polling for an event doesn''t really indicate whether it is urgent. It just so happens that polling can be used from the spin lock path. There could be other uses of polling that would not warrant keeping the underlying pCPU from entering deep C states. Perhaps this should rather be based on a guest hint (passed either with the poll hypercall or associated permanently with an event channel) then?>>I would think we should either avoid the atomic ops altogether if >>old_cpu == new_cpu, or switch the updating order (inc before dec). > >Do you mean when old_cpu == new_cpu, and if urgent_count == 1, current approach (dec before inc) has small time window (after dec, before inc) that urgent_count==0, thus may mislead couidle driver. if this is the case, I am fine with it and prefer to switching the updating order.Yes, that was my point. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33 >>> >--- a/xen/common/sched_credit.c Mon Feb 15 08:19:07 2010 +0000 >+++ b/xen/common/sched_credit.c Mon Feb 15 17:25:29 2010 +0000 >@@ -1060,6 +1060,7 @@ > /* We got a candidate. Grab it! */ > CSCHED_VCPU_STAT_CRANK(speer, migrate_q); > CSCHED_STAT_CRANK(migrate_queued); >+ ASSERT(!vc->is_urgent); > __runq_remove(speer); > vc->processor = cpu; > return speer;By what is this assertion motivated? I.e. why shouldn''t an urgent vCPU be getting here? We''re seeing this (BUG_ON() in the checked in version) trigger... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 22/02/2010 13:29, "Jan Beulich" <JBeulich@novell.com> wrote:>>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33 >>> >> --- a/xen/common/sched_credit.c Mon Feb 15 08:19:07 2010 +0000 >> +++ b/xen/common/sched_credit.c Mon Feb 15 17:25:29 2010 +0000 >> @@ -1060,6 +1060,7 @@ >> /* We got a candidate. Grab it! */ >> CSCHED_VCPU_STAT_CRANK(speer, migrate_q); >> CSCHED_STAT_CRANK(migrate_queued); >> + ASSERT(!vc->is_urgent); >> __runq_remove(speer); >> vc->processor = cpu; >> return speer; > > By what is this assertion motivated? I.e. why shouldn''t an urgent vCPU > be getting here? We''re seeing this (BUG_ON() in the checked in version) > trigger...The author''s assertion was that vc must be runnable and hence cannot be polling and hence cannot be is_urgent. It sounded dodgy to me hence I upgarded it to a BUG_ON(), but couldn''t actually eyeball a way in which vc->is_urgent could be true at that point in the code. We have the lock on the peer cpu''s schedule_lock, so is_urgent cannot change under our feet, and vc is not running so it cannot be added to the domain''s poll_mask under our feet. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: xen-devel-bounces@lists.xensource.com >[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser >Sent: Monday, February 22, 2010 9:45 PM >To: Jan Beulich; Yu, Ke >Cc: Tian, Kevin; xen-devel@lists.xensource.com >Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups > >On 22/02/2010 13:29, "Jan Beulich" <JBeulich@novell.com> wrote: > >>>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33 >>> >>> --- a/xen/common/sched_credit.c Mon Feb 15 08:19:07 2010 +0000 >>> +++ b/xen/common/sched_credit.c Mon Feb 15 17:25:29 2010 +0000 >>> @@ -1060,6 +1060,7 @@ >>> /* We got a candidate. Grab it! */ >>> CSCHED_VCPU_STAT_CRANK(speer, migrate_q); >>> CSCHED_STAT_CRANK(migrate_queued); >>> + ASSERT(!vc->is_urgent); >>> __runq_remove(speer); >>> vc->processor = cpu; >>> return speer; >> >> By what is this assertion motivated? I.e. why shouldn''t an urgent vCPU >> be getting here? We''re seeing this (BUG_ON() in the checked in version) >> trigger... > >The author''s assertion was that vc must be runnable and hence cannot be >polling and hence cannot be is_urgent. It sounded dodgy to me hence I >upgarded it to a BUG_ON(), but couldn''t actually eyeball a way in which >vc->is_urgent could be true at that point in the code. We have the lock on >the peer cpu''s schedule_lock, so is_urgent cannot change under our feet, and >vc is not running so it cannot be added to the domain''s poll_mask under our >feet. > > -- KeirRight. According to the code, there should be no way to this BUG_ON. If it happens, that reveal either bugs of code or the necessity of adding code to migrate urgent vcpu count. Do you have more information on how this BUG_ON happens? Regards Ke _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> "Yu, Ke" <ke.yu@intel.com> 23.02.10 10:32 >>> >>The author''s assertion was that vc must be runnable and hence cannot be >>polling and hence cannot be is_urgent. It sounded dodgy to me hence I >>upgarded it to a BUG_ON(), but couldn''t actually eyeball a way in which >>vc->is_urgent could be true at that point in the code. We have the lock on >>the peer cpu''s schedule_lock, so is_urgent cannot change under our feet, and >>vc is not running so it cannot be added to the domain''s poll_mask under our >>feet.>Right. According to the code, there should be no way to this BUG_ON. >If it happens, that reveal either bugs of code or the necessity of >adding code to migrate urgent vcpu count. Do you have more >information on how this BUG_ON happens?Obviously there are vCPU-s that get inserted on a run queue with is_urgent set (which according to my reading of Keir''s description shouldn''t happen). In particular, this --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c BUG_ON( __vcpu_on_runq(svc) ); BUG_ON( cpu != svc->vcpu->processor ); +WARN_ON(svc->vcpu->is_urgent);//temp list_for_each( iter, runq ) { triggers occasionally: (XEN) Xen call trace: (XEN) [<ffff82c48011a1e8>] csched_vcpu_wake+0x1e8/0x200 (XEN) [<ffff82c48011dde2>] vcpu_wake+0x152/0x3a0 (XEN) [<ffff82c48014b6cd>] vcpu_kick+0x1d/0x80 (XEN) [<ffff82c480108465>] evtchn_set_pending+0x145/0x1d0 (XEN) [<ffff82c4801087e8>] evtchn_send+0xa8/0x150 (XEN) [<ffff82c480108de2>] do_event_channel_op+0x552/0x10d0 (XEN) [<ffff82c4801f3b61>] do_iret+0xc1/0x1a0 (XEN) [<ffff82c4801ef169>] syscall_enter+0xa9/0xae As I understand it, a vCPU with is_urgent set being on any run queue is enough for the questionable BUG_ON() to eventually trigger. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 23/02/2010 10:37, "Jan Beulich" <JBeulich@novell.com> wrote:>> Right. According to the code, there should be no way to this BUG_ON. >> If it happens, that reveal either bugs of code or the necessity of >> adding code to migrate urgent vcpu count. Do you have more >> information on how this BUG_ON happens? > > Obviously there are vCPU-s that get inserted on a run queue with > is_urgent set (which according to my reading of Keir''s description > shouldn''t happen). In particular, thisIs it possible for a polling VCPU to become runnable without it being cleared from poll_mask? I suspect maybe that is the problem, and that needs dealing with, or the proper handling needs to be added to sched_credit.c. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 23.02.10 11:57 >>> >On 23/02/2010 10:37, "Jan Beulich" <JBeulich@novell.com> wrote: > >>> Right. According to the code, there should be no way to this BUG_ON. >>> If it happens, that reveal either bugs of code or the necessity of >>> adding code to migrate urgent vcpu count. Do you have more >>> information on how this BUG_ON happens? >> >> Obviously there are vCPU-s that get inserted on a run queue with >> is_urgent set (which according to my reading of Keir''s description >> shouldn''t happen). In particular, this > >Is it possible for a polling VCPU to become runnable without it being >cleared from poll_mask? I suspect maybe that is the problem, and that needs >dealing with, or the proper handling needs to be added to sched_credit.c.I don''t think that''s the case, at least not exclusively. Using --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c BUG_ON( __vcpu_on_runq(svc) ); BUG_ON( cpu != svc->vcpu->processor ); +WARN_ON(svc->vcpu->is_urgent);//temp list_for_each( iter, runq ) { --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -139,6 +139,7 @@ static inline void vcpu_runstate_change( ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).schedule_lock)); vcpu_urgent_count_update(v); +WARN_ON(v->is_urgent && new_state <= RUNSTATE_runnable);//temp trace_runstate_change(v, new_state); I get pairs of warnings (i.e. each for the same vCPU): (XEN) Xen WARN at schedule.c:142 (XEN) Xen call trace: (XEN) [<ffff82c48011c8d5>] schedule+0x375/0x510 (XEN) [<ffff82c48011deb8>] __do_softirq+0x58/0x80 (XEN) [<ffff82c4801e61e6>] process_softirqs+0x6/0x10 (XEN) Xen WARN at sched_credit.c:204 (XEN) Xen call trace: (XEN) [<ffff82c4801186b9>] csched_vcpu_wake+0x169/0x1a0 (XEN) [<ffff82c4801497f2>] update_runstate_area+0x102/0x110 (XEN) [<ffff82c48011cdcf>] vcpu_wake+0x13f/0x390 (XEN) [<ffff82c48014b1a0>] context_switch+0x760/0xed0 (XEN) [<ffff82c48014913d>] vcpu_kick+0x1d/0x80 (XEN) [<ffff82c480107feb>] evtchn_set_pending+0xab/0x1b0 (XEN) [<ffff82c4801083a9>] evtchn_send+0x129/0x150 (XEN) [<ffff82c480108950>] do_event_channel_op+0x4c0/0xf50 (XEN) [<ffff82c4801461b5>] reprogram_timer+0x55/0x90 (XEN) [<ffff82c4801461b5>] reprogram_timer+0x55/0x90 (XEN) [<ffff82c48011fd44>] timer_softirq_action+0x1a4/0x360 (XEN) [<ffff82c4801e6169>] syscall_enter+0xa9/0xae In schedule() this is always "prev" transitioning to RUNSTATE_runnable (i.e. _VPF_blocked not set), yet the second call trace shows that _VPF_blocked must have been set at that point (otherwise vcpu_unblock(), tail-called from vcpu_kick(), would not have called vcpu_wake()). If the order wasn''t always that shown, or if the two traces got intermixed, this could hint at a race - but they are always that way, which so far I cannot make sense of. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: 2010年2月24日 0:45 > >>>> Keir Fraser <keir.fraser@eu.citrix.com> 23.02.10 11:57 >>> >>On 23/02/2010 10:37, "Jan Beulich" <JBeulich@novell.com> wrote: >> >>>> Right. According to the code, there should be no way to >this BUG_ON. >>>> If it happens, that reveal either bugs of code or the necessity of >>>> adding code to migrate urgent vcpu count. Do you have more >>>> information on how this BUG_ON happens? >>> >>> Obviously there are vCPU-s that get inserted on a run queue with >>> is_urgent set (which according to my reading of Keir's description >>> shouldn't happen). In particular, this >> >>Is it possible for a polling VCPU to become runnable without it being >>cleared from poll_mask? I suspect maybe that is the problem, >and that needs >>dealing with, or the proper handling needs to be added to >sched_credit.c. > >I don't think that's the case, at least not exclusively. Using > >--- a/xen/common/sched_credit.c >+++ b/xen/common/sched_credit.c >@@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c > > BUG_ON( __vcpu_on_runq(svc) ); > BUG_ON( cpu != svc->vcpu->processor ); >+WARN_ON(svc->vcpu->is_urgent);//temp > > list_for_each( iter, runq ) > { >--- a/xen/common/schedule.c >+++ b/xen/common/schedule.c >@@ -139,6 +139,7 @@ static inline void vcpu_runstate_change( > >ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).sche >dule_lock)); > > vcpu_urgent_count_update(v); >+WARN_ON(v->is_urgent && new_state <= RUNSTATE_runnable);//temp > > trace_runstate_change(v, new_state); > >I get pairs of warnings (i.e. each for the same vCPU): > >(XEN) Xen WARN at schedule.c:142 >(XEN) Xen call trace: >(XEN) [<ffff82c48011c8d5>] schedule+0x375/0x510 >(XEN) [<ffff82c48011deb8>] __do_softirq+0x58/0x80 >(XEN) [<ffff82c4801e61e6>] process_softirqs+0x6/0x10 > >(XEN) Xen WARN at sched_credit.c:204 >(XEN) Xen call trace: >(XEN) [<ffff82c4801186b9>] csched_vcpu_wake+0x169/0x1a0 >(XEN) [<ffff82c4801497f2>] update_runstate_area+0x102/0x110 >(XEN) [<ffff82c48011cdcf>] vcpu_wake+0x13f/0x390 >(XEN) [<ffff82c48014b1a0>] context_switch+0x760/0xed0 >(XEN) [<ffff82c48014913d>] vcpu_kick+0x1d/0x80 >(XEN) [<ffff82c480107feb>] evtchn_set_pending+0xab/0x1b0 >(XEN) [<ffff82c4801083a9>] evtchn_send+0x129/0x150 >(XEN) [<ffff82c480108950>] do_event_channel_op+0x4c0/0xf50 >(XEN) [<ffff82c4801461b5>] reprogram_timer+0x55/0x90 >(XEN) [<ffff82c4801461b5>] reprogram_timer+0x55/0x90 >(XEN) [<ffff82c48011fd44>] timer_softirq_action+0x1a4/0x360 >(XEN) [<ffff82c4801e6169>] syscall_enter+0xa9/0xae > >In schedule() this is always "prev" transitioning to RUNSTATE_runnable >(i.e. _VPF_blocked not set), yet the second call trace shows that >_VPF_blocked must have been set at that point (otherwise >vcpu_unblock(), tail-called from vcpu_kick(), would not have called >vcpu_wake()). If the order wasn't always that shown, or if the two >traces got intermixed, this could hint at a race - but they are always >that way, which so far I cannot make sense of. >Such race surely exists, since two paths are each updating multiple fileds which however are not all protected with same scheduler lock. vcpu_unblock manipulates _VPF_blocked w/o acuquiring scheduler lock. Then below sequence is possible: enter schedule() with _VPF_blocked ... <-vcpu_unblock clears _VPF_blocked. poll_mask hasn't been cleared yet ... vcpu_runstate_change( prev, (test_bit(_VPF_blocked, &prev->pause_flags) ? RUNSTATE_blocked : (vcpu_runnable(prev) ? RUNSTATE_runnable : RUNSTATE_offline)), now); Then RUNSTATE_runnable is chosen. Then vcpu_urgent_count_update will set is_urgent flag since poll_mask has bit set. Then vcpu_unblock clears poll_mask and then invoke vcpu_wake. vcpu_wake wait for scheduler lock and then will see is_urgent flag set for a runnable vcpu. Here one necessary check is missed in vcpu_urgent_count_update, as only polling vcpu in blocked state should be cared in cpuidle. If there's any runnable vcpu in temporary polling state, idle vcpu won't get scheduled. is_urgent can be only set when new runstate is blocked, and when vcpu is in poll_mask. As runstate change always happen with scheduler lock, this can effectively ensure set/clear of is_urgent. Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Here is the patch according to Kevin's analysis. Jan, could you give it a
try? Thanks
diff -r 55eb480d82ae xen/common/schedule.c
--- a/xen/common/schedule.c     Wed Feb 24 13:24:54 2010 +0800
+++ b/xen/common/schedule.c     Wed Feb 24 13:55:50 2010 +0800
@@ -107,7 +107,8 @@ static inline void vcpu_urgent_count_upd
     if ( unlikely(v->is_urgent) )
     {
-        if ( !test_bit(v->vcpu_id, v->domain->poll_mask) )
+        if ( !test_bit(_VPF_blocked, &v->pause_flags) ||
+                !test_bit(v->vcpu_id, v->domain->poll_mask) )
         {
             v->is_urgent = 0;
            
atomic_dec(&per_cpu(schedule_data,v->processor).urgent_count);
@@ -115,7 +116,8 @@ static inline void vcpu_urgent_count_upd
     }
     else
     {
-        if ( unlikely(test_bit(v->vcpu_id, v->domain->poll_mask)) )
+        if ( unlikely(test_bit(_VPF_blocked, &v->pause_flags) &&
+                    test_bit(v->vcpu_id, v->domain->poll_mask)) )
         {
             v->is_urgent = 1;
            
atomic_inc(&per_cpu(schedule_data,v->processor).urgent_count);
>-----Original Message-----
>From: Tian, Kevin
>Sent: Wednesday, February 24, 2010 11:09 AM
>To: Jan Beulich; Keir Fraser; Yu, Ke
>Cc: xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>From: Jan Beulich [mailto:JBeulich@novell.com]
>>Sent: 2010年2月24日 0:45
>>
>>>>> Keir Fraser <keir.fraser@eu.citrix.com> 23.02.10
11:57 >>>
>>>On 23/02/2010 10:37, "Jan Beulich"
<JBeulich@novell.com> wrote:
>>>
>>>>> Right. According to the code, there should be no way to
>>this BUG_ON.
>>>>> If it happens, that reveal either bugs of code or the
necessity of
>>>>> adding code to migrate urgent vcpu count. Do you have more
>>>>> information on how this BUG_ON happens?
>>>>
>>>> Obviously there are vCPU-s that get inserted on a run queue
with
>>>> is_urgent set (which according to my reading of Keir's
description
>>>> shouldn't happen). In particular, this
>>>
>>>Is it possible for a polling VCPU to become runnable without it
being
>>>cleared from poll_mask? I suspect maybe that is the problem,
>>and that needs
>>>dealing with, or the proper handling needs to be added to
>>sched_credit.c.
>>
>>I don't think that's the case, at least not exclusively. Using
>>
>>--- a/xen/common/sched_credit.c
>>+++ b/xen/common/sched_credit.c
>>@@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c
>>
>>     BUG_ON( __vcpu_on_runq(svc) );
>>     BUG_ON( cpu != svc->vcpu->processor );
>>+WARN_ON(svc->vcpu->is_urgent);//temp
>>
>>     list_for_each( iter, runq )
>>     {
>>--- a/xen/common/schedule.c
>>+++ b/xen/common/schedule.c
>>@@ -139,6 +139,7 @@ static inline void vcpu_runstate_change(
>>
>>ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).sche
>>dule_lock));
>>
>>     vcpu_urgent_count_update(v);
>>+WARN_ON(v->is_urgent && new_state <=
RUNSTATE_runnable);//temp
>>
>>     trace_runstate_change(v, new_state);
>>
>>I get pairs of warnings (i.e. each for the same vCPU):
>>
>>(XEN) Xen WARN at schedule.c:142
>>(XEN) Xen call trace:
>>(XEN)    [<ffff82c48011c8d5>] schedule+0x375/0x510
>>(XEN)    [<ffff82c48011deb8>] __do_softirq+0x58/0x80
>>(XEN)    [<ffff82c4801e61e6>] process_softirqs+0x6/0x10
>>
>>(XEN) Xen WARN at sched_credit.c:204
>>(XEN) Xen call trace:
>>(XEN)    [<ffff82c4801186b9>] csched_vcpu_wake+0x169/0x1a0
>>(XEN)    [<ffff82c4801497f2>] update_runstate_area+0x102/0x110
>>(XEN)    [<ffff82c48011cdcf>] vcpu_wake+0x13f/0x390
>>(XEN)    [<ffff82c48014b1a0>] context_switch+0x760/0xed0
>>(XEN)    [<ffff82c48014913d>] vcpu_kick+0x1d/0x80
>>(XEN)    [<ffff82c480107feb>] evtchn_set_pending+0xab/0x1b0
>>(XEN)    [<ffff82c4801083a9>] evtchn_send+0x129/0x150
>>(XEN)    [<ffff82c480108950>] do_event_channel_op+0x4c0/0xf50
>>(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
>>(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
>>(XEN)    [<ffff82c48011fd44>] timer_softirq_action+0x1a4/0x360
>>(XEN)    [<ffff82c4801e6169>] syscall_enter+0xa9/0xae
>>
>>In schedule() this is always "prev" transitioning to
RUNSTATE_runnable
>>(i.e. _VPF_blocked not set), yet the second call trace shows that
>>_VPF_blocked must have been set at that point (otherwise
>>vcpu_unblock(), tail-called from vcpu_kick(), would not have called
>>vcpu_wake()). If the order wasn't always that shown, or if the two
>>traces got intermixed, this could hint at a race - but they are always
>>that way, which so far I cannot make sense of.
>>
>
>Such race surely exists, since two paths are each updating multiple
>fileds which however are not all protected with same scheduler lock.
>vcpu_unblock manipulates _VPF_blocked w/o acuquiring scheduler
>lock. Then below sequence is possible:
>
>enter schedule() with _VPF_blocked
>...
><-vcpu_unblock clears _VPF_blocked. poll_mask hasn't been cleared yet
>...
>vcpu_runstate_change(
>        prev,
>        (test_bit(_VPF_blocked, &prev->pause_flags) ?
RUNSTATE_blocked :
>         (vcpu_runnable(prev) ? RUNSTATE_runnable : RUNSTATE_offline)),
>        now);
>
>Then RUNSTATE_runnable is chosen. Then vcpu_urgent_count_update
>will set is_urgent flag since poll_mask has bit set. Then vcpu_unblock
>clears poll_mask and then invoke vcpu_wake. vcpu_wake wait for
>scheduler lock and then will see is_urgent flag set for a runnable vcpu.
>
>Here one necessary check is missed in vcpu_urgent_count_update, as
>only polling vcpu in blocked state should be cared in cpuidle. If
there's
>any runnable vcpu in temporary polling state, idle vcpu won't get
>scheduled. is_urgent can be only set when new runstate is blocked, and
>when vcpu is in poll_mask. As runstate change always happen with
>scheduler lock, this can effectively ensure set/clear of is_urgent.
>
>Thanks
>Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
>>> "Yu, Ke" <ke.yu@intel.com> 24.02.10 07:51 >>> >Here is the patch according to Kevin''s analysis. Jan, could you give it a try? ThanksIndeed, all of the warnings are gone with that after running for almost an hour where previously the BUG_ON() (converted to a WARN_ON()) would hit within minutes. Not sure whether Keir needs another formal submission or could push this in right away. Anyway, I believe the original BUG_ON() causing this issue should also be converted to a WARN_ON(). After all, the state checked doesn''t indicate a problem the system cannot survive - all that it indicates is that the urgent vCPU accounting is broken, and hence deep C states may get entered more or less frequently than supposed to. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel