Wei, Gang
2010-Apr-20 05:39 UTC
[Xen-devel] [PATCH] CPUIDLE: shorten hpet spin_lock holding time
CPUIDLE: shorten hpet spin_lock holding time Try to reduce spin_lock overhead for deep C state entry/exit. This will benefit systems with a lot of cpus which need the hpet broadcast to wakeup from deep C state. Signed-off-by: Wei Gang <gang.wei@intel.com> diff -r 7ee8bb40200a xen/arch/x86/hpet.c --- a/xen/arch/x86/hpet.c Thu Apr 15 19:11:16 2010 +0100 +++ b/xen/arch/x86/hpet.c Fri Apr 16 15:05:28 2010 +0800 @@ -186,6 +186,9 @@ static void handle_hpet_broadcast(struct again: ch->next_event = STIME_MAX; + + spin_unlock_irq(&ch->lock); + next_event = STIME_MAX; mask = (cpumask_t)CPU_MASK_NONE; now = NOW(); @@ -204,10 +207,14 @@ again: if ( next_event != STIME_MAX ) { - if ( reprogram_hpet_evt_channel(ch, next_event, now, 0) ) + spin_lock_irq(&ch->lock); + + if ( next_event < ch->next_event && + reprogram_hpet_evt_channel(ch, next_event, now, 0) ) goto again; - } - spin_unlock_irq(&ch->lock); + + spin_unlock_irq(&ch->lock); + } } static void hpet_interrupt_handler(int irq, void *data, @@ -656,10 +663,15 @@ void hpet_broadcast_enter(void) BUG_ON( !ch ); ASSERT(!local_irq_is_enabled()); - spin_lock(&ch->lock); if ( hpet_attach_channel ) + { + spin_lock(&ch->lock); + hpet_attach_channel(cpu, ch); + + spin_unlock(&ch->lock); + } /* Cancel any outstanding LAPIC timer event and disable interrupts. */ reprogram_timer(0); @@ -667,6 +679,8 @@ void hpet_broadcast_enter(void) cpu_set(cpu, ch->cpumask); + spin_lock(&ch->lock); + /* reprogram if current cpu expire time is nearer */ if ( this_cpu(timer_deadline_end) < ch->next_event ) reprogram_hpet_evt_channel(ch, this_cpu(timer_deadline_end), NOW(), 1); @@ -683,8 +697,6 @@ void hpet_broadcast_exit(void) return; BUG_ON( !ch ); - - spin_lock_irq(&ch->lock); if ( cpu_test_and_clear(cpu, ch->cpumask) ) { @@ -693,14 +705,22 @@ void hpet_broadcast_exit(void) if ( !reprogram_timer(this_cpu(timer_deadline_start)) ) raise_softirq(TIMER_SOFTIRQ); + spin_lock_irq(&ch->lock); + if ( cpus_empty(ch->cpumask) && ch->next_event != STIME_MAX ) reprogram_hpet_evt_channel(ch, STIME_MAX, 0, 0); + + spin_unlock_irq(&ch->lock); } if ( hpet_detach_channel ) + { + spin_lock_irq(&ch->lock); + hpet_detach_channel(cpu); - spin_unlock_irq(&ch->lock); + spin_unlock_irq(&ch->lock); + } } int hpet_broadcast_is_available(void) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-20 12:49 UTC
[Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
Is this a measurable win? The newer locking looks like it could be dodgy on 32-bit Xen: the 64-bit reads of timer_deadline_{start,end} will be non-atomic and unsynchronised so you can read garbage. Even on 64-bit Xen you can read stale values. I''ll be surprised if you got a performance win from chopping up critical regions in individual functions like that anyway. -- Keir On 20/04/2010 06:39, "Wei, Gang" <gang.wei@intel.com> wrote:> CPUIDLE: shorten hpet spin_lock holding time > > Try to reduce spin_lock overhead for deep C state entry/exit. This will > benefit systems with a lot of cpus which need the hpet broadcast to wakeup > from deep C state. > > Signed-off-by: Wei Gang <gang.wei@intel.com> > > diff -r 7ee8bb40200a xen/arch/x86/hpet.c > --- a/xen/arch/x86/hpet.c Thu Apr 15 19:11:16 2010 +0100 > +++ b/xen/arch/x86/hpet.c Fri Apr 16 15:05:28 2010 +0800 > @@ -186,6 +186,9 @@ static void handle_hpet_broadcast(struct > > again: > ch->next_event = STIME_MAX; > + > + spin_unlock_irq(&ch->lock); > + > next_event = STIME_MAX; > mask = (cpumask_t)CPU_MASK_NONE; > now = NOW(); > @@ -204,10 +207,14 @@ again: > > if ( next_event != STIME_MAX ) > { > - if ( reprogram_hpet_evt_channel(ch, next_event, now, 0) ) > + spin_lock_irq(&ch->lock); > + > + if ( next_event < ch->next_event && > + reprogram_hpet_evt_channel(ch, next_event, now, 0) ) > goto again; > - } > - spin_unlock_irq(&ch->lock); > + > + spin_unlock_irq(&ch->lock); > + } > } > > static void hpet_interrupt_handler(int irq, void *data, > @@ -656,10 +663,15 @@ void hpet_broadcast_enter(void) > BUG_ON( !ch ); > > ASSERT(!local_irq_is_enabled()); > - spin_lock(&ch->lock); > > if ( hpet_attach_channel ) > + { > + spin_lock(&ch->lock); > + > hpet_attach_channel(cpu, ch); > + > + spin_unlock(&ch->lock); > + } > > /* Cancel any outstanding LAPIC timer event and disable interrupts. */ > reprogram_timer(0); > @@ -667,6 +679,8 @@ void hpet_broadcast_enter(void) > > cpu_set(cpu, ch->cpumask); > > + spin_lock(&ch->lock); > + > /* reprogram if current cpu expire time is nearer */ > if ( this_cpu(timer_deadline_end) < ch->next_event ) > reprogram_hpet_evt_channel(ch, this_cpu(timer_deadline_end), NOW(), > 1); > @@ -683,8 +697,6 @@ void hpet_broadcast_exit(void) > return; > > BUG_ON( !ch ); > - > - spin_lock_irq(&ch->lock); > > if ( cpu_test_and_clear(cpu, ch->cpumask) ) > { > @@ -693,14 +705,22 @@ void hpet_broadcast_exit(void) > if ( !reprogram_timer(this_cpu(timer_deadline_start)) ) > raise_softirq(TIMER_SOFTIRQ); > > + spin_lock_irq(&ch->lock); > + > if ( cpus_empty(ch->cpumask) && ch->next_event != STIME_MAX ) > reprogram_hpet_evt_channel(ch, STIME_MAX, 0, 0); > + > + spin_unlock_irq(&ch->lock); > } > > if ( hpet_detach_channel ) > + { > + spin_lock_irq(&ch->lock); > + > hpet_detach_channel(cpu); > > - spin_unlock_irq(&ch->lock); > + spin_unlock_irq(&ch->lock); > + } > } > > int hpet_broadcast_is_available(void)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei, Gang
2010-Apr-20 14:04 UTC
[Xen-devel] RE: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On Tuesday, 2010-4-20 8:49 PM, Keir Fraser wrote:> Is this a measurable win? The newer locking looks like it could be > dodgy on 32-bit Xen: the 64-bit reads of timer_deadline_{start,end} > will be non-atomic and unsynchronised so you can read garbage. Even > on 64-bit Xen you can read stale values. I''ll be surprised if you got > a performance win from chopping up critical regions in individual > functions like that anyway.First of all, this is a measurable power win for cpu overcommitment idle case (vcpu:pcpu > 2:1, pcpu >= 32, guests are non-tickless kernels). As to the non-atomic access to timer_deadline_{start,end}, it should already be there before this patch. It is not protected by the hpet lock. Shall we add rw_lock for each timer_deadline_{start,end}? This can be done separately. Jimmy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-20 14:21 UTC
[Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On 20/04/2010 15:04, "Wei, Gang" <gang.wei@intel.com> wrote:> On Tuesday, 2010-4-20 8:49 PM, Keir Fraser wrote: >> Is this a measurable win? The newer locking looks like it could be >> dodgy on 32-bit Xen: the 64-bit reads of timer_deadline_{start,end} >> will be non-atomic and unsynchronised so you can read garbage. Even >> on 64-bit Xen you can read stale values. I''ll be surprised if you got >> a performance win from chopping up critical regions in individual >> functions like that anyway. > > First of all, this is a measurable power win for cpu overcommitment idle case > (vcpu:pcpu > 2:1, pcpu >= 32, guests are non-tickless kernels).So lots of short sleep periods, and possibly only a very few HPET channels to share? How prevalent is always-running APIC timer now, and is that going to be supported in future processors?> As to the non-atomic access to timer_deadline_{start,end}, it should already > be there before this patch. It is not protected by the hpet lock. Shall we add > rw_lock for each timer_deadline_{start,end}? This can be done separately.The bug isn''t previously there, since the fields will not be read unless the cpu is in ch->cpumask, which (was) protected by ch->lock. That was sufficient since a CPU would not modify timer_deadline_{start,end} between hpet_broadcast_enter and hpet_broadcast_exit. After your patch, handle_hpet_broadcast is no longer fully synchronised against those functions. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei, Gang
2010-Apr-20 15:20 UTC
[Xen-devel] RE: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On Tuesday, 2010-4-20 10:22 PM, Keir Fraser wrote:> On 20/04/2010 15:04, "Wei, Gang" <gang.wei@intel.com> wrote: > >> On Tuesday, 2010-4-20 8:49 PM, Keir Fraser wrote: >>> Is this a measurable win? The newer locking looks like it could be >>> dodgy on 32-bit Xen: the 64-bit reads of timer_deadline_{start,end} >>> will be non-atomic and unsynchronised so you can read garbage. Even >>> on 64-bit Xen you can read stale values. I''ll be surprised if you >>> got a performance win from chopping up critical regions in >>> individual functions like that anyway. >> >> First of all, this is a measurable power win for cpu overcommitment >> idle case (vcpu:pcpu > 2:1, pcpu >= 32, guests are non-tickless >> kernels). > > So lots of short sleep periods, and possibly only a very few HPET > channels to share?Yes.> How prevalent is always-running APIC timer now, > and is that going to be supported in future processors?Always-running APIC timer just started from Westmere. And I personally guess it should be supported in future processors. There are a lot of existing system still need hpet broadcast wakeup.>> As to the non-atomic access to timer_deadline_{start,end}, it should >> already be there before this patch. It is not protected by the hpet >> lock. Shall we add rw_lock for each timer_deadline_{start,end}? This >> can be done separately. > > The bug isn''t previously there, since the fields will not be read > unless the cpu is in ch->cpumask, which (was) protected by ch->lock. > That was sufficient since a CPU would not modify > timer_deadline_{start,end} between hpet_broadcast_enter and > hpet_broadcast_exit. After your patch, handle_hpet_broadcast is no > longer fully synchronised against those functions.Ok, your are right. So we may just need to make sure the cpu is in ch->cpumask while reading the timer_deadline_{start,end}. I can make some changes to my patch. Jimmy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei, Gang
2010-Apr-20 16:05 UTC
[Xen-devel] RE: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
Resend. CPUIDLE: shorten hpet spin_lock holding time Try to reduce spin_lock overhead for deep C state entry/exit. This will benefit systems with a lot of cpus which need the hpet broadcast to wakeup from deep C state. Signed-off-by: Wei Gang <gang.wei@intel.com> diff -r dbf0fd95180f xen/arch/x86/hpet.c --- a/xen/arch/x86/hpet.c Tue Apr 20 14:32:53 2010 +0100 +++ b/xen/arch/x86/hpet.c Tue Apr 20 23:48:19 2010 +0800 @@ -186,6 +186,9 @@ static void handle_hpet_broadcast(struct again: ch->next_event = STIME_MAX; + + spin_unlock_irq(&ch->lock); + next_event = STIME_MAX; mask = (cpumask_t)CPU_MASK_NONE; now = NOW(); @@ -193,10 +196,17 @@ again: /* find all expired events */ for_each_cpu_mask(cpu, ch->cpumask) { - if ( per_cpu(timer_deadline_start, cpu) <= now ) - cpu_set(cpu, mask); - else if ( per_cpu(timer_deadline_end, cpu) < next_event ) - next_event = per_cpu(timer_deadline_end, cpu); + spin_lock_irq(&ch->lock); + + if ( cpumask_test_cpu(cpu, ch->cpumask) ) + { + if ( per_cpu(timer_deadline_start, cpu) <= now ) + cpu_set(cpu, mask); + else if ( per_cpu(timer_deadline_end, cpu) < next_event ) + next_event = per_cpu(timer_deadline_end, cpu); + } + + spin_unlock_irq(&ch->lock); } /* wakeup the cpus which have an expired event. */ @@ -204,10 +214,14 @@ again: if ( next_event != STIME_MAX ) { - if ( reprogram_hpet_evt_channel(ch, next_event, now, 0) ) + spin_lock_irq(&ch->lock); + + if ( next_event < ch->next_event && + reprogram_hpet_evt_channel(ch, next_event, now, 0) ) goto again; - } - spin_unlock_irq(&ch->lock); + + spin_unlock_irq(&ch->lock); + } } static void hpet_interrupt_handler(int irq, void *data, @@ -656,17 +670,23 @@ void hpet_broadcast_enter(void) BUG_ON( !ch ); ASSERT(!local_irq_is_enabled()); - spin_lock(&ch->lock); if ( hpet_attach_channel ) + { + spin_lock(&ch->lock); + hpet_attach_channel(cpu, ch); + + spin_unlock(&ch->lock); + } /* Cancel any outstanding LAPIC timer event and disable interrupts. */ reprogram_timer(0); disable_APIC_timer(); + spin_lock(&ch->lock); + cpu_set(cpu, ch->cpumask); - /* reprogram if current cpu expire time is nearer */ if ( this_cpu(timer_deadline_end) < ch->next_event ) reprogram_hpet_evt_channel(ch, this_cpu(timer_deadline_end), NOW(), 1); @@ -684,23 +704,28 @@ void hpet_broadcast_exit(void) BUG_ON( !ch ); + /* Reprogram the deadline; trigger timer work now if it has passed. */ + enable_APIC_timer(); + if ( !reprogram_timer(this_cpu(timer_deadline_start)) ) + raise_softirq(TIMER_SOFTIRQ); + spin_lock_irq(&ch->lock); - if ( cpu_test_and_clear(cpu, ch->cpumask) ) - { - /* Reprogram the deadline; trigger timer work now if it has passed. */ - enable_APIC_timer(); - if ( !reprogram_timer(this_cpu(timer_deadline_start)) ) - raise_softirq(TIMER_SOFTIRQ); - - if ( cpus_empty(ch->cpumask) && ch->next_event != STIME_MAX ) - reprogram_hpet_evt_channel(ch, STIME_MAX, 0, 0); - } + cpu_clear(cpu, ch->cpumask); + if ( cpus_empty(ch->cpumask) && ch->next_event != STIME_MAX ) + reprogram_hpet_evt_channel(ch, STIME_MAX, 0, 0); + + spin_unlock_irq(&ch->lock); + if ( hpet_detach_channel ) + { + spin_lock_irq(&ch->lock); + hpet_detach_channel(cpu); - spin_unlock_irq(&ch->lock); + spin_unlock_irq(&ch->lock); + } } int hpet_broadcast_is_available(void) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-21 08:09 UTC
[Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On 20/04/2010 17:05, "Wei, Gang" <gang.wei@intel.com> wrote:> Resend. > > CPUIDLE: shorten hpet spin_lock holding time > > Try to reduce spin_lock overhead for deep C state entry/exit. This will > benefit systems with a lot of cpus which need the hpet broadcast to wakeup > from deep C state. > > Signed-off-by: Wei Gang <gang.wei@intel.com>It fixes the unsafe accesses to timer_deadline_{start,end} but I still think this optimisation is misguided and also unsafe. There is nothing to stop new CPUs being added to ch->cpumask after you start scanning ch->cpumask. For example, some new CPU which has a timer_deadline_end greater than ch->next_event, so it does not reprogram the HPET. But handle_hpet_broadcast is already mid-scan and misses this new CPU, so it does not reprogram the HPET either. Hence no timer fires for the new CPU and it misses its deadline. Really I think a better approach than something like this patch would be to better advertise the timer_slop=xxx Xen boot parameter for power-saving scenarios. I wonder what your numbers look like if you re-run your benchmark with (say) timer_slop=10000000 (i.e., 10ms slop) on the Xen command line? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei, Gang
2010-Apr-21 09:06 UTC
[Xen-devel] RE: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On Wednesday, 2010-4-21 4:10 PM, Keir Fraser wrote:> It fixes the unsafe accesses to timer_deadline_{start,end} but I > still think this optimisation is misguided and also unsafe. There is > nothing to stop new CPUs being added to ch->cpumask after you start > scanning ch->cpumask. For example, some new CPU which has a > timer_deadline_end greater than ch->next_event, so it does not > reprogram the HPET. But handle_hpet_broadcast is already mid-scan and > misses this new CPU, so it does not reprogram the HPET either. Hence > no timer fires for the new CPU and it misses its deadline.This will not happen. ch->next_event has already been set as STIME_MAX before start scanning ch->cpumask, so the new CPU with smallest timer_deadline_end will reprogram the HPET successfully.> Really I think a better approach than something like this patch would > be to better advertise the timer_slop=xxx Xen boot parameter for > power-saving scenarios. I wonder what your numbers look like if you > re-run your benchmark with (say) timer_slop=10000000 (i.e., 10ms > slop) on the Xen command line?I think it is another story. Enlarging timer_slop is one way to aligned & reduce breakevents, it do have effects to save power and possibly bring larger latency. What I am trying to address here is how to reduce spin_lock overheads in idel entry/exit path. The spin_lock overheads along with other overheads in the system with 32pcpu/64vcpu caused >25% cpu utilization while all guest are idle. Jimmy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-21 09:25 UTC
[Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On 21/04/2010 10:06, "Wei, Gang" <gang.wei@intel.com> wrote:>> It fixes the unsafe accesses to timer_deadline_{start,end} but I >> still think this optimisation is misguided and also unsafe. There is >> nothing to stop new CPUs being added to ch->cpumask after you start >> scanning ch->cpumask. For example, some new CPU which has a >> timer_deadline_end greater than ch->next_event, so it does not >> reprogram the HPET. But handle_hpet_broadcast is already mid-scan and >> misses this new CPU, so it does not reprogram the HPET either. Hence >> no timer fires for the new CPU and it misses its deadline. > > This will not happen. ch->next_event has already been set as STIME_MAX before > start scanning ch->cpumask, so the new CPU with smallest timer_deadline_end > will reprogram the HPET successfully.Okay, then CPU A executes hpet_broadcast_enter() and programs the HPET channel for its timeout X. Meanwhile concurrently executing handle_hpet_broadcast misses CPU A but finds some other CPU B with timeout Y much later than X, and erroneously programs the HPET channel with Y, causing CPU A to miss its deadline by an arbitrary amount. I dare say I can carry on finding races. :-)> I think it is another story. Enlarging timer_slop is one way to aligned & > reduce breakevents, it do have effects to save power and possibly bring larger > latency. What I am trying to address here is how to reduce spin_lock overheads > in idel entry/exit path. The spin_lock overheads along with other overheads in > the system with 32pcpu/64vcpu caused >25% cpu utilization while all guest are > idle.So far it''s looked to me like a correctness/performance tradeoff. :-D -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei, Gang
2010-Apr-21 09:36 UTC
[Xen-devel] RE: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On Wednesday, 2010-4-21 5:26 PM, Keir Fraser wrote:> On 21/04/2010 10:06, "Wei, Gang" <gang.wei@intel.com> wrote: > >>> It fixes the unsafe accesses to timer_deadline_{start,end} but I >>> still think this optimisation is misguided and also unsafe. There is >>> nothing to stop new CPUs being added to ch->cpumask after you start >>> scanning ch->cpumask. For example, some new CPU which has a >>> timer_deadline_end greater than ch->next_event, so it does not >>> reprogram the HPET. But handle_hpet_broadcast is already mid-scan >>> and misses this new CPU, so it does not reprogram the HPET either. >>> Hence no timer fires for the new CPU and it misses its deadline. >> >> This will not happen. ch->next_event has already been set as >> STIME_MAX before start scanning ch->cpumask, so the new CPU with >> smallest timer_deadline_end will reprogram the HPET successfully. > > Okay, then CPU A executes hpet_broadcast_enter() and programs the HPET > channel for its timeout X. Meanwhile concurrently executing > handle_hpet_broadcast misses CPU A but finds some other CPU B with > timeout Y much later than X, and erroneously programs the HPET > channel with Y, causing CPU A to miss its deadline by an arbitrary > amount.It is also not possible. handle_hpet_broadcast reprogram HPET just while next_event < ch->next_event. Here next_evet is Y, and ch->next_event is X. :-)> I dare say I can carry on finding races. :-)Please go on... The two racing conditions you mentioned were just considered before I resent the patch. :-D _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-21 09:52 UTC
[Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On 21/04/2010 10:36, "Wei, Gang" <gang.wei@intel.com> wrote:>> Okay, then CPU A executes hpet_broadcast_enter() and programs the HPET >> channel for its timeout X. Meanwhile concurrently executing >> handle_hpet_broadcast misses CPU A but finds some other CPU B with >> timeout Y much later than X, and erroneously programs the HPET >> channel with Y, causing CPU A to miss its deadline by an arbitrary >> amount. > > It is also not possible. handle_hpet_broadcast reprogram HPET just while > next_event < ch->next_event. Here next_evet is Y, and ch->next_event is X. :-)Ah yes, I spotted that just after I replied! Okay I''m almost convinced now, but...>> I dare say I can carry on finding races. :-) > > Please go on... The two racing conditions you mentioned were just considered > before I resent the patch. :-DOkay, one concern I still have is over possible races around cpuidle_wakeup_mwait(). It makes use of a cpumask cpuidle_mwait_flags, avoiding an IPI to cpus in the mask. However, there is nothing to stop the CPU having cleared itself from that cpumask before cpuidle does the write to softirq_pending. In that case, even assuming the CPU is now non-idle and so wakeup is spurious, a subsequent attempt to raise_softirq(TIMER_SOFTIRQ) will incorrectly not IPI because the flag is already set in softirq_pending? This race would be benign in the current locking strategy (without your patch) because the CPU that has left mwait_idle_with_hints() cannot get further than hpet_broadcast_exit() because it will spin on ch->lock, and there is a guaranteed check of softirq_pending at some point after that. However your patch removes that synchronisation because ch->lock is not held across cpuidle_wakeup_mwait(). What do you think to that? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-21 10:03 UTC
Re: [Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On 21/04/2010 10:52, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:> Okay, one concern I still have is over possible races around > cpuidle_wakeup_mwait(). It makes use of a cpumask cpuidle_mwait_flags, > avoiding an IPI to cpus in the mask. However, there is nothing to stop the > CPU having cleared itself from that cpumask before cpuidle does the write to > softirq_pending. In that case, even assuming the CPU is now non-idle and so > wakeup is spurious, a subsequent attempt to raise_softirq(TIMER_SOFTIRQ) > will incorrectly not IPI because the flag is already set in softirq_pending?Oh, another one, which can **even occur without your patch**: CPU A adds itself to cpuidle_mwait_flags while cpuidle_wakeup_mwait() is running. That function doesn''t see CPU A in its first read of the cpumask so it does not set TIMER_SOFTIRQ for CPU A. But it then *does* see CPU A in its final read of the cpumask, and hence clears A from the caller''s mask. Hence the caller does not pass CPU A to cpumask_raise_softirq(). Hence CPU A is erroneously not woken. Isn''t the synchronisation around those mwait/monitor functions just inherently broken, even without your patch, and your patch just makes it worse? :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei, Gang
2010-Apr-22 03:59 UTC
[Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On Wednesday, 2010-4-21 6:03 PM, Keir Fraser wrote:> On 21/04/2010 10:52, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote: > >> Okay, one concern I still have is over possible races around >> cpuidle_wakeup_mwait(). It makes use of a cpumask >> cpuidle_mwait_flags, avoiding an IPI to cpus in the mask. However, >> there is nothing to stop the CPU having cleared itself from that >> cpumask before cpuidle does the write to softirq_pending. In that >> case, even assuming the CPU is now non-idle and so wakeup is >> spurious, a subsequent attempt to raise_softirq(TIMER_SOFTIRQ) will >> incorrectly not IPI because the flag is already set in >> softirq_pending?If a CPU cleared itself from cpuidle_mwait_flags, then this CPU didn''t need a IPI to be waken. And one useless write to softirq_pending doesn''t have any side effect. So this case should be acceptable.> Oh, another one, which can **even occur without your patch**: CPU A > adds itself to cpuidle_mwait_flags while cpuidle_wakeup_mwait() is > running. That function doesn''t see CPU A in its first read of the > cpumask so it does not set TIMER_SOFTIRQ for CPU A. But it then > *does* see CPU A in its final read of the cpumask, and hence clears A > from the caller''s mask. Hence the caller does not pass CPU A to > cpumask_raise_softirq(). Hence CPU A is erroneously not woken.Nice shot. This issue can be resolved via keeping and using a snapshot of cpuidle_mwait_flags in cpuidle_wakeup_mwait.> Isn''t the synchronisation around those mwait/monitor functions just > inherently broken, even without your patch, and your patch just makes > it worse? :-)How about now after fixing the second concern you mentioned? Jimmy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-22 07:22 UTC
[Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On 22/04/2010 04:59, "Wei, Gang" <gang.wei@intel.com> wrote:>>> Okay, one concern I still have is over possible races around >>> cpuidle_wakeup_mwait(). It makes use of a cpumask >>> cpuidle_mwait_flags, avoiding an IPI to cpus in the mask. However, >>> there is nothing to stop the CPU having cleared itself from that >>> cpumask before cpuidle does the write to softirq_pending. In that >>> case, even assuming the CPU is now non-idle and so wakeup is >>> spurious, a subsequent attempt to raise_softirq(TIMER_SOFTIRQ) will >>> incorrectly not IPI because the flag is already set in >>> softirq_pending? > > If a CPU cleared itself from cpuidle_mwait_flags, then this CPU didn''t need a > IPI to be waken. And one useless write to softirq_pending doesn''t have any > side effect. So this case should be acceptable.That''s not totally convincing. The write to softirq_pending has one extra side effect: it is possible that the next time TIMER_SOFTIRQ really needs to be raised on that CPU, it will not receive notification via IPI, because the flag is already set in its softirq_pending mask. Hm, let me see if I can come up with a patch for this and post it for you. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-22 08:19 UTC
Re: [Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On 22/04/2010 08:22, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:>> If a CPU cleared itself from cpuidle_mwait_flags, then this CPU didn''t need a >> IPI to be waken. And one useless write to softirq_pending doesn''t have any >> side effect. So this case should be acceptable. > > That''s not totally convincing. The write to softirq_pending has one extra > side effect: it is possible that the next time TIMER_SOFTIRQ really needs to > be raised on that CPU, it will not receive notification via IPI, because the > flag is already set in its softirq_pending mask. > > Hm, let me see if I can come up with a patch for this and post it for you.How about the attached patch? It MWAITs on a completely new flag field, avoiding the IPI-avoidance semantics of softirq_pending. It also folds in your patch. It also does wakeup-waiting checks on timer_deadline_start, that being the field that initiates wakeup via the MONITORed memory region. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-22 08:21 UTC
[Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On 20/04/2010 17:05, "Wei, Gang" <gang.wei@intel.com> wrote:> Resend. > > CPUIDLE: shorten hpet spin_lock holding time > > Try to reduce spin_lock overhead for deep C state entry/exit. This will > benefit systems with a lot of cpus which need the hpet broadcast to wakeup > from deep C state. > > Signed-off-by: Wei Gang <gang.wei@intel.com>I was going to ask: do you still get the decent power-saving win from this new version of the patch? You acquire/release the lock a lot more potentially in this version, since you do so inside the loop over cpumask in handle_hpet_broadcast(). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-22 08:23 UTC
Re: [Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On 22/04/2010 09:19, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:>> Hm, let me see if I can come up with a patch for this and post it for you. > > How about the attached patch? It MWAITs on a completely new flag field, > avoiding the IPI-avoidance semantics of softirq_pending. It also folds in > your patch. It also does wakeup-waiting checks on timer_deadline_start, that > being the field that initiates wakeup via the MONITORed memory region....If you do think it looks okay, could you also test it out on relevant hardware? :-) Thanks, Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei, Gang
2010-Apr-29 11:08 UTC
RE: [Xen-devel] Re: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On Thursday, 2010-4-22 4:23 PM, Keir Fraser wrote:> On 22/04/2010 09:19, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote: >> How about the attached patch? It MWAITs on a completely new flag >> field, avoiding the IPI-avoidance semantics of softirq_pending. It >> also folds in your patch. It also does wakeup-waiting checks on >> timer_deadline_start, that being the field that initiates wakeup via >> the MONITORed memory region. > > ...If you do think it looks okay, could you also test it out on > relevant hardware? :-)Did some modification to this patch -- move the per_cpu mait_wakeup flag into irq_cpustat_t to make it __cacheline_aligned, and add check for timer_deadline_start==0 (means no timer in queue, it took me quite a lot time to find out it is necessary) case. Tested ok. Jimmy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei, Gang
2010-Apr-29 11:14 UTC
[Xen-devel] RE: [PATCH] CPUIDLE: shorten hpet spin_lock holding time
On Thursday, 2010-4-22 4:21 PM, Keir Fraser wrote:> On 20/04/2010 17:05, "Wei, Gang" <gang.wei@intel.com> wrote: > >> Resend. >> >> CPUIDLE: shorten hpet spin_lock holding time >> >> Try to reduce spin_lock overhead for deep C state entry/exit. This >> will benefit systems with a lot of cpus which need the hpet >> broadcast to wakeup from deep C state. >> >> Signed-off-by: Wei Gang <gang.wei@intel.com> > > I was going to ask: do you still get the decent power-saving win from > this new version of the patch? You acquire/release the lock a lot more > potentially in this version, since you do so inside the loop over > cpumask in handle_hpet_broadcast().Yes, I tested it and could still get the decent power-saving win from it. And it should not have racing issue after the mwait-wakeup re-implementation patch. Jimmy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel