Existing credit scheduler is not power aware. To achieve better power saving ability with negligible performance impact, following areas may be tweaked and listed here for comments first. Goal is not to silly save power with sacrifice of performance, e.g. we don''t want to prevent migration when there''re free cpus with some pending runqueues. But when free computing power is more than existing requirement, power aware policy can be pushed to choose a less power-intrusive decision. Of course even in latter case, it''s controllable with a scheduler parameter like csched_private.power and exposed to user. ---- a) when there''s more idle cpus than required a.1) csched_cpu_pick Existing policy is to pick one with more idle neighbours, to avoid shared resource contention among cores or threads. However from power P.O.V, package C-state saves much more power than per-core C-state vehicle. From this angle, it might be better to keep idle package continuously idle, while picking idle cores/threads with busy neighbours already, if csched_private. power is set. The performance/watt ratio is positively incremented though absolute performance is kicked a bit. a.2) csched_vcpu_wake Similar as above, instead of blindly kick all idle cpus in a rush, some selective knock can be pushed with power factor concerned. ---- b) when physical cpu resides in idle C-state Avoid unnecessary work to keep longer C-state residency. For example, accouting process (tick timer, more specifically) can be stopped before C-state entrance and then resumed after waking up. The point is that no accounting is required when current cpu is idle, and any runqueue change triggering from other cpus incurs a IPI to this cpu which effectively breaks it back to C0 state with accounting resumed. Since the residency period may be longer than accouting period (30ms), csched_tick should be aware of resume event to adjust elapsed credits. ---- c) when cpu''s freq is scaled dynamically When cpufreq/Px is enabled, cpu''s frequency is adjusted to different operation points driven by a on-demand governor. So csched_acct may need take frequency difference among cpus into consideration and total available credits won''t be a simple 300 * online cpu_number. ---- Of course there''re bunch of research areas to add more power factor into scheduler policy. But above is fundamental stuff which we believe would help scheduler understand power requirement and not incurs bad impact to performance/watt first. Comments are appreciated. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Tian, Kevin >Sent: 2008年6月19日 12:52 >---- > >c) when cpu''s freq is scaled dynamically > When cpufreq/Px is enabled, cpu''s frequency is adjusted >to different operation points driven by a on-demand governor. So >csched_acct may need take frequency difference among cpus into >consideration and total available credits won''t be a simple 300 * >online cpu_number. >Not accurate above. Credit scheduler can''t anticipate the cpu freq in next accounting 30ms phase, and thus it can still only assume total credit budget as 300 * online cpu_number for allocation. The question is whether we need substract credit in 10ms vcpu account multiplying a freq ratio. But it seems that two issues are along with this approach: a) total budget is counted inconsistently as vcpu accounting, which may bring inaccurate vision to credit scheduler, like balance, etc. b) it''s not easy to get accurate freq ratio on target cpu. Some cpus may not support, and on-demand governor runs async with credit tick timer. Also query remote cpu normaly incurs inter cpu traffic. Maybe on-demand governor can be put align with credit tick timer with same interval, which may solve the freq query issue. It looks complex than initial thought. Also considering on-demand governor will scale freq back immediately when there''re more real work to be done, this may not show a real impact in reality. We''ll keep an eye in future tune to see whether it matters. :-) Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19/6/08 05:51, "Tian, Kevin" <kevin.tian@intel.com> wrote:> b) when physical cpu resides in idle C-state > Avoid unnecessary work to keep longer C-state residency. > For example, accouting process (tick timer, more specifically) > can be stopped before C-state entrance and then resumed after > waking up. The point is that no accounting is required when current > cpu is idle, and any runqueue change triggering from other cpus > incurs a IPI to this cpu which effectively breaks it back to C0 > state with accounting resumed. Since the residency period may > be longer than accouting period (30ms), csched_tick should be > aware of resume event to adjust elapsed credits.Yes, this should be an easy low-hanging fruit to fix.> c) when cpu''s freq is scaled dynamically > When cpufreq/Px is enabled, cpu''s frequency is adjusted > to different operation points driven by a on-demand governor. So > csched_acct may need take frequency difference among cpus into > consideration and total available credits won''t be a simple 300 * > online cpu_number.Not sure. I think the current governor runs frequently to react to the scheduler (i.e., try to keep the CPU non-idle by downscaling frequency; upscale frequency when the CPU gets busy; both these done over sub-second timescales). Does it then make sense to have the scheduler react to governor? Sounds like it could be a weird feedback loop. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: 2008年6月19日 15:32 > > >> c) when cpu''s freq is scaled dynamically >> When cpufreq/Px is enabled, cpu''s frequency is adjusted >> to different operation points driven by a on-demand governor. So >> csched_acct may need take frequency difference among cpus into >> consideration and total available credits won''t be a simple 300 * >> online cpu_number. > >Not sure. I think the current governor runs frequently to react to the >scheduler (i.e., try to keep the CPU non-idle by downscaling frequency; >upscale frequency when the CPU gets busy; both these done over >sub-second >timescales).Yes, normally it''s based at 20ms level.>Does it then make sense to have the scheduler react to >governor? Sounds like it could be a weird feedback loop. >Good suggestion. We''re considering adding some more inputs from key components into on-demand governor, instead of simply polling busy ratio for freq change in a fixed interval. For example, when one cpu pulls vcpu from other runqueues, it''s the indicator that its current freq may not fit and it''s better to scale to max immediately instead of waiting for next 20ms check timer. Other indicators like interrupt, event, etc. You kick a good instance. :-) Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19/6/08 09:03, "Tian, Kevin" <kevin.tian@intel.com> wrote:>> Does it then make sense to have the scheduler react to >> governor? Sounds like it could be a weird feedback loop. >> > > Good suggestion. We''re considering adding some more inputs from > key components into on-demand governor, instead of simply polling > busy ratio for freq change in a fixed interval. For example, when one > cpu pulls vcpu from other runqueues, it''s the indicator that its current > freq may not fit and it''s better to scale to max immediately instead > of waiting for next 20ms check timer. Other indicators like interrupt, > event, etc. You kick a good instance. :-)I see. This specific example doesn''t sound unreasonable. I suppose experimental data will show what works and what doesn''t. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: 2008年6月19日 16:10 >On 19/6/08 09:03, "Tian, Kevin" <kevin.tian@intel.com> wrote: > >>> Does it then make sense to have the scheduler react to >>> governor? Sounds like it could be a weird feedback loop. >>> >> >> Good suggestion. We''re considering adding some more inputs from >> key components into on-demand governor, instead of simply polling >> busy ratio for freq change in a fixed interval. For example, when one >> cpu pulls vcpu from other runqueues, it''s the indicator that >its current >> freq may not fit and it''s better to scale to max immediately instead >> of waiting for next 20ms check timer. Other indicators like >interrupt, >> event, etc. You kick a good instance. :-) > >I see. This specific example doesn''t sound unreasonable. I suppose >experimental data will show what works and what doesn''t. >Yes, and we''ll start some experiment soon. Will let you know once we get some concrete data. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> c) when cpu''s freq is scaled dynamically > When cpufreq/Px is enabled, cpu''s frequency is adjusted > to different operation points driven by a on-demand governor. So > csched_acct may need take frequency difference among cpus into > consideration and total available credits won''t be a simple 300 * > online cpu_number.We should also adjust the accounting of the credits consumed in light of hyperthreading: we should scale the credit we subtract proportional to the how much of the period was spent competing with another VCPU running on alternate hyperthread (we can tell this by seeing how much time the idle thread spent running on the other thread). We can then scale the accounting according to some rough notion of the expected throughput of two hyperthreads e.g. experience on P4 CPU''s suggests that a single VCPU will typically receive something like 65% of its normal throughput when competing against another thread (total throughput 130%). We thus scale the amount of credit subtracted between 65% and 100% depending on how much time was spent competing. There''s an argument that says we should at least have an option to prevent VCPUs from different guests running against each other in adjacent threads. This would be introducing a simple kind of gang scheduling. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] >Sent: 2008年6月19日 17:14 > >> c) when cpu''s freq is scaled dynamically >> When cpufreq/Px is enabled, cpu''s frequency is adjusted >> to different operation points driven by a on-demand governor. So >> csched_acct may need take frequency difference among cpus into >> consideration and total available credits won''t be a simple 300 * >> online cpu_number. > >We should also adjust the accounting of the credits consumed >in light of >hyperthreading: we should scale the credit we subtract proportional to >the how much of the period was spent competing with another >VCPU running >on alternate hyperthread (we can tell this by seeing how much time the >idle thread spent running on the other thread). > >We can then scale the accounting according to some rough notion of the >expected throughput of two hyperthreads e.g. experience on P4 CPU''s >suggests that a single VCPU will typically receive something >like 65% of >its normal throughput when competing against another thread (total >throughput 130%). We thus scale the amount of credit subtracted between >65% and 100% depending on how much time was spent competing. > >There''s an argument that says we should at least have an option to >prevent VCPUs from different guests running against each other in >adjacent threads. This would be introducing a simple kind of gang >scheduling. >Well, when such scale can be or should be applied to other facets, original proposal on freq side doesn''t apply as I replied to myself in another mail, since credit scheduler can''t anticipate the freq distribution in next accounting phase, unless freq change is controlled by scheduler fully. But we''ll experiment adding scheduler input into freq governor as discussed with Keir. :-) BTW, when such scale concepts takes more factors as mentioned above, original tick based accounting seems more unconformable when there''s no direct map between tick and credit... Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Kevin. I''m glad you''re looking at this. There are a bunch of interesting areas to look at to improve scheduling on large hierarchical systems. The idle loop is at the center of most of them. On Jun 19, 2008, at 6:51 , Tian, Kevin wrote:> a) when there''s more idle cpus than required > > a.1) csched_cpu_pick > Existing policy is to pick one with more idle neighbours, > to avoid shared resource contention among cores or threads. > However from power P.O.V, package C-state saves much more > power than per-core C-state vehicle. From this angle, it might be > better to keep idle package continuously idle, while picking idle > cores/threads with busy neighbours already, if csched_private. > power is set. The performance/watt ratio is positively incremented > though absolute performance is kicked a bit.Regardless of any new knobs, a good default behavior might be to only take a package out of C-state when another non-idle package has had more than one VCPU active on it over some reasonable amount of time. By default, putting multiple VCPUs on the same physical package when other packages are idle is obviously not always going to be optimal. Maybe it''s not a bad default for VCPUs that are related (same VM or qemu)? I think Ian P hinted at this. But it frightens me that you would always do this by default for any set of VCPUs. Power saving is good but so is memory bandwidth> a.2) csched_vcpu_wake > Similar as above, instead of blindly kick all idle cpus in > a rush, some selective knock can be pushed with power factor > concerned.Yeah, you will need to rewrite the idle kick code. This can be tricky because a CPU''s idle state might change by the time it processes a "scheduling IPI" and you need to be careful that a runnable VCPU doesn''t sit on a runqueue when there is at least one idle CPU in the system. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Well, when such scale can be or should be applied to other facets, > original proposal on freq side doesn''t apply as I replied to myself in > another mail, since credit scheduler can''t anticipate the freq > distribution > in next accounting phase, unless freq change is controlled by > scheduler fully. But we''ll experiment adding scheduler input into freq > governor as discussed with Keir. :-)That''s OK -- it''s fine to account in arrears, and doing so will have the right influence on how we schedule things in the future. That''s why it''s important to move from tick accounting to absolute. Ian> BTW, when such scale concepts takes more factors as mentioned > above, original tick based accounting seems more unconformable when > there''s no direct map between tick and credit... > > Thanks, > Kevin_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Emmanuel Ackaouy [mailto:ackaouy@gmail.com] >Sent: 2008年6月19日 21:09 > >Hi Kevin. > >I''m glad you''re looking at this. There are a bunch of interesting >areas to look at to improve scheduling on large hierarchical >systems. The idle loop is at the center of most of them.Agree.> >On Jun 19, 2008, at 6:51 , Tian, Kevin wrote: >> a) when there''s more idle cpus than required >> >> a.1) csched_cpu_pick >> Existing policy is to pick one with more idle neighbours, >> to avoid shared resource contention among cores or threads. >> However from power P.O.V, package C-state saves much more >> power than per-core C-state vehicle. From this angle, it might be >> better to keep idle package continuously idle, while picking idle >> cores/threads with busy neighbours already, if csched_private. >> power is set. The performance/watt ratio is positively incremented >> though absolute performance is kicked a bit. > >Regardless of any new knobs, a good default behavior might be >to only take a package out of C-state when another non-idle >package has had more than one VCPU active on it over some >reasonable amount of time. > >By default, putting multiple VCPUs on the same physical package >when other packages are idle is obviously not always going to >be optimal. Maybe it''s not a bad default for VCPUs that are >related (same VM or qemu)? I think Ian P hinted at this. But it >frightens me that you would always do this by default for any set >of VCPUs. Power saving is good but so is memory bandwidthTo enable this feature depends on a control command from system adminstrator, who knows the tradeoff. From absolute performance P.O.V, I believe it''s not optimal. However if looking from the performance/watt, i.e. power efficiency angle, power saving due to package level idle may overwhelm performance impact by keeping activity in other package. Of course finally memory latency should be also considered in NUMA system, as you mentioned. Note that we''ll never keep one package idle when other package already has vcpu pending in runqueue. Even when such power aware feature is configured, it only happens when cpu number is larger than runnable vcpu number. Just like what prevalent OS provides to choose user''s own profiles... :-)> > >> a.2) csched_vcpu_wake >> Similar as above, instead of blindly kick all idle cpus in >> a rush, some selective knock can be pushed with power factor >> concerned. > >Yeah, you will need to rewrite the idle kick code. This can be >tricky because a CPU''s idle state might change by the time it >processes a "scheduling IPI" and you need to be careful that >a runnable VCPU doesn''t sit on a runqueue when there is at >least one idle CPU in the system. >I understand above caveats but not sure I catch exactly how it''s related to possible change. Could you elaborate a bit? How does above concerns get handled in current logic? Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] >Sent: 2008年6月19日 21:31 > >> Well, when such scale can be or should be applied to other facets, >> original proposal on freq side doesn''t apply as I replied to >myself in >> another mail, since credit scheduler can''t anticipate the freq >> distribution >> in next accounting phase, unless freq change is controlled by >> scheduler fully. But we''ll experiment adding scheduler input >into freq >> governor as discussed with Keir. :-) > >That''s OK -- it''s fine to account in arrears, and doing so >will have the >right influence on how we schedule things in the future. That''s why >it''s important to move from tick accounting to absolute. >OK, then that''ll be some mutual inputs between scheduler and freq governor... Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Jun 19, 2008, at 15:32 , Tian, Kevin wrote:>> Regardless of any new knobs, a good default behavior might be >> to only take a package out of C-state when another non-idle >> package has had more than one VCPU active on it over some >> reasonable amount of time. >> >> By default, putting multiple VCPUs on the same physical package >> when other packages are idle is obviously not always going to >> be optimal. Maybe it''s not a bad default for VCPUs that are >> related (same VM or qemu)? I think Ian P hinted at this. But it >> frightens me that you would always do this by default for any set >> of VCPUs. Power saving is good but so is memory bandwidth > > To enable this feature depends on a control command from system > adminstrator, who knows the tradeoff. From absolute performance > P.O.V, I believe it''s not optimal. However if looking from the > performance/watt, i.e. power efficiency angle, power saving due to > package level idle may overwhelm performance impact by keeping > activity in other package. Of course finally memory latency should > be also considered in NUMA system, as you mentioned.I''m saying something can be done to improve power saving in the current system without adding a knob. Perhaps you can give the admin even more power saving abilities with a knob, but it makes sense to save power when performance is not impacted, regardless of any knob position. Also, note I mentioned memory BANDWIDTH and not latency. It''s not the same thing. And I wasn''t just thinking about NUMA systems. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Jun 19, 2008, at 15:30 , Ian Pratt wrote:> That''s OK -- it''s fine to account in arrears, and doing so will have > the > right influence on how we schedule things in the future. That''s why > it''s important to move from tick accounting to absolute.I actually still don''t agree it''s important to move from tick accounting to absolute. CPU wall clock time is an approximation of service to start with. From the point of view of basic short term fairness and load balancing, tick based accounting works well and is simple to scale. Accounting for shared resources of physical CPUs makes sense, be it caches or memory buses (or the pipeline in the hyperthread case). But you can''t really do that precisely: 2 CPUs may share a memory bus, but perhaps one of them is compute bound out of its L1 cache. What is the point of precisely measuring wall clock CPU time if you''re then going to multiply that number by some constant that may or may not reflect the real impact of resource sharing in that case? IMO, the more pressing problem is to approximately account for shared physical resources and scale the cpu_pick() and cpu_kick() mechanisms to improve efficiency on medium and large hierarchical systems. It''s probably ok to approximate the cost of sharing physical resources using reasonable constants (ie 0.65 when co-scheduled on hyperthreads). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Emmanuel Ackaouy [mailto:ackaouy@gmail.com] >Sent: 2008年6月19日 22:38 > >On Jun 19, 2008, at 15:32 , Tian, Kevin wrote: >>> Regardless of any new knobs, a good default behavior might be >>> to only take a package out of C-state when another non-idle >>> package has had more than one VCPU active on it over some >>> reasonable amount of time. >>> >>> By default, putting multiple VCPUs on the same physical package >>> when other packages are idle is obviously not always going to >>> be optimal. Maybe it''s not a bad default for VCPUs that are >>> related (same VM or qemu)? I think Ian P hinted at this. But it >>> frightens me that you would always do this by default for any set >>> of VCPUs. Power saving is good but so is memory bandwidth >> >> To enable this feature depends on a control command from system >> adminstrator, who knows the tradeoff. From absolute performance >> P.O.V, I believe it''s not optimal. However if looking from the >> performance/watt, i.e. power efficiency angle, power saving due to >> package level idle may overwhelm performance impact by keeping >> activity in other package. Of course finally memory latency should >> be also considered in NUMA system, as you mentioned. > >I''m saying something can be done to improve power saving in >the current system without adding a knob. Perhaps you can give >the admin even more power saving abilities with a knob, but it >makes sense to save power when performance is not impacted, >regardless of any knob position.Then I agree. It''s always good to have one improved with the other immune, or fix some hindering both first. Then we''ll also compare whether a knob can shoot for obviously better result.> >Also, note I mentioned memory BANDWIDTH and not latency. >It''s not the same thing. And I wasn''t just thinking about NUMA >systems. >Thanks for pointing out. I misread fast. But I''m not sure how memory bandwidth is affected by the vcpu scheduling. Do you mean more mem traffic involved in bus due to shared cache contention when multiple vcpus are running in same package? It then may be workload specific and others may not be affected to same extent. But this is good hint that we''ll keep such workload in experiment when doing the change. Also consider vcpu/domain relationship is one thing we can try. The basic direction will be first go simple to see the effect. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Emmanuel Ackaouy [mailto:ackaouy@gmail.com] >Sent: 2008年6月19日 23:40 > >On Jun 19, 2008, at 15:30 , Ian Pratt wrote: >> That''s OK -- it''s fine to account in arrears, and doing so >will have >> the >> right influence on how we schedule things in the future. That''s why >> it''s important to move from tick accounting to absolute. > >I actually still don''t agree it''s important to move from tick >accounting >to absolute. CPU wall clock time is an approximation of service to >start with. From the point of view of basic short term fairness and >load balancing, tick based accounting works well and is simple to >scale. > >Accounting for shared resources of physical CPUs makes sense, >be it caches or memory buses (or the pipeline in the hyperthread >case). But you can''t really do that precisely: 2 CPUs may share a >memory bus, but perhaps one of them is compute bound out of its >L1 cache. What is the point of precisely measuring wall clock CPU >time if you''re then going to multiply that number by some constant >that may or may not reflect the real impact of resource sharing in >that case? >I''m not sure how fairness is ensured in my posted example in first mail with a tick-based accounting. Maybe, long-term fairness is still approximately achieved in average, but at last micro-accounting level may not perform well which impacts guest with such requirement. The effect of precisely accounting with multiply is hard to judge now without some experiments to prove. However to be absolute without multiply is still more natural way to go, IMO.>IMO, the more pressing problem is to approximately account for >shared physical resources and scale the cpu_pick() and cpu_kick() >mechanisms to improve efficiency on medium and large hierarchical >systems. It''s probably ok to approximate the cost of sharing physical >resources using reasonable constants (ie 0.65 when co-scheduled >on hyperthreads). >This is good. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Maybe Matching Threads
- [PATCH] Accurate vcpu weighting for credit scheduler
- [PATCH] xen,credit1: Add variable timeslice
- Problems with enabling hypervisor C and P-state control
- [PATCH] 1/2: cpufreq/PowerNow! in Xen: Time and platform changes
- [PATCH][cpufreq] Xen support for the ondemand governor [1/2] (hypervisor code)