Carsten Schiers
2009-Mar-31 14:52 UTC
AW: [Xen-devel] Potential side-effects and mitigations after cpuidle enabled by default
Sorry for my ignorance, but as I find all that very interesting, prior to read it all over and as I suffer from skew after using cpuidle and also cpufreq management (on an AMD CPU ;-, which TSC frequency is variant across freq/voltage scaling) a few questions: - you mention lost ticks in some guests, does this include Dom0? It''s where my messages mainly show up. - you recomend to limit cpuidle either to C1 or C2 (in case APIC timer is not stopping. How to know that? - xm debug-key c reports active C1, max_cstate C2, but only lists C1 usage. C1 Clock Ramping seems to be disabled. Platform timer is 25MHz HPET. Excuse my ignorance again, but doesn''t that mean I am not using C-states at all? I understand you speak about Xen 3.4. Currently, I am at 3.3.1 and have to wait for a slot to test 3.4. I am curious to see what happens. Dan told me how to use xm debug-key t and said, max cycles skew is so much smaller than max stime (xen system time) skew. This makes him believe 3.4 will help. BR, Carsten. ----- Originalnachricht ----- Von: "Wei, Gang" <gang.wei@intel.com> Gesendet: Die, 31.3.2009 16:00 An: xen-devel <xen-devel@lists.xensource.com> Cc: "Tian, Kevin" <kevin.tian@intel.com> ; Keir Fraser <keir.fraser@eu.citrix.com> ; "Yu, Ke" <ke.yu@intel.com> Betreff: [Xen-devel] Potential side-effects and mitigations after cpuidle enabled by default In xen3.4, cpuidle is defaultly enabled by c/s 19421. But some side-effects may exist under different h/w C-states implementations or h/w configurations, so that user may occasionally observe latency or system time/tsc skew. Below are conditions causing these side-effects and means to mitigate them: 1. Latency Latency could be caused by two factors: C-state entry/exit latency, and extra latency caused by broadcast mechanism. C-state entry/exit latency is inevitable since powering on/off gates takes time. Normally shallower C-state incurs lighter latency but less power saving capability, and vice versa for deeper C-state. Cpuidle governor tries to balance performance and power tradeoff in high level, which is one area where we''ll continue to tune. Broadcast is necessary to handle APIC timer stop at deep C-state (>=C3) on some platforms. One platform timer source is chosen to carry per-cpu timer deadline, and then wakeup CPUs in deep C-state timely at expected expiry. By far Xen3.4 supports PIT/HPET as the broadcast source. In current implementation PIT broadcast is implemented in periodical mode (10ms) which means up to 10ms extra latency could be added on expiry expected from sleep CPUs. This is just initial implementation choice which of course could be enhanced to on-demand on/off mode in the future. We didn''t go into that complexity in current implementation, due to its slow access and also short wrap count. So HPET broadcast is always preferred, once this facility is available which adds negligible overhead with timely wakeup. Then... world is not always perfect, and some side-effects also exist along with HPET. Detail is listed as below: 1.1. For h/w supporting ACPI C1 (halt) only (BIOS reported in ACPI _CST method): It''s immune from this side-effect as only instruction execution is halted. 1.2. For h/w supporting ACPI C2 in which TSC and apic timer don''t stop: ACPI C2 type is a bit special which is sometimes alias to a deep CPU C-state and thus current Xen3.4 treat ACPI C2 type in same manner as ACPI C3 type (i.e. broadcast is activated). If user knows on that platform ACPI C2 type has not that h/w limitation, ''lapic_timer_c2_ok'' could be added in grub to deactivate software mitigation. 1.3. For the rest implementations support ACPI C2+ in which apic timer will be stopped: 1.3.1. HPET as broadcast timer source HPET can delivery timely wakeup event to CPUs sleep in deep C-states with negligible overhead, as stated earlier. But HPET mode being used does make some differences to worthy of our noting: 1.3.1.1. If h/w supports per-channel MSI delivery mode (intr via FSB), it''s the best broadcast mechanism known so far. No side effect regarding to latency, and IPIs used to broadcast wakeup event could be reduced by a factor of number of available channels (each channel could independently serve one or several sleeping CPUs). As long as this feature is available, it''s always first prefered automatically 1.3.1.2. when MSI delivery mode is absent, we have to use legacy replacement mode with only one HPET channel available. Well, it''s not that bad as this only one channel could serve all sleeping CPUs by using IPIs to wake up. However another side-effect occurs, as PIT/RTC interrupts (IRQ0/IRQ8) are replaced by HPET channel. Then RTC alarm feature in dom0 will be lost, unless we add RTC emulation between dom0''s rtc module and Xen''s hpet logic (however, it''s not implemented by far.) Due to above side-effect, this broadcast option is disabled by default. In that case, PIT broadcast is the default. If user is sure that he doesn''t need RTC alarm, then use ''hpetbroadcast'' grub option to force enabling it. 1.3.2. PIT as broadcast timer source If MSI based HPET intr delivery is not available or HPET is missing, in all cases PIT broadcast is the current default one. As said earlier, 10ms periodical mode is implemented on PIT broadcast which thus could incur up to 10ms latency for each deep C-state entry/exit. One natural result is to observe ''many lost ticks'' in some guests. 1.4 Suggestions So, if user doesn''t care about power consumption while his platform does expose deep C-states, one mitigation is to add ''max_cstate='' boot option to restrict maximum allowed C-states (If limited to C2, ensure adding ''lapic_timer_c2_ok'' if applied). Runtime modification on ''max_cstate'' is allowed by xenpm (patch posted in 3/24/2009, not checked in yet). If user does care about power consumption w/o requirement on RTC alarm, then always using HPET is preferred. Last, we could either add RTC emulation on HPET or enhance PIT broadcast to use single shot mode, but would like to see comments from community whether it''s worthy of doing. :-) 2. system time/TSC skew Similarly to APIC timer stop, TSC is also stopped at deep C-states in some implementations, which thus requires Xen to recover lost counts at exit from deep C-state by software means. It''s easy to think kinds of errors caused by software methods. For the detail how TSC skew could occur, its side effects and possible solutions, you could refer to our Xen summit presentation: http://www.xen.org/files/xensummit_oracle09/XenSummit09pm.pdf Below is the brief introduction about which algorithm is available in different implementations: 2.1. Best case is to have non-stop TSC at h/w implementation level. For example, Intel Core-i7 processors supports this green feature which could be detected by CPUID. Xen will do nothing once this feature is detected, and thus no extra software-caused skew besides dozens of cycles due to crystal drift. 2.2. If TSC frequency is invariant across freq/voltage scaling (true for all Intel processors supporting VTx), Xen will sync AP''s TSCs to BSP''s at 1 second interval in per-cpu time calibration, meanwhile do recover in a per-cpu style, where only elapsed platform counter since last calibration point is compensated to local TSC with a boot-time-calculated scale factor. This global synchronization along with per-cpu compensation limits TSC skew to ns level in most cases. 2.3. If TSC frequency is variant across freq/voltage scaling, Xen will only do recover in a per-cpu style, where only elapsed platform counter since last calibration point is compensated to local TSC with local scale factor. In such manner TSC skew across cpus is accumulating and easy to be observed after system is up for some time. 2.4. Solution Once you observe obvious system time/TSC skew, and you don''t care power consumption specially, then similar to handle broadcast latency: Limit ''max_cstate'' to C1 or limit ''max_cstate'' to a real C2 and give ''lapic_timer_c2_ok'' option. Or, better to run your work on a newer platform with either constant TSC frequency or no-stop TSC feature supported. :-) Jimmy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei, Gang
2009-Apr-01 02:38 UTC
RE: [Xen-devel] Potential side-effects and mitigations after cpuidle enabled by default
On Tuesday, March 31, 2009 10:52 PM, Carsten Schiers wrote:> Sorry for my ignorance, but as I find all that very interesting, prior to read > it all over and as I suffer from skew after using cpuidle and also cpufreq > management (on an AMD CPU ;-, which TSC frequency is variant across > freq/voltage scaling) a few > questions: > > - you mention lost ticks in some guests, does this include Dom0? It''s where > my messages mainly show up.I haven''t observe lost ticks warning in Dom0 for current Xen3.4 tip by far.> - you recomend to limit cpuidle either to C1 or C2 (in case APIC > timer is not stopping. How to know that?You may need refer to processor''s spec.> - xm debug-key c reports active C1, max_cstate C2, but only lists C1 usage. > C1 Clock Ramping seems to be disabled. Platform timer is 25MHz HPET. > Excuse my ignorance again, but doesn''t that mean I am not using C-states > at all?In Xen3.3 the C1 residency is not counted yet. The max_cstate=C2 does not mean your platform support C2, it just means if your platform support C-states deeper than C2, the deepest used C-state will be C2. I guess xm debug-key c didn''t report any C2 information (usage, residency) in your platform, right? If yes, that means your system only support C1.> I understand you speak about Xen 3.4. Currently, I am at 3.3.1 and have to > wait > for a slot to test 3.4. I am curious to see what happens. Dan told me how to > use > xm debug-key t and said, max cycles skew is so much smaller than max stime > (xen > system time) skew. This makes him believe 3.4 will help.Yes, I also strongly suggesting you to have a try on 3.4. But I doesn''t expect much for the variant TSC case, just like what I said in the orginal mail. BTW, I believe enabling cpuidle or not should have no impact on your case. Have you checked the result while cpufreq disabled? Thanks Jimmy> > BR, > Carsten. > > ----- Originalnachricht ----- > Von: "Wei, Gang" <gang.wei@intel.com> > Gesendet: Die, 31.3.2009 16:00 > An: xen-devel <xen-devel@lists.xensource.com> > Cc: "Tian, Kevin" <kevin.tian@intel.com> ; Keir Fraser > <keir.fraser@eu.citrix.com> ; "Yu, Ke" <ke.yu@intel.com> Betreff: [Xen-devel] > Potential side-effects and mitigations after cpuidle enabled by default > > In xen3.4, cpuidle is defaultly enabled by c/s 19421. But some side-effects > may exist under different h/w C-states implementations or h/w configurations, > so that user may occasionally observe latency or system time/tsc skew. Below > are conditions causing these side-effects and means to mitigate them: > > 1. Latency > > Latency could be caused by two factors: C-state entry/exit latency, and extra > latency caused by broadcast mechanism. > > C-state entry/exit latency is inevitable since powering on/off gates takes > time. Normally shallower C-state incurs lighter latency but less power saving > capability, and vice versa for deeper C-state. Cpuidle governor tries to > balance performance and power tradeoff in high level, which is one area where > we''ll continue to tune. > > Broadcast is necessary to handle APIC timer stop at deep C-state (>=C3) on > some platforms. One platform timer source is chosen to carry per-cpu timer > deadline, and then wakeup CPUs in deep C-state timely at expected expiry. > By far Xen3.4 supports PIT/HPET as the broadcast source. In current > implementation PIT broadcast is implemented in periodical mode (10ms) which > means up to 10ms extra latency could be added on expiry expected from sleep > CPUs. This is just initial implementation choice which of course could be > enhanced to on-demand on/off mode in the future. We didn''t go into that > complexity in current implementation, due to its slow access and also short > wrap count. So HPET broadcast is always preferred, once this facility is > available which adds negligible overhead with timely wakeup. Then... world is > not always perfect, and some side-effects also exist along with HPET. > > Detail is listed as below: > > 1.1. For h/w supporting ACPI C1 (halt) only (BIOS reported in ACPI _CST > method): > > It''s immune from this side-effect as only instruction execution is halted. > > 1.2. For h/w supporting ACPI C2 in which TSC and apic timer don''t stop: > > ACPI C2 type is a bit special which is sometimes alias to a deep CPU > C-state and thus current Xen3.4 treat ACPI C2 type in same manner as > ACPI C3 type (i.e. broadcast is activated). If user knows on that platform > ACPI C2 type has not that h/w limitation, ''lapic_timer_c2_ok'' could be > added in grub to deactivate software mitigation. > > 1.3. For the rest implementations support ACPI C2+ in which apic timer > will be stopped: > > 1.3.1. HPET as broadcast timer source > > HPET can delivery timely wakeup event to CPUs sleep in deep > C-states with negligible overhead, as stated earlier. But > HPET mode being used does make some differences to worthy of > our noting: > > 1.3.1.1. If h/w supports per-channel MSI delivery mode (intr via FSB), it''s > the best broadcast mechanism known so far. No side effect regarding to > latency, and IPIs used to broadcast wakeup event could be reduced by a factor > of number of available channels (each channel could independently serve one > or several sleeping CPUs). > > As long as this feature is available, it''s always first prefered automatically > > 1.3.1.2. when MSI delivery mode is absent, we have to use legacy replacement > mode with only one HPET channel available. Well, it''s not that bad as this > only one channel could serve all sleeping CPUs by using IPIs to wake up. > However another side-effect occurs, as PIT/RTC interrupts (IRQ0/IRQ8) are > replaced by HPET channel. Then RTC alarm feature in dom0 will be lost, unless > we add RTC emulation between dom0''s rtc module and Xen''s hpet logic (however, > it''s not implemented by far.) > > Due to above side-effect, this broadcast option is disabled by default. In > that case, PIT broadcast is the default. If user is sure that he doesn''t need > RTC alarm, then use ''hpetbroadcast'' grub option to force enabling it. > > 1.3.2. PIT as broadcast timer source > > If MSI based HPET intr delivery is not available or HPET is missing, in all > cases PIT broadcast is the current default one. As said earlier, 10ms > periodical mode is implemented on PIT broadcast which thus could incur up to > 10ms latency for each deep C-state entry/exit. One natural result is to > observe ''many lost ticks'' in some guests. > > 1.4 Suggestions > > So, if user doesn''t care about power consumption while his platform does > expose deep C-states, one mitigation is to add ''max_cstate='' boot option to > restrict maximum allowed C-states (If limited to C2, ensure adding > ''lapic_timer_c2_ok'' if applied). Runtime modification on ''max_cstate'' is > allowed by xenpm (patch posted in 3/24/2009, not checked in yet). > > If user does care about power consumption w/o requirement on RTC alarm, then > always using HPET is preferred. > > Last, we could either add RTC emulation on HPET or enhance PIT broadcast to > use single shot mode, but would like to see comments from community whether > it''s worthy of doing. :-) > > 2. system time/TSC skew > > Similarly to APIC timer stop, TSC is also stopped at deep C-states in some > implementations, which thus requires Xen to recover lost counts at exit from > deep C-state by software means. It''s easy to think kinds of errors caused by > software methods. For the detail how TSC skew could occur, its side effects > and possible solutions, you could refer to our Xen summit presentation: > http://www.xen.org/files/xensummit_oracle09/XenSummit09pm.pdf > > Below is the brief introduction about which algorithm is available in > different implementations: > > 2.1. Best case is to have non-stop TSC at h/w implementation level. For > example, Intel Core-i7 processors supports this green feature which could be > detected by CPUID. Xen will do nothing once this feature is detected, and thus > no extra software-caused skew besides dozens of cycles due to crystal drift. > > 2.2. If TSC frequency is invariant across freq/voltage scaling (true for all > Intel processors supporting VTx), Xen will sync AP''s TSCs to BSP''s at 1 second > interval in per-cpu time calibration, meanwhile do recover in a per-cpu style, > where only elapsed platform counter since last calibration point is > compensated to local TSC with a boot-time-calculated scale factor. This > global synchronization along with per-cpu compensation limits TSC skew to ns > level in most cases. > > 2.3. If TSC frequency is variant across freq/voltage scaling, Xen will only do > recover in a per-cpu style, where only elapsed platform counter since last > calibration point is compensated to local TSC with local scale factor. In such > manner TSC skew across cpus is accumulating and easy to be observed after > system is up for some time. > > 2.4. Solution > > Once you observe obvious system time/TSC skew, and you don''t care power > consumption specially, then similar to handle broadcast latency: > > Limit ''max_cstate'' to C1 or limit ''max_cstate'' to a real C2 and give > ''lapic_timer_c2_ok'' option. > > Or, better to run your work on a newer platform with either constant TSC > frequency or no-stop TSC feature supported. :-) > > Jimmy > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel