Dan Magenheimer
2009-Mar-27 20:49 UTC
[Xen-devel] Time skew on HP DL785 (and possibly other boxes)
(Raising a yellow flag because this could turn into a serious issue for Xen and it may take quite a bit of work to come up with a solution.) We recently measured Xen system time skew on an HP DL785 and found it to be horrible... nearly a quarter millisecond worst case (with only about 10000 samples so it may get worse). This box uses 8 quad-core AMD chips connected via hypertransport. BUT each chip is on a separate motherboard. On this system hypertransport is fast and cross-node memory accesses are fast enough so that these NUMA systems need not behave like NUMA systems from a memory access perspective. So Xen just views the system as a 32-cpu box (other than some code in the memory allocator that tries to allocate near-memory where possible, but silently falls back to far-memory if necessary) and guest vcpus migrate freely between the nodes. (Correct?) However, I''m told that its not possible to route a clocksource over hypertransport, so TSC''s on processors on different motherboards may be VERY different and apparently the mechanisms for synchronizing Xen system time across motherboards may not be up to the challenge. As a result, OS''s and apps sensitive to time that are running on PV domains may be in for a rough ride on systems like this. (HVM domains may run into other problems because time will apparently stop for a "long time".) Since systems like this are targeted for consolidation and virtualization, I see this as a potentially big problem as it may appear to real Xen customers as bizarre non-reproducible problems, such as "make" failing, leading to questions about the stability and viability of using Xen. Comments? Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Mar-27 22:36 UTC
Re: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
Dan Magenheimer wrote:> However, I''m told that its not possible to route a clocksource > over hypertransport, so TSC''s on processors on different > motherboards may be VERY different and apparently the > mechanisms for synchronizing Xen system time across > motherboards may not be up to the challenge. As a result, > OS''s and apps sensitive to time that are running on PV > domains may be in for a rough ride on systems like this. > (HVM domains may run into other problems because time will > apparently stop for a "long time".) >I don''t see what the problem is. If each individual cpu has well known tsc parameters (rate and offset), then a PV client will get those timing parameters and use it to compute its time. It doesn''t matter if they''re syncronized between cpus or nodes. Xen will need to calibrate each of them against a good reference (hpet?), but that''s no different from now. I guess its possible that this system has more variation and latency for hpet access, which may mean that the calibration algorithm needs tweaking. Of course, if the tsc rates on each cpu is changing in some unpredictable way then that''s a whole other barrel of problems. Guests rely on Xen maintaing accurate tsc timing parameters.> Since systems like this are targeted for consolidation > and virtualization, I see this as a potentially big problem > as it may appear to real Xen customers as bizarre > non-reproducible problems, such as "make" failing, > leading to questions about the stability and viability > of using Xen. > > Comments? >In Linux there''s this function: /* * apic_is_clustered_box() -- Check if we can expect good TSC * * Thus far, the major user of this is IBM''s Summit2 series: * * Clustered boxes may have unsynced TSC problems if they are * multi-chassis. Use available data to take a good guess. * If in doubt, go HPET. */ __cpuinit int apic_is_clustered_box(void) {...} Which deals with Summit2 and ScaleSMP vsmp systems which also have unsynchronized tscs across nodes. At the moment it assumes that no non-VSMP AMD system has unsynchronized tscs; sounds like it will need updating for this system. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Mar-28 02:29 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
>From: Dan Magenheimer >Sent: 2009年3月28日 4:50 > >(Raising a yellow flag because this could turn into >a serious issue for Xen and it may take quite a bit >of work to come up with a solution.) > >We recently measured Xen system time skew on an HP DL785 >and found it to be horrible... nearly a quarter millisecond >worst case (with only about 10000 samples so it may get worse). > >This box uses 8 quad-core AMD chips connected via >hypertransport. BUT each chip is on a separate motherboard. >On this system hypertransport is fast and cross-node >memory accesses are fast enough so that these NUMA systems >need not behave like NUMA systems from a memory access >perspective. So Xen just views the system as a 32-cpu box >(other than some code in the memory allocator that tries >to allocate near-memory where possible, but silently falls >back to far-memory if necessary) and guest vcpus migrate >freely between the nodes. (Correct?)Then instead user'd better to enable NUMA aware bits with Xen which imposes some affinity limitation but looks a reasonable model on large scale system. Thanks, Kevin> >However, I'm told that its not possible to route a clocksource >over hypertransport, so TSC's on processors on different >motherboards may be VERY different and apparently the >mechanisms for synchronizing Xen system time across >motherboards may not be up to the challenge. As a result, >OS's and apps sensitive to time that are running on PV >domains may be in for a rough ride on systems like this. >(HVM domains may run into other problems because time will >apparently stop for a "long time".) > >Since systems like this are targeted for consolidation >and virtualization, I see this as a potentially big problem >as it may appear to real Xen customers as bizarre >non-reproducible problems, such as "make" failing, >leading to questions about the stability and viability >of using Xen. > >Comments? > >Dan > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Mar-31 22:08 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
> >This box uses 8 quad-core AMD chips connected via > >hypertransport. BUT each chip is on a separate motherboard. > >On this system hypertransport is fast and cross-node > >memory accesses are fast enough so that these NUMA systems > >need not behave like NUMA systems from a memory access > >perspective. So Xen just views the system as a 32-cpu box > >(other than some code in the memory allocator that tries > >to allocate near-memory where possible, but silently falls > >back to far-memory if necessary) and guest vcpus migrate > >freely between the nodes. (Correct?) > > Then instead user''d better to enable NUMA aware bits with Xen which > imposes some affinity limitation but looks a reasonable model > on large > scale system. > > Thanks, > KevinHi Kevin -- Are you suggesting that only NUMA-aware guests should be run on systems like this? If not, what do you mean by "NUMA aware bits"? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Mar-31 22:48 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >Sent: 2009年4月1日 6:08 > >> >This box uses 8 quad-core AMD chips connected via >> >hypertransport. BUT each chip is on a separate motherboard. >> >On this system hypertransport is fast and cross-node >> >memory accesses are fast enough so that these NUMA systems >> >need not behave like NUMA systems from a memory access >> >perspective. So Xen just views the system as a 32-cpu box >> >(other than some code in the memory allocator that tries >> >to allocate near-memory where possible, but silently falls >> >back to far-memory if necessary) and guest vcpus migrate >> >freely between the nodes. (Correct?) >> >> Then instead user'd better to enable NUMA aware bits with Xen which >> imposes some affinity limitation but looks a reasonable model >> on large >> scale system. >> >> Thanks, >> Kevin > >Hi Kevin -- > >Are you suggesting that only NUMA-aware guests should be >run on systems like this? If not, what do you mean by >"NUMA aware bits"? >No. I meant the physical NUMA features in Xen. IIRC, once NUMA support is turned on in Xen (by "numa" boot option), one guest is limited in one node automatically, meaning both cpu affinity only matching to that node and also memory allocated locally within that node. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Mar-31 23:21 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
> >> Then instead user''d better to enable NUMA aware bits with Xen which > >> imposes some affinity limitation but looks a reasonable model > >> on large > >> scale system. > > > >Are you suggesting that only NUMA-aware guests should be > >run on systems like this? If not, what do you mean by > >"NUMA aware bits"? > > > > No. I meant the physical NUMA features in Xen. IIRC, once > NUMA support is > turned on in Xen (by "numa" boot option), one guest is > limited in one node > automatically, meaning both cpu affinity only matching to > that node and also > memory allocated locally within that node. > > Thanks, > KevinOK, I see. That seems too restrictive when the interprocessor link is very fast like HT or QPI. I hope there is a solution that will allow xen system time to be fairly accurate and synchronized in this kind of system without depending on tsc. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Apr-03 22:23 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
I think I still have a real concern here. Let me see if I can explain. The goal for Xen timekeeping is to ensure that if a guest could somehow magically read any of its virtual clocks (tsc, pit, hpet, pmtimer, ??) on all its virtual processors simultaneously, the values read must always obey this "virtual clock law": max - min < delta We can argue how large that delta can reasonably be and it may vary depending on what the workload is, but it''s certainly under a millisecond, ten microseconds might not be a bad starting point, and it is getting smaller as processors get faster. If xen can''t guarantee that, then it must turn on "numa" mode, which appears to me to be extremely restrictive and no system vendor could sell honestly sell the true promise of virtualization on such a box. So we''d like to avoid that if possible. Now HP DL785-like designs are likely to become more common because an HT/QPI interconnect makes it possible to build a single model that is low cost but very-expandible. Such boxes use multiple motherboards because its much easier to expand by adding field replaceable units. Unfortunately, the current Xen system time model (which I think is also used by kvm?) may not be scaleable to these boxes. If the current Xen system time algorithm is scaleable, great. We are done. If it can be tweaked to be scaleable, great, no problem. But if the model needs to be changed substantially, for example if everything needs to be built on a platform timer because we just can''t guarantee the "virtual clock law", then we may have a real problem... and not just performance. Why? Because the "paravirtual clock" API is hard-coded in every existing PV domain... and in current and future versions of the linux kernel (and probably in Windows too?). If the new model is unable to use the same API, every prepackaged VM is broken. So I think we need to be very sure that we either: A) do not need to change the xen system time model to ensure the "virtual clock law" can be obeyed on such boxes, or B) DO need to change the xen system time model, but the paravirtual clock API does NOT need to change, or C) modify/augment the paravirtual clock API and start getting the updated version into guests/kernels asap, or D) ensure that system vendors know that Xen will never run guests reliably on such a box, without restricting operation to NUMA mode Note that the Linux approach doesn''t work here because: 1) a guest''s clocks might obey the "virtual clock law" at one moment on one set of physical processors and not at the next moment; 2) guests access to all clocks (except the tsc) is emulated so even if a guest decides the tsc is unreliable, that just doesn''t help if the alternate clock it chooses (e.g. HPET) is silently emulated on top of xen system time using the physical tsc. Now does that make my concern more clear? Thanks, Dan> -----Original Message----- > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > Sent: Friday, March 27, 2009 4:37 PM > To: Dan Magenheimer > Cc: Xen-Devel (E-mail); john.v.morris@hp.com > Subject: Re: [Xen-devel] Time skew on HP DL785 (and possibly other > boxes) > > > Dan Magenheimer wrote: > > However, I''m told that its not possible to route a clocksource > > over hypertransport, so TSC''s on processors on different > > motherboards may be VERY different and apparently the > > mechanisms for synchronizing Xen system time across > > motherboards may not be up to the challenge. As a result, > > OS''s and apps sensitive to time that are running on PV > > domains may be in for a rough ride on systems like this. > > (HVM domains may run into other problems because time will > > apparently stop for a "long time".) > > > > I don''t see what the problem is. If each individual cpu has > well known > tsc parameters (rate and offset), then a PV client will get > those timing > parameters and use it to compute its time. It doesn''t matter > if they''re > syncronized between cpus or nodes. > > Xen will need to calibrate each of them against a good reference > (hpet?), but that''s no different from now. I guess its possible that > this system has more variation and latency for hpet access, which may > mean that the calibration algorithm needs tweaking. > > Of course, if the tsc rates on each cpu is changing in some > unpredictable way then that''s a whole other barrel of > problems. Guests > rely on Xen maintaing accurate tsc timing parameters. > > > Since systems like this are targeted for consolidation > > and virtualization, I see this as a potentially big problem > > as it may appear to real Xen customers as bizarre > > non-reproducible problems, such as "make" failing, > > leading to questions about the stability and viability > > of using Xen. > > > > Comments? > > > > In Linux there''s this function: > > /* > * apic_is_clustered_box() -- Check if we can expect good TSC > * > * Thus far, the major user of this is IBM''s Summit2 series: > * > * Clustered boxes may have unsynced TSC problems if they are > * multi-chassis. Use available data to take a good guess. > * If in doubt, go HPET. > */ > __cpuinit int apic_is_clustered_box(void) > {...} > > > Which deals with Summit2 and ScaleSMP vsmp systems which also have > unsynchronized tscs across nodes. At the moment it assumes that no > non-VSMP AMD system has unsynchronized tscs; sounds like it will need > updating for this system. > > J > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Apr-05 07:56 UTC
Re: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
On 03/04/2009 23:23, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> I think I still have a real concern here. Let me see if > I can explain. > > The goal for Xen timekeeping is to ensure that if a guest > could somehow magically read any of its virtual clocks > (tsc, pit, hpet, pmtimer, ??) on all its virtual processors > simultaneously, the values read must always obey this > "virtual clock law":We can do this for all except TSC for HVM guests because there virtual TSC is hardwired onto the physical TSC (plus a configurable offset). If TSCs run at significantly different rates then that will be hard to hide from the guest. Luckily Windows is pretty robust to iffy timers, and no doubt particularly suspicious of TSCs in multiprocessor environments. Everything else builds on Xen system time, and Xen system time should just require each CPU''s TSC to be individually stable. This is true even with your 3.3 patch to rendezvous and snapshot all TSCs at the same instant in time. This doesn''t rely on all TSCs running at the same rate! The approach should work just as well if they run at their own separate stable rates off separate crystals. I think the benefit of your patch was in sync''ing system time across all CPUs at the same time, which significantly reduced maximum divergence. One concern I have however, is Intel''s X86_FEATURE_CONSTANT_TSC logic. This was added by them to prevent TSCs from diverging due to Cx deep sleep states, by observing that usually all TSCs will tick at the same exact rate, so all that needs to be done is to rewrite all AP TSCs to that of the BP periodically. This seems to work well on small systems, but the trigger for this mode is rather suspicious. CONSTANT_TSC feature means that a CPU''s TSC is invariant across frequency/voltage changes -- it *doesn''t* mean that all TSCs across a large MP box are at matched frequency! I wonder whether this optimisation will bite us on big iron? Probably it ought to disable itself if it detects significant TSC divergence, or at the very least maybe we should add a command-line option to disable (or enable?) it. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-05 12:17 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: 2009年4月5日 15:56 > >One concern I have however, is Intel's >X86_FEATURE_CONSTANT_TSC logic. This >was added by them to prevent TSCs from diverging due to Cx deep sleep >states, by observing that usually all TSCs will tick at the >same exact rate,Here one correction is, that constant tsc logic is introduced for P-states instead of C-states, to have TSC always stepping in constant pace on a given processor, regardless of whatever opertion point is being requested by cpufreq governor. It doesn't say anything that all TSCs tick at same rate however.>so all that needs to be done is to rewrite all AP TSCs to that >of the BP >periodically. This seems to work well on small systems, but >the trigger for >this mode is rather suspicious. CONSTANT_TSC feature means >that a CPU's TSC >is invariant across frequency/voltage changes -- it *doesn't* >mean that all >TSCs across a large MP box are at matched frequency! I wonderYou're exactly right here. To use it does require that all cpus are driven by a single crystal, which is not true for a large system with multipe crystals. So this approach (sync all TSCs to minimize skews caused by TSC stop from deep C-states) doesn't work in all cases.>whether this >optimisation will bite us on big iron? Probably it ought to >disable itself >if it detects significant TSC divergence, or at the very least maybe we >should add a command-line option to disable (or enable?) it.I guess thing won't be that worse in this C-state specific area. Large system based on Intel core-i7 processors or later always have invariant tsc feature (non-stop tsc) integrated, and thus no software recovery is required, while in the meantime, iirc, previous large scale servers don't have deep C-state (>=C3) implemented and so it's not an issue too. :-) Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-05 12:41 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: 2009年4月5日 15:56 > >On 03/04/2009 23:23, "Dan Magenheimer" ><dan.magenheimer@oracle.com> wrote: > >> I think I still have a real concern here. Let me see if >> I can explain. >> >> The goal for Xen timekeeping is to ensure that if a guest >> could somehow magically read any of its virtual clocks >> (tsc, pit, hpet, pmtimer, ??) on all its virtual processors >> simultaneously, the values read must always obey this >> "virtual clock law": > >We can do this for all except TSC for HVM guests because there >virtual TSC >is hardwired onto the physical TSC (plus a configurable >offset). If TSCs run >at significantly different rates then that will be hard to >hide from the >guest. Luckily Windows is pretty robust to iffy timers, and no doubt >particularly suspicious of TSCs in multiprocessor environments. >In that case then Xen'd better figure out some hints to have HVM guest recognize TSC as unreliable timer source, and then fall back to other virtual platform timers (since even keeping tsc still require emulation for every access now, which would give wrong illusion to guest and also be harder to be accurately emulated due to assumed high frequency). Although extra overhead could be incurred, that's the fact if HVM can be assured with affinity to one node or several nodes with known same frequency... Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-05 12:43 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
>From: Tian, Kevin >Sent: 2009年4月5日 20:41 > >>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >>Sent: 2009年4月5日 15:56 >> >>On 03/04/2009 23:23, "Dan Magenheimer" >><dan.magenheimer@oracle.com> wrote: >> >>> I think I still have a real concern here. Let me see if >>> I can explain. >>> >>> The goal for Xen timekeeping is to ensure that if a guest >>> could somehow magically read any of its virtual clocks >>> (tsc, pit, hpet, pmtimer, ??) on all its virtual processors >>> simultaneously, the values read must always obey this >>> "virtual clock law": >> >>We can do this for all except TSC for HVM guests because there >>virtual TSC >>is hardwired onto the physical TSC (plus a configurable >>offset). If TSCs run >>at significantly different rates then that will be hard to >>hide from the >>guest. Luckily Windows is pretty robust to iffy timers, and no doubt >>particularly suspicious of TSCs in multiprocessor environments. >> > >In that case then Xen'd better figure out some hints to have >HVM guest recognize TSC as unreliable timer source, and >then fall back to other virtual platform timers (since even keeping >tsc still require emulation for every access now, which would >give wrong illusion to guest and also be harder to be accurately >emulated due to assumed high frequency). Although extra >overhead could be incurred, that's the fact if HVM can be^^^^ I meant 'can't be' here.>assured with affinity to one node or several nodes with known >same frequency..._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-05 12:59 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >Sent: 2009年4月4日 6:23 > >I think I still have a real concern here. Let me see if >I can explain. > >The goal for Xen timekeeping is to ensure that if a guest >could somehow magically read any of its virtual clocks >(tsc, pit, hpet, pmtimer, ??) on all its virtual processors >simultaneously, the values read must always obey this >"virtual clock law": > > max - min < delta > >We can argue how large that delta can reasonably be and it >may vary depending on what the workload is, but >it's certainly under a millisecond, ten microseconds >might not be a bad starting point, and it is getting >smaller as processors get faster. > >If xen can't guarantee that, then it must turn on "numa" >mode, which appears to me to be extremely restrictive >and no system vendor could sell honestly sell the true >promise of virtualization on such a box. So we'd like >to avoid that if possible.I also heard one concern that completely random load balance may also work suboptimally on large scale system, being fierce contention on shared data structures, and thus some coarse-grained soft partition or limitation are welcomed to ensure accurate control on assigned resources to given VM and also avoid cross node traffic as possible. In such case enable 'numa' could serve the purpose to some extent, which simply refine given VM's activity within one node, but definitely allow administrative tools to move it across node at its disposal. I once heard that typical deployed VMs nowadays are provisioned with 1 - 4 vcpus which normally fits in one node. But this may not be true in all cases. Well, my point is a bit out of topic here. Of course your concern about cross-node TSC variance still makes sense whether or not node affinity is enforced, as long as VM is possibly migrated cross-nodes. My point is just that turn on 'numa' itself is really not a 'extremely restrictive' thing. :-)> >Note that the Linux approach doesn't work here >because: 1) a guest's clocks might obey the "virtual clock >law" at one moment on one set of physical processors >and not at the next moment; 2) guests access to all >clocks (except the tsc) is emulated so even if a guest >decides the tsc is unreliable, that just doesn't help >if the alternate clock it chooses (e.g. HPET) is silently >emulated on top of xen system time using the physical tsc.As Keir said, Xen system time itself is implemented in a stable style, and thus as long as HVM timer virtualization finally falls into emulation path, it should be stable too by adding some overhead atop current tsc virtualization path. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Apr-05 13:27 UTC
Re: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
On 05/04/2009 13:17, "Tian, Kevin" <kevin.tian@intel.com> wrote:>> One concern I have however, is Intel''s >> X86_FEATURE_CONSTANT_TSC logic. This >> was added by them to prevent TSCs from diverging due to Cx deep sleep >> states, by observing that usually all TSCs will tick at the >> same exact rate, > > Here one correction is, that constant tsc logic is introduced for > P-states instead of C-states, to have TSC always stepping in > constant pace on a given processor, regardless of whatever > opertion point is being requested by cpufreq governor. It > doesn''t say anything that all TSCs tick at same rate however.Then changeset 18923 is indeed broken and should be reverted? The problem is this changeset doesn''t just affect the cases it is meant to ''fix'' (usage of C states for CPUs without no-stop TSC). Apart from the fact it can be broken for systems with that type of CPU as well, it''s actually enabled for any modern CPU (anything advertising the constant-tsc feature). Probably I shouldn''t have checked in that patch in the first place. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-05 13:37 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: 2009年4月5日 21:28 > >On 05/04/2009 13:17, "Tian, Kevin" <kevin.tian@intel.com> wrote: > >>> One concern I have however, is Intel's >>> X86_FEATURE_CONSTANT_TSC logic. This >>> was added by them to prevent TSCs from diverging due to Cx >deep sleep >>> states, by observing that usually all TSCs will tick at the >>> same exact rate, >> >> Here one correction is, that constant tsc logic is introduced for >> P-states instead of C-states, to have TSC always stepping in >> constant pace on a given processor, regardless of whatever >> opertion point is being requested by cpufreq governor. It >> doesn't say anything that all TSCs tick at same rate however. > >Then changeset 18923 is indeed broken and should be reverted? >The problem is >this changeset doesn't just affect the cases it is meant to >'fix' (usage of >C states for CPUs without no-stop TSC). Apart from the fact it >can be broken >for systems with that type of CPU as well, it's actually >enabled for any >modern CPU (anything advertising the constant-tsc feature). Probably I >shouldn't have checked in that patch in the first place. >How about making it a selectable option, instead of reversing it completely? Thanks,. Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Apr-06 14:34 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
> > The goal for Xen timekeeping is to ensure that if a guest > > could somehow magically read any of its virtual clocks > > (tsc, pit, hpet, pmtimer, ??) on all its virtual processors > > simultaneously, the values read must always obey this > > "virtual clock law": > > We can do this for all except TSC for HVM guests becauseI understand that this is true IFF Xen system time itself obeys the virtual clock law. I am concerned that maybe it cannot on machines such as this. If not, NO HVM guest clock will obey the law, correct?> Everything else builds on Xen system time, and Xen system > time should just > require each CPU''s TSC to be individually stable. > ...I think the benefit of your patch was in > sync''ing system > time across all CPUs at the same time, which significantly > reduced maximum divergence.The problem was, in our testing on this DL785, the maximum divergence was not reduced enough! This was tested with xen-unstable (not sure what c/s).> One concern I have however, is Intel''s > X86_FEATURE_CONSTANT_TSC logic.It''s possible that this (or some other problem) has resulted in the divergenece on the DL785. So more testing is in order. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Apr-06 14:41 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
> Well, my point is a bit out of topic here. Of course your > concern about cross-node TSC variance still makes sense > whether or not node affinity is enforced, as long as VM is > possibly migrated cross-nodes. My point is just that turn > on ''numa'' itself is really not a ''extremely restrictive'' > thing. :-)Hi Kevin -- I think numa-mode is extremely restrictive because it makes a 32-way box work like eight 4-way blades. I think the whole point of HT/QPI is to reduce the memory latency enough so that a NUMA box does not look like a NUMA box. If time synchronization fails so that this type of box is forced to be partitioned, the value of HT/QPI is greatly diminished (at least in a virtualization environment). Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Apr-06 14:48 UTC
Re: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
On 06/04/2009 15:34, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>> One concern I have however, is Intel''s >> X86_FEATURE_CONSTANT_TSC logic. > > It''s possible that this (or some other problem) has resulted > in the divergenece on the DL785. So more testing is in > order.The Intel patch is enabled only via a command-line option as of c/s 19506. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2009-Apr-06 22:48 UTC
RE: [Xen-devel] Time skew on HP DL785 (and possibly other boxes)
>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >Sent: 2009年4月6日 22:41 > >> Well, my point is a bit out of topic here. Of course your >> concern about cross-node TSC variance still makes sense >> whether or not node affinity is enforced, as long as VM is >> possibly migrated cross-nodes. My point is just that turn >> on 'numa' itself is really not a 'extremely restrictive' >> thing. :-) > >Hi Kevin -- > >I think numa-mode is extremely restrictive because >it makes a 32-way box work like eight 4-way blades.virtualization in itself is something partitioned with each VM representing one working set. Most VMs deployed so far haven't requirement over virtual 4-way blades, and thus above restriction is less relaxed. Then it's natural to span them in-nodes instead of cross-nodes.> >I think the whole point of HT/QPI is to reduce the >memory latency enough so that a NUMA box does not >look like a NUMA box. If time synchronization fails >so that this type of box is forced to be partitioned, >the value of HT/QPI is greatly diminished (at least >in a virtualization environment). >It's orthogonal. The effort to keep reducing memory latency on NUMA box doesn't mean no observable memory latency difference for local and remote memory. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel