>On 26/3/07 19:50, "Ian Pratt" <Ian.Pratt@xxxxxxxxxxxx> wrote: > >> On your system it appears to be a couple of microseconds out, which is >> on the high side of what we've observed. Normally you only see that kind >> of mismatch on systems with TSCs running off different crystals. > > More likely a jittery chipset timer -- we've observed less-than-ideal > stability from some chipset timers, which can throw us off a bit when > independently sync'ing the TSCs (which each CPU does for its TSC > independently every couple of seconds). > > -- KeirSorry, a little slow on responding here, only took a year ;-) Where is the code that does this independent TSC sync'ing? I see code in smpboot.c that seems to do this at startup (though exactly how I admit I haven't yet figured out... looks like some kind of rendezvous loop triggered by the BP?). But I don't see where/how this gets called "every couple of seconds", nor do I see any writing to the TSC (except setting BP and each AP to zero at startup). Thanks, Dan ==================================If Xen could save time in a bottle / then clocks wouldn't virtually skew / It would save every tick / for VMs that aren't quick / and Xen then would send them anew (with apologies to the late great Jim Croce) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 8/4/08 17:34, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Sorry, a little slow on responding here, only took a year ;-) > > Where is the code that does this independent TSC sync''ing? I see > code in smpboot.c that seems to do this at startup (though exactly > how I admit I haven''t yet figured out... looks like some kind of > rendezvous loop triggered by the BP?). But I don''t see where/how > this gets called "every couple of seconds", nor do I see any writing > to the TSC (except setting BP and each AP to zero at startup).arch/x86/time.c:local_time_calibration() -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > Where is the code that does this independent TSC sync''ing? I see > > code in smpboot.c that seems to do this at startup (though exactly > > how I admit I haven''t yet figured out... looks like some kind of > > rendezvous loop triggered by the BP?). But I don''t see where/how > > this gets called "every couple of seconds", nor do I see any writing > > to the TSC (except setting BP and each AP to zero at startup). > > arch/x86/time.c:local_time_calibration()OK, thanks. If I read the code correctly, Xen goes through this effort to ensure that the TSC''s are synchronized, but maintains this synchronization in a data structure and doesn''t actually change each processor''s physical TSC. Correct? This is of course just fine for the hypervisor''s timer needs (and thus indirectly for paravirtualized domains). But I also observe that all of the hvm platform timer (pit, hpet, and pmtimer) code is built on top of the physical TSC plus the vmx/svm tsc_offset which doesn''t seem to be affected by the Xen TSC synchronization. True? So assuming the above isn''t mistaken, hvm domain reads of the platform timer on an SMP system lacking hardware-synchronized TSC may suffer from non-monotonicity. Correct? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Dan Magenheimer >Sent: 2008年4月9日 1:40 > >But I also observe that all of the hvm platform timer (pit, >hpet, and pmtimer) code is built on top of the physical TSC >plus the vmx/svm tsc_offset which doesn''t seem to be affected >by the Xen TSC synchronization. True?For cpus on same system bus driven by one crystal, TSC drift among cpus may be just dozen of cycles after boot time sync, which is negligible enough compared to migration overhead and thus it''s unlikely to have HVM guest to observe a non-monotonic behavior after resume. The issue comes with cpus running on different frequency, like driven by multiple crystals or on-demand frequency change which affects TSC too. HVM guest can be configured to avoid migrating among cpus with different TSC freq, like limiting its cpu affinity to cpus on same system bus. Or you have to configure HVM guest to not trust TSC... Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> >But I also observe that all of the hvm platform timer (pit, > >hpet, and pmtimer) code is built on top of the physical TSC > >plus the vmx/svm tsc_offset which doesn't seem to be affected > >by the Xen TSC synchronization. True? > > For cpus on same system bus driven by one crystal, TSC drift among > cpus may be just dozen of cycles after boot time sync, which is > negligible enough compared to migration overhead and thus > it's unlikely > to have HVM guest to observe a non-monotonic behavior after resume.I agree this case is not much of a problem.> The issue comes with cpus running on different frequency, like driven > by multiple crystals or on-demand frequency change which affects TSC > too. HVM guest can be configured to avoid migrating among cpus with > different TSC freq, like limiting its cpu affinity to cpus on > same system bus.These are the cases I am worried about. The linux kernel seems to have a number of cases that mark TSC as unstable, but Xen does not, nor (I think) does Xen expose this information anywhere. So it seems SMP guests need to be pinned to physical CPUs that are measured to have sync'ed TSC's to guarantee that the (virtual) platform timer is monotonic.> Or you have to configure HVM guest to not trust TSC...Yes, that's what I'm thinking... like Linux, Xen could/should build virtual platform timers on a physical clocksource other than tsc if all of the potential vcpu->pcpu mappings are not on sync'd-TSC-pcpus. I assume this problem is worse with multi-socket Hypertransport and future Intel QPI boxes? Or is TSC (and frequency changing) synchronized for such systems? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Dan Magenheimer [mailto:dan.magenheimer@oracle.com] >Sent: 2008年4月9日 9:55 > >> Or you have to configure HVM guest to not trust TSC... > >Yes, that''s what I''m thinking... like Linux, Xen could/should >build virtual platform timers on a physical clocksource other >than tsc if all of the potential vcpu->pcpu mappings are not >on sync''d-TSC-pcpus.virtual platform timers are only one area. The most important is TSC itself which is used frequently by guest to calculate relative offset...> >I assume this problem is worse with multi-socket Hypertransport >and future Intel QPI boxes? Or is TSC (and frequency changing) >synchronized for such systems?For same crystal case, Intel processors with VT-x support all have TSC constant feature which is not bound to frequency change and can be detected by CPUID. But for multiple crystals case, Xen may need tackle affinity then. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > The issue comes with cpus running on different frequency, likedriven> > by multiple crystals or on-demand frequency change which affects TSC > > too. HVM guest can be configured to avoid migrating among cpus with > > different TSC freq, like limiting its cpu affinity to cpus on same > > system bus. > > These are the cases I am worried about. The linux kernel seems tohave> a number of cases that mark TSC as unstable, but Xen does not, nor (I > think) does Xen expose this information anywhere. So it seems SMP > guests need to be pinned to physical CPUs that are measured to have > sync''ed TSC''s to guarantee that the (virtual) platform timer is > monotonic.Xen itself copes fine with CPUs running from entirely independent clock sources. It calibrates the TSCs frequency against a global clock (e.g. the hpet).> > Or you have to configure HVM guest to not trust TSC... > > Yes, that''s what I''m thinking... like Linux, Xen could/should build > virtual platform timers on a physical clocksource other than tsc ifall> of the potential vcpu->pcpu mappings are not on sync''d-TSC-pcpus.Although Xen is fine, guests can get confused if they''re relying on the TSC. Fortunately, Windows doesn''t rely on the TSC, and most folk run Linux PV which also works fine. If you want to make Linux work HVM on such a system you need to either convince it to not to use the TSC, or arrange for TSC reads to trap to Xen and then compute the result based on Xen''s time base. If you''re doing the latter, better hope that TSC reads aren''t called frequently... Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Although Xen is fine, guests can get confused if they''re > relying on the > TSC. Fortunately, Windows doesn''t rely on the TSC, and most folk run > Linux PV which also works fine. > > If you want to make Linux work HVM on such a system you need to either > convince it to not to use the TSC, or arrange for TSC reads to trap to > Xen and then compute the result based on Xen''s time base. If you''re > doing the latter, better hope that TSC reads aren''t called > frequently...Hi Ian -- Let me clarify... unless my reading of the code is wrong, ALL hvm guests that rely on ANY (virtual) platform timer are UNKNOWINGLY relying on the physical TSCs. Thus if the underlying physical system has unsynchronized TSCs, different vcpus in an SMP HVM guest (or even the SAME vcpu when rescheduled on another pcpu) may find that consecutive reads of ANY (virtual) platform timer are unexpectedly non-monotonic, which violates the whole purpose of using a PLATFORM timer. I suspect this is unintended and bad? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/4/08 15:25, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Let me clarify... unless my reading of the code is wrong, ALL hvm > guests that rely on ANY (virtual) platform timer are UNKNOWINGLY > relying on the physical TSCs. Thus if the underlying physical > system has unsynchronized TSCs, different vcpus in an SMP HVM > guest (or even the SAME vcpu when rescheduled on another pcpu) > may find that consecutive reads of ANY (virtual) platform timer > are unexpectedly non-monotonic, which violates the whole purpose > of using a PLATFORM timer.This is all true. The logic in vpt.c should be fixed to use Xen''s concept of system time and everything, guest TSC included, should be derived from that. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > Let me clarify... unless my reading of the code is wrong, ALL hvm > > guests that rely on ANY (virtual) platform timer are UNKNOWINGLY > > relying on the physical TSCs. Thus if the underlying physical > > system has unsynchronized TSCs, different vcpus in an SMP HVM > > guest (or even the SAME vcpu when rescheduled on another pcpu) > > may find that consecutive reads of ANY (virtual) platform timer > > are unexpectedly non-monotonic, which violates the whole purpose > > of using a PLATFORM timer. > > This is all true. The logic in vpt.c should be fixed to use > Xen''s concept of > system time and everything, guest TSC included, should be > derived from that.Does Xen''s concept of system time have sufficient resolution and continuity to ensure both monotonicity and a reasonable guest timer granularity? I''m thinking not; some form of interpolation will probably be necessary which will require reading a physical platform timer** (e.g. other than tsc). Since a guest that is presented with a (virtual) platform timer of a given resolution may come to rely on both the monotonicity AND resolution of that timer, I''m beginning to understand why "that other virtualization company" doesn''t virtualize HPET. Dan ** Lest anyone say "well then just read the d**n platform timer", be aware that it must be done judiciously as it can be very expensive: On one recent vintage box I have, I measured reading HPET at about 10000 cycles and reading PIT at about 50000! So if every vcpu on every guest reads the (virtual) platform timer at 1000Hz, things can get ugly fast. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/4/08 17:33, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>> This is all true. The logic in vpt.c should be fixed to use >> Xen''s concept of >> system time and everything, guest TSC included, should be >> derived from that. > > Does Xen''s concept of system time have sufficient resolution > and continuity to ensure both monotonicity and a reasonable > guest timer granularity? I''m thinking not; some form of > interpolation will probably be necessary which will require > reading a physical platform timer** (e.g. other than tsc).Xen''s system time provides nanosecond precision and is intended to be as accurate as the underlying platform timer (over long periods) and as granular and accurate as the TSC over sub-second periods. It''s quite good enough for any guest purposes.> Since a guest that is presented with a (virtual) platform timer > of a given resolution may come to rely on both the monotonicity > AND resolution of that timer, I''m beginning to understand why > "that other virtualization company" doesn''t virtualize HPET.The HPET is a good example of the difference between precision and accuracy. It may report its period in picoseconds, but the spec allows drift of 100s of ppm. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> >> This is all true. The logic in vpt.c should be fixed to use > >> Xen''s concept of > >> system time and everything, guest TSC included, should be > >> derived from that. > > > > Does Xen''s concept of system time have sufficient resolution > > and continuity to ensure both monotonicity and a reasonable > > guest timer granularity? I''m thinking not; some form of > > interpolation will probably be necessary which will require > > reading a physical platform timer** (e.g. other than tsc). > > Xen''s system time provides nanosecond precision and is > intended to be as > accurate as the underlying platform timer (over long periods) and as > granular and accurate as the TSC over sub-second periods. > It''s quite good enough for any guest purposes.OK, as long as the maximum uncorrected drift between physical TSCs does not exceed the guest-expected granularity of its virtual platform timer, I agree its good enough. It appears that TSC drift for each pcpu is corrected by Xen once per second. Any idea for real systems out there what the maximum drift (per second) is? Will this be affected by existing or future power-savings designs (e.g. is it possible for the TSCs in one socket to be slowed down while the TSCs in another socket are not)? If so, as Kevin points out, some kind of affinity enforcement might be necessary for time-sensitive VMs. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/4/08 19:36, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> OK, as long as the maximum uncorrected drift between physical TSCs > does not exceed the guest-expected granularity of its virtual > platform timer, I agree its good enough.Ignoring power-saving events, TSCs are crystal-driven and hence we can expect specified tolerance of a few ppm across temperature extremes, and in practice over few-second periods I would expect tolerance of better than 1ppm. *However* I have seen platform timers (which also should be crystal-driven) which inexplicably exhibit much worse behaviour.> It appears that TSC drift for each pcpu is corrected by Xen > once per second. Any idea for real systems out there what the > maximum drift (per second) is? Will this be affected by > existing or future power-savings designs (e.g. is it possible > for the TSCs in one socket to be slowed down while the TSCs > in another socket are not)? If so, as Kevin points out, > some kind of affinity enforcement might be necessary for > time-sensitive VMs.P-state changes are informed to Xen so we can re-sync the local TSC immediately. The tricky ones are unannounced thermal events because software does not get informed about those. On some systems we can turn them off, on others (new Intel platforms) TSC is constant-rate regardless. In a normal running system thermal events are rare. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > OK, as long as the maximum uncorrected drift between physical TSCs > > does not exceed the guest-expected granularity of its virtual > > platform timer, I agree its good enough. > > Ignoring power-saving events, TSCs are crystal-driven and hence we can > expect specified tolerance of a few ppm across temperature > extremes, and in > practice over few-second periods I would expect tolerance of > better than > 1ppm. *However* I have seen platform timers (which also should be > crystal-driven) which inexplicably exhibit much worse behaviour.OK... back to monotonicity for a moment: So regardless of ppms and thermal and P-state and drifts, are you confident that the current corrected-tsc mechanism will never see time going backwards for the following test? (Apologies for pseudo-code, but hope you get the drift... pun intended). global val1, proceed = 0; Guest thread 1: spin_lock(lock); val1 = read_hpet(); proceed = 1; spin_unlock(lock); Guest thread 2: while (!proceed); spin_unlock_wait(lock); val2 = read_hpet(); if (val2 < val1) PANIC(); If you are not confident that this will be OK on existing and (within-reason) future Xen platforms, perhaps the hvm virtual platform timers should (at least optionally) be built on physical platform timers (Dave Winchell cc''ed), which would ensure time never goes backwards.> > It appears that TSC drift for each pcpu is corrected by Xen > > once per second. Any idea for real systems out there what the > > maximum drift (per second) is? Will this be affected by > > existing or future power-savings designs (e.g. is it possible > > for the TSCs in one socket to be slowed down while the TSCs > > in another socket are not)? If so, as Kevin points out, > > some kind of affinity enforcement might be necessary for > > time-sensitive VMs. > > P-state changes are informed to Xen so we can re-sync the local TSC > immediately. The tricky ones are unannounced thermal events > because software > does not get informed about those. On some systems we can > turn them off, on > others (new Intel platforms) TSC is constant-rate regardless. > In a normal > running system thermal events are rare.If it is possible to write code that can determine at boot-time (or at hotplug cpu_online) what CPUs are guaranteed-sync''ed with what other CPUs, it would be nice if this information was exported by Xen so that tools can manage very-time-sensitive guests appropriately. Personally, I think this code should be provided by the CPU vendors ;-) Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/4/08 22:27, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> If you are not confident that this will be OK on existing and > (within-reason) future Xen platforms, perhaps the hvm virtual > platform timers should (at least optionally) be built on physical > platform timers (Dave Winchell cc''ed), which would ensure time > never goes backwards.If we wanted to be more certain we could maintain a last_system_time fields per VCPU and, whenever using system time to compute current value for a virtual timer for an HVM VCPU, we could actually use max(system time, last_system_time). This would mean we were 100% sure that time didn''t go backwards, by turning small backwards deltas into very short periods of stalled time. As it is: no, since system time ''free runs'' on each CPU over one-second periods, there can be drift between CPUs if they are driven by different oscillators. Also there are tolerances in our software calibration code to consider. Which is why Linux guests implement the max(curr time, last time) in their gettimeofday() code. It would be quite reasonable to the same, inside Xen, for HVM guests. We can at least be pretty certain that any drifts across CPUs/VCPUs will be on the order of less than 100us. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> If we wanted to be more certain we could maintain a > last_system_time fields per VCPU andIf you mean per VCPU *and* per guest this seems like a good idea.> backwards, by turning small backwards deltas into very short > periods of stalled time.The stalled time may be a problem, but only if the tsc skew between processors is "bad". Your estimate of 100us seems like it could be unacceptable for some applications. Any idea how expensive arch/x86/time.c:local_time_calibration() is? If it''s not too bad, one option might be to add a xen boot parameter "calibratehz" to calibrate more frequently. Then systems running time-sensitive guests can be instructed to increase the parameter accordingly to ensure tsc skew is small enough.> > If you are not confident that this will be OK on existing and > > (within-reason) future Xen platforms, perhaps the hvm virtual > > platform timers should (at least optionally) be built on physical > > platform timers (Dave Winchell cc''ed), which would ensure time > > never goes backwards. > > If we wanted to be more certain we could maintain a > last_system_time fields > per VCPU and, whenever using system time to compute current > value for a > virtual timer for an HVM VCPU, we could actually use max(system time, > last_system_time). This would mean we were 100% sure that > time didn''t go > backwards, by turning small backwards deltas into very short > periods of > stalled time. > > As it is: no, since system time ''free runs'' on each CPU over > one-second > periods, there can be drift between CPUs if they are driven > by different > oscillators. Also there are tolerances in our software > calibration code to > consider. Which is why Linux guests implement the max(curr > time, last time) > in their gettimeofday() code. It would be quite reasonable to > the same, > inside Xen, for HVM guests. We can at least be pretty certain that any > drifts across CPUs/VCPUs will be on the order of less than 100us. > > -- Keir > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel