Dan Magenheimer
2008-Jul-02 16:03 UTC
[Xen-devel] [PATCH] strictly increasing hvm guest time
This simple one-line patch changes hvm guest time from monotonically non-decreasing to monotonically strictly- increasing. As a result, two consecutive reads of the (virtual) hpet will never return the same value, thus avoiding the appearance that time has stopped (which may occur if there is skew between physical processor TSCs). The only problem scenario I can see is if: 1) N = number of physical CPUs on system 2) T = time in nsec of fastest call P that an hvm guest can make that indirectly invokes hvm_get_guest_time() 3) N>T (highly unlikely) 4) guests on all N physical CPUs are continuously calling P (also highly unlikely) then guest time could accelerate faster than Xen system time. Dan ==================================Thanks... for the memory I really could use more / My throughput's on the floor The balloon is flat / My swap disk's fat / I've OOM's in store Overcommitted so much (with apologies to the late great Bob Hope) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Jul-02 16:07 UTC
[Xen-devel] Re: [PATCH] strictly increasing hvm guest time
On 2/7/08 17:03, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> This simple one-line patch changes hvm guest time from > monotonically non-decreasing to monotonically strictly- > increasing. As a result, two consecutive reads of the > (virtual) hpet will never return the same value, thus > avoiding the appearance that time has stopped (which may > occur if there is skew between physical processor TSCs).It does seem a little hack-ish, if we don''t know of any issues arising from the current code, and we expect cross-cpu deltas to be pretty small. Also guests will often convert HPET reads to well-known units (e.g., microseconds, milliseconds) before using them, in which case even a delta of one may not result in differing converted time values. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-02 21:50 UTC
[Xen-devel] RE: [PATCH] strictly increasing hvm guest time
> > This simple one-line patch changes hvm guest time from > > monotonically non-decreasing to monotonically strictly- > > increasing. As a result, two consecutive reads of the > > (virtual) hpet will never return the same value, thus > > avoiding the appearance that time has stopped (which may > > occur if there is skew between physical processor TSCs). > > It does seem a little hack-ish, if we don''t know of any > issues arising from > the current code, and we expect cross-cpu deltas to be pretty > small.Using "xm debug-key t; xm dmesg | tail -1" you can get an idea of the deltas. Even on my single-socket dual-core recent-vintage Intel box, I''m frequently seeing Diff''s > 300ns. While this is still relatively small (and part of it may be SMP cache synchronization time), this is supposed to be a "good TSC" box. I''m spinning a small patch capturing the maximum so that can be output via debug-key t also.> Also > guests will often convert HPET reads to well-known units (e.g., > microseconds, milliseconds) before using them, in which case > even a delta of > one may not result in differing converted time values.Yes, but most newer Linux systems have a high-res timer API that returns nanoseconds, though admittedly it is not widely used yet. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-02 22:41 UTC
[Xen-devel] [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
> Subject: [Xen-devel] RE: [PATCH] strictly increasing hvm guest time > > I''m spinning a small patch capturing the maximum so that can > be output via debug-key t also.Attached is the patch. Interestingly, on my single-socket two-core recent-vintage Intel processor, this patch reports a max skew of >13 usec, much higher than the values I''m seeing from "xm debug-key t". I wonder if this is due to a mistake in my patch (though I don''t see it) or if the various stime error corrections are not converging as expected, resulting in a broader stime skew between processors than expected? Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Jul-03 08:03 UTC
[Xen-devel] Re: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
On 2/7/08 23:41, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Attached is the patch. Interestingly, on my single-socket > two-core recent-vintage Intel processor, this patch reports > a max skew of >13 usec, much higher than the values I''m > seeing from "xm debug-key t". I wonder if this is due to > a mistake in my patch (though I don''t see it) or if the > various stime error corrections are not converging as > expected, resulting in a broader stime skew between > processors than expected?Perhaps this relatively large skew happens at start of day, before the periodic calibration has ''locked on''? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-03 16:24 UTC
[Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
> > Attached is the patch. Interestingly, on my single-socket > > two-core recent-vintage Intel processor, this patch reports > > a max skew of >13 usec, much higher than the values I''m > > seeing from "xm debug-key t". I wonder if this is due to > > a mistake in my patch (though I don''t see it) or if the > > various stime error corrections are not converging as > > expected, resulting in a broader stime skew between > > processors than expected? > > Perhaps this relatively large skew happens at start of day, before the > periodic calibration has ''locked on''?Indeed you are correct. This updated patch now reports zero skew as expected. IMHO, it would be nice to put this patch into the tree as it will be good for helping to diagnose time skew problems such as the one just reported on the list. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-03 16:35 UTC
RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
> From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of Dan > Magenheimer > Subject: [Xen-devel] RE: [PATCH] record max stime skew (was > > > Perhaps this relatively large skew happens at start of day, > before the > > periodic calibration has ''locked on''? > > Indeed you are correct. This updated patch now reports zero skew > as expected. > > IMHO, it would be nice to put this patch into the tree as it > will be good for helping to diagnose time skew problems > such as the one just reported on the list.Oops! Just after I sent the above email, I checked again and the same machine (no reboots, no guests ever launched) now reports a max stime skew of 4333ns!! Methinks there might be some periodic glitch in the calibration code? Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-03 20:03 UTC
RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
> > IMHO, it would be nice to put this patch into the tree as it > > will be good for helping to diagnose time skew problems > > such as the one just reported on the list. > > Oops! Just after I sent the above email, I checked again and > the same machine (no reboots, no guests ever launched) now reports > a max stime skew of 4333ns!! Methinks there might be some > periodic glitch in the calibration code?OK this version records not only max but also a distribution of skew. (The code is a bit ugly... I thought about doing something fancy with log-binary but decided a few base-10 ranges were clearer for a human to read.) With this, I use "watch -d ''xm debug-key t; xm dmesg | tail -3''" and can observe that (on my single-socket two-core recent-vintage Intel box) roughly three-quarters of the skew measurements are between 10-100nsec, roughly one-quarter are between 100ns-1us, a couple percent are between 1us-10us and a few are >10us. This represents an approximate distribution of how long an hvm guest might observe time to be stopped (if it is able to repeatedly read time values quickly enough). So on some machines, this might be substantially worse than the old hvm-platform-timer-built-on-tsc mechanism (though we had no monotonicity constraint built into that). I wonder if the >1us outliers are occurring only if the processor has been idle for awhile, vs entirely random. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Jul-03 23:00 UTC
Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
Skipping cpu0 makes no sense. It''s not the ''master''. master_stime is time calculated from the platform timer (hpet, pit, or whatever). All cpus are equal peers. Apart from that looks plausible to me. -- Keir On 3/7/08 21:03, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>>> IMHO, it would be nice to put this patch into the tree as it >>> will be good for helping to diagnose time skew problems >>> such as the one just reported on the list. >> >> Oops! Just after I sent the above email, I checked again and >> the same machine (no reboots, no guests ever launched) now reports >> a max stime skew of 4333ns!! Methinks there might be some >> periodic glitch in the calibration code? > > OK this version records not only max but also a distribution > of skew. (The code is a bit ugly... I thought about doing > something fancy with log-binary but decided a few base-10 > ranges were clearer for a human to read.) > > With this, I use "watch -d ''xm debug-key t; xm dmesg | tail -3''" > and can observe that (on my single-socket two-core recent-vintage > Intel box) roughly three-quarters of the skew measurements are > between 10-100nsec, roughly one-quarter are between 100ns-1us, > a couple percent are between 1us-10us and a few are >10us. > > This represents an approximate distribution of how long an hvm > guest might observe time to be stopped (if it is able to repeatedly > read time values quickly enough). > > So on some machines, this might be substantially worse than the > old hvm-platform-timer-built-on-tsc mechanism (though we had > no monotonicity constraint built into that). > > I wonder if the >1us outliers are occurring only if the > processor has been idle for awhile, vs entirely random. > > Dan_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-04 15:11 UTC
RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
> Skipping cpu0 makes no sense.Oops, I misunderstood that for some reason. Here''s a fixed version. I also now preserve the "Platform timer is" line since that can get flushed out of the dmesg buffer. Any idea why the skew can get so bad? Dan> -----Original Message----- > From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] > Sent: Thursday, July 03, 2008 5:00 PM > To: dan.magenheimer@oracle.com; Xen-Devel (E-mail) > Cc: Dave Winchell > Subject: Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: > [PATCH] strictly increasing hvm guest time) > > > Skipping cpu0 makes no sense. It''s not the ''master''. > master_stime is time > calculated from the platform timer (hpet, pit, or whatever). > All cpus are > equal peers. Apart from that looks plausible to me. > > -- Keir > > On 3/7/08 21:03, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > > >>> IMHO, it would be nice to put this patch into the tree as it > >>> will be good for helping to diagnose time skew problems > >>> such as the one just reported on the list. > >> > >> Oops! Just after I sent the above email, I checked again and > >> the same machine (no reboots, no guests ever launched) now reports > >> a max stime skew of 4333ns!! Methinks there might be some > >> periodic glitch in the calibration code? > > > > OK this version records not only max but also a distribution > > of skew. (The code is a bit ugly... I thought about doing > > something fancy with log-binary but decided a few base-10 > > ranges were clearer for a human to read.) > > > > With this, I use "watch -d ''xm debug-key t; xm dmesg | tail -3''" > > and can observe that (on my single-socket two-core recent-vintage > > Intel box) roughly three-quarters of the skew measurements are > > between 10-100nsec, roughly one-quarter are between 100ns-1us, > > a couple percent are between 1us-10us and a few are >10us. > > > > This represents an approximate distribution of how long an hvm > > guest might observe time to be stopped (if it is able to repeatedly > > read time values quickly enough). > > > > So on some machines, this might be substantially worse than the > > old hvm-platform-timer-built-on-tsc mechanism (though we had > > no monotonicity constraint built into that). > > > > I wonder if the >1us outliers are occurring only if the > > processor has been idle for awhile, vs entirely random. > > > > Dan > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Jul-04 15:22 UTC
Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
On 4/7/08 16:11, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Oops, I misunderstood that for some reason. > > Here''s a fixed version. I also now preserve the "Platform timer is" > line since that can get flushed out of the dmesg buffer. > > Any idea why the skew can get so bad?Not really. We could check in this patch or similar and perhaps collect more information. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-04 19:32 UTC
RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
> On 4/7/08 16:11, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > > > Oops, I misunderstood that for some reason. > > > > Here''s a fixed version. I also now preserve the "Platform timer is" > > line since that can get flushed out of the dmesg buffer.OOPS, forgot the patch! Attached this time.> > Any idea why the skew can get so bad? > > Not really. We could check in this patch or similar and > perhaps collect more > information. > > -- KeirWell one suspicion I had was that very long hpet reads were getting serialized, but I tried clocksource=acpi and clocksource=pit and get similar skew range results. In fact pit shows a max of >17000 vs hpet and acpi closer to 11000. (OTOH, I suppose it IS possible that this is roughly how long it takes to read each of these platform timers.) Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Jul-04 19:56 UTC
Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
On 4/7/08 20:32, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Well one suspicion I had was that very long hpet reads were > getting serialized, but I tried clocksource=acpi and > clocksource=pit and get similar skew range results. > In fact pit shows a max of >17000 vs hpet and acpi closer > to 11000. (OTOH, I suppose it IS possible that this is > roughly how long it takes to read each of these platform > timers.)That ought to be easy to check. I would expect that the PIT, for example, could take a couple of microseconds to access. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-10 00:24 UTC
RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
> > Well one suspicion I had was that very long hpet reads were > > getting serialized, but I tried clocksource=acpi and > > clocksource=pit and get similar skew range results. > > In fact pit shows a max of >17000 vs hpet and acpi closer > > to 11000. (OTOH, I suppose it IS possible that this is > > roughly how long it takes to read each of these platform > > timers.) > > That ought to be easy to check. I would expect that the PIT, > for example, > could take a couple of microseconds to access. > > -- Keir(I haven''t seen the patch applied... since it just collects data, it would be nice if it was applied so others could try it.) To follow up on this, I tried a number of tests but wasn''t able to identify the problem and have given up (for now). In case someone else starts looking at this (or if any of my tests suggest a solution to someone), I thought I''d document what I tried. PROBLEM: Xen system time skew between processors local time and platform time is generally "small" but "sometimes" gets quite "large". This is important because, the larger the skew, the more likely an hvm guest will experience time stopping or (in some cases) time going backwards. On my box, "small" is under 1 usec, "large" is 9-18 usec, and "sometimes" is about one out of 500 measurements. Note that my box is a recent vintage Intel single-socket dual-core ("Conroe"). I suspect periodically some lock is being waited for for a long time, or maybe an unexpected interrupt is occurring, but I didn''t find anything through code reading or experiments. TEST METHOD: The patch I sent on this thread collects data whenever local_time_calibration() is run (which is 1Hz on each processor) and "xm debug-key t" prints this data so it can be seen with "xm dmesg". To see the problem, one need only boot dom0 and run xm debug-key and xm dmesg. 1) CONJECTURE: Related to how long it takes to read the platform timer The max skew (and distribution) are definitely different depending on whether clocksource=hpet or clocksource=pit. For hpet, I am almost always seeing a max skew of 11000+ and with pit 17000+. ONCE (over many hours of runs) I saw a skew with hpet of 15000. However, I added code in the platform timer read routine (inside all locks but NOT with interrupts off) to artificially lengthen a platform timer read and it made no difference in the measurements 2) CONJECTURE: Max skew only occurs on some processors (e.g. not on the one that does the platform calibration) Nope, if you wait long enough max skew is fairly close on all processors (though in some cases, it seems to take a long time... perhaps because of unbalanced load?) 3) CONJECTURE: Max skew occurs on platform timer overflow. Possibly, but there is certainly not a 1-1 correspondence. Sometimes there are more large skews than overflows and sometimes less. 4) CONJECTURE: Artifact of ntpd running Nope, same skews whether ntpd is running on dom0 or not 5) CONJECTURE: Related to frequency changes or suspends Nope, none of these happening on my box. 6) CONJECTURE: "Weirdness can happen" comment in time.c Nope, this path isn''t getting executed. 7) CONJECTURE: Result of natural skews between platform timer and tsc, plus jitter. Unfixable. Possible, untested, not sure how. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Jul-10 07:40 UTC
Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)
On 10/7/08 01:24, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> 7) CONJECTURE: Result of natural skews between platform > timer and tsc, plus jitter. Unfixable. > > Possible, untested, not sure how.I ended up suspecting this on one of the test platforms I originally did the Xen-system-time implementation on. It was an old AMD white box iirc. On that system, TSC and platform time seemed to have a significant and inexplicable jitter at around 1Hz. The jitter was 100s of ppm, which was totally unexpected for what should be crystal-based oscillators. And the test code was simple enough that it was hard to suspect that either (I think I was just dumping the counters every second or two after reading them as close together as I could). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-10 22:42 UTC
Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
> > 7) CONJECTURE: Result of natural skews between platform > > timer and tsc, plus jitter. Unfixable. > > > > Possible, untested, not sure how. > > I ended up suspecting this on one of the test platforms I > originally did the > Xen-system-time implementation on. It was an old AMD white > box iirc. On that > system, TSC and platform time seemed to have a significant > and inexplicable > jitter at around 1Hz. The jitter was 100s of ppm, which was totally > unexpected for what should be crystal-based oscillators. And > the test code > was simple enough that it was hard to suspect that either (I > think I was > just dumping the counters every second or two after reading > them as close > together as I could).Is this the code in read_clocks() in keyhandler.c? If so, I just did an experiment there with some interesting results: I modified that code to record the "max dif" and then executed it >10000 times. The result shows maxdif ~11usec which corresponds with my earlier measurements. Next, I replaced the calls to NOW() in read_clocks() and read_clocks_slave() with rdtscll(). Guess what? The result is a maxdif of 11000 "ticks" but now on a 3GHz clock, which is about 3.3usec. Next, I disabled interrupts in read_clocks_slave() around the while loop plus the rdtscll() so that I ensure I''m not accidentally counting any interrupts. Now I''m seeing maxdif<330nsec (>6000 measurements). Next, I go back to NOW(), but with interrupts disabled as above. So far maxdif is about 10.7usec (>6000 measurements). SO XEN SYSTEM TIME MAX SKEW IS >30X WORSE THAN TSC MAX SKEW! Looks to me like there''s still something algorithmically wrong and its not just natural skew and jitter. Maybe some corner case in the scale-delta code? Also, should interrupts be turned off during the calibration part of init_pit_and_calibrate_tsc() (which might cause different scaling factors for each CPU)? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Jul-11 08:27 UTC
Re: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
On 10/7/08 23:42, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> SO XEN SYSTEM TIME MAX SKEW IS >30X WORSE THAN TSC MAX SKEW! > > Looks to me like there''s still something algorithmically wrong > and its not just natural skew and jitter. Maybe some corner > case in the scale-delta code? Also, should interrupts be turned > off during the calibration part of init_pit_and_calibrate_tsc() > (which might cause different scaling factors for each CPU)?I didn''t measure skew across CPUs. I measured jitter between one local TSC and the chosen platform timer for calibration (in my case I think this was the HPET). I did this because getting a consistent tick rate from the platform timer, and from each local TSC, is the basis for the calibration algorithm. The more jitter there is between them, the less well it will work. I implemented a user-space program to collect the required stats. It used CLI/STI to prevent getting interrupted when reading the timer pair. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-11 20:53 UTC
RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
> I didn''t measure skew across CPUs. I measured jitter between > one local TSC > and the chosen platform timer for calibration (in my case I > think this was > the HPET). I did this because getting a consistent tick rate from the > platform timer, and from each local TSC, is the basis for the > calibration > algorithm. The more jitter there is between them, the less > well it will > work. > > I implemented a user-space program to collect the required > stats. It used > CLI/STI to prevent getting interrupted when reading the timer pair.Hmmm... if the TSC is known to be stable*, is there any reason to do the calibration vs the platform timer? If TSC is stable, could we instead just do essentially a divide by cpu_ghz in get_s_time() and be done, no periodic local_time_calibration() necessary? Since TSC is stable on many newer platforms, it would be nice to use this feature to decrease skew for guests (both PV and HV). * stable is the term used by Linux to mean that there''s no skew between the different TSC''s in an SMP system I gave this a try and it seems to work so far. (Fortunately, my CPU is 3GHz so I just had to divide by 3... I''m not sure how to divide by a non-integer.) Max skew for stime is holding steady at 270nsec, >40x better than periodic calibration w/hpet. If this sounds good, a design question: Should this be controlled: 1) by a boot option, or 2) by the TSC_CONSTANT cpu flag, or 3) when determined dynamically to be safe using code similar to arch/x86/tsc_sync.c in recent Linux kernels (1) is by far the easiest (perhaps not too late for 3.3?) while (3) is clearly the best for users but adds lots of code (bloat/untested) Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2008-Jul-11 21:27 UTC
RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
> Hmmm... if the TSC is known to be stable*, is there any reason to > do the calibration vs the platform timer? If TSC is stable, > could we instead just do essentially a divide by cpu_ghz in > get_s_time() and be done, no periodic local_time_calibration() > necessary? Since TSC is stable on many newer platforms, it > would be nice to use this feature to decrease skew for guests > (both PV and HV). > > * stable is the term used by Linux to mean that there''s no > skew between the different TSC''s in an SMP systemSome NUMA systems have different oscillators on each node so you can''t rely on the frequency being identical. Such systems are fairly rare (though their common use case is server virtualization). I guess a command line option to enable independent calibration for these systems would be OK, though it would obviously be better to start off assuming the frequencies are identical, and then detect rate differences. Ian> I gave this a try and it seems to work so far. (Fortunately, > my CPU is 3GHz so I just had to divide by 3... I''m not sure > how to divide by a non-integer.) Max skew for stime is holding > steady at 270nsec, >40x better than periodic calibration w/hpet. > > If this sounds good, a design question: Should this be > controlled: > > 1) by a boot option, or > 2) by the TSC_CONSTANT cpu flag, or > 3) when determined dynamically to be safe using code similar > to arch/x86/tsc_sync.c in recent Linux kernels > > (1) is by far the easiest (perhaps not too late for 3.3?) > while (3) is clearly the best for users but adds lots of > code (bloat/untested) > > Thanks, > Dan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Jul-11 21:27 UTC
Re: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
On 11/7/08 21:53, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> 1) by a boot option, or > 2) by the TSC_CONSTANT cpu flag, or > 3) when determined dynamically to be safe using code similar > to arch/x86/tsc_sync.c in recent Linux kernels > > (1) is by far the easiest (perhaps not too late for 3.3?) > while (3) is clearly the best for users but adds lots of > code (bloat/untested)(1) is perhaps fine. How does (2) work? The individual CPUs do not know whether they are synchronised across the mainboard. I think constant-tsc is necessary (individual CPUs must not vary their multiplier of the input clock rate) but may not be sufficient. I don''t know how much code is involved in (3). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-12 21:05 UTC
RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
> Some NUMA systems have different oscillators on each node so you can''t > rely on the frequency being identical. Such systems are fairly rare > (though their common use case is server virtualization). I guess a > command line option to enable independent calibration for > these systems > would be OK, though it would obviously be better to start off assuming > the frequencies are identical, and then detect rate differences. > > IanGood point. This is the way that Linux does it too, I think. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-12 21:07 UTC
RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
> > 1) by a boot option, or > > 2) by the TSC_CONSTANT cpu flag, or > > 3) when determined dynamically to be safe using code similar > > to arch/x86/tsc_sync.c in recent Linux kernels > > > > (1) is by far the easiest (perhaps not too late for 3.3?) > > while (3) is clearly the best for users but adds lots of > > code (bloat/untested) > > (1) is perhaps fine.OK, patch to follow. I''ve used "clocksource=tsc"> How does (2) work? The individual CPUs do not know whether they are > synchronised across the mainboard. I think constant-tsc is necessary > (individual CPUs must not vary their multiplier of the input > clock rate) but > may not be sufficient.Good point.> I don''t know how much code is involved in (3).It''s enough that I will take the "easy way" for now (boot option) and look at submitting a dynamically-evaluate patch later. Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-19 17:51 UTC
RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
> > SO XEN SYSTEM TIME MAX SKEW IS >30X WORSE THAN TSC MAX SKEW! > > > > Looks to me like there''s still something algorithmically wrong > > and its not just natural skew and jitter. Maybe some corner > > case in the scale-delta code? Also, should interrupts be turned > > off during the calibration part of init_pit_and_calibrate_tsc() > > (which might cause different scaling factors for each CPU)? > > I didn''t measure skew across CPUs. I measured jitter between > one local TSC > and the chosen platform timer for calibration (in my case I > think this was > the HPET). I did this because getting a consistent tick rate from the > platform timer, and from each local TSC, is the basis for the > calibration > algorithm. The more jitter there is between them, the less > well it will > work. > > I implemented a user-space program to collect the required > stats. It used > CLI/STI to prevent getting interrupted when reading the timer pair.Hi Keir - I''m still looking at whether all of the intra-processor stime skew I''m seeing is due to jitter vs algorithmic. Would you expect system load to impact stime skew between processors (using hpet as a system timer)? I can repeatably watch skew get worse when I am launching an hvm domain. It is MUCH worse when the new domain is in its early stages of booting. CPU load on domain0 has little or no impact but I/O load on dom0 seems to make skew get worse. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Jul-21 08:32 UTC
Re: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
On 19/7/08 18:51, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Would you expect system load to impact stime skew between > processors (using hpet as a system timer)? I can repeatably > watch skew get worse when I am launching an hvm domain. It is > MUCH worse when the new domain is in its early stages of booting. > CPU load on domain0 has little or no impact but I/O load > on dom0 seems to make skew get worse.Perhaps it makes a difference if it takes each CPU a bit longer to execute the calibration function in softirq context? That could be delayed by long hypercalls, for example (although long hypercalls should mostly be preemptible). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-22 22:27 UTC
RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))
> > Would you expect system load to impact stime skew between > > processors (using hpet as a system timer)? I can repeatably > > watch skew get worse when I am launching an hvm domain. It is > > MUCH worse when the new domain is in its early stages of booting. > > CPU load on domain0 has little or no impact but I/O load > > on dom0 seems to make skew get worse. > > Perhaps it makes a difference if it takes each CPU a bit > longer to execute > the calibration function in softirq context? That could be > delayed by long > hypercalls, for example (although long hypercalls should mostly be > preemptible).I''m not positive yet, but I think I have an explanation for this. The issue is not HOW LONG it takes to execute the calibration function but WHEN relative to other processors the calibration function executes. If jitter on the platform timer occurs and the (e.g. two) calibration functions are triggered "temporally maximally distant" (e.g. cpu0 at 1.0, 2.0, 3.0 and cpu1 at 1.5, 2.5, 3.5), their differing slope during the interim partial-second could result in greater skew. Since activity on a processor will result in different locks held, interrupts on/off, etc, system load differences between processors is more likely to cause distance to vary between the scheduled calibration functions on each processor. (Worse, could maximal distance maybe result in harmonic resonance? The fact that I can observe the effect seems to imply that it stays bad for awhile.) This is all still theoretical... I still have to figure out how to measure this. But does the theory make sense? Perhaps some form of the proposed "deferrable timers" can be used to ensure per-cpu calibration happens on different processors at roughly the same moment? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2008-Jul-22 23:07 UTC
RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
> > I''m not positive yet, but I think I have an explanation for > this. The issue is not HOW LONG it takes to execute the > calibration function but WHEN relative to other processors > the calibration function executes. If jitter on the platform > timer occurs and the (e.g. two) calibration functions are triggered > "temporally maximally distant" (e.g. cpu0 at 1.0, 2.0, 3.0 > and cpu1 at 1.5, 2.5, 3.5), their differing slope during the > interim partial-second could result in greater skew. Since activity > on a processor will result in different locks held, interrupts > on/off, etc, system load differences between processors is more > likely to cause distance to vary between the scheduled calibration > functions on each processor.If you want to test this theory, you can easily get all the CPUs to recalibrate at the same instant, though it''s a bit expensive: Get one CPU to issue an smp_call_function on all CPUs (including itself). The called function should atomic_inc a variable and then spin waiting reading the count until all CPUs have reached this point. When this happens, turn interrupts off, atomic_dec the same counter, spin until it hits zero, then read the TSC, re-enable interrupts, finish. The TSC reads should all happen very close to each other. One of the CPUs could read the platform timer after the TSC to tie everything together. The only thing that could mess this up would be NMI''s or SMI''s. You could at least detect that by reading the TSC after all CPUs have incremented the counter, and check that only a "reasonable" amount of time had elapsed. If not, set a flag to indicate that a recalibration is required (you''d need to add another gather loop to enable all CPUs to vote on whether they''re happy). Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2008-Jul-23 00:40 UTC
RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
> If you want to test this theory, you can easily get all the CPUs to > recalibrate at the same instant, though it''s a bit expensive: > > Get one CPU to issue an smp_call_function on all CPUs (including > itself). The called function should atomic_inc a variable and > then spin > waiting reading the count until all CPUs have reached this point. When > this happens, turn interrupts off, atomic_dec the same counter, spin > until it hits zero, then read the TSC, re-enable interrupts, finish. > The TSC reads should all happen very close to each other.The code invoked by "xm debug-key t" does exactly that and I''ve been using it (as one way) to measure skew. Any idea how expensive it is? Is it too expensive to do once/second? If it''s not more expensive than the (1Hz per processor) local_time_calibration(), perhaps we should just use it to set TSC on all processors once/second and dispense with the existing (beautiful but one additional frequency to resonate) platform-timer-interpolated-by-tsc approach? On the other hand, I''ll bet the bigger the system, the more difficult it is to rendezvous them... and the more natural skew there will be between the sockets.> The only thing that could mess this up would be NMI''s or SMI''s. You > could at least detect that by reading the TSC after all CPUs have > incremented the counter, and check that only a "reasonable" amount of > time had elapsed. If not, set a flag to indicate that a > recalibration is > required (you''d need to add another gather loop to enable all CPUs to > vote on whether they''re happy).I think I''ve seen this code in recent Linux. But assuming we stay with the existing approach, I''m not sure the processors need to be calibrated at "exactly" the same time, just "close". Something similar to "round jiffies" (see http://lkml.org/lkml/2006/10/10/189) may be enough... though I guess that depends on the character of the timesource jitter. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2008-Jul-23 01:16 UTC
RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
> Is it too expensive to do once/second? If it''s not more expensive > than the (1Hz per processor) local_time_calibration(), perhaps we > should just use it to set TSC on all processors once/second and > dispense > with the existing (beautiful but one additional frequency to resonate) > platform-timer-interpolated-by-tsc approach?It doesn''t need to be done very frequently, e.g. every 10-30s -- anytime before the TSC wraps should work.> On the other hand, I''ll bet the bigger the system, the more difficult > it is to rendezvous them...Yes, but it shouldn''t be too horrendous -- we have to do stuff like this for some (rare) synchronous TLB flushes anyhow.> and the more natural skew there will be between the sockets.This skew will still be tiny, sub microsecond.> > The only thing that could mess this up would be NMI''s or SMI''s. You > > could at least detect that by reading the TSC after all CPUs have > > incremented the counter, and check that only a "reasonable" amountof> > time had elapsed. If not, set a flag to indicate that a > > recalibration is > > required (you''d need to add another gather loop to enable all CPUsto> > vote on whether they''re happy). > > I think I''ve seen this code in recent Linux.It''s worth implementing this just to see how good a job we could do. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2008-Jul-23 06:11 UTC
RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))
>From: Dan Magenheimer >Sent: 2008年7月23日 6:27 > >Perhaps some form of the proposed "deferrable timers" can >be used to ensure per-cpu calibration happens on different >processors at roughly the same moment? >It can''t. Deferrable timer is a per-cpu concept, to rendezvous what can be deferred on local cpu. There''s nothing to coordinate across-cpu activities, for which Instead you have to use some form of IPIs and self-defined sync process as what Ian suggested. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel