thr3ads.net - Xen devel - [Xen-devel] [PATCH] strictly increasing hvm guest time [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Dan Magenheimer

2008-Jul-02 16:03 UTC

[Xen-devel] [PATCH] strictly increasing hvm guest time

This simple one-line patch changes hvm guest time from
monotonically non-decreasing to monotonically strictly-
increasing.  As a result, two consecutive reads of the
(virtual) hpet will never return the same value, thus
avoiding the appearance that time has stopped (which may
occur if there is skew between physical processor TSCs).

The only problem scenario I can see is if:

1) N = number of physical CPUs on system
2) T = time in nsec of fastest call P that an hvm guest can
   make that indirectly invokes hvm_get_guest_time()
3) N>T (highly unlikely)
4) guests on all N physical CPUs are continuously
   calling P (also highly unlikely)

then guest time could accelerate faster than Xen system
time.

Dan

==================================Thanks... for the memory
I really could use more / My throughput's on the floor
The balloon is flat / My swap disk's fat / I've OOM's in store
Overcommitted so much
(with apologies to the late great Bob Hope)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-02 16:07 UTC

head link

[Xen-devel] Re: [PATCH] strictly increasing hvm guest time

On 2/7/08 17:03, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> This simple one-line patch changes hvm guest time from
> monotonically non-decreasing to monotonically strictly-
> increasing.  As a result, two consecutive reads of the
> (virtual) hpet will never return the same value, thus
> avoiding the appearance that time has stopped (which may
> occur if there is skew between physical processor TSCs).
It does seem a little hack-ish, if we don''t know of any issues arising
from
the current code, and we expect cross-cpu deltas to be pretty small. Also
guests will often convert HPET reads to well-known units (e.g.,
microseconds, milliseconds) before using them, in which case even a delta of
one may not result in differing converted time values.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-02 21:50 UTC

head link

[Xen-devel] RE: [PATCH] strictly increasing hvm guest time

> > This simple one-line patch changes hvm guest time from
> > monotonically non-decreasing to monotonically strictly-
> > increasing.  As a result, two consecutive reads of the
> > (virtual) hpet will never return the same value, thus
> > avoiding the appearance that time has stopped (which may
> > occur if there is skew between physical processor TSCs).
> 
> It does seem a little hack-ish, if we don''t know of any 
> issues arising from
> the current code, and we expect cross-cpu deltas to be pretty 
> small.
Using "xm debug-key t; xm dmesg | tail -1" you can get an idea
of the deltas.  Even on my single-socket dual-core recent-vintage
Intel box, I''m frequently seeing Diff''s > 300ns.  While
this
is still relatively small (and part of it may be SMP cache
synchronization time), this is supposed to be a "good TSC"
box.

I''m spinning a small patch capturing the maximum so that can
be output via debug-key t also.
> Also
> guests will often convert HPET reads to well-known units (e.g.,
> microseconds, milliseconds) before using them, in which case 
> even a delta of
> one may not result in differing converted time values.
Yes, but most newer Linux systems have a high-res timer API
that returns nanoseconds, though admittedly it is not widely
used yet.

Thanks,
Dan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-02 22:41 UTC

head link

[Xen-devel] [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

> Subject: [Xen-devel] RE: [PATCH] strictly increasing hvm guest time
> 
> I''m spinning a small patch capturing the maximum so that can
> be output via debug-key t also.
Attached is the patch.  Interestingly, on my single-socket
two-core recent-vintage Intel processor, this patch reports
a max skew of >13 usec, much higher than the values I''m
seeing from "xm debug-key t".  I wonder if this is due to
a mistake in my patch (though I don''t see it) or if the
various stime error corrections are not converging as
expected, resulting in a broader stime skew between
processors than expected?

Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-03 08:03 UTC

head link

[Xen-devel] Re: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

On 2/7/08 23:41, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> Attached is the patch.  Interestingly, on my single-socket
> two-core recent-vintage Intel processor, this patch reports
> a max skew of >13 usec, much higher than the values I''m
> seeing from "xm debug-key t".  I wonder if this is due to
> a mistake in my patch (though I don''t see it) or if the
> various stime error corrections are not converging as
> expected, resulting in a broader stime skew between
> processors than expected?
Perhaps this relatively large skew happens at start of day, before the
periodic calibration has ''locked on''?

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-03 16:24 UTC

head link

[Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

> > Attached is the patch.  Interestingly, on my single-socket
> > two-core recent-vintage Intel processor, this patch reports
> > a max skew of >13 usec, much higher than the values I''m
> > seeing from "xm debug-key t".  I wonder if this is due to
> > a mistake in my patch (though I don''t see it) or if the
> > various stime error corrections are not converging as
> > expected, resulting in a broader stime skew between
> > processors than expected?
> 
> Perhaps this relatively large skew happens at start of day, before the
> periodic calibration has ''locked on''?
Indeed you are correct.  This updated patch now reports zero skew
as expected.

IMHO, it would be nice to put this patch into the tree as it
will be good for helping to diagnose time skew problems
such as the one just reported on the list.

Thanks,
Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-03 16:35 UTC

head link

RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

> From: xen-devel-bounces@lists.xensource.com
> [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of Dan
> Magenheimer
> Subject: [Xen-devel] RE: [PATCH] record max stime skew (was 
> 
> > Perhaps this relatively large skew happens at start of day, 
> before the
> > periodic calibration has ''locked on''?
> 
> Indeed you are correct.  This updated patch now reports zero skew
> as expected.
> 
> IMHO, it would be nice to put this patch into the tree as it
> will be good for helping to diagnose time skew problems
> such as the one just reported on the list.
Oops!  Just after I sent the above email, I checked again and
the same machine (no reboots, no guests ever launched) now reports
a max stime skew of 4333ns!!  Methinks there might be some
periodic glitch in the calibration code?

Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-03 20:03 UTC

head link

RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

> > IMHO, it would be nice to put this patch into the tree as it
> > will be good for helping to diagnose time skew problems
> > such as the one just reported on the list.
> 
> Oops!  Just after I sent the above email, I checked again and
> the same machine (no reboots, no guests ever launched) now reports
> a max stime skew of 4333ns!!  Methinks there might be some
> periodic glitch in the calibration code?
OK this version records not only max but also a distribution
of skew.  (The code is a bit ugly... I thought about doing
something fancy with log-binary but decided a few base-10
ranges were clearer for a human to read.)

With this, I use "watch -d ''xm debug-key t; xm dmesg | tail
-3''"
and can observe that (on my single-socket two-core recent-vintage
Intel box) roughly three-quarters of the skew measurements are
between 10-100nsec, roughly one-quarter are between 100ns-1us,
a couple percent are between 1us-10us and a few are >10us.

This represents an approximate distribution of how long an hvm
guest might observe time to be stopped (if it is able to repeatedly
read time values quickly enough).

So on some machines, this might be substantially worse than the
old hvm-platform-timer-built-on-tsc mechanism (though we had
no monotonicity constraint built into that).

I wonder if the >1us outliers are occurring only if the
processor has been idle for awhile, vs entirely random.

Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-03 23:00 UTC

head link

Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

Skipping cpu0 makes no sense. It''s not the ''master''.
master_stime is time
calculated from the platform timer (hpet, pit, or whatever). All cpus are
equal peers. Apart from that looks plausible to me.

  -- Keir

On 3/7/08 21:03, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
>>> IMHO, it would be nice to put this patch into the tree as it
>>> will be good for helping to diagnose time skew problems
>>> such as the one just reported on the list.
>> 
>> Oops!  Just after I sent the above email, I checked again and
>> the same machine (no reboots, no guests ever launched) now reports
>> a max stime skew of 4333ns!!  Methinks there might be some
>> periodic glitch in the calibration code?
> 
> OK this version records not only max but also a distribution
> of skew.  (The code is a bit ugly... I thought about doing
> something fancy with log-binary but decided a few base-10
> ranges were clearer for a human to read.)
> 
> With this, I use "watch -d ''xm debug-key t; xm dmesg | tail
-3''"
> and can observe that (on my single-socket two-core recent-vintage
> Intel box) roughly three-quarters of the skew measurements are
> between 10-100nsec, roughly one-quarter are between 100ns-1us,
> a couple percent are between 1us-10us and a few are >10us.
> 
> This represents an approximate distribution of how long an hvm
> guest might observe time to be stopped (if it is able to repeatedly
> read time values quickly enough).
> 
> So on some machines, this might be substantially worse than the
> old hvm-platform-timer-built-on-tsc mechanism (though we had
> no monotonicity constraint built into that).
> 
> I wonder if the >1us outliers are occurring only if the
> processor has been idle for awhile, vs entirely random.
> 
> Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-04 15:11 UTC

head link

RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

> Skipping cpu0 makes no sense.
Oops, I misunderstood that for some reason.

Here''s a fixed version.  I also now preserve the "Platform timer
is"
line since that can get flushed out of the dmesg buffer.

Any idea why the skew can get so bad?

Dan
> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Thursday, July 03, 2008 5:00 PM
> To: dan.magenheimer@oracle.com; Xen-Devel (E-mail)
> Cc: Dave Winchell
> Subject: Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE:
> [PATCH] strictly increasing hvm guest time)
> 
> 
> Skipping cpu0 makes no sense. It''s not the
''master''.
> master_stime is time
> calculated from the platform timer (hpet, pit, or whatever). 
> All cpus are
> equal peers. Apart from that looks plausible to me.
> 
>   -- Keir
> 
> On 3/7/08 21:03, "Dan Magenheimer"
<dan.magenheimer@oracle.com> wrote:
> 
> >>> IMHO, it would be nice to put this patch into the tree as it
> >>> will be good for helping to diagnose time skew problems
> >>> such as the one just reported on the list.
> >>
> >> Oops!  Just after I sent the above email, I checked again and
> >> the same machine (no reboots, no guests ever launched) now reports
> >> a max stime skew of 4333ns!!  Methinks there might be some
> >> periodic glitch in the calibration code?
> >
> > OK this version records not only max but also a distribution
> > of skew.  (The code is a bit ugly... I thought about doing
> > something fancy with log-binary but decided a few base-10
> > ranges were clearer for a human to read.)
> >
> > With this, I use "watch -d ''xm debug-key t; xm dmesg |
tail -3''"
> > and can observe that (on my single-socket two-core recent-vintage
> > Intel box) roughly three-quarters of the skew measurements are
> > between 10-100nsec, roughly one-quarter are between 100ns-1us,
> > a couple percent are between 1us-10us and a few are >10us.
> >
> > This represents an approximate distribution of how long an hvm
> > guest might observe time to be stopped (if it is able to repeatedly
> > read time values quickly enough).
> >
> > So on some machines, this might be substantially worse than the
> > old hvm-platform-timer-built-on-tsc mechanism (though we had
> > no monotonicity constraint built into that).
> >
> > I wonder if the >1us outliers are occurring only if the
> > processor has been idle for awhile, vs entirely random.
> >
> > Dan
> 
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-04 15:22 UTC

head link

Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

On 4/7/08 16:11, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> Oops, I misunderstood that for some reason.
> 
> Here''s a fixed version.  I also now preserve the "Platform
timer is"
> line since that can get flushed out of the dmesg buffer.
> 
> Any idea why the skew can get so bad?
Not really. We could check in this patch or similar and perhaps collect more
information.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-04 19:32 UTC

head link

RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

> On 4/7/08 16:11, "Dan Magenheimer"
<dan.magenheimer@oracle.com> wrote:
> 
> > Oops, I misunderstood that for some reason.
> >
> > Here''s a fixed version.  I also now preserve the
"Platform timer is"
> > line since that can get flushed out of the dmesg buffer.
OOPS, forgot the patch!  Attached this time.
> > Any idea why the skew can get so bad?
> 
> Not really. We could check in this patch or similar and 
> perhaps collect more
> information.
> 
>  -- Keir
Well one suspicion I had was that very long hpet reads were
getting serialized, but I tried clocksource=acpi and
clocksource=pit and get similar skew range results.
In fact pit shows a max of >17000 vs hpet and acpi closer
to 11000.  (OTOH, I suppose it IS possible that this is
roughly how long it takes to read each of these platform
timers.)

Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-04 19:56 UTC

head link

Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

On 4/7/08 20:32, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> Well one suspicion I had was that very long hpet reads were
> getting serialized, but I tried clocksource=acpi and
> clocksource=pit and get similar skew range results.
> In fact pit shows a max of >17000 vs hpet and acpi closer
> to 11000.  (OTOH, I suppose it IS possible that this is
> roughly how long it takes to read each of these platform
> timers.)
That ought to be easy to check. I would expect that the PIT, for example,
could take a couple of microseconds to access.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-10 00:24 UTC

head link

RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

> > Well one suspicion I had was that very long hpet reads were
> > getting serialized, but I tried clocksource=acpi and
> > clocksource=pit and get similar skew range results.
> > In fact pit shows a max of >17000 vs hpet and acpi closer
> > to 11000.  (OTOH, I suppose it IS possible that this is
> > roughly how long it takes to read each of these platform
> > timers.)
> 
> That ought to be easy to check. I would expect that the PIT, 
> for example,
> could take a couple of microseconds to access.
> 
>  -- Keir
(I haven''t seen the patch applied... since it just collects
data, it would be nice if it was applied so others could
try it.)

To follow up on this, I tried a number of tests but wasn''t
able to identify the problem and have given up (for now).
In case someone else starts looking at this (or if any of
my tests suggest a solution to someone), I thought I''d
document what I tried.

PROBLEM: Xen system time skew between processors local time
and platform time is generally "small" but "sometimes" gets
quite "large".  This is important because, the larger the
skew, the more likely an hvm guest will experience time
stopping or (in some cases) time going backwards.

On my box, "small" is under 1 usec, "large" is 9-18 usec,
and "sometimes" is about one out of 500 measurements.  Note
that my box is a recent vintage Intel single-socket dual-core
("Conroe").

I suspect periodically some lock is being waited for for
a long time, or maybe an unexpected interrupt is occurring,
but I didn''t find anything through code reading or
experiments.

TEST METHOD: The patch I sent on this thread collects data
whenever local_time_calibration() is run (which is 1Hz on
each processor) and "xm debug-key t" prints this data
so it can be seen with "xm dmesg".  To see the problem,
one need only boot dom0 and run xm debug-key and xm dmesg.

1) CONJECTURE: Related to how long it takes to read the
   platform timer

The max skew (and distribution) are definitely different
depending on whether clocksource=hpet or clocksource=pit.
For hpet, I am almost always seeing a max skew of 11000+
and with pit 17000+.  ONCE (over many hours of runs) I saw
a skew with hpet of 15000.  However, I added code in the
platform timer read routine (inside all locks but NOT with
interrupts off) to artificially lengthen a platform timer
read and it made no difference in the measurements

2) CONJECTURE: Max skew only occurs on some processors (e.g.
   not on the one that does the platform calibration)

Nope, if you wait long enough max skew is fairly close
on all processors (though in some cases, it seems to
take a long time... perhaps because of unbalanced load?)

3) CONJECTURE: Max skew occurs on platform timer overflow.

Possibly, but there is certainly not a 1-1 correspondence.
Sometimes there are more large skews than overflows and
sometimes less.

4) CONJECTURE: Artifact of ntpd running

Nope, same skews whether ntpd is running on dom0 or not

5) CONJECTURE: Related to frequency changes or suspends

Nope, none of these happening on my box.

6)  CONJECTURE: "Weirdness can happen" comment in time.c

Nope, this path isn''t getting executed.

7) CONJECTURE: Result of natural skews between platform
timer and tsc, plus jitter.  Unfixable.

Possible, untested, not sure how.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-10 07:40 UTC

head link

Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

On 10/7/08 01:24, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> 7) CONJECTURE: Result of natural skews between platform
> timer and tsc, plus jitter.  Unfixable.
> 
> Possible, untested, not sure how.
I ended up suspecting this on one of the test platforms I originally did the
Xen-system-time implementation on. It was an old AMD white box iirc. On that
system, TSC and platform time seemed to have a significant and inexplicable
jitter at around 1Hz. The jitter was 100s of ppm, which was totally
unexpected for what should be crystal-based oscillators. And the test code
was simple enough that it was hard to suspect that either (I think I was
just dumping the counters every second or two after reading them as close
together as I could).

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-10 22:42 UTC

head link

Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

> > 7) CONJECTURE: Result of natural skews between platform
> > timer and tsc, plus jitter.  Unfixable.
> >
> > Possible, untested, not sure how.
> 
> I ended up suspecting this on one of the test platforms I 
> originally did the
> Xen-system-time implementation on. It was an old AMD white 
> box iirc. On that
> system, TSC and platform time seemed to have a significant 
> and inexplicable
> jitter at around 1Hz. The jitter was 100s of ppm, which was totally
> unexpected for what should be crystal-based oscillators. And 
> the test code
> was simple enough that it was hard to suspect that either (I 
> think I was
> just dumping the counters every second or two after reading 
> them as close
> together as I could).
Is this the code in read_clocks() in keyhandler.c?  If so,
I just did an experiment there with some interesting results:

I modified that code to record the "max dif" and then executed
it >10000 times.  The result shows maxdif ~11usec which
corresponds with my earlier measurements.  Next, I replaced the
calls to NOW() in read_clocks() and read_clocks_slave() with
rdtscll().  Guess what?  The result is a maxdif of 11000 "ticks"
but now on a 3GHz clock, which is about 3.3usec.  Next, I disabled
interrupts in read_clocks_slave() around the while loop plus
the rdtscll() so that I ensure I''m not accidentally counting any
interrupts.  Now I''m seeing maxdif<330nsec (>6000 measurements).
Next, I go back to NOW(), but with interrupts disabled as above.
So far maxdif is about 10.7usec (>6000 measurements).

SO XEN SYSTEM TIME MAX SKEW IS >30X WORSE THAN TSC MAX SKEW!

Looks to me like there''s still something algorithmically wrong
and its not just natural skew and jitter.  Maybe some corner
case in the scale-delta code?  Also, should interrupts be turned
off during the calibration part of init_pit_and_calibrate_tsc()
(which might cause different scaling factors for each CPU)?



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-11 08:27 UTC

head link

Re: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

On 10/7/08 23:42, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> SO XEN SYSTEM TIME MAX SKEW IS >30X WORSE THAN TSC MAX SKEW!
> 
> Looks to me like there''s still something algorithmically wrong
> and its not just natural skew and jitter.  Maybe some corner
> case in the scale-delta code?  Also, should interrupts be turned
> off during the calibration part of init_pit_and_calibrate_tsc()
> (which might cause different scaling factors for each CPU)?
I didn''t measure skew across CPUs. I measured jitter between one local
TSC
and the chosen platform timer for calibration (in my case I think this was
the HPET). I did this because getting a consistent tick rate from the
platform timer, and from each local TSC, is the basis for the calibration
algorithm. The more jitter there is between them, the less well it will
work.

I implemented a user-space program to collect the required stats. It used
CLI/STI to prevent getting interrupted when reading the timer pair.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-11 20:53 UTC

head link

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

> I didn''t measure skew across CPUs. I measured jitter between 
> one local TSC
> and the chosen platform timer for calibration (in my case I 
> think this was
> the HPET). I did this because getting a consistent tick rate from the
> platform timer, and from each local TSC, is the basis for the 
> calibration
> algorithm. The more jitter there is between them, the less 
> well it will
> work.
> 
> I implemented a user-space program to collect the required 
> stats. It used
> CLI/STI to prevent getting interrupted when reading the timer pair.
Hmmm... if the TSC is known to be stable*, is there any reason to
do the calibration vs the platform timer?  If TSC is stable,
could we instead just do essentially a divide by cpu_ghz in
get_s_time() and be done, no periodic local_time_calibration()
necessary?  Since TSC is stable on many newer platforms, it
would be nice to use this feature to decrease skew for guests
(both PV and HV).

* stable is the term used by Linux to mean that there''s no
skew between the different TSC''s in an SMP system

I gave this a try and it seems to work so far.  (Fortunately,
my CPU is 3GHz so I just had to divide by 3... I''m not sure
how to divide by a non-integer.)  Max skew for stime is holding
steady at 270nsec, >40x better than periodic calibration w/hpet.

If this sounds good, a design question:  Should this be
controlled:

1) by a boot option, or
2) by the TSC_CONSTANT cpu flag, or
3) when determined dynamically to be safe using code similar
   to arch/x86/tsc_sync.c in recent Linux kernels

(1) is by far the easiest (perhaps not too late for 3.3?)
while (3) is clearly the best for users but adds lots of
code (bloat/untested)

Thanks,
Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2008-Jul-11 21:27 UTC

head link

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

> Hmmm... if the TSC is known to be stable*, is there any reason to
> do the calibration vs the platform timer?  If TSC is stable,
> could we instead just do essentially a divide by cpu_ghz in
> get_s_time() and be done, no periodic local_time_calibration()
> necessary?  Since TSC is stable on many newer platforms, it
> would be nice to use this feature to decrease skew for guests
> (both PV and HV).
> 
> * stable is the term used by Linux to mean that there''s no
> skew between the different TSC''s in an SMP system
Some NUMA systems have different oscillators on each node so you can''t
rely on the frequency being identical. Such systems are fairly rare
(though their common use case is server virtualization). I guess a
command line option to enable independent calibration for these systems
would be OK, though it would obviously be better to start off assuming
the frequencies are identical, and then detect rate differences. 

Ian

 > I gave this a try and it seems to work so far.  (Fortunately,
> my CPU is 3GHz so I just had to divide by 3... I''m not sure
> how to divide by a non-integer.)  Max skew for stime is holding
> steady at 270nsec, >40x better than periodic calibration w/hpet.
> 
> If this sounds good, a design question:  Should this be
> controlled:
> 
> 1) by a boot option, or
> 2) by the TSC_CONSTANT cpu flag, or
> 3) when determined dynamically to be safe using code similar
>    to arch/x86/tsc_sync.c in recent Linux kernels
> 
> (1) is by far the easiest (perhaps not too late for 3.3?)
> while (3) is clearly the best for users but adds lots of
> code (bloat/untested)
> 
> Thanks,
> Dan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-11 21:27 UTC

head link

Re: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

On 11/7/08 21:53, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> 1) by a boot option, or
> 2) by the TSC_CONSTANT cpu flag, or
> 3) when determined dynamically to be safe using code similar
>    to arch/x86/tsc_sync.c in recent Linux kernels
> 
> (1) is by far the easiest (perhaps not too late for 3.3?)
> while (3) is clearly the best for users but adds lots of
> code (bloat/untested)
(1) is perhaps fine.

How does (2) work? The individual CPUs do not know whether they are
synchronised across the mainboard. I think constant-tsc is necessary
(individual CPUs must not vary their multiplier of the input clock rate) but
may not be sufficient.

I don''t know how much code is involved in (3).

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-12 21:05 UTC

head link

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

> Some NUMA systems have different oscillators on each node so you
can''t
> rely on the frequency being identical. Such systems are fairly rare
> (though their common use case is server virtualization). I guess a
> command line option to enable independent calibration for 
> these systems
> would be OK, though it would obviously be better to start off assuming
> the frequencies are identical, and then detect rate differences. 
> 
> Ian
Good point.  This is the way that Linux does it too, I think.

Dan
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-12 21:07 UTC

head link

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

> > 1) by a boot option, or
> > 2) by the TSC_CONSTANT cpu flag, or
> > 3) when determined dynamically to be safe using code similar
> >    to arch/x86/tsc_sync.c in recent Linux kernels
> >
> > (1) is by far the easiest (perhaps not too late for 3.3?)
> > while (3) is clearly the best for users but adds lots of
> > code (bloat/untested)
> 
> (1) is perhaps fine.
OK, patch to follow.  I''ve used "clocksource=tsc"
 > How does (2) work? The individual CPUs do not know whether they are
> synchronised across the mainboard. I think constant-tsc is necessary
> (individual CPUs must not vary their multiplier of the input 
> clock rate) but
> may not be sufficient.
Good point.
> I don''t know how much code is involved in (3).
It''s enough that I will take the "easy way" for now (boot
option)
and look at submitting a dynamically-evaluate patch later.

Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-19 17:51 UTC

head link

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

> > SO XEN SYSTEM TIME MAX SKEW IS >30X WORSE THAN TSC MAX SKEW!
> >
> > Looks to me like there''s still something algorithmically
wrong
> > and its not just natural skew and jitter.  Maybe some corner
> > case in the scale-delta code?  Also, should interrupts be turned
> > off during the calibration part of init_pit_and_calibrate_tsc()
> > (which might cause different scaling factors for each CPU)?
> 
> I didn''t measure skew across CPUs. I measured jitter between 
> one local TSC
> and the chosen platform timer for calibration (in my case I 
> think this was
> the HPET). I did this because getting a consistent tick rate from the
> platform timer, and from each local TSC, is the basis for the 
> calibration
> algorithm. The more jitter there is between them, the less 
> well it will
> work.
> 
> I implemented a user-space program to collect the required 
> stats. It used
> CLI/STI to prevent getting interrupted when reading the timer pair.
Hi Keir -

I''m still looking at whether all of the intra-processor stime
skew I''m seeing is due to jitter vs algorithmic.

Would you expect system load to impact stime skew between
processors (using hpet as a system timer)?  I can repeatably
watch skew get worse when I am launching an hvm domain.  It is
MUCH worse when the new domain is in its early stages of booting.
CPU load on domain0 has little or no impact but I/O load
on dom0 seems to make skew get worse.

Thanks,
Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Jul-21 08:32 UTC

head link

Re: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

On 19/7/08 18:51, "Dan Magenheimer" <dan.magenheimer@oracle.com>
wrote:
> Would you expect system load to impact stime skew between
> processors (using hpet as a system timer)?  I can repeatably
> watch skew get worse when I am launching an hvm domain.  It is
> MUCH worse when the new domain is in its early stages of booting.
> CPU load on domain0 has little or no impact but I/O load
> on dom0 seems to make skew get worse.
Perhaps it makes a difference if it takes each CPU a bit longer to execute
the calibration function in softirq context? That could be delayed by long
hypercalls, for example (although long hypercalls should mostly be
preemptible).

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-22 22:27 UTC

head link

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

> > Would you expect system load to impact stime skew between
> > processors (using hpet as a system timer)?  I can repeatably
> > watch skew get worse when I am launching an hvm domain.  It is
> > MUCH worse when the new domain is in its early stages of booting.
> > CPU load on domain0 has little or no impact but I/O load
> > on dom0 seems to make skew get worse.
> 
> Perhaps it makes a difference if it takes each CPU a bit 
> longer to execute
> the calibration function in softirq context? That could be 
> delayed by long
> hypercalls, for example (although long hypercalls should mostly be
> preemptible).
I''m not positive yet, but I think I have an explanation for
this.  The issue is not HOW LONG it takes to execute the
calibration function but WHEN relative to other processors
the calibration function executes.  If jitter on the platform
timer occurs and the (e.g. two) calibration functions are triggered
"temporally maximally distant" (e.g. cpu0 at 1.0, 2.0, 3.0
and cpu1 at 1.5, 2.5, 3.5), their differing slope during the
interim partial-second could result in greater skew.  Since activity
on a processor will result in different locks held, interrupts
on/off, etc, system load differences between processors is more
likely to cause distance to vary between the scheduled calibration
functions on each processor.

(Worse, could maximal distance maybe result in harmonic
resonance?  The fact that I can observe the effect seems to
imply that it stays bad for awhile.)

This is all still theoretical... I still have to figure out how to
measure this.  But does the theory make sense?

Perhaps some form of the proposed "deferrable timers" can
be used to ensure per-cpu calibration happens on different
processors at roughly the same moment?

Thanks,
Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2008-Jul-22 23:07 UTC

head link

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

> 
> I''m not positive yet, but I think I have an explanation for
> this.  The issue is not HOW LONG it takes to execute the
> calibration function but WHEN relative to other processors
> the calibration function executes.  If jitter on the platform
> timer occurs and the (e.g. two) calibration functions are triggered
> "temporally maximally distant" (e.g. cpu0 at 1.0, 2.0, 3.0
> and cpu1 at 1.5, 2.5, 3.5), their differing slope during the
> interim partial-second could result in greater skew.  Since activity
> on a processor will result in different locks held, interrupts
> on/off, etc, system load differences between processors is more
> likely to cause distance to vary between the scheduled calibration
> functions on each processor.
If you want to test this theory, you can easily get all the CPUs to
recalibrate at the same instant, though it''s a bit expensive:

Get one CPU to issue an smp_call_function on all CPUs (including
itself). The called function should atomic_inc a variable and then spin
waiting reading the count until all CPUs have reached this point. When
this happens, turn interrupts off, atomic_dec the same counter, spin
until it hits zero, then read the TSC, re-enable interrupts, finish.

The TSC reads should all happen very close to each other. One of the
CPUs could read the platform timer after the TSC to tie everything
together.

The only thing that could mess this up would be NMI''s or
SMI''s. You
could at least detect that by reading the TSC after all CPUs have
incremented the counter, and check that only a "reasonable" amount of
time had elapsed. If not, set a flag to indicate that a recalibration is
required (you''d need to add another gather loop to enable all CPUs to
vote on whether they''re happy).


Ian



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2008-Jul-23 00:40 UTC

head link

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

> If you want to test this theory, you can easily get all the CPUs to
> recalibrate at the same instant, though it''s a bit expensive:
> 
> Get one CPU to issue an smp_call_function on all CPUs (including
> itself). The called function should atomic_inc a variable and 
> then spin
> waiting reading the count until all CPUs have reached this point. When
> this happens, turn interrupts off, atomic_dec the same counter, spin
> until it hits zero, then read the TSC, re-enable interrupts, finish.
> The TSC reads should all happen very close to each other. 
The code invoked by "xm debug-key t" does exactly that and
I''ve been
using it (as one way) to measure skew.  Any idea how expensive it is?
Is it too expensive to do once/second?  If it''s not more expensive
than the (1Hz per processor) local_time_calibration(), perhaps we
should just use it to set TSC on all processors once/second and dispense
with the existing (beautiful but one additional frequency to resonate)
platform-timer-interpolated-by-tsc approach?

On the other hand, I''ll bet the bigger the system, the more difficult
it is to rendezvous them... and the more natural skew there will be
between the sockets.
 > The only thing that could mess this up would be NMI''s or
SMI''s. You
> could at least detect that by reading the TSC after all CPUs have
> incremented the counter, and check that only a "reasonable"
amount of
> time had elapsed. If not, set a flag to indicate that a 
> recalibration is
> required (you''d need to add another gather loop to enable all CPUs
to
> vote on whether they''re happy).
I think I''ve seen this code in recent Linux.

But assuming we stay with the existing approach, I''m not sure
the processors need to be calibrated at "exactly" the same time,
just "close".  Something similar to "round jiffies" (see
http://lkml.org/lkml/2006/10/10/189) may be enough... though
I guess that depends on the character of the timesource jitter.

Thanks,
Dan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2008-Jul-23 01:16 UTC

head link

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

> Is it too expensive to do once/second?  If it''s not more expensive
> than the (1Hz per processor) local_time_calibration(), perhaps we
> should just use it to set TSC on all processors once/second and
> dispense
> with the existing (beautiful but one additional frequency to resonate)
> platform-timer-interpolated-by-tsc approach?
It doesn''t need to be done very frequently, e.g. every 10-30s --
anytime
before the TSC wraps should work.
> On the other hand, I''ll bet the bigger the system, the more
difficult
> it is to rendezvous them... 
Yes, but it shouldn''t be too horrendous -- we have to do stuff like
this
for some (rare) synchronous TLB flushes anyhow. 
> and the more natural skew there will be between the sockets.
This skew will still be tiny, sub microsecond.
> > The only thing that could mess this up would be NMI''s or
SMI''s. You
> > could at least detect that by reading the TSC after all CPUs have
> > incremented the counter, and check that only a "reasonable"
amount
of> > time had elapsed. If not, set a flag to indicate that a
> > recalibration is
> > required (you''d need to add another gather loop to enable all
CPUs
to> > vote on whether they''re happy).
> 
> I think I''ve seen this code in recent Linux.
It''s worth implementing this just to see how good a job we could do.

Ian

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2008-Jul-23 06:11 UTC

head link

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

>From: Dan Magenheimer
>Sent: 2008年7月23日 6:27
>
>Perhaps some form of the proposed "deferrable timers" can
>be used to ensure per-cpu calibration happens on different
>processors at roughly the same moment?
>
It can''t. Deferrable timer is a per-cpu concept, to rendezvous
what can be deferred on local cpu. There''s nothing to 
coordinate across-cpu activities, for which Instead you have to 
use some form of IPIs and self-defined sync process as what
Ian suggested.

Thanks,
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jul 2008 - [PATCH] strictly increasing hvm guest time

[Xen-devel] [PATCH] strictly increasing hvm guest time

[Xen-devel] Re: [PATCH] strictly increasing hvm guest time

[Xen-devel] RE: [PATCH] strictly increasing hvm guest time

[Xen-devel] [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

[Xen-devel] Re: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

[Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

Re: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time)

Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

Re: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

Re: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

Re: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel] RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasing hvm guest time))

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))

RE: Xen system skew MUCH worse than tsc skew (was RE: [Xen-devel]RE: [PATCH] record max stime skew (was RE: [PATCH] strictly increasinghvm guest time))