thr3ads.net - Xen devel - [Xen-devel] [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Wei, Gang

2008-Dec-05 06:22 UTC

[Xen-devel] [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Originally, the sequence for each cpu is [tsc-save, entry deepC, break-evt, exit
deepC, tsc-restore], the system error is quite easy to be accumulated. Once the
workloads between cpus are not balanced, the tsc skew between cpus will
eventually become bigger & begger - more than 10 seconds can be observed.

Now, we just keep a initial stamp via cstate_init_stamp during the booting/s3
resuming, which is based on the platform stime. All cpus need only to do
tsc-restore relative to the initial stamp after exit deepC. The base is fixed,
and is the same for all cpus, so it can avoid accumulated tsc-skew.

Signed-off-by: Wei Gang <gang.wei@intel.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-05 08:53 UTC

head link

[Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 05/12/2008 06:22, "Wei, Gang" <gang.wei@intel.com> wrote:
> Originally, the sequence for each cpu is [tsc-save, entry deepC, break-evt,
> exit deepC, tsc-restore], the system error is quite easy to be accumulated.
> Once the workloads between cpus are not balanced, the tsc skew between cpus
> will eventually become bigger & begger - more than 10 seconds can be
observed.
> 
> Now, we just keep a initial stamp via cstate_init_stamp during the
booting/s3
> resuming, which is based on the platform stime. All cpus need only to do
> tsc-restore relative to the initial stamp after exit deepC. The base is
fixed,
> and is the same for all cpus, so it can avoid accumulated tsc-skew.
> 
> Signed-off-by: Wei Gang <gang.wei@intel.com>
Synchronising to a start-of-day timestamp and extrapolating with cpu_khz is
not a good idea because of the inherent accumulating inaccuracy in cpu_khz
(it only has a fixed number of bits of precision). So, for example, if
certain CPUs have not deep-C-slept for a long while, they will wander from
your ''true'' TSC values and then you will see TSC mismatches
when the CPU
does eventually C-sleep, or compared with other CPUs when they do so.

More significantly, cpu_khz is no good as a fixed static estimate of CPU
clock speed when we start playing with P states.

I think your new code structure is correct. That is, work out wakeup TSC
value from read_platform_stime() rather than some saved TSC value. However,
you should extrapolate from that CPU''s own t->stime_local_stamp,
t->tsc_scale, and t->local_tsc_stamp. It''s probably a pretty
simple change
to your new cstate_restore_tsc().

You see what I mean?

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-05 09:21 UTC

head link

[Xen-devel] RE: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On Friday, December 05, 2008 4:54 PM, Keir Fraser wrote:> On 05/12/2008 06:22, "Wei, Gang" <gang.wei@intel.com>
wrote:
> 
>> Originally, the sequence for each cpu is [tsc-save, entry deepC,
break-evt,
>> exit deepC, tsc-restore], the system error is quite easy to be
accumulated.
>> Once the workloads between cpus are not balanced, the tsc skew between
cpus
>> will eventually become bigger & begger - more than 10 seconds can
be
>> observed. 
>> 
>> Now, we just keep a initial stamp via cstate_init_stamp during the
booting/s3
>> resuming, which is based on the platform stime. All cpus need only to
do
>> tsc-restore relative to the initial stamp after exit deepC. The base is
>> fixed, and is the same for all cpus, so it can avoid accumulated
tsc-skew.
>> 
>> Signed-off-by: Wei Gang <gang.wei@intel.com>
> 
> Synchronising to a start-of-day timestamp and extrapolating with cpu_khz is
> not a good idea because of the inherent accumulating inaccuracy in cpu_khz
> (it only has a fixed number of bits of precision). So, for example, if
> certain CPUs have not deep-C-slept for a long while, they will wander from
> your ''true'' TSC values and then you will see TSC
mismatches when the CPU
> does eventually C-sleep, or compared with other CPUs when they do so.
> 
> More significantly, cpu_khz is no good as a fixed static estimate of CPU
> clock speed when we start playing with P states.
> 
> I think your new code structure is correct. That is, work out wakeup TSC
> value from read_platform_stime() rather than some saved TSC value. However,
> you should extrapolate from that CPU''s own
t->stime_local_stamp,
> t->tsc_scale, and t->local_tsc_stamp. It''s probably a pretty
simple change
> to your new cstate_restore_tsc().
> 
> You see what I mean?
I tried extrapolating from t->stime_local_stamp, cpu_khz, and
t->local_tsc_stamp before I got into the current solution. It would still
bring accumulating skew, but in a slower increasing speed. I would like to try
it again with  t->tsc_scale instead of cpu_khz. If it is works, it would
really be simpler. Allow me some time.

Jimmy
> 
>  -- Keir_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2008-Dec-05 09:59 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

>From: Keir Fraser
>Sent: Friday, December 05, 2008 4:54 PM
>Synchronising to a start-of-day timestamp and extrapolating 
>with cpu_khz is
>not a good idea because of the inherent accumulating 
>inaccuracy in cpu_khz
>(it only has a fixed number of bits of precision). So, for example, if
>certain CPUs have not deep-C-slept for a long while, they will 
>wander from
>your ''true'' TSC values and then you will see TSC
mismatches
>when the CPU
>does eventually C-sleep, or compared with other CPUs when they do so.
>
>More significantly, cpu_khz is no good as a fixed static 
>estimate of CPU
>clock speed when we start playing with P states.
This is not big issue since most processors today has constant
TSC immune from P-state change. Actually when freq change
does affect TSC count, I don''t think time calibration may 
actually help if that scale happens at small interval and it''s still 
high possibility for multiple vcpus of one domain to observe 
time weirdness, or one vcpu migrating...
>
>I think your new code structure is correct. That is, work out 
>wakeup TSC
>value from read_platform_stime() rather than some saved TSC 
>value. However,
>you should extrapolate from that CPU''s own t->stime_local_stamp,
>t->tsc_scale, and t->local_tsc_stamp. It''s probably a pretty 
>simple change
>to your new cstate_restore_tsc().
Based on t->local_tsc_stamp to recover TSC can still accumulate
errors, though in a far slower pace. Here the problem is software
calculation is always inaccurate compared to what TSC would count
if not stopped. We tried to calculate calculation error for each TSC
save/restore. The approach is to use C2 (TSC not stopped) and
then apply restore logic to compare to-be-written value to the real
TSC count by rdtsc. We can see that 2k-3k drift can be observed
when chipset link ASPM is enabled, and hundreds of drift when
ASPM is disabled. As idle percentage on every processor is different,
finally the effect is not only slower than real one (if not stopped), but 
also increasing drift among cpus. The first one normally just affects 
TOD faster or slower, but the latter one can generate severe guest 
problems like softlockup warning or DMA timeout when vcpu is 
migrated between cpus with big TSC/now skew. Recover TSC based
on local calibration stamp reduces error accumulation pace from
per-cstate-entry/exit to per-calibration (ms->s), but after a long
run multiple cpus can still observe big skews.

To avoid accumulating errors, the best way is to align to an 
absolute platform timer without counting last stamp. But one 
drawback as you said is to have intermittent skew when one 
cpu doesn''t enter idle for a long time. But this is also true for 
above offset counting approach.

Then if we agree always aligning TSC to absolute platform timer
counter, it doesn''t make difference to use cpu_khz or local tsc_scale
since both are using scale factor calculated within a small period
to represent the underlying crystal frequency.

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2008-Dec-05 10:05 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

>From: Tian, Kevin
>Sent: Friday, December 05, 2008 6:00 PM
>
>Then if we agree always aligning TSC to absolute platform timer
>counter, it doesn''t make difference to use cpu_khz or local
tsc_scale
>since both are using scale factor calculated within a small period
>to represent the underlying crystal frequency.
>
Let me hold back above words. As you said, cpu_khz has lower accuracy
by cutting down lowest bits.

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-05 10:13 UTC

head link

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 05/12/2008 10:05, "Tian, Kevin" <kevin.tian@intel.com> wrote:
>> From: Tian, Kevin
>> Sent: Friday, December 05, 2008 6:00 PM
>> 
>> Then if we agree always aligning TSC to absolute platform timer
>> counter, it doesn''t make difference to use cpu_khz or local
tsc_scale
>> since both are using scale factor calculated within a small period
>> to represent the underlying crystal frequency.
>> 
> 
> Let me hold back above words. As you said, cpu_khz has lower accuracy
> by cutting down lowest bits.
Yes. Also bear in mind that absolute ongoing synchronisation between TSCs
*does not matter*. Xen will happily synchronise system time on top of
(slowly enough, constantly enough) diverging TSCs, and of course HVM VCPUs
re-set their guest TSC offset when moving between host CPUs.

What *does* matter is the possibility of warping a host TSC value on wake
from deep sleep, compared with its value if the sleep had never happened. In
this case, system time will be wrong (since we haven''t been through a
calibration step since waking up) and HVM timers will be wrong. And using
start-of-day timestamp plus cpu_khz makes this more likely. The correct
thing to do is obey the most recent set of local calibration values.

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-05 11:30 UTC

head link

[Xen-devel] RE: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

> I tried extrapolating from t->stime_local_stamp, cpu_khz, and
> t->local_tsc_stamp before I got into the current solution. It would
still
> bring accumulating skew, but in a slower increasing speed. I would like to
> try it again with  t->tsc_scale instead of cpu_khz. If it is works, it
would
> really be simpler. Allow me some time.    
Below patch should be what you expected. It will still bring continuously
increasing tsc skew. If I pin all domains'' vcpus on 1 pcpu, the skew is
increasing faster.

diff -r 1b173394f815 xen/arch/x86/acpi/cpu_idle.c
--- a/xen/arch/x86/acpi/cpu_idle.c	Thu Dec 04 16:36:43 2008 +0000
+++ b/xen/arch/x86/acpi/cpu_idle.c	Fri Dec 05 19:06:06 2008 +0800
@@ -317,8 +317,6 @@ static void acpi_processor_idle(void)
          * stopped by H/W. Without carefully handling of TSC/APIC stop issues,
          * deep C state can''t work correctly.
          */
-        /* preparing TSC stop */
-        cstate_save_tsc();
         /* preparing APIC stop */
         lapic_timer_off();
 
diff -r 1b173394f815 xen/arch/x86/time.c
--- a/xen/arch/x86/time.c	Thu Dec 04 16:36:43 2008 +0000
+++ b/xen/arch/x86/time.c	Fri Dec 05 19:06:06 2008 +0800
@@ -48,11 +48,9 @@ struct time_scale {
 
 struct cpu_time {
     u64 local_tsc_stamp;
-    u64 cstate_tsc_stamp;
     s_time_t stime_local_stamp;
     s_time_t stime_master_stamp;
     struct time_scale tsc_scale;
-    u64 cstate_plt_count_stamp;
 };
 
 struct platform_timesource {
@@ -149,6 +147,32 @@ static inline u64 scale_delta(u64 delta,
 #endif
 
     return product;
+}
+
+/*
+ * Scale a 32-bit time delta to delta by dividing a 32-bit fraction &
scaling,
+ * yielding a 64-bit result.
+ */
+static inline u64 scale_time(u32 time_delta, struct time_scale *scale)
+{
+    u64 td64 = time_delta;
+    u64 quotient, remainder;
+
+    td64 <<= 32;
+
+    quotient = td64 / scale->mul_frac;
+    remainder = td64 % scale->mul_frac;
+
+    if ( scale->shift < 0 )
+    {
+        quotient <<= -scale->shift;
+        remainder <<= -scale->shift;
+        quotient += remainder / scale->mul_frac;
+    }
+    else
+        quotient >>= scale->shift;
+
+    return quotient;
 }
 
 /*
@@ -644,29 +668,19 @@ static void init_platform_timer(void)
            freq_string(pts->frequency), pts->name);
 }
 
-void cstate_save_tsc(void)
+void cstate_restore_tsc(void)
 {
     struct cpu_time *t = &this_cpu(cpu_time);
+    u64 tsc_delta;
+    s_time_t stime_delta;
 
     if ( tsc_invariant )
         return;
 
-    t->cstate_plt_count_stamp = plt_src.read_counter();
-    rdtscll(t->cstate_tsc_stamp);
-}
-
-void cstate_restore_tsc(void)
-{
-    struct cpu_time *t = &this_cpu(cpu_time);
-    u64 plt_count_delta, tsc_delta;
-
-    if ( tsc_invariant )
-        return;
-
-    plt_count_delta = (plt_src.read_counter() -
-                       t->cstate_plt_count_stamp) & plt_mask;
-    tsc_delta = scale_delta(plt_count_delta, &plt_scale) *
cpu_khz/1000000UL;
-    wrmsrl(MSR_IA32_TSC, t->cstate_tsc_stamp + tsc_delta);
+    stime_delta = read_platform_stime() - t->stime_local_stamp;
+    ASSERT(stime_delta < 0x100000000UL);
+    tsc_delta = scale_time((u32)stime_delta, &t->tsc_scale);
+    wrmsrl(MSR_IA32_TSC, t->local_tsc_stamp + tsc_delta);
 }
 
 /***************************************************************************
diff -r 1b173394f815 xen/include/xen/time.h
--- a/xen/include/xen/time.h	Thu Dec 04 16:36:43 2008 +0000
+++ b/xen/include/xen/time.h	Fri Dec 05 19:06:06 2008 +0800
@@ -13,7 +13,6 @@
 #include <asm/time.h>
 
 extern int init_xen_time(void);
-extern void cstate_save_tsc(void);
 extern void cstate_restore_tsc(void);
 
 extern unsigned long cpu_khz;

Jimmy
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-05 11:47 UTC

head link

[Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 05/12/2008 11:30, "Wei, Gang" <gang.wei@intel.com> wrote:
>> I tried extrapolating from t->stime_local_stamp, cpu_khz, and
>> t->local_tsc_stamp before I got into the current solution. It would
still
>> bring accumulating skew, but in a slower increasing speed. I would like
to
>> try it again with  t->tsc_scale instead of cpu_khz. If it is works,
it would
>> really be simpler. Allow me some time.
> 
> Below patch should be what you expected. It will still bring continuously
> increasing tsc skew. If I pin all domains'' vcpus on 1 pcpu, the
skew is
> increasing faster.
It looks about right to me, and better than using cpu_khz as in the current
code and the original patch. How much skew does it introduce? Is the skew a
problem? Bearing in mind that from the point of view of HVM guests, TSC
rates will be all over the place anyway if we are using P states (and the
host TSC is not invariant, which is actually the only time your code is
enabled anyway).

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2008-Dec-05 11:50 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>Sent: Friday, December 05, 2008 6:13 PM
>On 05/12/2008 10:05, "Tian, Kevin" <kevin.tian@intel.com>
wrote:
>
>>> From: Tian, Kevin
>>> Sent: Friday, December 05, 2008 6:00 PM
>>> 
>>> Then if we agree always aligning TSC to absolute platform timer
>>> counter, it doesn''t make difference to use cpu_khz or
local
>tsc_scale
>>> since both are using scale factor calculated within a small period
>>> to represent the underlying crystal frequency.
>>> 
>> 
>> Let me hold back above words. As you said, cpu_khz has lower accuracy
>> by cutting down lowest bits.
>
>Yes. Also bear in mind that absolute ongoing synchronisation 
>between TSCs
>*does not matter*. Xen will happily synchronise system time on top of
>(slowly enough, constantly enough) diverging TSCs, and of 
>course HVM VCPUs
>re-set their guest TSC offset when moving between host CPUs.
We had measurement on following cases: (4 idle up-hvm-rhel5 with 2 cores)

a) disable deep C-state
b) enable deep C-state, with original tsc save/restore at each C-state
entry/exit
c) enable deep C-state, and restore TSC based on local calibration stamp
    and tsc scale
d) enable deep C-state, and restore TSC based on monotonic platform stime
    and cpu_khz

	system time skew	TSC skew
a)	hundred ns		several us
b)	accumulating larger	accumulating larger
c) 	dozens of us		accumulating larger
d)	hundred ns		several us

Large system time skew can impact both pv and hvm domain. pv
domain will complain time went backward when migrating to a cpu 
with slower NOW(). hvm domain will have delayed vpt expiration
when migrating to slower one, or vice versa missed ticks are accounted
by xen for some timer mode. Both c) and d) ensures skew within a stable
range. 

Large TSC skew is normally OK with pv domain, since xen time
stamps are synced at gettimeofday and timer interrupt within pv
guest. Possibly impacted is some user processes which uses 
rdtsc directly. However larget TSC skew is really bad for hvm
guest, especially when guest TSC offset is never adjusted at
vcpu migration. That will cause guest itself to catch up missing
ticks in a batch, which results softlockup warning or DMA time
out. Thus with c) we can still observe guest complains after running
a enough long time.

I''m not sure whether guest TSC offset can be adjusted accurately,
since you need first get TSC skew among cores which may require 
issuing IPI and adds extra overhead. It just gets really messed to
handle an accumulating TSC skew for hvm guest.

That''s why we go with option d) which really exposes same level
of constraints compared to disabled case. This is not perfect
solution, but it shows more stable result than others.
>
>What *does* matter is the possibility of warping a host TSC 
>value on wake
>from deep sleep, compared with its value if the sleep had 
>never happened. In
>this case, system time will be wrong (since we haven''t been through
a
>calibration step since waking up) and HVM timers will be 
>wrong. And using
>start-of-day timestamp plus cpu_khz makes this more likely. The correct
>thing to do is obey the most recent set of local calibration values.
>
I assume you meant S3 for "deep sleep"? If yes, I don''t think
it
an issue. A sane dom0 S3 flow will only happen after other domains
has been notified with virtual S3 event, and thus after waken up
from dom0 S3, every domain will resume its timekeeping sub-system.

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2008-Dec-05 11:53 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

>From: Keir Fraser
>Sent: Friday, December 05, 2008 7:48 PM
>
>It looks about right to me, and better than using cpu_khz as 
>in the current
>code and the original patch. How much skew does it introduce? 
>Is the skew a
>problem? Bearing in mind that from the point of view of HVM guests, TSC
>rates will be all over the place anyway if we are using P 
>states (and the
>host TSC is not invariant, which is actually the only time your code is
>enabled anyway).
>
''invariant'' here means a new feature that tsc is always
running
even when clock is stopped at deep C-state entrance. For example,
latest Intel Core i7 has that feature supported. For P-state, it''s 
another term called "constant tsc" being used which has nothing
to do with C-state. Also as I wrote earlier, constant tsc is supported
in many processors nowadays.

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-05 12:36 UTC

head link

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 05/12/2008 11:50, "Tian, Kevin" <kevin.tian@intel.com> wrote:
> That''s why we go with option d) which really exposes same level
> of constraints compared to disabled case. This is not perfect
> solution, but it shows more stable result than others.
It does depend on constant_tsc though. That''s fine, but I think then it
should be implemented as an alternative method selected only when you detect
constant_tsc at runtime. The fallback should be something like Gang Wei
posted most recently.

Also don''t use cpu_khz as it''s way less accurate than you need
to put up
with. Create a ''struct time_scale'' that will convert from
nanoseconds to TSC
ticks. I think a reciprocal function for time_scale''s would be useful
here
(and in Gang Wei''s last patch as well) as then you can take the
existing
initial TSC->ns scale and flip it.

The other concern I have is that, even with a much more accurate scale
factor, CPUs which do not enter deep-C (because they are running busy loops
for example) will *slowly* wander from the time line determined by your
scale factor (due to unavoidable imprecision in the scale factor). This
could build up to be noticeable over days. It might be an idea therefore to
occasionally force a TSC resync on any CPU that hasn''t had it resynced
for
other reasons, just to zap any accumulating imprecision that has crept in.
It''d only need to be done maybe once a minute or maybe even much less
than
that.

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-05 13:06 UTC

head link

[Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 05/12/2008 11:30, "Wei, Gang" <gang.wei@intel.com> wrote:
>> I tried extrapolating from t->stime_local_stamp, cpu_khz, and
>> t->local_tsc_stamp before I got into the current solution. It would
still
>> bring accumulating skew, but in a slower increasing speed. I would like
to
>> try it again with  t->tsc_scale instead of cpu_khz. If it is works,
it would
>> really be simpler. Allow me some time.
> 
> Below patch should be what you expected. It will still bring continuously
> increasing tsc skew. If I pin all domains'' vcpus on 1 pcpu, the
skew is
> increasing faster.
I applied a modified version of this as c/s 18873. Primarily I removed
scale_time() and replaced with a simpler reciprocal function which is
applied to scale factors to invert them. Please take a look.

As I replied to Kevin, you can still have a scheme based on exploiting
constant_tsc if you like, but it needs to be enabled only when you really
have constant_tsc, and also subject to other comments I made in that email.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-05 14:47 UTC

head link

[Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 05/12/2008 13:06, "Keir Fraser" <keir.fraser@eu.citrix.com>
wrote:
>> Below patch should be what you expected. It will still bring
continuously
>> increasing tsc skew. If I pin all domains'' vcpus on 1 pcpu,
the skew is
>> increasing faster.
> 
> I applied a modified version of this as c/s 18873. Primarily I removed
> scale_time() and replaced with a simpler reciprocal function which is
applied
> to scale factors to invert them. Please take a look.
I should have tested the reciprocal function before checkin. A fixed (and
now rather less simple) version is in c/s 18875.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-06 07:36 UTC

head link

[Xen-devel] RE: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On Friday, December 05, 2008 10:48 PM, Keir Fraser
wrote:> On 05/12/2008 13:06, "Keir Fraser"
<keir.fraser@eu.citrix.com> wrote:
> 
>>> Below patch should be what you expected. It will still bring
continuously
>>> increasing tsc skew. If I pin all domains'' vcpus on 1
pcpu, the skew is
>>> increasing faster.
>> 
>> I applied a modified version of this as c/s 18873. Primarily I removed
>> scale_time() and replaced with a simpler reciprocal function which is
applied
>> to scale factors to invert them. Please take a look.
> 
> I should have tested the reciprocal function before checkin. A fixed (and
> now rather less simple) version is in c/s 18875.
Your fix is absolutely right. Now we at least make the tsc skew increasing in a
far slower speed. Thanks, and I will consider how to make my original patch work
for constant_tsc case only.

Jimmy
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-12 03:36 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On Friday, December 05, 2008 8:36 PM, Keir Fraser wrote:> It does depend on constant_tsc though. That''s fine, but I think
then it
> should be implemented as an alternative method selected only when you
detect
> constant_tsc at runtime. The fallback should be something like Gang Wei
> posted most recently.
> 
> Also don''t use cpu_khz as it''s way less accurate than you
need to put up
> with. Create a ''struct time_scale'' that will convert from
nanoseconds to TSC
> ticks. I think a reciprocal function for time_scale''s would be
useful here
> (and in Gang Wei''s last patch as well) as then you can take the
existing
> initial TSC->ns scale and flip it.
> 
> The other concern I have is that, even with a much more accurate scale
> factor, CPUs which do not enter deep-C (because they are running busy loops
> for example) will *slowly* wander from the time line determined by your
> scale factor (due to unavoidable imprecision in the scale factor). This
> could build up to be noticeable over days. It might be an idea therefore to
> occasionally force a TSC resync on any CPU that hasn''t had it
resynced for
> other reasons, just to zap any accumulating imprecision that has crept in.
> It''d only need to be done maybe once a minute or maybe even much
less than
> that.
I am updating the original patch for constant_tsc case. There are some
questions.

The first question is, can we just replace the per-second tsc_scale calibration
with constant tsc_scale & per-second tsc restoring? For constant_tsc, it
seems meaningless to do the tsc_scale calibration works. And to make the
tsc_skew as small as possible, it is better to do the tsc resync per-second
other than once a minute or even less.

The second question is, how can we make a more accruate tsc_scale and use it for
all local NOW() calculation & tsc restoring? Current initial tsc->ns
scale is based on PIT, but it is not certain to be the platform timer source. Is
it practical to make tsc_scale calibration working for seconds and take the
calibrated tsc_scale as the globle and fixed one?

Jimmy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-12 09:31 UTC

head link

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 12/12/2008 03:36, "Wei, Gang" <gang.wei@intel.com> wrote:
> I am updating the original patch for constant_tsc case. There are some
> questions.
> 
> The first question is, can we just replace the per-second tsc_scale
> calibration with constant tsc_scale & per-second tsc restoring? For
> constant_tsc, it seems meaningless to do the tsc_scale calibration works.
And
> to make the tsc_skew as small as possible, it is better to do the tsc
resync
> per-second other than once a minute or even less.
If you mean have the same tsc_scale in all per_cpu(cpu_time).tsc_scale, then
yes. And you can do your once-a-second work in time_calibration() -- you can
put an if-else in there.

I thought our tsc_skew was looking quite good now with the updated
scale_reciprocal(), by the way. How much more drift is there for you to
wring out?
> The second question is, how can we make a more accruate tsc_scale and use
it
> for all local NOW() calculation & tsc restoring? Current initial
tsc->ns scale
> is based on PIT, but it is not certain to be the platform timer source. Is
it
> practical to make tsc_scale calibration working for seconds and take the
> calibrated tsc_scale as the globle and fixed one?
You don''t want a platform source, right? So what does it matter? The
more
important question is: how much inaccuracy is built in to the existing
tsc->ns scale factor compared with real wallclock progress of time? And how
easily can that be improved? The fact that PIT may wander relative to HPET,
for example.... Who cares?

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-13 04:01 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On Friday, December 12, 2008 5:32 PM, Keir Fraser wrote:> On 12/12/2008 03:36, "Wei, Gang" <gang.wei@intel.com>
wrote:
> 
>> I am updating the original patch for constant_tsc case. There are some
>> questions. 
>> 
>> The first question is, can we just replace the per-second tsc_scale
>> calibration with constant tsc_scale & per-second tsc restoring? For
>> constant_tsc, it seems meaningless to do the tsc_scale calibration
works. And
>> to make the tsc_skew as small as possible, it is better to do the tsc
resync
>> per-second other than once a minute or even less.
> 
> If you mean have the same tsc_scale in all per_cpu(cpu_time).tsc_scale,
then
> yes. And you can do your once-a-second work in time_calibration() -- you
can
> put an if-else in there.
Yes, I will do it this way.
> 
> I thought our tsc_skew was looking quite good now with the updated
> scale_reciprocal(), by the way. How much more drift is there for you to
> wring out?
My tsc_skew testing method is:
1. specify "cpuidle hpetbroadcast cpufreq=xen" in grub xen line.
2. start 4 UP rhel5u1 hvm guest
3. ''xm vcpu-pin ...'' to pin all vcpus of dom0 & all guests
to cpu1(total 2 pcpus in my box)
4. run ''while true; do find . /; done'' in several guests
5. trigger keyhandler of ''t'' from time to time to watch the
time skew & cycle skew.

I observed the cycle skew easily exceeded ten seconds in hours.
> 
>> The second question is, how can we make a more accruate tsc_scale and
use it
>> for all local NOW() calculation & tsc restoring? Current initial
tsc->ns
>> scale is based on PIT, but it is not certain to be the platform timer
>> source. Is it practical to make tsc_scale calibration working for
seconds
>> and take the calibrated tsc_scale as the globle and fixed one?
> 
> You don''t want a platform source, right? So what does it matter?
The more
> important question is: how much inaccuracy is built in to the existing
> tsc->ns scale factor compared with real wallclock progress of time? And
how
> easily can that be improved? The fact that PIT may wander relative to HPET,
> for example.... Who cares?
In my box, the existing initial tsc->ns scale:
    .mul_frac=ca1761ff, .shift=-1 (2533507879 ticks/sec)
and the stable calibrated  tsc->ns scale in nopm case:
    .mul_frac=ca191f43, .shift=-1(2533422531 ticks/sec)
Don''t you think the inaccuracy is a little big (85348 ticks/sec)? But
It is not really easy to improve it. I will keep using the existing scale for
this moment. We can keep looking whether we can find a better way to improve it.

Jimmy
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-13 09:51 UTC

head link

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 13/12/2008 04:01, "Wei, Gang" <gang.wei@intel.com> wrote:
>> You don''t want a platform source, right? So what does it
matter? The more
>> important question is: how much inaccuracy is built in to the existing
>> tsc->ns scale factor compared with real wallclock progress of time?
And how
>> easily can that be improved? The fact that PIT may wander relative to
HPET,
>> for example.... Who cares?
> 
> In my box, the existing initial tsc->ns scale:
>     .mul_frac=ca1761ff, .shift=-1 (2533507879 ticks/sec)
> and the stable calibrated  tsc->ns scale in nopm case:
>     .mul_frac=ca191f43, .shift=-1(2533422531 ticks/sec)
> Don''t you think the inaccuracy is a little big (85348 ticks/sec)?
But It is
> not really easy to improve it. I will keep using the existing scale for
this
> moment. We can keep looking whether we can find a better way to improve it.
The difference is 33 parts per million. The actual frequency of the timer
crystal will likely deviate from its stated frequency by more than that. I
don''t think there''s anything you can do here (beyond allowing
tsc_scale to
be adjusted periodically by ntp algorithms in dom0).

By the way, c/s 18102 (subsequently reverted) may be interesting for you in
implementing the TSC-and-no-platform-timer mode. I''m not sure how much
of it
will really be applicable, but it might be worth a look at least.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-13 15:27 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On Saturday, December 13, 2008 5:51 PM, Keir Fraser
wrote:> By the way, c/s 18102 (subsequently reverted) may be interesting for you in
> implementing the TSC-and-no-platform-timer mode. I''m not sure how
much of it
> will really be applicable, but it might be worth a look at least.
c/s 18102 looks like a incompleted patch which want to reintroduce TSC as the
plt only when the tsc is always running & at a constant rate. I guess you
still want it because TSC access has less cost, right? Anyway, it indeed has
less to do with my current goal: handling the deepC-stop tsc which runs at
constant rate.

Jimmy
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-13 15:41 UTC

head link

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 13/12/2008 15:27, "Wei, Gang" <gang.wei@intel.com> wrote:
> On Saturday, December 13, 2008 5:51 PM, Keir Fraser wrote:
>> By the way, c/s 18102 (subsequently reverted) may be interesting for
you in
>> implementing the TSC-and-no-platform-timer mode. I''m not sure
how much of it
>> will really be applicable, but it might be worth a look at least.
> 
> c/s 18102 looks like a incompleted patch which want to reintroduce TSC as
the
> plt only when the tsc is always running & at a constant rate. I guess
you
> still want it because TSC access has less cost, right? Anyway, it indeed
has
> less to do with my current goal: handling the deepC-stop tsc which runs at
> constant rate.
You don;t need to take anything from that c/s if it''s not useful.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-15 03:06 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Here is the updated patch for constant-tsc case. -Jimmy

CPUIDLE: revise tsc-restore to avoid increasing tsc skew between cpus

Originally, the sequence for each cpu is [tsc-save, entry deepC, break-evt, exit
deepC, tsc-restore], the system error is quite easy to be accumulated. Once the
workloads between cpus are not balanced, the tsc skew between cpus will
eventually become bigger & begger - more than 10 seconds can be observed.

Then we remove the tsc-save step, and just based on percpu
t->stime_master_stamp, t->tsc_scale, & t->local_tsc_stamp to do the
tsc-restore after exit from deepC. It make the accumulating slower, but
can''t remove it.

Now, for constant-tsc case, we just keep a initial stamp via cstate_init_stamp
during the booting/s3 resuming, which is based on the platform stime. All cpus
need only to do tsc-restore relative to the initial stamp after exit deepC. The
base  and tsc->ns scale are fixed and same for all cpus, so it can avoid
accumulated tsc-skew. BTW, bypass the percpu tsc scale calibration for
constant-tsc case.

Signed-off-by: Wei Gang <gang.wei@intel.com>

diff -r 045f70d1acdb xen/arch/x86/time.c
--- a/xen/arch/x86/time.c	Sat Dec 13 17:44:20 2008 +0000
+++ b/xen/arch/x86/time.c	Mon Dec 15 10:35:11 2008 +0800
@@ -69,8 +69,11 @@ static DEFINE_PER_CPU(struct cpu_time, c
 #define EPOCH MILLISECS(1000)
 static struct timer calibration_timer;
 
-/* TSC is invariant on C state entry? */
-static bool_t tsc_invariant;
+/* TSC will not stop during deep C state? */
+static bool_t tsc_nostop;
+/* TSC will be constant rate, independent with P/T state? */
+static int constant_tsc = 0;
+boolean_param("const_tsc", constant_tsc);
 
 /*
  * We simulate a 32-bit platform timer from the 16-bit PIT ch2 counter.
@@ -551,6 +554,10 @@ static u64 plt_stamp;            /* hard
 static u64 plt_stamp;            /* hardware-width platform counter stamp   */
 static struct timer plt_overflow_timer;
 
+/* following 2 variables are for deep C state TSC restore usage */
+static u64 initial_tsc_stamp;    /* initial tsc stamp while plt starting */
+static s_time_t initial_stime_platform_stamp; /* initial stime stamp */
+
 static void plt_overflow(void *unused)
 {
     u64 count;
@@ -664,25 +671,41 @@ static void init_platform_timer(void)
            freq_string(pts->frequency), pts->name);
 }
 
-void cstate_restore_tsc(void)
+static void cstate_init_stamp(void)
+{
+    if ( tsc_nostop || !constant_tsc )
+        return;
+
+    initial_stime_platform_stamp = read_platform_stime();
+    rdtscll(initial_tsc_stamp);
+}
+
+static inline void __restore_tsc(s_time_t plt_stime)
 {
     struct cpu_time *t = &this_cpu(cpu_time);
     struct time_scale sys_to_tsc = scale_reciprocal(t->tsc_scale);
     s_time_t stime_delta;
     u64 tsc_delta;
 
-    if ( tsc_invariant )
+    if ( tsc_nostop )
         return;
 
-    stime_delta = read_platform_stime() - t->stime_master_stamp;
+    stime_delta = plt_stime - 
+        (constant_tsc ? initial_stime_platform_stamp :
t->stime_master_stamp);
+
     if ( stime_delta < 0 )
         stime_delta = 0;
 
     tsc_delta = scale_delta(stime_delta, &sys_to_tsc);
 
-    wrmsrl(MSR_IA32_TSC, t->local_tsc_stamp + tsc_delta);
+    wrmsrl(MSR_IA32_TSC, 
+        (constant_tsc ? initial_tsc_stamp : t->local_tsc_stamp) +
tsc_delta);
 }
 
+void cstate_restore_tsc(void)
+{
+    __restore_tsc(read_platform_stime());
+}
 /***************************************************************************
  * CMOS Timer functions
  ***************************************************************************/
@@ -960,6 +983,18 @@ static void local_time_calibration(void)
            curr_master_stime - curr_local_stime);
 #endif
 
+    if ( constant_tsc )
+    {
+        local_irq_disable();
+        t->local_tsc_stamp    = curr_tsc;
+        t->stime_local_stamp  = curr_master_stime;
+        t->stime_master_stamp = curr_master_stime;
+        local_irq_enable();
+
+        update_vcpu_system_time(current);
+        goto out;
+    }
+
     /* Local time warps forward if it lags behind master time. */
     if ( curr_local_stime < curr_master_stime )
         curr_local_stime = curr_master_stime;
@@ -1082,6 +1117,8 @@ static void time_calibration_rendezvous(
         mb(); /* receive signal /then/ read r->master_stime */
     }
 
+    __restore_tsc(r->master_stime);
+
     rdtscll(c->local_tsc_stamp);
     c->stime_local_stamp = get_s_time();
     c->stime_master_stamp = r->master_stime;
@@ -1125,9 +1162,23 @@ void init_percpu_time(void)
 /* Late init function (after all CPUs are booted). */
 int __init init_xen_time(void)
 {
-    /* Is TSC invariant during deep C state? */
+    /* for recent intel x86 model, the tsc increments at a constant rate */
+    if ( (current_cpu_data.x86 == 0xf && current_cpu_data.x86_model
>= 0x03) ||
+         (current_cpu_data.x86 == 0x6 && current_cpu_data.x86_model
>= 0x0e) )
+    {
+        int cpu;
+
+        constant_tsc = 1;
+
+        for_each_cpu(cpu)
+        {
+            per_cpu(cpu_time, cpu).tsc_scale = per_cpu(cpu_time, 0).tsc_scale;
+        }
+    }
+
+    /* Is TSC not stop during deep C state ? */
     if ( cpuid_edx(0x80000007) & (1u<<8) )
-        tsc_invariant = 1;
+        tsc_nostop = 1;
 
     open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);
 
@@ -1139,6 +1190,8 @@ int __init init_xen_time(void)
 
     stime_platform_stamp = NOW();
     init_platform_timer();
+
+    cstate_init_stamp();
 
     init_percpu_time();
 
@@ -1260,6 +1313,8 @@ int time_resume(void)
     disable_pit_irq();
 
     init_percpu_time();
+
+    cstate_init_stamp();
 
     do_settime(get_cmos_time() + cmos_utc_offset, 0, NOW());
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-15 09:40 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Redo the constant_tsc & tsc_nostop check part and post it again.

Jimmy

On Monday, December 15, 2008 11:07 AM, Wei, Gang wrote:> Here is the updated patch for constant-tsc case. -Jimmy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-15 12:03 UTC

head link

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 15/12/2008 09:40, "Wei, Gang" <gang.wei@intel.com> wrote:
> Redo the constant_tsc & tsc_nostop check part and post it again.
> 
> Jimmy
> 
> On Monday, December 15, 2008 11:07 AM, Wei, Gang wrote:
>> Here is the updated patch for constant-tsc case. -Jimmy
I applied the bits outside time.c. For time.c itself, how about the simpler
attached alternative? Does it work well? :-)

Notice I don''t reset TSCs at all (except when !NOSTOP_TSC and after
deep
sleep).

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-15 13:28 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On Monday, December 15, 2008 8:03 PM, Keir Fraser wrote:> On 15/12/2008 09:40, "Wei, Gang" <gang.wei@intel.com>
wrote:
> 
>> Redo the constant_tsc & tsc_nostop check part and post it again.
> 
> I applied the bits outside time.c. For time.c itself, how about the simpler
> attached alternative? Does it work well? :-)
Although it looks simpler & workable, but the practice shows it
doesn''t work.

The phenomena:

A) xm vcpu-list shows weird number, & cpu 1 looks like dead-locked.

[root@lke_linux_mv ~]# xm vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0                             0     0     0   r--     124.8 any cpu
Domain-0                             0     1     0   ---  7281029792.3 any cpu
LinuxHVMDomain_1                     1     0     0   ---      37.5 any cpu
LinuxHVMDomain_2                     3     0     0   ---      45.5 any cpu
LinuxHVMDomain_3                     4     0     1   r--       4.1 any cpu
LinuxHVMDomain_4                     2     0     0   ---      41.9 any cpu

The dump info of cpu1:

(XEN) Xen call trace:
(XEN)    [<ffff828c8010ca71>] __dump_execstate+0x1/0x60
(XEN)    [<ffff828c8014f66d>] smp_call_function_interrupt+0x7d/0xd0
(XEN)    [<ffff828c8013b41a>] call_function_interrupt+0x2a/0x30
(XEN)    [<ffff828c80151538>] get_s_time+0x28/0x70
(XEN)    [<ffff828c8017fa0a>] hvm_get_guest_time+0x3a/0x90
(XEN)    [<ffff828c8017d620>] pmt_timer_callback+0x0/0x80
(XEN)    [<ffff828c8017d4fc>] pmt_update_time+0x1c/0x70
(XEN)    [<ffff828c8017d620>] pmt_timer_callback+0x0/0x80
(XEN)    [<ffff828c8017d649>] pmt_timer_callback+0x29/0x80
(XEN)    [<ffff828c80118ebc>] execute_timer+0x2c/0x50
(XEN)    [<ffff828c80118ffa>] timer_softirq_action+0x11a/0x2d0
(XEN)    [<ffff828c801175b0>] do_softirq+0x70/0x80
(XEN)    [<ffff828c80188915>] vmx_asm_do_vmentry+0xd2/0xdd
(XEN)
(XEN) *** Dumping CPU1 guest state: ***
(XEN) ----[ Xen-3.4-unstable  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    1
(XEN) RIP:    f000:[<000000000000054f>]
(XEN) RFLAGS: 0000000000000202   CONTEXT: hvm guest
(XEN) rax: 00000000000001f7   rbx: 0000000000000000   rcx: 0000000000000000
(XEN) rdx: 00000000000001f7   rsi: 00000000000058b0   rdi: 0000000000003400
(XEN) rbp: 0000000000001f7a   rsp: 0000000000001f78   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 0000000000000010   cr4: 0000000000000000
(XEN) cr3: 0000000000000000   cr2: 0000000000000000
(XEN) ds: 0000   es: 7000   fs: 0000   gs: 0000   ss: 0000   cs: f000
(XEN)

B) the tsc skew become larger and larger

(XEN) Synced stime skew: max=54879663ns avg=54529587ns samples=93
current=54879663ns
(XEN) Synced cycles skew: max=113678958 avg=112788173 samples=93
current=113678958
(XEN) Synced stime skew: max=54881161ns avg=54533327ns samples=94
current=54881161ns
(XEN) Synced cycles skew: max=113682743 avg=112797689 samples=94
current=113682743
(XEN) Synced stime skew: max=54882749ns avg=54537006ns samples=95
current=54882749ns
(XEN) Synced cycles skew: max=113686776 avg=112807048 samples=95
current=113686776
> 
> Notice I don''t reset TSCs at all (except when !NOSTOP_TSC and
after deep
> sleep).
> 
>  -- Keir_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2008-Dec-15 16:02 UTC

head link

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On 15/12/2008 13:28, "Wei, Gang" <gang.wei@intel.com> wrote:
>>> Redo the constant_tsc & tsc_nostop check part and post it
again.
>> 
>> I applied the bits outside time.c. For time.c itself, how about the
simpler
>> attached alternative? Does it work well? :-)
> 
> Although it looks simpler & workable, but the practice shows it
doesn''t work.
Weird. I wonder if CPU TSCs aren''t as synced as we''d like, and
we''re getting
a -ve TSC delta in get_s_time(). Perhaps setting the TSC MSR to
r->master_tsc_stamp in time_calibration_rendezvous() would avoid that.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2008-Dec-16 00:29 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>Sent: Tuesday, December 16, 2008 12:02 AM
>
>On 15/12/2008 13:28, "Wei, Gang" <gang.wei@intel.com> wrote:
>
>>>> Redo the constant_tsc & tsc_nostop check part and post it
again.
>>> 
>>> I applied the bits outside time.c. For time.c itself, how 
>about the simpler
>>> attached alternative? Does it work well? :-)
>> 
>> Although it looks simpler & workable, but the practice shows 
>it doesn''t work.
>
>Weird. I wonder if CPU TSCs aren''t as synced as we''d like,
and
>we''re getting
>a -ve TSC delta in get_s_time(). Perhaps setting the TSC MSR to
>r->master_tsc_stamp in time_calibration_rendezvous() would avoid that.
>
I''m not sure why you don''t want to adjust TSC. Whether cpu
TSCs
are sync-ed or not doesn''t make sence if TSC will stop when
individual core enters deep C-state. As long as multiple cpus get
difference chance in deep C-state, finally you always has increasing
TSC skews. Whatever you can do in Xen side w/o adjusting TSC,
it only helps those aware of xen system time, e.g. pv guest. However
for hvm guest, TSC skew always causes problem.

I think no software approach can cleanly solve this TSC skew, unless
with hardware assistance like always running tsc. Since Jimmy''s
approach can work far better than previous one, (yes, we know some
weakness when one cpu doesn''t enter deep C-state for a long time),
it''s still worthy of slipping in? In the meantime, IMO your change can
be also required, since there''s no much need for periodic time calib-
ration upon a constant tsc feature. But it''s a seperate issue as we
aim to solve. :-)

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2008-Dec-16 00:44 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Forgot below comment and I read your patch too quickly.

It''s supposed to work, and maybe some sequence issue reverts
the effect you want to achieve. For example, at least the lines
within init_xen_time is useless, since calibrate_tsc_ap happens
later which will update AP tsc_scale. Anyway, this looks in a
right direction, and we''ll do some debug to find the exact reason.

Thanks,
Kevin
>From: Tian, Kevin
>Sent: Tuesday, December 16, 2008 8:29 AM
>
>>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>>Sent: Tuesday, December 16, 2008 12:02 AM
>>
>>On 15/12/2008 13:28, "Wei, Gang" <gang.wei@intel.com>
wrote:
>>
>>>>> Redo the constant_tsc & tsc_nostop check part and post
it again.
>>>> 
>>>> I applied the bits outside time.c. For time.c itself, how 
>>about the simpler
>>>> attached alternative? Does it work well? :-)
>>> 
>>> Although it looks simpler & workable, but the practice shows 
>>it doesn''t work.
>>
>>Weird. I wonder if CPU TSCs aren''t as synced as we''d
like, and
>>we''re getting
>>a -ve TSC delta in get_s_time(). Perhaps setting the TSC MSR to
>>r->master_tsc_stamp in time_calibration_rendezvous() would avoid
that.
>>
>
>I''m not sure why you don''t want to adjust TSC. Whether cpu
TSCs
>are sync-ed or not doesn''t make sence if TSC will stop when
>individual core enters deep C-state. As long as multiple cpus get
>difference chance in deep C-state, finally you always has increasing
>TSC skews. Whatever you can do in Xen side w/o adjusting TSC,
>it only helps those aware of xen system time, e.g. pv guest. However
>for hvm guest, TSC skew always causes problem.
>
>I think no software approach can cleanly solve this TSC skew, unless
>with hardware assistance like always running tsc. Since Jimmy''s
>approach can work far better than previous one, (yes, we know some
>weakness when one cpu doesn''t enter deep C-state for a long time),
>it''s still worthy of slipping in? In the meantime, IMO your change
can
>be also required, since there''s no much need for periodic time
calib-
>ration upon a constant tsc feature. But it''s a seperate issue as we
>aim to solve. :-)
>
>Thanks,
>Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tian, Kevin

2008-Dec-16 00:58 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Again, another wrong comment was given by me since the
sequence is correct. Not sure my bad mind in such a good
morning. :-( Jimmy is now trying your suggestion to sync
TSC MSR in time calibration softirq.

Thanks,
Kevin
>From: Tian, Kevin 
>Sent: Tuesday, December 16, 2008 8:45 AM
>To: Tian, Kevin; ''Keir Fraser''; Wei, Gang; 
>
>Forgot below comment and I read your patch too quickly.
>
>It''s supposed to work, and maybe some sequence issue reverts
>the effect you want to achieve. For example, at least the lines
>within init_xen_time is useless, since calibrate_tsc_ap happens
>later which will update AP tsc_scale. Anyway, this looks in a
>right direction, and we''ll do some debug to find the exact reason.
>
>Thanks,
>Kevin
>
>>From: Tian, Kevin
>>Sent: Tuesday, December 16, 2008 8:29 AM
>>
>>>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>>>Sent: Tuesday, December 16, 2008 12:02 AM
>>>
>>>On 15/12/2008 13:28, "Wei, Gang"
<gang.wei@intel.com> wrote:
>>>
>>>>>> Redo the constant_tsc & tsc_nostop check part and
post it again.
>>>>> 
>>>>> I applied the bits outside time.c. For time.c itself, how 
>>>about the simpler
>>>>> attached alternative? Does it work well? :-)
>>>> 
>>>> Although it looks simpler & workable, but the practice
shows
>>>it doesn''t work.
>>>
>>>Weird. I wonder if CPU TSCs aren''t as synced as
we''d like, and
>>>we''re getting
>>>a -ve TSC delta in get_s_time(). Perhaps setting the TSC MSR to
>>>r->master_tsc_stamp in time_calibration_rendezvous() would 
>avoid that.
>>>
>>
>>I''m not sure why you don''t want to adjust TSC. Whether
cpu TSCs
>>are sync-ed or not doesn''t make sence if TSC will stop when
>>individual core enters deep C-state. As long as multiple cpus get
>>difference chance in deep C-state, finally you always has increasing
>>TSC skews. Whatever you can do in Xen side w/o adjusting TSC,
>>it only helps those aware of xen system time, e.g. pv guest. However
>>for hvm guest, TSC skew always causes problem.
>>
>>I think no software approach can cleanly solve this TSC skew, unless
>>with hardware assistance like always running tsc. Since Jimmy''s
>>approach can work far better than previous one, (yes, we know some
>>weakness when one cpu doesn''t enter deep C-state for a long
time),
>>it''s still worthy of slipping in? In the meantime, IMO your
change can
>>be also required, since there''s no much need for periodic time
calib-
>>ration upon a constant tsc feature. But it''s a seperate issue
as we
>>aim to solve. :-)
>>
>>Thanks,
>>Kevin
>_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-16 02:26 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

On Tuesday, December 16, 2008 12:02 AM, Keir Fraser
wrote:> On 15/12/2008 13:28, "Wei, Gang" <gang.wei@intel.com>
wrote:
> 
>>>> Redo the constant_tsc & tsc_nostop check part and post it
again.
>>> 
>>> I applied the bits outside time.c. For time.c itself, how about the
simpler
>>> attached alternative? Does it work well? :-)
>> 
>> Although it looks simpler & workable, but the practice shows it
doesn''t work.
> 
> Weird. I wonder if CPU TSCs aren''t as synced as we''d
like, and we''re getting
> a -ve TSC delta in get_s_time(). Perhaps setting the TSC MSR to
> r->master_tsc_stamp in time_calibration_rendezvous() would avoid that.
I added a conditional wrmsrl as you said, meanwhile removed unnecessary
c->master_tsc_stamp. It works fine now.

Below ''t'' key outputs are gotten in the extreme case: pin all
dom0 & guest vcpus on cpu1 & execute cmd ''while true; do a=1;
done'' within one guest.
The largest stime skew is ~40us, largest cycles skew is ~100,000 ticks(~40us).
The normal idle case skew is quite small (~xxxns).

(XEN) Synced stime skew: max=39662ns avg=8038ns samples=850 current=326ns
(XEN) Synced cycles skew: max=100475 avg=20368 samples=850 current=825
...
(XEN) Synced stime skew: max=39750ns avg=16958ns samples=3954 current=30667ns
(XEN) Synced cycles skew: max=100708 avg=42967 samples=3954 current=77696
...
(XEN) Synced stime skew: max=39750ns avg=17318ns samples=4544 current=22981ns
(XEN) Synced cycles skew: max=100708 avg=43880 samples=4544 current=58225

Jimmy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2008-Dec-16 08:00 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore toavoid big tsc skew between cpus

>>> "Wei, Gang" <gang.wei@intel.com> 16.12.08 03:26
>>>
>Below ''t'' key outputs are gotten in the extreme case: pin
all dom0 & guest vcpus on cpu1 & execute cmd ''while true; do
a=1; done''
>within one guest.
>The largest stime skew is ~40us, largest cycles skew is ~100,000
ticks(~40us). The normal idle case skew is quite small (~xxxns).
>
>(XEN) Synced stime skew: max=39662ns avg=8038ns samples=850 current=326ns
>(XEN) Synced cycles skew: max=100475 avg=20368 samples=850 current=825
>...
>(XEN) Synced stime skew: max=39750ns avg=16958ns samples=3954
current=30667ns
>(XEN) Synced cycles skew: max=100708 avg=42967 samples=3954 current=77696
>...
>(XEN) Synced stime skew: max=39750ns avg=17318ns samples=4544
current=22981ns
>(XEN) Synced cycles skew: max=100708 avg=43880 samples=4544 current=58225
Is the average continuing to grow over larger periods of time? I would have
expected it to converge rather than increase as a proof of it being a long
term solution.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Wei, Gang

2008-Dec-16 08:09 UTC

head link

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore toavoid big tsc skew between cpus

On Tuesday, December 16, 2008 4:00 PM, Jan Beulich
wrote:>>>> "Wei, Gang" <gang.wei@intel.com> 16.12.08 03:26
>>>
>> Below ''t'' key outputs are gotten in the extreme case:
pin all dom0 & guest
>> vcpus on cpu1 & execute cmd ''while true; do a=1;
done'' within one guest. The
>> largest stime skew is ~40us, largest cycles skew is ~100,000
ticks(~40us).
>> The normal idle case skew is quite small (~xxxns).  
>> 
>> (XEN) Synced stime skew: max=39662ns avg=8038ns samples=850
current=326ns
>> (XEN) Synced cycles skew: max=100475 avg=20368 samples=850 current=825
...
>> (XEN) Synced stime skew: max=39750ns avg=16958ns samples=3954
current=30667ns
>> (XEN) Synced cycles skew: max=100708 avg=42967 samples=3954
current=77696 ...
>> (XEN) Synced stime skew: max=39750ns avg=17318ns samples=4544
current=22981ns
>> (XEN) Synced cycles skew: max=100708 avg=43880 samples=4544
current=58225
> 
> Is the average continuing to grow over larger periods of time? I would have
> expected it to converge rather than increase as a proof of it being a long
> term solution.
The average looks converge in my box. We can do more tests after it get in.

Jimmy
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Dec 2008 - [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

[Xen-devel] [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

[Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

[Xen-devel] RE: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

[Xen-devel] RE: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

[Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

[Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

[Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

[Xen-devel] RE: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

Re: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore toavoid big tsc skew between cpus

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore toavoid big tsc skew between cpus