thr3ads.net - Xen devel - [PATCH] xen: always set the sched clock as unstable [Apr 2012]

If this information is useful, please help other people find it:
Share via:

David Vrabel

2012-Apr-13 18:20 UTC

[PATCH] xen: always set the sched clock as unstable

From: David Vrabel <david.vrabel@citrix.com>

The sched clock was considered stable based on the capabilities of the
underlying hardware.  This does not make sense for Xen PV guests as:
a) the hardware TSC is not used directly as the clock source; and b)
guests may migrate to hosts with different hardware capabilities.

It is not clear to me whether the Xen clock source is supposed to be
stable and whether it should be stable across migration.  For a clock
source to be stable it must be: a) monotonic; c) synchronized across
CPUs; and c) constant rate.

There have also been reports of systems with apparently unstable
clocks where clearing sched_clock_stable has fixed problems with
migrated VMs hanging.

So, always set the sched clock as unstable when using the Xen clock
source.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 arch/x86/xen/time.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 0296a95..8469b5a 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -473,6 +473,7 @@ static void __init xen_time_init(void)
 	do_settimeofday(&tp);
 
 	setup_force_cpu_cap(X86_FEATURE_TSC);
+	sched_clock_stable = 0;
 
 	xen_setup_runstate_info(cpu);
 	xen_setup_timer(cpu);
-- 
1.7.2.5

Sheng Yang

2012-Apr-13 18:31 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

On Fri, Apr 13, 2012 at 11:20 AM, David Vrabel
<david.vrabel@citrix.com>wrote:
> From: David Vrabel <david.vrabel@citrix.com>
>
> The sched clock was considered stable based on the capabilities of the
> underlying hardware.  This does not make sense for Xen PV guests as:
> a) the hardware TSC is not used directly as the clock source; and b)
> guests may migrate to hosts with different hardware capabilities.
>
> It is not clear to me whether the Xen clock source is supposed to be
> stable and whether it should be stable across migration.  For a clock
> source to be stable it must be: a) monotonic; c) synchronized across
> CPUs; and c) constant rate.
>
> There have also been reports of systems with apparently unstable
> clocks where clearing sched_clock_stable has fixed problems with
> migrated VMs hanging.
>
> So, always set the sched clock as unstable when using the Xen clock
> source.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  arch/x86/xen/time.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 0296a95..8469b5a 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -473,6 +473,7 @@ static void __init xen_time_init(void)
>        do_settimeofday(&tp);
>
>        setup_force_cpu_cap(X86_FEATURE_TSC);
> +       sched_clock_stable = 0;
>
>        xen_setup_runstate_info(cpu);
>        xen_setup_timer(cpu);
> --
> 1.7.2.5
>
>I really prefer to hide nonstop tsc feature from Xen side, rather than let
Linux kernel to have a condition that "Nonstop TSC feature existed, but
sched_clock_stable=0".

For the other context, please refer to the discussion at xen-devel.

http://lists.xen.org/archives/html/xen-devel/2012-04/msg00888.html
http://lists.xen.org/archives/html/xen-devel/2012-04/msg00969.html

-- 
regards
Yang, Sheng


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Sheng Yang

2012-Apr-13 18:33 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

On Fri, Apr 13, 2012 at 11:20 AM, David Vrabel <david.vrabel@citrix.com>
wrote:>
> From: David Vrabel <david.vrabel@citrix.com>
>
> The sched clock was considered stable based on the capabilities of the
> underlying hardware.  This does not make sense for Xen PV guests as:
> a) the hardware TSC is not used directly as the clock source; and b)
> guests may migrate to hosts with different hardware capabilities.
>
> It is not clear to me whether the Xen clock source is supposed to be
> stable and whether it should be stable across migration.  For a clock
> source to be stable it must be: a) monotonic; c) synchronized across
> CPUs; and c) constant rate.
>
> There have also been reports of systems with apparently unstable
> clocks where clearing sched_clock_stable has fixed problems with
> migrated VMs hanging.
>
> So, always set the sched clock as unstable when using the Xen clock
> source.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  arch/x86/xen/time.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 0296a95..8469b5a 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -473,6 +473,7 @@ static void __init xen_time_init(void)
>        do_settimeofday(&tp);
>
>        setup_force_cpu_cap(X86_FEATURE_TSC);
> +       sched_clock_stable = 0;
>
>        xen_setup_runstate_info(cpu);
>        xen_setup_timer(cpu);
> --
> 1.7.2.5
>
(Sorry for duplicate posts, gmail is really not a ideal client for
lkml - though exchange is worse...)

I really prefer to hide nonstop tsc feature from Xen side, rather than
let Linux kernel to have a condition that "Nonstop TSC feature
existed, but sched_clock_stable=0".

For the other context, please refer to the discussion at xen-devel.

http://lists.xen.org/archives/html/xen-devel/2012-04/msg00888.html
http://lists.xen.org/archives/html/xen-devel/2012-04/msg00969.html

--
regards
Yang, Sheng

David Vrabel

2012-Apr-13 18:39 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

On 13/04/12 19:31, Sheng Yang wrote:> On Fri, Apr 13, 2012 at 11:20 AM, David Vrabel <david.vrabel@citrix.com
> <mailto:david.vrabel@citrix.com>> wrote:
> 
>     From: David Vrabel <david.vrabel@citrix.com
>     <mailto:david.vrabel@citrix.com>>
> 
>     The sched clock was considered stable based on the capabilities of the
>     underlying hardware.  This does not make sense for Xen PV guests as:
>     a) the hardware TSC is not used directly as the clock source; and b)
>     guests may migrate to hosts with different hardware capabilities.
> 
>     It is not clear to me whether the Xen clock source is supposed to be
>     stable and whether it should be stable across migration.  For a clock
>     source to be stable it must be: a) monotonic; c) synchronized across
>     CPUs; and c) constant rate.
> 
>     There have also been reports of systems with apparently unstable
>     clocks where clearing sched_clock_stable has fixed problems with
>     migrated VMs hanging.
> 
>     So, always set the sched clock as unstable when using the Xen clock
>     source.
> 
>     Signed-off-by: David Vrabel <david.vrabel@citrix.com
>     <mailto:david.vrabel@citrix.com>>
>     ---
>      arch/x86/xen/time.c |    1 +
>      1 files changed, 1 insertions(+), 0 deletions(-)
> 
>     diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
>     index 0296a95..8469b5a 100644
>     --- a/arch/x86/xen/time.c
>     +++ b/arch/x86/xen/time.c
>     @@ -473,6 +473,7 @@ static void __init xen_time_init(void)
>            do_settimeofday(&tp);
> 
>            setup_force_cpu_cap(X86_FEATURE_TSC);
>     +       sched_clock_stable = 0;
> 
>            xen_setup_runstate_info(cpu);
>            xen_setup_timer(cpu);
>     --
>     1.7.2.5
> 
> 
> I really prefer to hide nonstop tsc feature from Xen side, rather than
> let Linux kernel to have a condition that "Nonstop TSC feature
existed,
> but sched_clock_stable=0".
I disagree.  The decision on whether the clock is stable or not should
be in the Xen clock source code.

David

Jan Beulich

2012-Apr-16 11:32 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

>>> On 13.04.12 at 20:20, David Vrabel <david.vrabel@citrix.com>
wrote:
> From: David Vrabel <david.vrabel@citrix.com>
> 
> The sched clock was considered stable based on the capabilities of the
> underlying hardware.  This does not make sense for Xen PV guests as:
> a) the hardware TSC is not used directly as the clock source; and b)
> guests may migrate to hosts with different hardware capabilities.
> 
> It is not clear to me whether the Xen clock source is supposed to be
> stable and whether it should be stable across migration.  For a clock
> source to be stable it must be: a) monotonic; c) synchronized across
> CPUs; and c) constant rate.
> 
> There have also been reports of systems with apparently unstable
> clocks where clearing sched_clock_stable has fixed problems with
> migrated VMs hanging.
> 
> So, always set the sched clock as unstable when using the Xen clock
> source.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> ---
>  arch/x86/xen/time.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 0296a95..8469b5a 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -473,6 +473,7 @@ static void __init xen_time_init(void)
>  	do_settimeofday(&tp);
>  
>  	setup_force_cpu_cap(X86_FEATURE_TSC);
> +	sched_clock_stable = 0;
This, unfortunately, is not sufficient afaict: If a CPU gets brought up
post-boot, the variable may need to be cleared again. Instead you
ought to call mark_tsc_unstable().

Jan
>  
>  	xen_setup_runstate_info(cpu);
>  	xen_setup_timer(cpu);
> -- 
> 1.7.2.5

David Vrabel

2012-Apr-16 14:59 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

On 16/04/12 12:32, Jan Beulich wrote:>>>> On 13.04.12 at 20:20, David Vrabel
<david.vrabel@citrix.com> wrote:
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> The sched clock was considered stable based on the capabilities of the
>> underlying hardware.  This does not make sense for Xen PV guests as:
>> a) the hardware TSC is not used directly as the clock source; and b)
>> guests may migrate to hosts with different hardware capabilities.
>>
>> It is not clear to me whether the Xen clock source is supposed to be
>> stable and whether it should be stable across migration.  For a clock
>> source to be stable it must be: a) monotonic; c) synchronized across
>> CPUs; and c) constant rate.
Tim, Thomas, can you comment on the above paragraph?  Is it correct?
>> There have also been reports of systems with apparently unstable
>> clocks where clearing sched_clock_stable has fixed problems with
>> migrated VMs hanging.
>>
>> So, always set the sched clock as unstable when using the Xen clock
>> source.
>>
>> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
>> ---
>>  arch/x86/xen/time.c |    1 +
>>  1 files changed, 1 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
>> index 0296a95..8469b5a 100644
>> --- a/arch/x86/xen/time.c
>> +++ b/arch/x86/xen/time.c
>> @@ -473,6 +473,7 @@ static void __init xen_time_init(void)
>>  	do_settimeofday(&tp);
>>  
>>  	setup_force_cpu_cap(X86_FEATURE_TSC);
>> +	sched_clock_stable = 0;
> 
> This, unfortunately, is not sufficient afaict: If a CPU gets brought up
> post-boot, the variable may need to be cleared again. Instead you
> ought to call mark_tsc_unstable().
Yeah, mark_tsc_unstable() is the right thing to do.

David

Tim Deegan

2012-Apr-16 15:16 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

At 15:59 +0100 on 16 Apr (1334591984), David Vrabel
wrote:> On 16/04/12 12:32, Jan Beulich wrote:
> >>>> On 13.04.12 at 20:20, David Vrabel
<david.vrabel@citrix.com> wrote:
> >> From: David Vrabel <david.vrabel@citrix.com>
> >>
> >> The sched clock was considered stable based on the capabilities of
the
> >> underlying hardware.  This does not make sense for Xen PV guests
as:
> >> a) the hardware TSC is not used directly as the clock source; and
b)
> >> guests may migrate to hosts with different hardware capabilities.
> >>
> >> It is not clear to me whether the Xen clock source is supposed to
be
> >> stable and whether it should be stable across migration.  For a
clock
> >> source to be stable it must be: a) monotonic; c) synchronized
across
> >> CPUs; and c) constant rate.
> 
> Tim, Thomas, can you comment on the above paragraph?  Is it correct?
How synchronized does it need to be?  The adjustment of the rate happens
independently on different CPUs so you might be able to see small
differences if once CPU has made the adjustment but another hasn''t yet.

That aside, on platforms where the real TSC is stable, AFAIK the xen PV
time will be too, _except_ across migration.  I have no idea whether Xen
exports sufficient information to the guest for it to be able to tell
whether this is the case, though -- it needs to know not only that the
hardware is stable, but thet _Xen_ thinks it''s stable.

Across migration, the PV time might not be monotonic (because it''s
always
the wallclock time on the current host, not a VM-specific attribute).
Which is not to say that the linux-side code couldn''t make it
monotonic,
at least for small differences between hosts. 

Tim.

Konrad Rzeszutek Wilk

2012-Apr-16 15:17 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

On Mon, Apr 16, 2012 at 03:59:44PM +0100, David Vrabel
wrote:> On 16/04/12 12:32, Jan Beulich wrote:
> >>>> On 13.04.12 at 20:20, David Vrabel
<david.vrabel@citrix.com> wrote:
> >> From: David Vrabel <david.vrabel@citrix.com>
> >>
> >> The sched clock was considered stable based on the capabilities of
the
> >> underlying hardware.  This does not make sense for Xen PV guests
as:
In regards to PV dom0 is this still the case? Asking b/c your
patch will make dom0 be in the same category.
> >> a) the hardware TSC is not used directly as the clock source; and
b)
> >> guests may migrate to hosts with different hardware capabilities.
> >>
> >> It is not clear to me whether the Xen clock source is supposed to
be
> >> stable and whether it should be stable across migration.  For a
clock
I thought it was dependent on XEN_DOMCTL_settscinfo:
 - whether it gets emulated, or the guest can do rdtsc or pvrdtsc?

Which I think is determined by some ''timer=X'' option in the
guest config?

> >> source to be stable it must be: a) monotonic; c) synchronized
across
> >> CPUs; and c) constant rate.
> 
> Tim, Thomas, can you comment on the above paragraph?  Is it correct?
> 
> >> There have also been reports of systems with apparently unstable
> >> clocks where clearing sched_clock_stable has fixed problems with
> >> migrated VMs hanging.
> >>
> >> So, always set the sched clock as unstable when using the Xen
clock
> >> source.
> >>
> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> >> ---
> >>  arch/x86/xen/time.c |    1 +
> >>  1 files changed, 1 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> >> index 0296a95..8469b5a 100644
> >> --- a/arch/x86/xen/time.c
> >> +++ b/arch/x86/xen/time.c
> >> @@ -473,6 +473,7 @@ static void __init xen_time_init(void)
> >>  	do_settimeofday(&tp);
> >>  
> >>  	setup_force_cpu_cap(X86_FEATURE_TSC);
> >> +	sched_clock_stable = 0;
> > 
> > This, unfortunately, is not sufficient afaict: If a CPU gets brought
up
> > post-boot, the variable may need to be cleared again. Instead you
> > ought to call mark_tsc_unstable().
> 
> Yeah, mark_tsc_unstable() is the right thing to do.
> 
> David

Dan Magenheimer

2012-Apr-16 16:05 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

> From: David Vrabel [mailto:david.vrabel@citrix.com]
> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
Nacked-by: Dan Magenheimer <dan.magenheimer@oracle.com>

(Apologies for missing the original post... our Oracle mail server
has gone bonkers again... classifying nearly all (but not all) xen-devel
email as spam.  This problem started when xen.org moved to a different
ISP last year, was supposedly fixed by Oracle IT, and has just
started being a problem again. Argh!)
 > On 16/04/12 12:32, Jan Beulich wrote:
> >>>> On 13.04.12 at 20:20, David Vrabel
<david.vrabel@citrix.com> wrote:
> >> From: David Vrabel <david.vrabel@citrix.com>
> >>
> >> The sched clock was considered stable based on the capabilities of
the
> >> underlying hardware.  This does not make sense for Xen PV guests
as:
> >> a) the hardware TSC is not used directly as the clock source; and
b)
> >> guests may migrate to hosts with different hardware capabilities.
> >>
> >> It is not clear to me whether the Xen clock source is supposed to
be
> >> stable and whether it should be stable across migration.  For a
clock
> >> source to be stable it must be: a) monotonic; c) synchronized
across
> >> CPUs; and c) constant rate.
> 
> Tim, Thomas, can you comment on the above paragraph?  Is it correct?
(Sigh... I keep seeing clock-related things, wish I had more time
to spend on them, cursing, and going back to other things.  But,
I need to comment further here...)

Hmmm... I spent a great deal of time on TSC support in the hypervisor
2-3 years ago.  I worked primarily on PV, but Intel supposedly was tracking
everything on HVM as well.  There''s most likely a bug or two still
lurking
but, for all guests, with the default tsc_mode, TSC is provided by Xen
as an absolutely stable clock source.  If Xen determines that the underlying
hardware declares that TSC is stable, guest rdtsc instructions are not trapped.
If it is not, Xen emulates all guest rdtsc instructions.  After a migration or
save/restore, TSC is always emulated.  The result is (ignoring possible
bugs) that TSC as provided by Xen is a) monotonic; b) synchronized across
CPUs; and c) constant rate.  Even across migration/save/restore.

This should be true for Xen 4.0+ (but not for pre-Xen-4.0).

Please see docs/misc/tscmode.txt in the xen tree.  Though
it may appear at first to be targeted at a different audience,
all the relevant info is in there if you read it all the way through.

(If you have any questions or disagreements on that doc, please start
a new thread and cc me directly since my list access is unreliable.)
 > >> There have also been reports of systems with apparently unstable
> >> clocks where clearing sched_clock_stable has fixed problems with
> >> migrated VMs hanging.
> >>
> >> So, always set the sched clock as unstable when using the Xen
clock
> >> source.
> >>
> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
> >> ---
> >>  arch/x86/xen/time.c |    1 +
> >>  1 files changed, 1 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> >> index 0296a95..8469b5a 100644
> >> --- a/arch/x86/xen/time.c
> >> +++ b/arch/x86/xen/time.c
> >> @@ -473,6 +473,7 @@ static void __init xen_time_init(void)
> >>  	do_settimeofday(&tp);
> >>
> >>  	setup_force_cpu_cap(X86_FEATURE_TSC);
> >> +	sched_clock_stable = 0;
> >
> > This, unfortunately, is not sufficient afaict: If a CPU gets brought
up
> > post-boot, the variable may need to be cleared again. Instead you
> > ought to call mark_tsc_unstable().
> 
> Yeah, mark_tsc_unstable() is the right thing to do.
NACK!

No, no, no.  The exact opposite is true.  Like VMware, TSC is
stable.  The issue is that Linux trusts other clock hardware more
completely than TSC so whenever there is a problem with another
clocksource, Linux blames TSC and marks TSC unstable.  But TSC
on Xen 4.0+ is innocent.  In fact, TSC is a better clocksource
choice than clocksource=xen (aka pvclock) because pvclock
indirectly depends on TSC.

For upstream kernels, the answer is to set clocksource=tsc
and tsc=reliable, like VMware enforces. See:

https://lists.ubuntu.com/archives/kernel-team/2008-October/004283.html 

In fact, it might be wise for a Xen-savvy kernel to check to see
if it is running on Xen-4.0+ and, if so, force clocksource=tsc
and tsc=reliable.

There have been very odd rare problems reported in Xen time
handling for a very long time.  These usually manifest as some
kind of "TSC is not stable" message from a guest Linux kernel,
but the symptoms always point away from TSC as the culprit.
Forcing Xen-savvy guests to use TSC will either make these problems
go away (if they haven''t already been fixed) or allow us to find
the obscure underlying hypervisor bugs rather than paper over them.

Thanks,
Dan

P.S.  For anyone new to this areas, see VMware''s classic document:
http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf

P.P.S. note this recent kernel issue which is related, but
likely not seen in Xen... it pre-requires cpu overcommitment
at boot time when TSC is being calibrated by the kernel.

https://lkml.org/lkml/2012/2/21/518

Jan Beulich

2012-Apr-16 16:14 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

>>> On 16.04.12 at 18:05, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
>>  From: David Vrabel [mailto:david.vrabel@citrix.com]
>> On 16/04/12 12:32, Jan Beulich wrote:
>> >>>> On 13.04.12 at 20:20, David Vrabel
<david.vrabel@citrix.com> wrote:
>> >> --- a/arch/x86/xen/time.c
>> >> +++ b/arch/x86/xen/time.c
>> >> @@ -473,6 +473,7 @@ static void __init xen_time_init(void)
>> >>  	do_settimeofday(&tp);
>> >>
>> >>  	setup_force_cpu_cap(X86_FEATURE_TSC);
>> >> +	sched_clock_stable = 0;
>> >
>> > This, unfortunately, is not sufficient afaict: If a CPU gets
brought up
>> > post-boot, the variable may need to be cleared again. Instead you
>> > ought to call mark_tsc_unstable().
>> 
>> Yeah, mark_tsc_unstable() is the right thing to do.
> 
> NACK!
> 
> No, no, no.  The exact opposite is true.  Like VMware, TSC is
> stable.  The issue is that Linux trusts other clock hardware more
> completely than TSC so whenever there is a problem with another
> clocksource, Linux blames TSC and marks TSC unstable.  But TSC
> on Xen 4.0+ is innocent.  In fact, TSC is a better clocksource
> choice than clocksource=xen (aka pvclock) because pvclock
> indirectly depends on TSC.
> 
> For upstream kernels, the answer is to set clocksource=tsc
> and tsc=reliable, like VMware enforces. See:
> 
> https://lists.ubuntu.com/archives/kernel-team/2008-October/004283.html 
> 
> In fact, it might be wise for a Xen-savvy kernel to check to see
> if it is running on Xen-4.0+ and, if so, force clocksource=tsc
> and tsc=reliable.
Are you possibly mixing up PV and HVM cases? sched_clock_stable
getting set _is_ a problem in PV kernels - we had bug reports long
ago when this first appeared in arch/x86/kernel/cpu/intel.c. I''m
suspecting this because there''s not supposed to be (and in non-
pv-ops there is no; in pv-ops I assume it simply has no effect)
clocksource=tsc in PV kernels.

Jan

Dan Magenheimer

2012-Apr-16 16:20 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

> From: Konrad Rzeszutek Wilk
> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
> 
> On Mon, Apr 16, 2012 at 03:59:44PM +0100, David Vrabel wrote:
> > On 16/04/12 12:32, Jan Beulich wrote:
> > >>>> On 13.04.12 at 20:20, David Vrabel
<david.vrabel@citrix.com> wrote:
> > >> From: David Vrabel <david.vrabel@citrix.com>
> > >>
> > >> The sched clock was considered stable based on the
capabilities of the
> > >> underlying hardware.  This does not make sense for Xen PV
guests as:
> 
> In regards to PV dom0 is this still the case? Asking b/c your
> patch will make dom0 be in the same category.
> 
> > >> a) the hardware TSC is not used directly as the clock source;
and b)
> > >> guests may migrate to hosts with different hardware
capabilities.
> > >>
> > >> It is not clear to me whether the Xen clock source is
supposed to be
> > >> stable and whether it should be stable across migration.  For
a clock
Xen is -- and has always been -- responsible for emulating sufficient
clock hardware devices and presenting them to the guest AND ensuring
that this emulation works properly across migration/save/restore (which
is required because these transitions may be completely transparent
to the guest).

Prior to Xen 4.0, TSC was not considered to be a clocksource worthy
of emulating and was passed through to PV guests unchanged (and not
fully handled for HVM either IIRC).  At Xen 4.0+, it is handled properly.
 > I thought it was dependent on XEN_DOMCTL_settscinfo:
>  - whether it gets emulated, or the guest can do rdtsc or pvrdtsc?
>
> Which I think is determined by some ''timer=X'' option in
the guest config?
I think you may be thinking of tsc_mode.  See docs/misc/tscmode.txt
in the Xen source.  The default should work correctly.

David Vrabel

2012-Apr-16 16:26 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

On 16/04/12 17:05, Dan Magenheimer wrote:>> From: David Vrabel [mailto:david.vrabel@citrix.com]
>> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
> 
> Nacked-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Fair enough,
> [A stable clock] should be true for Xen 4.0+ (but not for pre-Xen-4.0).
The original customer problem is on a host with Xen 3.4.  What do you
recommend for Linux guests running such hosts?
> In fact, it might be wise for a Xen-savvy kernel to check to see
> if it is running on Xen-4.0+ and, if so, force clocksource=tsc
> and tsc=reliable.
So, should the xen clocksource do:

if Xen 4.0+
    clock is stable, use rdtsc only.
else
    clock is unstable, use existing pvclock implementation.

David

Tim Deegan

2012-Apr-16 17:08 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

At 09:05 -0700 on 16 Apr (1334567132), Dan Magenheimer
wrote:> Hmmm... I spent a great deal of time on TSC support in the hypervisor
> 2-3 years ago.  I worked primarily on PV, but Intel supposedly was tracking
> everything on HVM as well.  There''s most likely a bug or two still
lurking
> but, for all guests, with the default tsc_mode, TSC is provided by Xen
> as an absolutely stable clock source.  If Xen determines that the
underlying
> hardware declares that TSC is stable, guest rdtsc instructions are not
trapped.
> If it is not, Xen emulates all guest rdtsc instructions.  After a migration
or
> save/restore, TSC is always emulated.  The result is (ignoring possible
> bugs) that TSC as provided by Xen is a) monotonic; b) synchronized across
> CPUs; and c) constant rate.  Even across migration/save/restore.
AIUI, this thread is about the PV-time clock source, not about the TSC
itself.  Even if the TSC is emulated (or in some other way made
"stable") the PV wallclock is not necessarily stable across migration.
But since migration is controlled by the kernel, presumably the kernel
can DTRT about it.
> In fact, it might be wise for a Xen-savvy kernel to check to see
> if it is running on Xen-4.0+ and, if so, force clocksource=tsc
> and tsc=reliable.
That seems like overdoing it.  Certainly it''s not OK unless it can also
check that Xen is providing a stable TSC (i.e. that tscmode==1).

In the case where the PV clock has been selected, can it not be marked
unstable without also marking the TSC unstable?

Tim.

Dan Magenheimer

2012-Apr-16 17:22 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
> 
> >>> On 16.04.12 at 18:05, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> >>  From: David Vrabel [mailto:david.vrabel@citrix.com]
> >> On 16/04/12 12:32, Jan Beulich wrote:
> >> >>>> On 13.04.12 at 20:20, David Vrabel
<david.vrabel@citrix.com> wrote:
> >> >> --- a/arch/x86/xen/time.c
> >> >> +++ b/arch/x86/xen/time.c
> >> >> @@ -473,6 +473,7 @@ static void __init
xen_time_init(void)
> >> >>  	do_settimeofday(&tp);
> >> >>
> >> >>  	setup_force_cpu_cap(X86_FEATURE_TSC);
> >> >> +	sched_clock_stable = 0;
> >> >
> >> > This, unfortunately, is not sufficient afaict: If a CPU gets
brought up
> >> > post-boot, the variable may need to be cleared again. Instead
you
> >> > ought to call mark_tsc_unstable().
> >>
> >> Yeah, mark_tsc_unstable() is the right thing to do.
> >
> > NACK!
> >
> > No, no, no.  The exact opposite is true.  Like VMware, TSC is
> > stable.  The issue is that Linux trusts other clock hardware more
> > completely than TSC so whenever there is a problem with another
> > clocksource, Linux blames TSC and marks TSC unstable.  But TSC
> > on Xen 4.0+ is innocent.  In fact, TSC is a better clocksource
> > choice than clocksource=xen (aka pvclock) because pvclock
> > indirectly depends on TSC.
> >
> > For upstream kernels, the answer is to set clocksource=tsc
> > and tsc=reliable, like VMware enforces. See:
> >
> > https://lists.ubuntu.com/archives/kernel-team/2008-October/004283.html
> >
> > In fact, it might be wise for a Xen-savvy kernel to check to see
> > if it is running on Xen-4.0+ and, if so, force clocksource=tsc
> > and tsc=reliable.
> 
> Are you possibly mixing up PV and HVM cases? sched_clock_stable
> getting set _is_ a problem in PV kernels - we had bug reports long
> ago when this first appeared in arch/x86/kernel/cpu/intel.c. I''m
> suspecting this because there''s not supposed to be (and in non-
> pv-ops there is no; in pv-ops I assume it simply has no effect)
> clocksource=tsc in PV kernels.
Hi Jan --

In upstream (and recent pv-ops) kernels, is there any need for there
to be a difference between HVM and PV in the clocksource chosen?  The
pvclock algorithm was necessary for PV when non-TSC hardware clocks
were privileged and the only non-privileged hardware clock (TSC)
was badly broken in hardware and for migration/save/restore.
With TSC now working and stable, and now that we are making changes
in the upstream kernel that work for both PV and HVM, is it
time to drop pvclock (at least as the default for PV)?

Certainly if an old (non-pv-ops) kernel is broken, something like
David''s patch might be an acceptable workaround.  I''m just
arguing
against perpetuating pvclock-as-the-only-xen-clock upstream.

Does that make sense?
Dan

Dan Magenheimer

2012-Apr-16 17:30 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

> From: David Vrabel [mailto:david.vrabel@citrix.com]
> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
> 
> On 16/04/12 17:05, Dan Magenheimer wrote:
> >> From: David Vrabel [mailto:david.vrabel@citrix.com]
> >> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock
as unstable
> >
> > Nacked-by: Dan Magenheimer <dan.magenheimer@oracle.com>
> 
> Fair enough,
> 
> > [A stable clock] should be true for Xen 4.0+ (but not for
pre-Xen-4.0).
> 
> The original customer problem is on a host with Xen 3.4.  What do you
> recommend for Linux guests running such hosts?
For pre-Xen-4.0 and an unchanged PV guest, I don''t know.  If you can
back-patch the guest kernel with a workaround such as your patch, great!
I''m only arguing against the patch getting perpetuated upstream.
 > > In fact, it might be wise for a Xen-savvy kernel to check to see
> > if it is running on Xen-4.0+ and, if so, force clocksource=tsc
> > and tsc=reliable.
> 
> So, should the xen clocksource do:
> 
> if Xen 4.0+
>     clock is stable, use rdtsc only.
> else
>     clock is unstable, use existing pvclock implementation.
Yes, that''s what I propose.  To clarify:

if the guest can and does determine it is running on Xen 4.0+
	TSC is guaranteed by Xen to be stable, use clocksource=tsc tsc=reliable
else
	Xen only guarantees that pvclock is stable, use pvclock

Dan Magenheimer

2012-Apr-16 17:52 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

> From: Tim Deegan [mailto:tim@xen.org]
> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
> 
> At 09:05 -0700 on 16 Apr (1334567132), Dan Magenheimer wrote:
> > Hmmm... I spent a great deal of time on TSC support in the hypervisor
> > 2-3 years ago.  I worked primarily on PV, but Intel supposedly was
tracking
> > everything on HVM as well.  There''s most likely a bug or two
still lurking
> > but, for all guests, with the default tsc_mode, TSC is provided by Xen
> > as an absolutely stable clock source.  If Xen determines that the
underlying
> > hardware declares that TSC is stable, guest rdtsc instructions are not
trapped.
> > If it is not, Xen emulates all guest rdtsc instructions.  After a
migration or
> > save/restore, TSC is always emulated.  The result is (ignoring
possible
> > bugs) that TSC as provided by Xen is a) monotonic; b) synchronized
across
> > CPUs; and c) constant rate.  Even across migration/save/restore.
> 
> AIUI, this thread is about the PV-time clock source, not about the TSC
> itself.  Even if the TSC is emulated (or in some other way made
> "stable") the PV wallclock is not necessarily stable across
migration.
> But since migration is controlled by the kernel, presumably the kernel
> can DTRT about it.
Under what circumstances is PV wallclock not stable across migration?
> > In fact, it might be wise for a Xen-savvy kernel to check to see
> > if it is running on Xen-4.0+ and, if so, force clocksource=tsc
> > and tsc=reliable.
> 
> That seems like overdoing it.  Certainly it''s not OK unless it can
also
> check that Xen is providing a stable TSC (i.e. that tscmode==1).
Xen guarantees a stable TSC for the default (tsc_mode==0) also.

If the vm.cfg file explicitly sets a guest tsc_mode==2, you are correct
that pvclock is still necessary.  But as the documentation says:
tsc_mode==2 should be set if "it is certain that all apps running in this
VM are TSC-resilient and highest performance is required".  In
the case we are talking about, the PV guest kernel itself isn''t TSC-
resilient!

In any case, IIRC, there is a pvcpuid instruction to determine the
tsc_mode, so when the upstream kernel checks for Xen 4.0+, it could
also check to ensure the tsc_mode wasn''t overridden and set to 2.
If it is set to 2, TSC should not be an available clocksource,
as the guest kernel would break on migration/save/restore.
> In the case where the PV clock has been selected, can it not be marked
> unstable without also marking the TSC unstable?
I''m not sure I understand...

Are you talking about the HVM case of an upstream kernel, maybe
when the clocksource is manually overridden on the kernel command
line or after boot with sysfs?

If pvclock is necessary (e.g. old Xen), how would it be
marked unstable? (I didn''t know there was code to do that.)

Dan

Tim Deegan

2012-Apr-16 18:17 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

At 10:52 -0700 on 16 Apr (1334573568), Dan Magenheimer
wrote:> > From: Tim Deegan [mailto:tim@xen.org]
> > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
> > 
> > At 09:05 -0700 on 16 Apr (1334567132), Dan Magenheimer wrote:
> > > Hmmm... I spent a great deal of time on TSC support in the
hypervisor
> > > 2-3 years ago.  I worked primarily on PV, but Intel supposedly
was tracking
> > > everything on HVM as well.  There''s most likely a bug or
two still lurking
> > > but, for all guests, with the default tsc_mode, TSC is provided
by Xen
> > > as an absolutely stable clock source.  If Xen determines that the
underlying
> > > hardware declares that TSC is stable, guest rdtsc instructions
are not trapped.
> > > If it is not, Xen emulates all guest rdtsc instructions.  After a
migration or
> > > save/restore, TSC is always emulated.  The result is (ignoring
possible
> > > bugs) that TSC as provided by Xen is a) monotonic; b)
synchronized across
> > > CPUs; and c) constant rate.  Even across migration/save/restore.
> > 
> > AIUI, this thread is about the PV-time clock source, not about the TSC
> > itself.  Even if the TSC is emulated (or in some other way made
> > "stable") the PV wallclock is not necessarily stable across
migration.
> > But since migration is controlled by the kernel, presumably the kernel
> > can DTRT about it.
> 
> Under what circumstances is PV wallclock not stable across migration?
The wallclock is host-local, so I don''t think it can be guaranteed to
be
strictly monotonic across migration.  But as I said that''s OK because
the Xen code in the kernel is in control during migration.
> > > In fact, it might be wise for a Xen-savvy kernel to check to see
> > > if it is running on Xen-4.0+ and, if so, force clocksource=tsc
> > > and tsc=reliable.
> > 
> > That seems like overdoing it.  Certainly it''s not OK unless
it can also
> > check that Xen is providing a stable TSC (i.e. that tscmode==1).
> 
> Xen guarantees a stable TSC for the default (tsc_mode==0) also.
> 
> If the vm.cfg file explicitly sets a guest tsc_mode==2, you are correct
> that pvclock is still necessary.  But as the documentation says:
> tsc_mode==2 should be set if "it is certain that all apps running in
this
> VM are TSC-resilient and highest performance is required".  In
> the case we are talking about, the PV guest kernel itself isn''t
TSC-
> resilient!
Only if we deliberately break it! :) 
> In any case, IIRC, there is a pvcpuid instruction to determine the
> tsc_mode, so when the upstream kernel checks for Xen 4.0+, it could
> also check to ensure the tsc_mode wasn''t overridden and set to 2.
Yes, that''s what I was suggesting.
> > In the case where the PV clock has been selected, can it not be marked
> > unstable without also marking the TSC unstable?
> 
> I''m not sure I understand...
> 
> Are you talking about the HVM case of an upstream kernel, maybe
> when the clocksource is manually overridden on the kernel command
> line or after boot with sysfs?
I''m talking about any case where the clocksource == xen.  
> If pvclock is necessary (e.g. old Xen), how would it be
> marked unstable? (I didn''t know there was code to do that.)
I think I''m confused by terminology.  Maybe David can correct me.  My
understanding was that there is some concept inside linux of a time
source being ''stable'', which requires it to be synchronized,
monotonic
and constant-rate.  The PV clock is two of those things (within a
reasonable tolerance) but may not be monotonic over migration.  I was
suggesting that, however linux deals with that, it can probably do it
without changing its opinion of whether the TSC is stable.

If the PV clocksource works, and works in more configurations than TSC,
I don''t see much advantage of deprecating it in favour of TSC.  But I
don''t have any huge objection to it either, I guess, as long as it only
happens when it''s safe.

And on older Xens, or for tsc_mode==2, the kernel probably ought to mark
the TSC as unstable, because it is.

Cheers,

Tim.

Sheng Yang

2012-Apr-16 23:01 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

On Mon, Apr 16, 2012 at 11:17 AM, Tim Deegan <tim@xen.org>
wrote:> At 10:52 -0700 on 16 Apr (1334573568), Dan Magenheimer wrote:
>> > From: Tim Deegan [mailto:tim@xen.org]
>> > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock
as unstable
>> >
>> > At 09:05 -0700 on 16 Apr (1334567132), Dan Magenheimer wrote:
>> > > Hmmm... I spent a great deal of time on TSC support in the
hypervisor
>> > > 2-3 years ago.  I worked primarily on PV, but Intel
supposedly was tracking
>> > > everything on HVM as well.  There''s most likely a
bug or two still lurking
>> > > but, for all guests, with the default tsc_mode, TSC is
provided by Xen
>> > > as an absolutely stable clock source.  If Xen determines that
the underlying
>> > > hardware declares that TSC is stable, guest rdtsc
instructions are not trapped.
>> > > If it is not, Xen emulates all guest rdtsc instructions.
 After a migration or
>> > > save/restore, TSC is always emulated.  The result is
(ignoring possible
>> > > bugs) that TSC as provided by Xen is a) monotonic; b)
synchronized across
>> > > CPUs; and c) constant rate.  Even across
migration/save/restore.
>> >
>> > AIUI, this thread is about the PV-time clock source, not about the
TSC
>> > itself.  Even if the TSC is emulated (or in some other way made
>> > "stable") the PV wallclock is not necessarily stable
across migration.
>> > But since migration is controlled by the kernel, presumably the
kernel
>> > can DTRT about it.
>>
>> Under what circumstances is PV wallclock not stable across migration?
>
> The wallclock is host-local, so I don''t think it can be guaranteed
to be
> strictly monotonic across migration.  But as I said that''s OK
because
> the Xen code in the kernel is in control during migration.
>
>> > > In fact, it might be wise for a Xen-savvy kernel to check to
see
>> > > if it is running on Xen-4.0+ and, if so, force
clocksource=tsc
>> > > and tsc=reliable.
>> >
>> > That seems like overdoing it.  Certainly it''s not OK
unless it can also
>> > check that Xen is providing a stable TSC (i.e. that tscmode==1).
>>
>> Xen guarantees a stable TSC for the default (tsc_mode==0) also.
>>
>> If the vm.cfg file explicitly sets a guest tsc_mode==2, you are correct
>> that pvclock is still necessary.  But as the documentation says:
>> tsc_mode==2 should be set if "it is certain that all apps running
in this
>> VM are TSC-resilient and highest performance is required".  In
>> the case we are talking about, the PV guest kernel itself
isn''t TSC-
>> resilient!
>
> Only if we deliberately break it! :)
>
>> In any case, IIRC, there is a pvcpuid instruction to determine the
>> tsc_mode, so when the upstream kernel checks for Xen 4.0+, it could
>> also check to ensure the tsc_mode wasn''t overridden and set to
2.
>
> Yes, that''s what I was suggesting.
>
>> > In the case where the PV clock has been selected, can it not be
marked
>> > unstable without also marking the TSC unstable?
>>
>> I''m not sure I understand...
>>
>> Are you talking about the HVM case of an upstream kernel, maybe
>> when the clocksource is manually overridden on the kernel command
>> line or after boot with sysfs?
>
> I''m talking about any case where the clocksource == xen.
>
>> If pvclock is necessary (e.g. old Xen), how would it be
>> marked unstable? (I didn''t know there was code to do that.)
>
> I think I''m confused by terminology.  Maybe David can correct me.
 My
> understanding was that there is some concept inside linux of a time
> source being ''stable'', which requires it to be
synchronized, monotonic
> and constant-rate.  The PV clock is two of those things (within a
> reasonable tolerance) but may not be monotonic over migration.  I was
> suggesting that, however linux deals with that, it can probably do it
> without changing its opinion of whether the TSC is stable.
In fact the sched_clock_stable is only regarding one Intel processor
feature named "Invarient TSC"(a.k.a Non-stop TSC).

I''ve reported the original issue to xen-devel, and purpose one patch
to fix CPUID filter in the libxc of Xen.

I think mask CPUID bit in the hypervisor is better than make this
change in the kernel, since Xen controlled what to present to the
guest, it doesn''t make sense if we present a feature to the guest, and
hack the kernel to disable this feature at the same time.

I haven''t dug much into the code, but here is the background(most
copied from my xen-devel post):

Recently we got some reports of migration hang on latest
Debian(2.6.32-41kernel package) kernel with some certain machines(but
it''s hard to debug on them since they''re customer''s
machine).

Booting dmesg snippet below:

[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 3.4.2 (preserve-AD)
[    0.000000] NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:1
nr_node_ids:1
[    0.000000] PERCPU: Embedded 15 pages/cpu @c1608000 s37656 r0
d23784 u65536
[    0.000000] pcpu-alloc: s37656 r0 d23784 u65536 alloc=16*4096
[    0.000000] pcpu-alloc: [0] 0
[508119.807590] trying to map vcpu_info 0 at c1609010, mfn 992cac,
offset 16
[508119.807593] cpu 0 using vcpu_info at c1609010
[508119.807594] Xen: using vcpu_info placement
[508119.807598] Built 1 zonelists in Zone order, mobility grouping on.
 Total pages: 32416

Dmesg show that when booting, timestamp of printk jumped from 0 to a
big number([508119.807590] in this case) immediately.

And when migrating:

[509508.914333] suspending xenstore...
[516212.055921] trying to map vcpu_info 0 at c1609010, mfn 895fd7,
offset 16
[516212.055930] cpu 0 using vcpu_info at c1609010

Timestamp jumped again. We can reproduce above issues on our Sandy
Bridge machines.

After this, call trace and guest hang *maybe* observed on some machines:

[516383.019499] INFO: task xenwatch:12 blocked for more than 120
seconds.
[516383.019566] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[516383.019578] xenwatch      D c1610e20     0    12      2 0x00000000
[516383.019591]  c781eec0 00000246 c1610e58 c1610e20 c781f300 c1441e20
c1441e20 001cf000
[516383.019605]  c781f07c c1610e20 00000000 00000001 c1441e20 c62e01c0
c1610e20 c62e01c0
[516383.019617]  c127e18e c781f07c c7830020 c7830020 c1441e20 c1441e20
c127f2f1 c781f080
[516383.019629] Call Trace:
[516383.019640]  [<c127e18e>] ? schedule+0x78f/0x7dc
[516383.019645]  [<c127f2f1>] ? _spin_unlock_irqrestore+0xd/0xf
[516383.019649]  [<c127e4a1>] ? schedule_timeout+0x20/0xb0
[516383.019656]  [<c100573c>] ? xen_force_evtchn_callback+0xc/0x10
[516383.019660]  [<c127e3aa>] ? wait_for_common+0xa4/0x100
[516383.019665]  [<c1033315>] ? default_wake_function+0x0/0x8
[516383.019671]  [<c104a144>] ? kthread_stop+0x4f/0x8e
[516383.019675]  [<c1047883>] ? cleanup_workqueue_thread+0x3a/0x45
[516383.019679]  [<c1047903>] ? destroy_workqueue+0x56/0x85
[516383.019684]  [<c106a395>] ? stop_machine_destroy+0x23/0x37
[516383.019690]  [<c11962d8>] ? shutdown_handler+0x200/0x22f
[516383.019694]  [<c1197439>] ? xenwatch_thread+0xdc/0x103
[516383.019698]  [<c104a322>] ? autoremove_wake_function+0x0/0x2d
[516383.019701]  [<c119735d>] ? xenwatch_thread+0x0/0x103
[516383.019705]  [<c104a0f0>] ? kthread+0x61/0x66
[516383.019709]  [<c104a08f>] ? kthread+0x0/0x66
[516383.019714]  [<c1008d87>] ? kernel_thread_helper+0x7/0x10

But I _cannot_ reproduce the call trace and hang on our Sandy Bridge.

So I think there are maybe *two* bugs in this issue, one caused time
jump(detail below), the other in the kernel triggered by the first bug
sometime, thus result in migration fail.

I''ve spent some time to identify the timestamp jump issue, and finally
found it''s due to Invarient TSC (CPUID Leaf 0x80000007 EDX:8, also
called non-stop TSC). The present of the feature would enable a
parameter in the kernel named: sched_clock_stable. Seems this
parameter is unable to work with Xen''s pvclock. If
sched_clock_stable() is set, value returned by xen_clocksource_read()
would be returned as sched_clock_cpu() directly(rather than calculated
through sched_clock_local()), but CMIIW the value returned by
xen_clocksource_read() is based on host(vcpu) uptime rather than this
VM''s uptime, then result in the timestamp jump.

I''ve compiled a kernel, force sched_clock_stable=0, then it solved the
timestamp jump issue as expected. Luckily, seems it also solved the
call trace and guest hang issue as well.

I''ve posted a patch to mask the CPUID leaf 0x80000007 in Xen. I think
the issue can be easily reproduced using a Westmere or SandyBridge
machine(my old colleagues at Intel said the feature likely existed
after Nehalem) running newer version of PV guest, check the guest
cpuinfo you would see nonstop_tsc, and you would notice the abnormal
timestamp of printk.

-- 
regards
Yang, Sheng
>
> If the PV clocksource works, and works in more configurations than TSC,
> I don''t see much advantage of deprecating it in favour of TSC.
 But I
> don''t have any huge objection to it either, I guess, as long as it
only
> happens when it''s safe.
>
> And on older Xens, or for tsc_mode==2, the kernel probably ought to mark
> the TSC as unstable, because it is.
>
> Cheers,
>
> Tim.

Dan Magenheimer

2012-Apr-17 00:29 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

> From: Sheng Yang [mailto:sheng@yasker.org]
> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
Hi Sheng --

See reply at the very end...
> On Mon, Apr 16, 2012 at 11:17 AM, Tim Deegan <tim@xen.org> wrote:
> > At 10:52 -0700 on 16 Apr (1334573568), Dan Magenheimer wrote:
> >> > From: Tim Deegan [mailto:tim@xen.org]
> >> > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched
clock as unstable
> >> >
> >> > At 09:05 -0700 on 16 Apr (1334567132), Dan Magenheimer wrote:
> >> > > Hmmm... I spent a great deal of time on TSC support in
the hypervisor
> >> > > 2-3 years ago.  I worked primarily on PV, but Intel
supposedly was tracking
> >> > > everything on HVM as well.  There''s most likely
a bug or two still lurking
> >> > > but, for all guests, with the default tsc_mode, TSC is
provided by Xen
> >> > > as an absolutely stable clock source.  If Xen determines
that the underlying
> >> > > hardware declares that TSC is stable, guest rdtsc
instructions are not trapped.
> >> > > If it is not, Xen emulates all guest rdtsc instructions.
 After a migration or
> >> > > save/restore, TSC is always emulated.  The result is
(ignoring possible
> >> > > bugs) that TSC as provided by Xen is a) monotonic; b)
synchronized across
> >> > > CPUs; and c) constant rate.  Even across
migration/save/restore.
> >> >
> >> > AIUI, this thread is about the PV-time clock source, not
about the TSC
> >> > itself.  Even if the TSC is emulated (or in some other way
made
> >> > "stable") the PV wallclock is not necessarily
stable across migration.
> >> > But since migration is controlled by the kernel, presumably
the kernel
> >> > can DTRT about it.
> >>
> >> Under what circumstances is PV wallclock not stable across
migration?
> >
> > The wallclock is host-local, so I don''t think it can be
guaranteed to be
> > strictly monotonic across migration.  But as I said that''s OK
because
> > the Xen code in the kernel is in control during migration.
> >
> >> > > In fact, it might be wise for a Xen-savvy kernel to
check to see
> >> > > if it is running on Xen-4.0+ and, if so, force
clocksource=tsc
> >> > > and tsc=reliable.
> >> >
> >> > That seems like overdoing it.  Certainly it''s not OK
unless it can also
> >> > check that Xen is providing a stable TSC (i.e. that
tscmode==1).
> >>
> >> Xen guarantees a stable TSC for the default (tsc_mode==0) also.
> >>
> >> If the vm.cfg file explicitly sets a guest tsc_mode==2, you are
correct
> >> that pvclock is still necessary.  But as the documentation says:
> >> tsc_mode==2 should be set if "it is certain that all apps
running in this
> >> VM are TSC-resilient and highest performance is required".
 In
> >> the case we are talking about, the PV guest kernel itself
isn''t TSC-
> >> resilient!
> >
> > Only if we deliberately break it! :)
> >
> >> In any case, IIRC, there is a pvcpuid instruction to determine the
> >> tsc_mode, so when the upstream kernel checks for Xen 4.0+, it
could
> >> also check to ensure the tsc_mode wasn''t overridden and
set to 2.
> >
> > Yes, that''s what I was suggesting.
> >
> >> > In the case where the PV clock has been selected, can it not
be marked
> >> > unstable without also marking the TSC unstable?
> >>
> >> I''m not sure I understand...
> >>
> >> Are you talking about the HVM case of an upstream kernel, maybe
> >> when the clocksource is manually overridden on the kernel command
> >> line or after boot with sysfs?
> >
> > I''m talking about any case where the clocksource == xen.
> >
> >> If pvclock is necessary (e.g. old Xen), how would it be
> >> marked unstable? (I didn''t know there was code to do
that.)
> >
> > I think I''m confused by terminology.  Maybe David can correct
me.  My
> > understanding was that there is some concept inside linux of a time
> > source being ''stable'', which requires it to be
synchronized, monotonic
> > and constant-rate.  The PV clock is two of those things (within a
> > reasonable tolerance) but may not be monotonic over migration.  I was
> > suggesting that, however linux deals with that, it can probably do it
> > without changing its opinion of whether the TSC is stable.
> 
> In fact the sched_clock_stable is only regarding one Intel processor
> feature named "Invarient TSC"(a.k.a Non-stop TSC).
> 
> I''ve reported the original issue to xen-devel, and purpose one
patch
> to fix CPUID filter in the libxc of Xen.
> 
> I think mask CPUID bit in the hypervisor is better than make this
> change in the kernel, since Xen controlled what to present to the
> guest, it doesn''t make sense if we present a feature to the guest,
and
> hack the kernel to disable this feature at the same time.
> 
> I haven''t dug much into the code, but here is the background(most
> copied from my xen-devel post):
> 
> Recently we got some reports of migration hang on latest
> Debian(2.6.32-41kernel package) kernel with some certain machines(but
> it''s hard to debug on them since they''re
customer''s machine).
> 
> Booting dmesg snippet below:
> 
> [    0.000000] Booting paravirtualized kernel on Xen
> [    0.000000] Xen version: 3.4.2 (preserve-AD)
> [    0.000000] NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:1
> nr_node_ids:1
> [    0.000000] PERCPU: Embedded 15 pages/cpu @c1608000 s37656 r0
> d23784 u65536
> [    0.000000] pcpu-alloc: s37656 r0 d23784 u65536 alloc=16*4096
> [    0.000000] pcpu-alloc: [0] 0
> [508119.807590] trying to map vcpu_info 0 at c1609010, mfn 992cac,
> offset 16
> [508119.807593] cpu 0 using vcpu_info at c1609010
> [508119.807594] Xen: using vcpu_info placement
> [508119.807598] Built 1 zonelists in Zone order, mobility grouping on.
>  Total pages: 32416
> 
> Dmesg show that when booting, timestamp of printk jumped from 0 to a
> big number([508119.807590] in this case) immediately.
> 
> And when migrating:
> 
> [509508.914333] suspending xenstore...
> [516212.055921] trying to map vcpu_info 0 at c1609010, mfn 895fd7,
> offset 16
> [516212.055930] cpu 0 using vcpu_info at c1609010
> 
> Timestamp jumped again. We can reproduce above issues on our Sandy
> Bridge machines.
> 
> After this, call trace and guest hang *maybe* observed on some machines:
> 
> [516383.019499] INFO: task xenwatch:12 blocked for more than 120
> seconds.
> [516383.019566] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [516383.019578] xenwatch      D c1610e20     0    12      2 0x00000000
> [516383.019591]  c781eec0 00000246 c1610e58 c1610e20 c781f300 c1441e20
> c1441e20 001cf000
> [516383.019605]  c781f07c c1610e20 00000000 00000001 c1441e20 c62e01c0
> c1610e20 c62e01c0
> [516383.019617]  c127e18e c781f07c c7830020 c7830020 c1441e20 c1441e20
> c127f2f1 c781f080
> [516383.019629] Call Trace:
> [516383.019640]  [<c127e18e>] ? schedule+0x78f/0x7dc
> [516383.019645]  [<c127f2f1>] ? _spin_unlock_irqrestore+0xd/0xf
> [516383.019649]  [<c127e4a1>] ? schedule_timeout+0x20/0xb0
> [516383.019656]  [<c100573c>] ? xen_force_evtchn_callback+0xc/0x10
> [516383.019660]  [<c127e3aa>] ? wait_for_common+0xa4/0x100
> [516383.019665]  [<c1033315>] ? default_wake_function+0x0/0x8
> [516383.019671]  [<c104a144>] ? kthread_stop+0x4f/0x8e
> [516383.019675]  [<c1047883>] ? cleanup_workqueue_thread+0x3a/0x45
> [516383.019679]  [<c1047903>] ? destroy_workqueue+0x56/0x85
> [516383.019684]  [<c106a395>] ? stop_machine_destroy+0x23/0x37
> [516383.019690]  [<c11962d8>] ? shutdown_handler+0x200/0x22f
> [516383.019694]  [<c1197439>] ? xenwatch_thread+0xdc/0x103
> [516383.019698]  [<c104a322>] ? autoremove_wake_function+0x0/0x2d
> [516383.019701]  [<c119735d>] ? xenwatch_thread+0x0/0x103
> [516383.019705]  [<c104a0f0>] ? kthread+0x61/0x66
> [516383.019709]  [<c104a08f>] ? kthread+0x0/0x66
> [516383.019714]  [<c1008d87>] ? kernel_thread_helper+0x7/0x10
> 
> But I _cannot_ reproduce the call trace and hang on our Sandy Bridge.
> 
> So I think there are maybe *two* bugs in this issue, one caused time
> jump(detail below), the other in the kernel triggered by the first bug
> sometime, thus result in migration fail.
> 
> I''ve spent some time to identify the timestamp jump issue, and
finally
> found it''s due to Invarient TSC (CPUID Leaf 0x80000007 EDX:8, also
> called non-stop TSC). The present of the feature would enable a
> parameter in the kernel named: sched_clock_stable. Seems this
> parameter is unable to work with Xen''s pvclock. If
> sched_clock_stable() is set, value returned by xen_clocksource_read()
> would be returned as sched_clock_cpu() directly(rather than calculated
> through sched_clock_local()), but CMIIW the value returned by
> xen_clocksource_read() is based on host(vcpu) uptime rather than this
> VM''s uptime, then result in the timestamp jump.
> 
> I''ve compiled a kernel, force sched_clock_stable=0, then it solved
the
> timestamp jump issue as expected. Luckily, seems it also solved the
> call trace and guest hang issue as well.
> 
> I''ve posted a patch to mask the CPUID leaf 0x80000007 in Xen. I
think
> the issue can be easily reproduced using a Westmere or SandyBridge
> machine(my old colleagues at Intel said the feature likely existed
> after Nehalem) running newer version of PV guest, check the guest
> cpuinfo you would see nonstop_tsc, and you would notice the abnormal
> timestamp of printk.
Yes definitely.  I thought that I implemented this properly for
PV but I think maybe it never got implemented for HVM?  See the section
titled "TSC INVARIANT BIT and NO_MIGRATE" in docs/misc/tscmode.txt in
the Xen source.

However, if "clocksource=tsc tsc=reliable" is selected for a HVM
domain, I think the results may be the same as if Invariant TSC
bit is checked by the Linux kernel?  So maybe the code for
readjusting the TSC to adjust to migration was also never implemented
in HVM, just in PV?  (I remember discussing this problem with Jun Nakajima
on an Oracle/Intel call a couple of years ago.  Maybe it was
discussed but never implemented... at the time, I was primarily
concerned with and tested only for PV as that was Oracle''s
customer at the time.)

Anyway, please force "clocksource=tsc tsc=reliable" on your HVM
guest to see if it fails the same way as when the guest "sees"
the Invariant TSC bit is set.

Thanks,
Dan

P.S. The Invariant TSC bit *did* exist on Nehalem, however there
definitely exists old firmware that did not properly align the
TSCs across all cores on boot, so the bit was present but "lied".
Maybe you are seeing the problems on a Nehalem system with broken
firmware?  I know some Sun x86 systems shipped with broken
firmware, so it is very likely other system vendors did also.

Jan Beulich

2012-Apr-17 07:27 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

>>> On 16.04.12 at 19:22, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> In upstream (and recent pv-ops) kernels, is there any need for there
> to be a difference between HVM and PV in the clocksource chosen?  The
Yes, because RDTSC interception for PV guests is slow (using #GP
and requiring instruction decode).
> pvclock algorithm was necessary for PV when non-TSC hardware clocks
> were privileged and the only non-privileged hardware clock (TSC)
> was badly broken in hardware and for migration/save/restore.
> With TSC now working and stable, and now that we are making changes
> in the upstream kernel that work for both PV and HVM, is it
> time to drop pvclock (at least as the default for PV)?
> 
> Certainly if an old (non-pv-ops) kernel is broken, something like
> David''s patch might be an acceptable workaround.  I''m
just arguing
> against perpetuating pvclock-as-the-only-xen-clock upstream.
Afaict, the only uniformly reliable clocksource for PV guests is the
virtual one which pvclock builds upon. Raw TSC is definitely not an
option on NUMA systems (and PV guests aren''t aware of the
NUMAness of the underlying system).

Jan

Jan Beulich

2012-Apr-17 07:47 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

>>> On 16.04.12 at 19:30, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
>>  From: David Vrabel [mailto:david.vrabel@citrix.com]
>> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
>> 
>> On 16/04/12 17:05, Dan Magenheimer wrote:
>> >> From: David Vrabel [mailto:david.vrabel@citrix.com]
>> >> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched
clock as unstable
>> >
>> > Nacked-by: Dan Magenheimer <dan.magenheimer@oracle.com>
>> 
>> Fair enough,
>> 
>> > [A stable clock] should be true for Xen 4.0+ (but not for
pre-Xen-4.0).
>> 
>> The original customer problem is on a host with Xen 3.4.  What do you
>> recommend for Linux guests running such hosts?
> 
> For pre-Xen-4.0 and an unchanged PV guest, I don''t know.  If you
can
> back-patch the guest kernel with a workaround such as your patch, great!
> I''m only arguing against the patch getting perpetuated upstream.
>  
>> > In fact, it might be wise for a Xen-savvy kernel to check to see
>> > if it is running on Xen-4.0+ and, if so, force clocksource=tsc
>> > and tsc=reliable.
>> 
>> So, should the xen clocksource do:
>> 
>> if Xen 4.0+
>>     clock is stable, use rdtsc only.
>> else
>>     clock is unstable, use existing pvclock implementation.
> 
> Yes, that''s what I propose.  To clarify:
> 
> if the guest can and does determine it is running on Xen 4.0+
_and_ TSC reads are emulated (which I don''t think they are by
default).

Jan
> 	TSC is guaranteed by Xen to be stable, use clocksource=tsc tsc=reliable
> else
> 	Xen only guarantees that pvclock is stable, use pvclock

Tim Deegan

2012-Apr-17 08:19 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

At 16:01 -0700 on 16 Apr (1334592096), Sheng Yang wrote:> So I think there are maybe *two* bugs in this issue, one caused time
> jump(detail below), the other in the kernel triggered by the first bug
> sometime, thus result in migration fail.
> 
> I''ve spent some time to identify the timestamp jump issue, and
finally
> found it''s due to Invarient TSC (CPUID Leaf 0x80000007 EDX:8, also
> called non-stop TSC). The present of the feature would enable a
> parameter in the kernel named: sched_clock_stable. Seems this
> parameter is unable to work with Xen''s pvclock. If
> sched_clock_stable() is set, value returned by xen_clocksource_read()
> would be returned as sched_clock_cpu() directly(rather than calculated
> through sched_clock_local()), but CMIIW the value returned by
> xen_clocksource_read() is based on host(vcpu) uptime rather than this
> VM''s uptime, then result in the timestamp jump.
OK - that seems like a kernel bug.  Linux should not be modifying how it
treats the PV clocksource based on the ''Invariant TSC'' bit.
(Conversely, the patch to pretend the TSC is not invariant just because
the PV clocksource is present also seems wrong, and the earlier patch
that just enforces sched_clock_stable=0 would be better.)
> I''ve compiled a kernel, force sched_clock_stable=0, then it solved
the
> timestamp jump issue as expected. Luckily, seems it also solved the
> call trace and guest hang issue as well.
> 
> I''ve posted a patch to mask the CPUID leaf 0x80000007 in Xen.
Well, as Dan says, if Xen is emulating RDTSC to provide a
''stable'' TSC,
we shouldn''t _also_ tell the guest that it''s not stable. :)  

OTOH, grepping for CONSTANT_TSC, NONSTOP_TSC, and TSC_RELIABLE, I don''t
see anywhere even in xen-unstable where these bits are ever hidden from
the guest.  I think it would be reasonable to mask this from PV guests
at least for tsc_mode == 2, and on older Xens.

Tim.

Dan Magenheimer

2012-Apr-17 15:36 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
> 
> >>> On 16.04.12 at 19:22, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> > In upstream (and recent pv-ops) kernels, is there any need for there
> > to be a difference between HVM and PV in the clocksource chosen?  The
> 
> Yes, because RDTSC interception for PV guests is slow (using #GP
> and requiring instruction decode).
"Slow" is relative.  I showed (somewhere on xen-devel years ago) that
the emulation performance hit is much smaller than the original developers
expected and is detectable only with certain applications that
execute rdtsc ~100K/second.  Furthermore, the cycle count of an rdtsc
has gone up on modern systems, so the cost ratio of emulating
rdtsc vs executing the raw instruction is going down.
 > > pvclock algorithm was necessary for PV when non-TSC hardware clocks
> > were privileged and the only non-privileged hardware clock (TSC)
> > was badly broken in hardware and for migration/save/restore.
> > With TSC now working and stable, and now that we are making changes
> > in the upstream kernel that work for both PV and HVM, is it
> > time to drop pvclock (at least as the default for PV)?
> >
> > Certainly if an old (non-pv-ops) kernel is broken, something like
> > David''s patch might be an acceptable workaround. 
I''m just arguing
> > against perpetuating pvclock-as-the-only-xen-clock upstream.
> 
> Afaict, the only uniformly reliable clocksource for PV guests is the
> virtual one which pvclock builds upon. Raw TSC is definitely not an
> option on NUMA systems (and PV guests aren''t aware of the
> NUMAness of the underlying system).
You''ll have to define NUMA.  On "old" NUMA systems, where
there are
multiple motherboards, your statement is true.  On newer systems
where NUMA simply means there are multiple memory controllers and
all of them are cache-coherent, even when there are multiple
"motherboards" joined by HT or QPI, processor and system vendors
take great pains to ensure that the clock signal (and thus TSC) is
synchronized and "stable" across all cpus.

But I agree there ARE exceptions... for those, I proposed a Xen boot
option that said "don''t trust TSC even if all the evidence implies
that you can", but Keir shot it down (also years ago).

Dan

Dan Magenheimer

2012-Apr-17 15:42 UTC

head link

Re: [PATCH] xen: always set the sched clock as unstable

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Subject: RE: [Xen-devel] [PATCH] xen: always set the sched clock as
unstable
> 
> >>> On 16.04.12 at 19:30, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
> >>  From: David Vrabel [mailto:david.vrabel@citrix.com]
> >> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock
as unstable
> >>
> >> On 16/04/12 17:05, Dan Magenheimer wrote:
> >> >> From: David Vrabel [mailto:david.vrabel@citrix.com]
> >> >> Subject: Re: [Xen-devel] [PATCH] xen: always set the
sched clock as unstable
> >> >
> >> > Nacked-by: Dan Magenheimer <dan.magenheimer@oracle.com>
> >>
> >> Fair enough,
> >>
> >> > [A stable clock] should be true for Xen 4.0+ (but not for
pre-Xen-4.0).
> >>
> >> The original customer problem is on a host with Xen 3.4.  What do
you
> >> recommend for Linux guests running such hosts?
> >
> > For pre-Xen-4.0 and an unchanged PV guest, I don''t know.  If
you can
> > back-patch the guest kernel with a workaround such as your patch,
great!
> > I''m only arguing against the patch getting perpetuated
upstream.
> >
> >> > In fact, it might be wise for a Xen-savvy kernel to check to
see
> >> > if it is running on Xen-4.0+ and, if so, force
clocksource=tsc
> >> > and tsc=reliable.
> >>
> >> So, should the xen clocksource do:
> >>
> >> if Xen 4.0+
> >>     clock is stable, use rdtsc only.
> >> else
> >>     clock is unstable, use existing pvclock implementation.
> >
> > Yes, that''s what I propose.  To clarify:
> >
> > if the guest can and does determine it is running on Xen 4.0+
> 
> _and_ TSC reads are emulated (which I don''t think they are by
> default
They are emulated by default on any machine where Xen has
determined that TSC is untrustworthy AND always after migration.
So by definition (if not always in fact, see previous email),
and ignoring Xen bugs, Xen 4.0+ guarantees to guests that TSC
is a stable clock across all vcpus.

Xen devel - Apr 2012 - [PATCH] xen: always set the sched clock as unstable

[PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable

Re: [PATCH] xen: always set the sched clock as unstable