From: David Vrabel <david.vrabel@citrix.com> The sched clock was considered stable based on the capabilities of the underlying hardware. This does not make sense for Xen PV guests as: a) the hardware TSC is not used directly as the clock source; and b) guests may migrate to hosts with different hardware capabilities. It is not clear to me whether the Xen clock source is supposed to be stable and whether it should be stable across migration. For a clock source to be stable it must be: a) monotonic; c) synchronized across CPUs; and c) constant rate. There have also been reports of systems with apparently unstable clocks where clearing sched_clock_stable has fixed problems with migrated VMs hanging. So, always set the sched clock as unstable when using the Xen clock source. Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- arch/x86/xen/time.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index 0296a95..8469b5a 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -473,6 +473,7 @@ static void __init xen_time_init(void) do_settimeofday(&tp); setup_force_cpu_cap(X86_FEATURE_TSC); + sched_clock_stable = 0; xen_setup_runstate_info(cpu); xen_setup_timer(cpu); -- 1.7.2.5
On Fri, Apr 13, 2012 at 11:20 AM, David Vrabel <david.vrabel@citrix.com>wrote:> From: David Vrabel <david.vrabel@citrix.com> > > The sched clock was considered stable based on the capabilities of the > underlying hardware. This does not make sense for Xen PV guests as: > a) the hardware TSC is not used directly as the clock source; and b) > guests may migrate to hosts with different hardware capabilities. > > It is not clear to me whether the Xen clock source is supposed to be > stable and whether it should be stable across migration. For a clock > source to be stable it must be: a) monotonic; c) synchronized across > CPUs; and c) constant rate. > > There have also been reports of systems with apparently unstable > clocks where clearing sched_clock_stable has fixed problems with > migrated VMs hanging. > > So, always set the sched clock as unstable when using the Xen clock > source. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com> > --- > arch/x86/xen/time.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c > index 0296a95..8469b5a 100644 > --- a/arch/x86/xen/time.c > +++ b/arch/x86/xen/time.c > @@ -473,6 +473,7 @@ static void __init xen_time_init(void) > do_settimeofday(&tp); > > setup_force_cpu_cap(X86_FEATURE_TSC); > + sched_clock_stable = 0; > > xen_setup_runstate_info(cpu); > xen_setup_timer(cpu); > -- > 1.7.2.5 > >I really prefer to hide nonstop tsc feature from Xen side, rather than let Linux kernel to have a condition that "Nonstop TSC feature existed, but sched_clock_stable=0". For the other context, please refer to the discussion at xen-devel. http://lists.xen.org/archives/html/xen-devel/2012-04/msg00888.html http://lists.xen.org/archives/html/xen-devel/2012-04/msg00969.html -- regards Yang, Sheng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, Apr 13, 2012 at 11:20 AM, David Vrabel <david.vrabel@citrix.com> wrote:> > From: David Vrabel <david.vrabel@citrix.com> > > The sched clock was considered stable based on the capabilities of the > underlying hardware. This does not make sense for Xen PV guests as: > a) the hardware TSC is not used directly as the clock source; and b) > guests may migrate to hosts with different hardware capabilities. > > It is not clear to me whether the Xen clock source is supposed to be > stable and whether it should be stable across migration. For a clock > source to be stable it must be: a) monotonic; c) synchronized across > CPUs; and c) constant rate. > > There have also been reports of systems with apparently unstable > clocks where clearing sched_clock_stable has fixed problems with > migrated VMs hanging. > > So, always set the sched clock as unstable when using the Xen clock > source. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com> > --- > arch/x86/xen/time.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c > index 0296a95..8469b5a 100644 > --- a/arch/x86/xen/time.c > +++ b/arch/x86/xen/time.c > @@ -473,6 +473,7 @@ static void __init xen_time_init(void) > do_settimeofday(&tp); > > setup_force_cpu_cap(X86_FEATURE_TSC); > + sched_clock_stable = 0; > > xen_setup_runstate_info(cpu); > xen_setup_timer(cpu); > -- > 1.7.2.5 >(Sorry for duplicate posts, gmail is really not a ideal client for lkml - though exchange is worse...) I really prefer to hide nonstop tsc feature from Xen side, rather than let Linux kernel to have a condition that "Nonstop TSC feature existed, but sched_clock_stable=0". For the other context, please refer to the discussion at xen-devel. http://lists.xen.org/archives/html/xen-devel/2012-04/msg00888.html http://lists.xen.org/archives/html/xen-devel/2012-04/msg00969.html -- regards Yang, Sheng
David Vrabel
2012-Apr-13 18:39 UTC
Re: [PATCH] xen: always set the sched clock as unstable
On 13/04/12 19:31, Sheng Yang wrote:> On Fri, Apr 13, 2012 at 11:20 AM, David Vrabel <david.vrabel@citrix.com > <mailto:david.vrabel@citrix.com>> wrote: > > From: David Vrabel <david.vrabel@citrix.com > <mailto:david.vrabel@citrix.com>> > > The sched clock was considered stable based on the capabilities of the > underlying hardware. This does not make sense for Xen PV guests as: > a) the hardware TSC is not used directly as the clock source; and b) > guests may migrate to hosts with different hardware capabilities. > > It is not clear to me whether the Xen clock source is supposed to be > stable and whether it should be stable across migration. For a clock > source to be stable it must be: a) monotonic; c) synchronized across > CPUs; and c) constant rate. > > There have also been reports of systems with apparently unstable > clocks where clearing sched_clock_stable has fixed problems with > migrated VMs hanging. > > So, always set the sched clock as unstable when using the Xen clock > source. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com > <mailto:david.vrabel@citrix.com>> > --- > arch/x86/xen/time.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c > index 0296a95..8469b5a 100644 > --- a/arch/x86/xen/time.c > +++ b/arch/x86/xen/time.c > @@ -473,6 +473,7 @@ static void __init xen_time_init(void) > do_settimeofday(&tp); > > setup_force_cpu_cap(X86_FEATURE_TSC); > + sched_clock_stable = 0; > > xen_setup_runstate_info(cpu); > xen_setup_timer(cpu); > -- > 1.7.2.5 > > > I really prefer to hide nonstop tsc feature from Xen side, rather than > let Linux kernel to have a condition that "Nonstop TSC feature existed, > but sched_clock_stable=0".I disagree. The decision on whether the clock is stable or not should be in the Xen clock source code. David
>>> On 13.04.12 at 20:20, David Vrabel <david.vrabel@citrix.com> wrote: > From: David Vrabel <david.vrabel@citrix.com> > > The sched clock was considered stable based on the capabilities of the > underlying hardware. This does not make sense for Xen PV guests as: > a) the hardware TSC is not used directly as the clock source; and b) > guests may migrate to hosts with different hardware capabilities. > > It is not clear to me whether the Xen clock source is supposed to be > stable and whether it should be stable across migration. For a clock > source to be stable it must be: a) monotonic; c) synchronized across > CPUs; and c) constant rate. > > There have also been reports of systems with apparently unstable > clocks where clearing sched_clock_stable has fixed problems with > migrated VMs hanging. > > So, always set the sched clock as unstable when using the Xen clock > source. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com> > --- > arch/x86/xen/time.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c > index 0296a95..8469b5a 100644 > --- a/arch/x86/xen/time.c > +++ b/arch/x86/xen/time.c > @@ -473,6 +473,7 @@ static void __init xen_time_init(void) > do_settimeofday(&tp); > > setup_force_cpu_cap(X86_FEATURE_TSC); > + sched_clock_stable = 0;This, unfortunately, is not sufficient afaict: If a CPU gets brought up post-boot, the variable may need to be cleared again. Instead you ought to call mark_tsc_unstable(). Jan> > xen_setup_runstate_info(cpu); > xen_setup_timer(cpu); > -- > 1.7.2.5
David Vrabel
2012-Apr-16 14:59 UTC
Re: [PATCH] xen: always set the sched clock as unstable
On 16/04/12 12:32, Jan Beulich wrote:>>>> On 13.04.12 at 20:20, David Vrabel <david.vrabel@citrix.com> wrote: >> From: David Vrabel <david.vrabel@citrix.com> >> >> The sched clock was considered stable based on the capabilities of the >> underlying hardware. This does not make sense for Xen PV guests as: >> a) the hardware TSC is not used directly as the clock source; and b) >> guests may migrate to hosts with different hardware capabilities. >> >> It is not clear to me whether the Xen clock source is supposed to be >> stable and whether it should be stable across migration. For a clock >> source to be stable it must be: a) monotonic; c) synchronized across >> CPUs; and c) constant rate.Tim, Thomas, can you comment on the above paragraph? Is it correct?>> There have also been reports of systems with apparently unstable >> clocks where clearing sched_clock_stable has fixed problems with >> migrated VMs hanging. >> >> So, always set the sched clock as unstable when using the Xen clock >> source. >> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com> >> --- >> arch/x86/xen/time.c | 1 + >> 1 files changed, 1 insertions(+), 0 deletions(-) >> >> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c >> index 0296a95..8469b5a 100644 >> --- a/arch/x86/xen/time.c >> +++ b/arch/x86/xen/time.c >> @@ -473,6 +473,7 @@ static void __init xen_time_init(void) >> do_settimeofday(&tp); >> >> setup_force_cpu_cap(X86_FEATURE_TSC); >> + sched_clock_stable = 0; > > This, unfortunately, is not sufficient afaict: If a CPU gets brought up > post-boot, the variable may need to be cleared again. Instead you > ought to call mark_tsc_unstable().Yeah, mark_tsc_unstable() is the right thing to do. David
At 15:59 +0100 on 16 Apr (1334591984), David Vrabel wrote:> On 16/04/12 12:32, Jan Beulich wrote: > >>>> On 13.04.12 at 20:20, David Vrabel <david.vrabel@citrix.com> wrote: > >> From: David Vrabel <david.vrabel@citrix.com> > >> > >> The sched clock was considered stable based on the capabilities of the > >> underlying hardware. This does not make sense for Xen PV guests as: > >> a) the hardware TSC is not used directly as the clock source; and b) > >> guests may migrate to hosts with different hardware capabilities. > >> > >> It is not clear to me whether the Xen clock source is supposed to be > >> stable and whether it should be stable across migration. For a clock > >> source to be stable it must be: a) monotonic; c) synchronized across > >> CPUs; and c) constant rate. > > Tim, Thomas, can you comment on the above paragraph? Is it correct?How synchronized does it need to be? The adjustment of the rate happens independently on different CPUs so you might be able to see small differences if once CPU has made the adjustment but another hasn''t yet. That aside, on platforms where the real TSC is stable, AFAIK the xen PV time will be too, _except_ across migration. I have no idea whether Xen exports sufficient information to the guest for it to be able to tell whether this is the case, though -- it needs to know not only that the hardware is stable, but thet _Xen_ thinks it''s stable. Across migration, the PV time might not be monotonic (because it''s always the wallclock time on the current host, not a VM-specific attribute). Which is not to say that the linux-side code couldn''t make it monotonic, at least for small differences between hosts. Tim.
Konrad Rzeszutek Wilk
2012-Apr-16 15:17 UTC
Re: [PATCH] xen: always set the sched clock as unstable
On Mon, Apr 16, 2012 at 03:59:44PM +0100, David Vrabel wrote:> On 16/04/12 12:32, Jan Beulich wrote: > >>>> On 13.04.12 at 20:20, David Vrabel <david.vrabel@citrix.com> wrote: > >> From: David Vrabel <david.vrabel@citrix.com> > >> > >> The sched clock was considered stable based on the capabilities of the > >> underlying hardware. This does not make sense for Xen PV guests as:In regards to PV dom0 is this still the case? Asking b/c your patch will make dom0 be in the same category.> >> a) the hardware TSC is not used directly as the clock source; and b) > >> guests may migrate to hosts with different hardware capabilities. > >> > >> It is not clear to me whether the Xen clock source is supposed to be > >> stable and whether it should be stable across migration. For a clockI thought it was dependent on XEN_DOMCTL_settscinfo: - whether it gets emulated, or the guest can do rdtsc or pvrdtsc? Which I think is determined by some ''timer=X'' option in the guest config?> >> source to be stable it must be: a) monotonic; c) synchronized across > >> CPUs; and c) constant rate. > > Tim, Thomas, can you comment on the above paragraph? Is it correct? > > >> There have also been reports of systems with apparently unstable > >> clocks where clearing sched_clock_stable has fixed problems with > >> migrated VMs hanging. > >> > >> So, always set the sched clock as unstable when using the Xen clock > >> source. > >> > >> Signed-off-by: David Vrabel <david.vrabel@citrix.com> > >> --- > >> arch/x86/xen/time.c | 1 + > >> 1 files changed, 1 insertions(+), 0 deletions(-) > >> > >> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c > >> index 0296a95..8469b5a 100644 > >> --- a/arch/x86/xen/time.c > >> +++ b/arch/x86/xen/time.c > >> @@ -473,6 +473,7 @@ static void __init xen_time_init(void) > >> do_settimeofday(&tp); > >> > >> setup_force_cpu_cap(X86_FEATURE_TSC); > >> + sched_clock_stable = 0; > > > > This, unfortunately, is not sufficient afaict: If a CPU gets brought up > > post-boot, the variable may need to be cleared again. Instead you > > ought to call mark_tsc_unstable(). > > Yeah, mark_tsc_unstable() is the right thing to do. > > David
Dan Magenheimer
2012-Apr-16 16:05 UTC
Re: [PATCH] xen: always set the sched clock as unstable
> From: David Vrabel [mailto:david.vrabel@citrix.com] > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstableNacked-by: Dan Magenheimer <dan.magenheimer@oracle.com> (Apologies for missing the original post... our Oracle mail server has gone bonkers again... classifying nearly all (but not all) xen-devel email as spam. This problem started when xen.org moved to a different ISP last year, was supposedly fixed by Oracle IT, and has just started being a problem again. Argh!)> On 16/04/12 12:32, Jan Beulich wrote: > >>>> On 13.04.12 at 20:20, David Vrabel <david.vrabel@citrix.com> wrote: > >> From: David Vrabel <david.vrabel@citrix.com> > >> > >> The sched clock was considered stable based on the capabilities of the > >> underlying hardware. This does not make sense for Xen PV guests as: > >> a) the hardware TSC is not used directly as the clock source; and b) > >> guests may migrate to hosts with different hardware capabilities. > >> > >> It is not clear to me whether the Xen clock source is supposed to be > >> stable and whether it should be stable across migration. For a clock > >> source to be stable it must be: a) monotonic; c) synchronized across > >> CPUs; and c) constant rate. > > Tim, Thomas, can you comment on the above paragraph? Is it correct?(Sigh... I keep seeing clock-related things, wish I had more time to spend on them, cursing, and going back to other things. But, I need to comment further here...) Hmmm... I spent a great deal of time on TSC support in the hypervisor 2-3 years ago. I worked primarily on PV, but Intel supposedly was tracking everything on HVM as well. There''s most likely a bug or two still lurking but, for all guests, with the default tsc_mode, TSC is provided by Xen as an absolutely stable clock source. If Xen determines that the underlying hardware declares that TSC is stable, guest rdtsc instructions are not trapped. If it is not, Xen emulates all guest rdtsc instructions. After a migration or save/restore, TSC is always emulated. The result is (ignoring possible bugs) that TSC as provided by Xen is a) monotonic; b) synchronized across CPUs; and c) constant rate. Even across migration/save/restore. This should be true for Xen 4.0+ (but not for pre-Xen-4.0). Please see docs/misc/tscmode.txt in the xen tree. Though it may appear at first to be targeted at a different audience, all the relevant info is in there if you read it all the way through. (If you have any questions or disagreements on that doc, please start a new thread and cc me directly since my list access is unreliable.)> >> There have also been reports of systems with apparently unstable > >> clocks where clearing sched_clock_stable has fixed problems with > >> migrated VMs hanging. > >> > >> So, always set the sched clock as unstable when using the Xen clock > >> source. > >> > >> Signed-off-by: David Vrabel <david.vrabel@citrix.com> > >> --- > >> arch/x86/xen/time.c | 1 + > >> 1 files changed, 1 insertions(+), 0 deletions(-) > >> > >> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c > >> index 0296a95..8469b5a 100644 > >> --- a/arch/x86/xen/time.c > >> +++ b/arch/x86/xen/time.c > >> @@ -473,6 +473,7 @@ static void __init xen_time_init(void) > >> do_settimeofday(&tp); > >> > >> setup_force_cpu_cap(X86_FEATURE_TSC); > >> + sched_clock_stable = 0; > > > > This, unfortunately, is not sufficient afaict: If a CPU gets brought up > > post-boot, the variable may need to be cleared again. Instead you > > ought to call mark_tsc_unstable(). > > Yeah, mark_tsc_unstable() is the right thing to do.NACK! No, no, no. The exact opposite is true. Like VMware, TSC is stable. The issue is that Linux trusts other clock hardware more completely than TSC so whenever there is a problem with another clocksource, Linux blames TSC and marks TSC unstable. But TSC on Xen 4.0+ is innocent. In fact, TSC is a better clocksource choice than clocksource=xen (aka pvclock) because pvclock indirectly depends on TSC. For upstream kernels, the answer is to set clocksource=tsc and tsc=reliable, like VMware enforces. See: https://lists.ubuntu.com/archives/kernel-team/2008-October/004283.html In fact, it might be wise for a Xen-savvy kernel to check to see if it is running on Xen-4.0+ and, if so, force clocksource=tsc and tsc=reliable. There have been very odd rare problems reported in Xen time handling for a very long time. These usually manifest as some kind of "TSC is not stable" message from a guest Linux kernel, but the symptoms always point away from TSC as the culprit. Forcing Xen-savvy guests to use TSC will either make these problems go away (if they haven''t already been fixed) or allow us to find the obscure underlying hypervisor bugs rather than paper over them. Thanks, Dan P.S. For anyone new to this areas, see VMware''s classic document: http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf P.P.S. note this recent kernel issue which is related, but likely not seen in Xen... it pre-requires cpu overcommitment at boot time when TSC is being calibrated by the kernel. https://lkml.org/lkml/2012/2/21/518
>>> On 16.04.12 at 18:05, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> From: David Vrabel [mailto:david.vrabel@citrix.com] >> On 16/04/12 12:32, Jan Beulich wrote: >> >>>> On 13.04.12 at 20:20, David Vrabel <david.vrabel@citrix.com> wrote: >> >> --- a/arch/x86/xen/time.c >> >> +++ b/arch/x86/xen/time.c >> >> @@ -473,6 +473,7 @@ static void __init xen_time_init(void) >> >> do_settimeofday(&tp); >> >> >> >> setup_force_cpu_cap(X86_FEATURE_TSC); >> >> + sched_clock_stable = 0; >> > >> > This, unfortunately, is not sufficient afaict: If a CPU gets brought up >> > post-boot, the variable may need to be cleared again. Instead you >> > ought to call mark_tsc_unstable(). >> >> Yeah, mark_tsc_unstable() is the right thing to do. > > NACK! > > No, no, no. The exact opposite is true. Like VMware, TSC is > stable. The issue is that Linux trusts other clock hardware more > completely than TSC so whenever there is a problem with another > clocksource, Linux blames TSC and marks TSC unstable. But TSC > on Xen 4.0+ is innocent. In fact, TSC is a better clocksource > choice than clocksource=xen (aka pvclock) because pvclock > indirectly depends on TSC. > > For upstream kernels, the answer is to set clocksource=tsc > and tsc=reliable, like VMware enforces. See: > > https://lists.ubuntu.com/archives/kernel-team/2008-October/004283.html > > In fact, it might be wise for a Xen-savvy kernel to check to see > if it is running on Xen-4.0+ and, if so, force clocksource=tsc > and tsc=reliable.Are you possibly mixing up PV and HVM cases? sched_clock_stable getting set _is_ a problem in PV kernels - we had bug reports long ago when this first appeared in arch/x86/kernel/cpu/intel.c. I''m suspecting this because there''s not supposed to be (and in non- pv-ops there is no; in pv-ops I assume it simply has no effect) clocksource=tsc in PV kernels. Jan
Dan Magenheimer
2012-Apr-16 16:20 UTC
Re: [PATCH] xen: always set the sched clock as unstable
> From: Konrad Rzeszutek Wilk > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > > On Mon, Apr 16, 2012 at 03:59:44PM +0100, David Vrabel wrote: > > On 16/04/12 12:32, Jan Beulich wrote: > > >>>> On 13.04.12 at 20:20, David Vrabel <david.vrabel@citrix.com> wrote: > > >> From: David Vrabel <david.vrabel@citrix.com> > > >> > > >> The sched clock was considered stable based on the capabilities of the > > >> underlying hardware. This does not make sense for Xen PV guests as: > > In regards to PV dom0 is this still the case? Asking b/c your > patch will make dom0 be in the same category. > > > >> a) the hardware TSC is not used directly as the clock source; and b) > > >> guests may migrate to hosts with different hardware capabilities. > > >> > > >> It is not clear to me whether the Xen clock source is supposed to be > > >> stable and whether it should be stable across migration. For a clockXen is -- and has always been -- responsible for emulating sufficient clock hardware devices and presenting them to the guest AND ensuring that this emulation works properly across migration/save/restore (which is required because these transitions may be completely transparent to the guest). Prior to Xen 4.0, TSC was not considered to be a clocksource worthy of emulating and was passed through to PV guests unchanged (and not fully handled for HVM either IIRC). At Xen 4.0+, it is handled properly.> I thought it was dependent on XEN_DOMCTL_settscinfo: > - whether it gets emulated, or the guest can do rdtsc or pvrdtsc? > > Which I think is determined by some ''timer=X'' option in the guest config?I think you may be thinking of tsc_mode. See docs/misc/tscmode.txt in the Xen source. The default should work correctly.
David Vrabel
2012-Apr-16 16:26 UTC
Re: [PATCH] xen: always set the sched clock as unstable
On 16/04/12 17:05, Dan Magenheimer wrote:>> From: David Vrabel [mailto:david.vrabel@citrix.com] >> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > > Nacked-by: Dan Magenheimer <dan.magenheimer@oracle.com>Fair enough,> [A stable clock] should be true for Xen 4.0+ (but not for pre-Xen-4.0).The original customer problem is on a host with Xen 3.4. What do you recommend for Linux guests running such hosts?> In fact, it might be wise for a Xen-savvy kernel to check to see > if it is running on Xen-4.0+ and, if so, force clocksource=tsc > and tsc=reliable.So, should the xen clocksource do: if Xen 4.0+ clock is stable, use rdtsc only. else clock is unstable, use existing pvclock implementation. David
At 09:05 -0700 on 16 Apr (1334567132), Dan Magenheimer wrote:> Hmmm... I spent a great deal of time on TSC support in the hypervisor > 2-3 years ago. I worked primarily on PV, but Intel supposedly was tracking > everything on HVM as well. There''s most likely a bug or two still lurking > but, for all guests, with the default tsc_mode, TSC is provided by Xen > as an absolutely stable clock source. If Xen determines that the underlying > hardware declares that TSC is stable, guest rdtsc instructions are not trapped. > If it is not, Xen emulates all guest rdtsc instructions. After a migration or > save/restore, TSC is always emulated. The result is (ignoring possible > bugs) that TSC as provided by Xen is a) monotonic; b) synchronized across > CPUs; and c) constant rate. Even across migration/save/restore.AIUI, this thread is about the PV-time clock source, not about the TSC itself. Even if the TSC is emulated (or in some other way made "stable") the PV wallclock is not necessarily stable across migration. But since migration is controlled by the kernel, presumably the kernel can DTRT about it.> In fact, it might be wise for a Xen-savvy kernel to check to see > if it is running on Xen-4.0+ and, if so, force clocksource=tsc > and tsc=reliable.That seems like overdoing it. Certainly it''s not OK unless it can also check that Xen is providing a stable TSC (i.e. that tscmode==1). In the case where the PV clock has been selected, can it not be marked unstable without also marking the TSC unstable? Tim.
Dan Magenheimer
2012-Apr-16 17:22 UTC
Re: [PATCH] xen: always set the sched clock as unstable
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: RE: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > > >>> On 16.04.12 at 18:05, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > >> From: David Vrabel [mailto:david.vrabel@citrix.com] > >> On 16/04/12 12:32, Jan Beulich wrote: > >> >>>> On 13.04.12 at 20:20, David Vrabel <david.vrabel@citrix.com> wrote: > >> >> --- a/arch/x86/xen/time.c > >> >> +++ b/arch/x86/xen/time.c > >> >> @@ -473,6 +473,7 @@ static void __init xen_time_init(void) > >> >> do_settimeofday(&tp); > >> >> > >> >> setup_force_cpu_cap(X86_FEATURE_TSC); > >> >> + sched_clock_stable = 0; > >> > > >> > This, unfortunately, is not sufficient afaict: If a CPU gets brought up > >> > post-boot, the variable may need to be cleared again. Instead you > >> > ought to call mark_tsc_unstable(). > >> > >> Yeah, mark_tsc_unstable() is the right thing to do. > > > > NACK! > > > > No, no, no. The exact opposite is true. Like VMware, TSC is > > stable. The issue is that Linux trusts other clock hardware more > > completely than TSC so whenever there is a problem with another > > clocksource, Linux blames TSC and marks TSC unstable. But TSC > > on Xen 4.0+ is innocent. In fact, TSC is a better clocksource > > choice than clocksource=xen (aka pvclock) because pvclock > > indirectly depends on TSC. > > > > For upstream kernels, the answer is to set clocksource=tsc > > and tsc=reliable, like VMware enforces. See: > > > > https://lists.ubuntu.com/archives/kernel-team/2008-October/004283.html > > > > In fact, it might be wise for a Xen-savvy kernel to check to see > > if it is running on Xen-4.0+ and, if so, force clocksource=tsc > > and tsc=reliable. > > Are you possibly mixing up PV and HVM cases? sched_clock_stable > getting set _is_ a problem in PV kernels - we had bug reports long > ago when this first appeared in arch/x86/kernel/cpu/intel.c. I''m > suspecting this because there''s not supposed to be (and in non- > pv-ops there is no; in pv-ops I assume it simply has no effect) > clocksource=tsc in PV kernels.Hi Jan -- In upstream (and recent pv-ops) kernels, is there any need for there to be a difference between HVM and PV in the clocksource chosen? The pvclock algorithm was necessary for PV when non-TSC hardware clocks were privileged and the only non-privileged hardware clock (TSC) was badly broken in hardware and for migration/save/restore. With TSC now working and stable, and now that we are making changes in the upstream kernel that work for both PV and HVM, is it time to drop pvclock (at least as the default for PV)? Certainly if an old (non-pv-ops) kernel is broken, something like David''s patch might be an acceptable workaround. I''m just arguing against perpetuating pvclock-as-the-only-xen-clock upstream. Does that make sense? Dan
Dan Magenheimer
2012-Apr-16 17:30 UTC
Re: [PATCH] xen: always set the sched clock as unstable
> From: David Vrabel [mailto:david.vrabel@citrix.com] > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > > On 16/04/12 17:05, Dan Magenheimer wrote: > >> From: David Vrabel [mailto:david.vrabel@citrix.com] > >> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > > > > Nacked-by: Dan Magenheimer <dan.magenheimer@oracle.com> > > Fair enough, > > > [A stable clock] should be true for Xen 4.0+ (but not for pre-Xen-4.0). > > The original customer problem is on a host with Xen 3.4. What do you > recommend for Linux guests running such hosts?For pre-Xen-4.0 and an unchanged PV guest, I don''t know. If you can back-patch the guest kernel with a workaround such as your patch, great! I''m only arguing against the patch getting perpetuated upstream.> > In fact, it might be wise for a Xen-savvy kernel to check to see > > if it is running on Xen-4.0+ and, if so, force clocksource=tsc > > and tsc=reliable. > > So, should the xen clocksource do: > > if Xen 4.0+ > clock is stable, use rdtsc only. > else > clock is unstable, use existing pvclock implementation.Yes, that''s what I propose. To clarify: if the guest can and does determine it is running on Xen 4.0+ TSC is guaranteed by Xen to be stable, use clocksource=tsc tsc=reliable else Xen only guarantees that pvclock is stable, use pvclock
Dan Magenheimer
2012-Apr-16 17:52 UTC
Re: [PATCH] xen: always set the sched clock as unstable
> From: Tim Deegan [mailto:tim@xen.org] > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > > At 09:05 -0700 on 16 Apr (1334567132), Dan Magenheimer wrote: > > Hmmm... I spent a great deal of time on TSC support in the hypervisor > > 2-3 years ago. I worked primarily on PV, but Intel supposedly was tracking > > everything on HVM as well. There''s most likely a bug or two still lurking > > but, for all guests, with the default tsc_mode, TSC is provided by Xen > > as an absolutely stable clock source. If Xen determines that the underlying > > hardware declares that TSC is stable, guest rdtsc instructions are not trapped. > > If it is not, Xen emulates all guest rdtsc instructions. After a migration or > > save/restore, TSC is always emulated. The result is (ignoring possible > > bugs) that TSC as provided by Xen is a) monotonic; b) synchronized across > > CPUs; and c) constant rate. Even across migration/save/restore. > > AIUI, this thread is about the PV-time clock source, not about the TSC > itself. Even if the TSC is emulated (or in some other way made > "stable") the PV wallclock is not necessarily stable across migration. > But since migration is controlled by the kernel, presumably the kernel > can DTRT about it.Under what circumstances is PV wallclock not stable across migration?> > In fact, it might be wise for a Xen-savvy kernel to check to see > > if it is running on Xen-4.0+ and, if so, force clocksource=tsc > > and tsc=reliable. > > That seems like overdoing it. Certainly it''s not OK unless it can also > check that Xen is providing a stable TSC (i.e. that tscmode==1).Xen guarantees a stable TSC for the default (tsc_mode==0) also. If the vm.cfg file explicitly sets a guest tsc_mode==2, you are correct that pvclock is still necessary. But as the documentation says: tsc_mode==2 should be set if "it is certain that all apps running in this VM are TSC-resilient and highest performance is required". In the case we are talking about, the PV guest kernel itself isn''t TSC- resilient! In any case, IIRC, there is a pvcpuid instruction to determine the tsc_mode, so when the upstream kernel checks for Xen 4.0+, it could also check to ensure the tsc_mode wasn''t overridden and set to 2. If it is set to 2, TSC should not be an available clocksource, as the guest kernel would break on migration/save/restore.> In the case where the PV clock has been selected, can it not be marked > unstable without also marking the TSC unstable?I''m not sure I understand... Are you talking about the HVM case of an upstream kernel, maybe when the clocksource is manually overridden on the kernel command line or after boot with sysfs? If pvclock is necessary (e.g. old Xen), how would it be marked unstable? (I didn''t know there was code to do that.) Dan
At 10:52 -0700 on 16 Apr (1334573568), Dan Magenheimer wrote:> > From: Tim Deegan [mailto:tim@xen.org] > > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > > > > At 09:05 -0700 on 16 Apr (1334567132), Dan Magenheimer wrote: > > > Hmmm... I spent a great deal of time on TSC support in the hypervisor > > > 2-3 years ago. I worked primarily on PV, but Intel supposedly was tracking > > > everything on HVM as well. There''s most likely a bug or two still lurking > > > but, for all guests, with the default tsc_mode, TSC is provided by Xen > > > as an absolutely stable clock source. If Xen determines that the underlying > > > hardware declares that TSC is stable, guest rdtsc instructions are not trapped. > > > If it is not, Xen emulates all guest rdtsc instructions. After a migration or > > > save/restore, TSC is always emulated. The result is (ignoring possible > > > bugs) that TSC as provided by Xen is a) monotonic; b) synchronized across > > > CPUs; and c) constant rate. Even across migration/save/restore. > > > > AIUI, this thread is about the PV-time clock source, not about the TSC > > itself. Even if the TSC is emulated (or in some other way made > > "stable") the PV wallclock is not necessarily stable across migration. > > But since migration is controlled by the kernel, presumably the kernel > > can DTRT about it. > > Under what circumstances is PV wallclock not stable across migration?The wallclock is host-local, so I don''t think it can be guaranteed to be strictly monotonic across migration. But as I said that''s OK because the Xen code in the kernel is in control during migration.> > > In fact, it might be wise for a Xen-savvy kernel to check to see > > > if it is running on Xen-4.0+ and, if so, force clocksource=tsc > > > and tsc=reliable. > > > > That seems like overdoing it. Certainly it''s not OK unless it can also > > check that Xen is providing a stable TSC (i.e. that tscmode==1). > > Xen guarantees a stable TSC for the default (tsc_mode==0) also. > > If the vm.cfg file explicitly sets a guest tsc_mode==2, you are correct > that pvclock is still necessary. But as the documentation says: > tsc_mode==2 should be set if "it is certain that all apps running in this > VM are TSC-resilient and highest performance is required". In > the case we are talking about, the PV guest kernel itself isn''t TSC- > resilient!Only if we deliberately break it! :)> In any case, IIRC, there is a pvcpuid instruction to determine the > tsc_mode, so when the upstream kernel checks for Xen 4.0+, it could > also check to ensure the tsc_mode wasn''t overridden and set to 2.Yes, that''s what I was suggesting.> > In the case where the PV clock has been selected, can it not be marked > > unstable without also marking the TSC unstable? > > I''m not sure I understand... > > Are you talking about the HVM case of an upstream kernel, maybe > when the clocksource is manually overridden on the kernel command > line or after boot with sysfs?I''m talking about any case where the clocksource == xen.> If pvclock is necessary (e.g. old Xen), how would it be > marked unstable? (I didn''t know there was code to do that.)I think I''m confused by terminology. Maybe David can correct me. My understanding was that there is some concept inside linux of a time source being ''stable'', which requires it to be synchronized, monotonic and constant-rate. The PV clock is two of those things (within a reasonable tolerance) but may not be monotonic over migration. I was suggesting that, however linux deals with that, it can probably do it without changing its opinion of whether the TSC is stable. If the PV clocksource works, and works in more configurations than TSC, I don''t see much advantage of deprecating it in favour of TSC. But I don''t have any huge objection to it either, I guess, as long as it only happens when it''s safe. And on older Xens, or for tsc_mode==2, the kernel probably ought to mark the TSC as unstable, because it is. Cheers, Tim.
On Mon, Apr 16, 2012 at 11:17 AM, Tim Deegan <tim@xen.org> wrote:> At 10:52 -0700 on 16 Apr (1334573568), Dan Magenheimer wrote: >> > From: Tim Deegan [mailto:tim@xen.org] >> > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable >> > >> > At 09:05 -0700 on 16 Apr (1334567132), Dan Magenheimer wrote: >> > > Hmmm... I spent a great deal of time on TSC support in the hypervisor >> > > 2-3 years ago. I worked primarily on PV, but Intel supposedly was tracking >> > > everything on HVM as well. There''s most likely a bug or two still lurking >> > > but, for all guests, with the default tsc_mode, TSC is provided by Xen >> > > as an absolutely stable clock source. If Xen determines that the underlying >> > > hardware declares that TSC is stable, guest rdtsc instructions are not trapped. >> > > If it is not, Xen emulates all guest rdtsc instructions. After a migration or >> > > save/restore, TSC is always emulated. The result is (ignoring possible >> > > bugs) that TSC as provided by Xen is a) monotonic; b) synchronized across >> > > CPUs; and c) constant rate. Even across migration/save/restore. >> > >> > AIUI, this thread is about the PV-time clock source, not about the TSC >> > itself. Even if the TSC is emulated (or in some other way made >> > "stable") the PV wallclock is not necessarily stable across migration. >> > But since migration is controlled by the kernel, presumably the kernel >> > can DTRT about it. >> >> Under what circumstances is PV wallclock not stable across migration? > > The wallclock is host-local, so I don''t think it can be guaranteed to be > strictly monotonic across migration. But as I said that''s OK because > the Xen code in the kernel is in control during migration. > >> > > In fact, it might be wise for a Xen-savvy kernel to check to see >> > > if it is running on Xen-4.0+ and, if so, force clocksource=tsc >> > > and tsc=reliable. >> > >> > That seems like overdoing it. Certainly it''s not OK unless it can also >> > check that Xen is providing a stable TSC (i.e. that tscmode==1). >> >> Xen guarantees a stable TSC for the default (tsc_mode==0) also. >> >> If the vm.cfg file explicitly sets a guest tsc_mode==2, you are correct >> that pvclock is still necessary. But as the documentation says: >> tsc_mode==2 should be set if "it is certain that all apps running in this >> VM are TSC-resilient and highest performance is required". In >> the case we are talking about, the PV guest kernel itself isn''t TSC- >> resilient! > > Only if we deliberately break it! :) > >> In any case, IIRC, there is a pvcpuid instruction to determine the >> tsc_mode, so when the upstream kernel checks for Xen 4.0+, it could >> also check to ensure the tsc_mode wasn''t overridden and set to 2. > > Yes, that''s what I was suggesting. > >> > In the case where the PV clock has been selected, can it not be marked >> > unstable without also marking the TSC unstable? >> >> I''m not sure I understand... >> >> Are you talking about the HVM case of an upstream kernel, maybe >> when the clocksource is manually overridden on the kernel command >> line or after boot with sysfs? > > I''m talking about any case where the clocksource == xen. > >> If pvclock is necessary (e.g. old Xen), how would it be >> marked unstable? (I didn''t know there was code to do that.) > > I think I''m confused by terminology. Maybe David can correct me. My > understanding was that there is some concept inside linux of a time > source being ''stable'', which requires it to be synchronized, monotonic > and constant-rate. The PV clock is two of those things (within a > reasonable tolerance) but may not be monotonic over migration. I was > suggesting that, however linux deals with that, it can probably do it > without changing its opinion of whether the TSC is stable.In fact the sched_clock_stable is only regarding one Intel processor feature named "Invarient TSC"(a.k.a Non-stop TSC). I''ve reported the original issue to xen-devel, and purpose one patch to fix CPUID filter in the libxc of Xen. I think mask CPUID bit in the hypervisor is better than make this change in the kernel, since Xen controlled what to present to the guest, it doesn''t make sense if we present a feature to the guest, and hack the kernel to disable this feature at the same time. I haven''t dug much into the code, but here is the background(most copied from my xen-devel post): Recently we got some reports of migration hang on latest Debian(2.6.32-41kernel package) kernel with some certain machines(but it''s hard to debug on them since they''re customer''s machine). Booting dmesg snippet below: [ 0.000000] Booting paravirtualized kernel on Xen [ 0.000000] Xen version: 3.4.2 (preserve-AD) [ 0.000000] NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:1 nr_node_ids:1 [ 0.000000] PERCPU: Embedded 15 pages/cpu @c1608000 s37656 r0 d23784 u65536 [ 0.000000] pcpu-alloc: s37656 r0 d23784 u65536 alloc=16*4096 [ 0.000000] pcpu-alloc: [0] 0 [508119.807590] trying to map vcpu_info 0 at c1609010, mfn 992cac, offset 16 [508119.807593] cpu 0 using vcpu_info at c1609010 [508119.807594] Xen: using vcpu_info placement [508119.807598] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32416 Dmesg show that when booting, timestamp of printk jumped from 0 to a big number([508119.807590] in this case) immediately. And when migrating: [509508.914333] suspending xenstore... [516212.055921] trying to map vcpu_info 0 at c1609010, mfn 895fd7, offset 16 [516212.055930] cpu 0 using vcpu_info at c1609010 Timestamp jumped again. We can reproduce above issues on our Sandy Bridge machines. After this, call trace and guest hang *maybe* observed on some machines: [516383.019499] INFO: task xenwatch:12 blocked for more than 120 seconds. [516383.019566] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [516383.019578] xenwatch D c1610e20 0 12 2 0x00000000 [516383.019591] c781eec0 00000246 c1610e58 c1610e20 c781f300 c1441e20 c1441e20 001cf000 [516383.019605] c781f07c c1610e20 00000000 00000001 c1441e20 c62e01c0 c1610e20 c62e01c0 [516383.019617] c127e18e c781f07c c7830020 c7830020 c1441e20 c1441e20 c127f2f1 c781f080 [516383.019629] Call Trace: [516383.019640] [<c127e18e>] ? schedule+0x78f/0x7dc [516383.019645] [<c127f2f1>] ? _spin_unlock_irqrestore+0xd/0xf [516383.019649] [<c127e4a1>] ? schedule_timeout+0x20/0xb0 [516383.019656] [<c100573c>] ? xen_force_evtchn_callback+0xc/0x10 [516383.019660] [<c127e3aa>] ? wait_for_common+0xa4/0x100 [516383.019665] [<c1033315>] ? default_wake_function+0x0/0x8 [516383.019671] [<c104a144>] ? kthread_stop+0x4f/0x8e [516383.019675] [<c1047883>] ? cleanup_workqueue_thread+0x3a/0x45 [516383.019679] [<c1047903>] ? destroy_workqueue+0x56/0x85 [516383.019684] [<c106a395>] ? stop_machine_destroy+0x23/0x37 [516383.019690] [<c11962d8>] ? shutdown_handler+0x200/0x22f [516383.019694] [<c1197439>] ? xenwatch_thread+0xdc/0x103 [516383.019698] [<c104a322>] ? autoremove_wake_function+0x0/0x2d [516383.019701] [<c119735d>] ? xenwatch_thread+0x0/0x103 [516383.019705] [<c104a0f0>] ? kthread+0x61/0x66 [516383.019709] [<c104a08f>] ? kthread+0x0/0x66 [516383.019714] [<c1008d87>] ? kernel_thread_helper+0x7/0x10 But I _cannot_ reproduce the call trace and hang on our Sandy Bridge. So I think there are maybe *two* bugs in this issue, one caused time jump(detail below), the other in the kernel triggered by the first bug sometime, thus result in migration fail. I''ve spent some time to identify the timestamp jump issue, and finally found it''s due to Invarient TSC (CPUID Leaf 0x80000007 EDX:8, also called non-stop TSC). The present of the feature would enable a parameter in the kernel named: sched_clock_stable. Seems this parameter is unable to work with Xen''s pvclock. If sched_clock_stable() is set, value returned by xen_clocksource_read() would be returned as sched_clock_cpu() directly(rather than calculated through sched_clock_local()), but CMIIW the value returned by xen_clocksource_read() is based on host(vcpu) uptime rather than this VM''s uptime, then result in the timestamp jump. I''ve compiled a kernel, force sched_clock_stable=0, then it solved the timestamp jump issue as expected. Luckily, seems it also solved the call trace and guest hang issue as well. I''ve posted a patch to mask the CPUID leaf 0x80000007 in Xen. I think the issue can be easily reproduced using a Westmere or SandyBridge machine(my old colleagues at Intel said the feature likely existed after Nehalem) running newer version of PV guest, check the guest cpuinfo you would see nonstop_tsc, and you would notice the abnormal timestamp of printk. -- regards Yang, Sheng> > If the PV clocksource works, and works in more configurations than TSC, > I don''t see much advantage of deprecating it in favour of TSC. But I > don''t have any huge objection to it either, I guess, as long as it only > happens when it''s safe. > > And on older Xens, or for tsc_mode==2, the kernel probably ought to mark > the TSC as unstable, because it is. > > Cheers, > > Tim.
Dan Magenheimer
2012-Apr-17 00:29 UTC
Re: [PATCH] xen: always set the sched clock as unstable
> From: Sheng Yang [mailto:sheng@yasker.org] > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstableHi Sheng -- See reply at the very end...> On Mon, Apr 16, 2012 at 11:17 AM, Tim Deegan <tim@xen.org> wrote: > > At 10:52 -0700 on 16 Apr (1334573568), Dan Magenheimer wrote: > >> > From: Tim Deegan [mailto:tim@xen.org] > >> > Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > >> > > >> > At 09:05 -0700 on 16 Apr (1334567132), Dan Magenheimer wrote: > >> > > Hmmm... I spent a great deal of time on TSC support in the hypervisor > >> > > 2-3 years ago. I worked primarily on PV, but Intel supposedly was tracking > >> > > everything on HVM as well. There''s most likely a bug or two still lurking > >> > > but, for all guests, with the default tsc_mode, TSC is provided by Xen > >> > > as an absolutely stable clock source. If Xen determines that the underlying > >> > > hardware declares that TSC is stable, guest rdtsc instructions are not trapped. > >> > > If it is not, Xen emulates all guest rdtsc instructions. After a migration or > >> > > save/restore, TSC is always emulated. The result is (ignoring possible > >> > > bugs) that TSC as provided by Xen is a) monotonic; b) synchronized across > >> > > CPUs; and c) constant rate. Even across migration/save/restore. > >> > > >> > AIUI, this thread is about the PV-time clock source, not about the TSC > >> > itself. Even if the TSC is emulated (or in some other way made > >> > "stable") the PV wallclock is not necessarily stable across migration. > >> > But since migration is controlled by the kernel, presumably the kernel > >> > can DTRT about it. > >> > >> Under what circumstances is PV wallclock not stable across migration? > > > > The wallclock is host-local, so I don''t think it can be guaranteed to be > > strictly monotonic across migration. But as I said that''s OK because > > the Xen code in the kernel is in control during migration. > > > >> > > In fact, it might be wise for a Xen-savvy kernel to check to see > >> > > if it is running on Xen-4.0+ and, if so, force clocksource=tsc > >> > > and tsc=reliable. > >> > > >> > That seems like overdoing it. Certainly it''s not OK unless it can also > >> > check that Xen is providing a stable TSC (i.e. that tscmode==1). > >> > >> Xen guarantees a stable TSC for the default (tsc_mode==0) also. > >> > >> If the vm.cfg file explicitly sets a guest tsc_mode==2, you are correct > >> that pvclock is still necessary. But as the documentation says: > >> tsc_mode==2 should be set if "it is certain that all apps running in this > >> VM are TSC-resilient and highest performance is required". In > >> the case we are talking about, the PV guest kernel itself isn''t TSC- > >> resilient! > > > > Only if we deliberately break it! :) > > > >> In any case, IIRC, there is a pvcpuid instruction to determine the > >> tsc_mode, so when the upstream kernel checks for Xen 4.0+, it could > >> also check to ensure the tsc_mode wasn''t overridden and set to 2. > > > > Yes, that''s what I was suggesting. > > > >> > In the case where the PV clock has been selected, can it not be marked > >> > unstable without also marking the TSC unstable? > >> > >> I''m not sure I understand... > >> > >> Are you talking about the HVM case of an upstream kernel, maybe > >> when the clocksource is manually overridden on the kernel command > >> line or after boot with sysfs? > > > > I''m talking about any case where the clocksource == xen. > > > >> If pvclock is necessary (e.g. old Xen), how would it be > >> marked unstable? (I didn''t know there was code to do that.) > > > > I think I''m confused by terminology. Maybe David can correct me. My > > understanding was that there is some concept inside linux of a time > > source being ''stable'', which requires it to be synchronized, monotonic > > and constant-rate. The PV clock is two of those things (within a > > reasonable tolerance) but may not be monotonic over migration. I was > > suggesting that, however linux deals with that, it can probably do it > > without changing its opinion of whether the TSC is stable. > > In fact the sched_clock_stable is only regarding one Intel processor > feature named "Invarient TSC"(a.k.a Non-stop TSC). > > I''ve reported the original issue to xen-devel, and purpose one patch > to fix CPUID filter in the libxc of Xen. > > I think mask CPUID bit in the hypervisor is better than make this > change in the kernel, since Xen controlled what to present to the > guest, it doesn''t make sense if we present a feature to the guest, and > hack the kernel to disable this feature at the same time. > > I haven''t dug much into the code, but here is the background(most > copied from my xen-devel post): > > Recently we got some reports of migration hang on latest > Debian(2.6.32-41kernel package) kernel with some certain machines(but > it''s hard to debug on them since they''re customer''s machine). > > Booting dmesg snippet below: > > [ 0.000000] Booting paravirtualized kernel on Xen > [ 0.000000] Xen version: 3.4.2 (preserve-AD) > [ 0.000000] NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:1 > nr_node_ids:1 > [ 0.000000] PERCPU: Embedded 15 pages/cpu @c1608000 s37656 r0 > d23784 u65536 > [ 0.000000] pcpu-alloc: s37656 r0 d23784 u65536 alloc=16*4096 > [ 0.000000] pcpu-alloc: [0] 0 > [508119.807590] trying to map vcpu_info 0 at c1609010, mfn 992cac, > offset 16 > [508119.807593] cpu 0 using vcpu_info at c1609010 > [508119.807594] Xen: using vcpu_info placement > [508119.807598] Built 1 zonelists in Zone order, mobility grouping on. > Total pages: 32416 > > Dmesg show that when booting, timestamp of printk jumped from 0 to a > big number([508119.807590] in this case) immediately. > > And when migrating: > > [509508.914333] suspending xenstore... > [516212.055921] trying to map vcpu_info 0 at c1609010, mfn 895fd7, > offset 16 > [516212.055930] cpu 0 using vcpu_info at c1609010 > > Timestamp jumped again. We can reproduce above issues on our Sandy > Bridge machines. > > After this, call trace and guest hang *maybe* observed on some machines: > > [516383.019499] INFO: task xenwatch:12 blocked for more than 120 > seconds. > [516383.019566] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [516383.019578] xenwatch D c1610e20 0 12 2 0x00000000 > [516383.019591] c781eec0 00000246 c1610e58 c1610e20 c781f300 c1441e20 > c1441e20 001cf000 > [516383.019605] c781f07c c1610e20 00000000 00000001 c1441e20 c62e01c0 > c1610e20 c62e01c0 > [516383.019617] c127e18e c781f07c c7830020 c7830020 c1441e20 c1441e20 > c127f2f1 c781f080 > [516383.019629] Call Trace: > [516383.019640] [<c127e18e>] ? schedule+0x78f/0x7dc > [516383.019645] [<c127f2f1>] ? _spin_unlock_irqrestore+0xd/0xf > [516383.019649] [<c127e4a1>] ? schedule_timeout+0x20/0xb0 > [516383.019656] [<c100573c>] ? xen_force_evtchn_callback+0xc/0x10 > [516383.019660] [<c127e3aa>] ? wait_for_common+0xa4/0x100 > [516383.019665] [<c1033315>] ? default_wake_function+0x0/0x8 > [516383.019671] [<c104a144>] ? kthread_stop+0x4f/0x8e > [516383.019675] [<c1047883>] ? cleanup_workqueue_thread+0x3a/0x45 > [516383.019679] [<c1047903>] ? destroy_workqueue+0x56/0x85 > [516383.019684] [<c106a395>] ? stop_machine_destroy+0x23/0x37 > [516383.019690] [<c11962d8>] ? shutdown_handler+0x200/0x22f > [516383.019694] [<c1197439>] ? xenwatch_thread+0xdc/0x103 > [516383.019698] [<c104a322>] ? autoremove_wake_function+0x0/0x2d > [516383.019701] [<c119735d>] ? xenwatch_thread+0x0/0x103 > [516383.019705] [<c104a0f0>] ? kthread+0x61/0x66 > [516383.019709] [<c104a08f>] ? kthread+0x0/0x66 > [516383.019714] [<c1008d87>] ? kernel_thread_helper+0x7/0x10 > > But I _cannot_ reproduce the call trace and hang on our Sandy Bridge. > > So I think there are maybe *two* bugs in this issue, one caused time > jump(detail below), the other in the kernel triggered by the first bug > sometime, thus result in migration fail. > > I''ve spent some time to identify the timestamp jump issue, and finally > found it''s due to Invarient TSC (CPUID Leaf 0x80000007 EDX:8, also > called non-stop TSC). The present of the feature would enable a > parameter in the kernel named: sched_clock_stable. Seems this > parameter is unable to work with Xen''s pvclock. If > sched_clock_stable() is set, value returned by xen_clocksource_read() > would be returned as sched_clock_cpu() directly(rather than calculated > through sched_clock_local()), but CMIIW the value returned by > xen_clocksource_read() is based on host(vcpu) uptime rather than this > VM''s uptime, then result in the timestamp jump. > > I''ve compiled a kernel, force sched_clock_stable=0, then it solved the > timestamp jump issue as expected. Luckily, seems it also solved the > call trace and guest hang issue as well. > > I''ve posted a patch to mask the CPUID leaf 0x80000007 in Xen. I think > the issue can be easily reproduced using a Westmere or SandyBridge > machine(my old colleagues at Intel said the feature likely existed > after Nehalem) running newer version of PV guest, check the guest > cpuinfo you would see nonstop_tsc, and you would notice the abnormal > timestamp of printk.Yes definitely. I thought that I implemented this properly for PV but I think maybe it never got implemented for HVM? See the section titled "TSC INVARIANT BIT and NO_MIGRATE" in docs/misc/tscmode.txt in the Xen source. However, if "clocksource=tsc tsc=reliable" is selected for a HVM domain, I think the results may be the same as if Invariant TSC bit is checked by the Linux kernel? So maybe the code for readjusting the TSC to adjust to migration was also never implemented in HVM, just in PV? (I remember discussing this problem with Jun Nakajima on an Oracle/Intel call a couple of years ago. Maybe it was discussed but never implemented... at the time, I was primarily concerned with and tested only for PV as that was Oracle''s customer at the time.) Anyway, please force "clocksource=tsc tsc=reliable" on your HVM guest to see if it fails the same way as when the guest "sees" the Invariant TSC bit is set. Thanks, Dan P.S. The Invariant TSC bit *did* exist on Nehalem, however there definitely exists old firmware that did not properly align the TSCs across all cores on boot, so the bit was present but "lied". Maybe you are seeing the problems on a Nehalem system with broken firmware? I know some Sun x86 systems shipped with broken firmware, so it is very likely other system vendors did also.
>>> On 16.04.12 at 19:22, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > In upstream (and recent pv-ops) kernels, is there any need for there > to be a difference between HVM and PV in the clocksource chosen? TheYes, because RDTSC interception for PV guests is slow (using #GP and requiring instruction decode).> pvclock algorithm was necessary for PV when non-TSC hardware clocks > were privileged and the only non-privileged hardware clock (TSC) > was badly broken in hardware and for migration/save/restore. > With TSC now working and stable, and now that we are making changes > in the upstream kernel that work for both PV and HVM, is it > time to drop pvclock (at least as the default for PV)? > > Certainly if an old (non-pv-ops) kernel is broken, something like > David''s patch might be an acceptable workaround. I''m just arguing > against perpetuating pvclock-as-the-only-xen-clock upstream.Afaict, the only uniformly reliable clocksource for PV guests is the virtual one which pvclock builds upon. Raw TSC is definitely not an option on NUMA systems (and PV guests aren''t aware of the NUMAness of the underlying system). Jan
>>> On 16.04.12 at 19:30, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> From: David Vrabel [mailto:david.vrabel@citrix.com] >> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable >> >> On 16/04/12 17:05, Dan Magenheimer wrote: >> >> From: David Vrabel [mailto:david.vrabel@citrix.com] >> >> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable >> > >> > Nacked-by: Dan Magenheimer <dan.magenheimer@oracle.com> >> >> Fair enough, >> >> > [A stable clock] should be true for Xen 4.0+ (but not for pre-Xen-4.0). >> >> The original customer problem is on a host with Xen 3.4. What do you >> recommend for Linux guests running such hosts? > > For pre-Xen-4.0 and an unchanged PV guest, I don''t know. If you can > back-patch the guest kernel with a workaround such as your patch, great! > I''m only arguing against the patch getting perpetuated upstream. > >> > In fact, it might be wise for a Xen-savvy kernel to check to see >> > if it is running on Xen-4.0+ and, if so, force clocksource=tsc >> > and tsc=reliable. >> >> So, should the xen clocksource do: >> >> if Xen 4.0+ >> clock is stable, use rdtsc only. >> else >> clock is unstable, use existing pvclock implementation. > > Yes, that''s what I propose. To clarify: > > if the guest can and does determine it is running on Xen 4.0+_and_ TSC reads are emulated (which I don''t think they are by default). Jan> TSC is guaranteed by Xen to be stable, use clocksource=tsc tsc=reliable > else > Xen only guarantees that pvclock is stable, use pvclock
At 16:01 -0700 on 16 Apr (1334592096), Sheng Yang wrote:> So I think there are maybe *two* bugs in this issue, one caused time > jump(detail below), the other in the kernel triggered by the first bug > sometime, thus result in migration fail. > > I''ve spent some time to identify the timestamp jump issue, and finally > found it''s due to Invarient TSC (CPUID Leaf 0x80000007 EDX:8, also > called non-stop TSC). The present of the feature would enable a > parameter in the kernel named: sched_clock_stable. Seems this > parameter is unable to work with Xen''s pvclock. If > sched_clock_stable() is set, value returned by xen_clocksource_read() > would be returned as sched_clock_cpu() directly(rather than calculated > through sched_clock_local()), but CMIIW the value returned by > xen_clocksource_read() is based on host(vcpu) uptime rather than this > VM''s uptime, then result in the timestamp jump.OK - that seems like a kernel bug. Linux should not be modifying how it treats the PV clocksource based on the ''Invariant TSC'' bit. (Conversely, the patch to pretend the TSC is not invariant just because the PV clocksource is present also seems wrong, and the earlier patch that just enforces sched_clock_stable=0 would be better.)> I''ve compiled a kernel, force sched_clock_stable=0, then it solved the > timestamp jump issue as expected. Luckily, seems it also solved the > call trace and guest hang issue as well. > > I''ve posted a patch to mask the CPUID leaf 0x80000007 in Xen.Well, as Dan says, if Xen is emulating RDTSC to provide a ''stable'' TSC, we shouldn''t _also_ tell the guest that it''s not stable. :) OTOH, grepping for CONSTANT_TSC, NONSTOP_TSC, and TSC_RELIABLE, I don''t see anywhere even in xen-unstable where these bits are ever hidden from the guest. I think it would be reasonable to mask this from PV guests at least for tsc_mode == 2, and on older Xens. Tim.
Dan Magenheimer
2012-Apr-17 15:36 UTC
Re: [PATCH] xen: always set the sched clock as unstable
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: RE: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > > >>> On 16.04.12 at 19:22, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > In upstream (and recent pv-ops) kernels, is there any need for there > > to be a difference between HVM and PV in the clocksource chosen? The > > Yes, because RDTSC interception for PV guests is slow (using #GP > and requiring instruction decode)."Slow" is relative. I showed (somewhere on xen-devel years ago) that the emulation performance hit is much smaller than the original developers expected and is detectable only with certain applications that execute rdtsc ~100K/second. Furthermore, the cycle count of an rdtsc has gone up on modern systems, so the cost ratio of emulating rdtsc vs executing the raw instruction is going down.> > pvclock algorithm was necessary for PV when non-TSC hardware clocks > > were privileged and the only non-privileged hardware clock (TSC) > > was badly broken in hardware and for migration/save/restore. > > With TSC now working and stable, and now that we are making changes > > in the upstream kernel that work for both PV and HVM, is it > > time to drop pvclock (at least as the default for PV)? > > > > Certainly if an old (non-pv-ops) kernel is broken, something like > > David''s patch might be an acceptable workaround. I''m just arguing > > against perpetuating pvclock-as-the-only-xen-clock upstream. > > Afaict, the only uniformly reliable clocksource for PV guests is the > virtual one which pvclock builds upon. Raw TSC is definitely not an > option on NUMA systems (and PV guests aren''t aware of the > NUMAness of the underlying system).You''ll have to define NUMA. On "old" NUMA systems, where there are multiple motherboards, your statement is true. On newer systems where NUMA simply means there are multiple memory controllers and all of them are cache-coherent, even when there are multiple "motherboards" joined by HT or QPI, processor and system vendors take great pains to ensure that the clock signal (and thus TSC) is synchronized and "stable" across all cpus. But I agree there ARE exceptions... for those, I proposed a Xen boot option that said "don''t trust TSC even if all the evidence implies that you can", but Keir shot it down (also years ago). Dan
Dan Magenheimer
2012-Apr-17 15:42 UTC
Re: [PATCH] xen: always set the sched clock as unstable
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: RE: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > > >>> On 16.04.12 at 19:30, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > >> From: David Vrabel [mailto:david.vrabel@citrix.com] > >> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > >> > >> On 16/04/12 17:05, Dan Magenheimer wrote: > >> >> From: David Vrabel [mailto:david.vrabel@citrix.com] > >> >> Subject: Re: [Xen-devel] [PATCH] xen: always set the sched clock as unstable > >> > > >> > Nacked-by: Dan Magenheimer <dan.magenheimer@oracle.com> > >> > >> Fair enough, > >> > >> > [A stable clock] should be true for Xen 4.0+ (but not for pre-Xen-4.0). > >> > >> The original customer problem is on a host with Xen 3.4. What do you > >> recommend for Linux guests running such hosts? > > > > For pre-Xen-4.0 and an unchanged PV guest, I don''t know. If you can > > back-patch the guest kernel with a workaround such as your patch, great! > > I''m only arguing against the patch getting perpetuated upstream. > > > >> > In fact, it might be wise for a Xen-savvy kernel to check to see > >> > if it is running on Xen-4.0+ and, if so, force clocksource=tsc > >> > and tsc=reliable. > >> > >> So, should the xen clocksource do: > >> > >> if Xen 4.0+ > >> clock is stable, use rdtsc only. > >> else > >> clock is unstable, use existing pvclock implementation. > > > > Yes, that''s what I propose. To clarify: > > > > if the guest can and does determine it is running on Xen 4.0+ > > _and_ TSC reads are emulated (which I don''t think they are by > defaultThey are emulated by default on any machine where Xen has determined that TSC is untrustworthy AND always after migration. So by definition (if not always in fact, see previous email), and ignoring Xen bugs, Xen 4.0+ guarantees to guests that TSC is a stable clock across all vcpus.