Jed Smith
2010-Jul-14 19:24 UTC
[Xen-devel] [PATCH] x86: unconditionally mark TSC unstable under Xen
Jeremy, Jan - what do you think? Is this a bad move? I feel like there is a consequence to this that I am unaware of, but it fixes my issue. -- x86: unconditionally mark TSC unstable under Xen In certain domU environments (new Intel), time will rewind and jump around by seconds or more, leading to inaccurate measurements kernel-wide. This patch unilaterally marks the TSC unstable under Xen, which prevents timing from jumping around on these processors without significant penalty in all domU environments. An example of this is the following testcase which runs many shell processes, each of which does a DNS lookup with dig(1): http://gist.github.com/449825 When run on a Debian testing x86_64 system without this change it gives: $ (cd /tmp;git clone git://gist.github.com/449825.git;cd 449825;git pull;time perl many-digs.pl) [...] real 0m7.063s user 268659840m0.951s sys 38524003m13.072s And with it: real 0m6.468s user 0m2.851s sys 0m6.789s The issue isn''t particular to that bit of code, it''ll also crop when running the Git test suite in parallel, or in the TIME+ top(1) reports. Which will eventually end up displaying times like "-596523h-14:-8" for most long-lived processes on the system. This closes bug #16314 - Erroneous idle times for processes: https://bugzilla.kernel.org/show_bug.cgi?id=16314 It was also reported on the Linode forums as "Xen timing wonkyness": http://www.linode.com/forums/viewtopic.php?t=5731 Signed-off-by: Jed Smith <jed@linode.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Tested-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- arch/x86/kernel/cpu/intel.c | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 85f69cd..afc839a 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -90,8 +90,14 @@ static void __cpuinit early_init_intel(struct cpuinfo_x86 *c) if (c->x86_power & (1 << 8)) { set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); +#ifndef CONFIG_XEN if (!check_tsc_unstable()) sched_clock_stable = 1; +#else + /* + * Under Xen, we cannot consider the TSC stable or it will + * go backwards in certain circumstances. Bug 16314. + */ + mark_tsc_unstable("Xen domain"); +#endif } /* -- 1.6.0.4 Regards, Jed Smith Systems Administrator Linode, LLC +1 (609) 593-7103 x1209 jed@linode.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Jul-14 21:36 UTC
[Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On 07/14/2010 12:24 PM, Jed Smith wrote:> Jeremy, Jan - what do you think? Is this a bad move? I feel like there > is a consequence to this that I am unaware of, but it fixes my issue. >Ah, well that''s interesting. There''s a couple of comments: 1. you can''t do this with just a compile-time test, since the same kernel can also boot native 2. nothing in a Xen PV domU environment should be using the tsc directly, so this shouldn''t have an effect. If something is using the tsc we should track it down. I wonder, however, if you''re getting the same result as Jan''s suggestion of making sched_clock unstable by making the tsc unstable. In that case, this patch may help: Subject: [PATCH] xen: disable xen_sched_clock by default xen_sched_clock only counts unstolen time. In principle this should be useful to the Linux scheduler so that it knows how much time a process actually consumed. But in practice this doesn''t work very well as the scheduler expects the sched_clock time to be synchronized between cpus. It also uses sched_clock to measure the time a task spends sleeping, in which case "unstolen time" isn''t meaningful. So just use plain xen_clocksource_read to return wallclock nanoseconds for sched_clock. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig index b83e119..6a09f01 100644 --- a/arch/x86/xen/Kconfig +++ b/arch/x86/xen/Kconfig @@ -29,6 +29,10 @@ config XEN_SAVE_RESTORE depends on XEN && PM default y +config XEN_SCHED_CLOCK + bool + default n + config XEN_DEBUG_FS bool "Enable Xen debug and tuning parameters in debugfs" depends on XEN && DEBUG_FS diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 4b57c0b..242a230 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -939,7 +939,11 @@ static const struct pv_time_ops xen_time_ops __initdata = { .set_wallclock = xen_set_wallclock, .get_wallclock = xen_get_wallclock, .get_tsc_khz = xen_tsc_khz, +#ifdef CONFIG_XEN_SCHED_CLOCK .sched_clock = xen_sched_clock, +#else + .sched_clock = xen_clocksource_read, +#endif }; static const struct pv_cpu_ops xen_cpu_ops __initdata = { diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index 9d1f853..32dc3dc 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -154,6 +154,7 @@ static void do_stolen_accounting(void) account_idle_ticks(ticks); } +#ifdef CONFIG_XEN_SCHED_CLOCK /* * Xen sched_clock implementation. Returns the number of unstolen * nanoseconds, which is nanoseconds the VCPU spent in RUNNING+BLOCKED @@ -191,7 +192,7 @@ unsigned long long xen_sched_clock(void) return ret; } - +#endif /* Get the TSC speed from Xen */ unsigned long xen_tsc_khz(void) Thanks, J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jed Smith
2010-Jul-14 22:29 UTC
Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On Jul 14, 2010, at 5:36 PM, Jeremy Fitzhardinge wrote:> On 07/14/2010 12:24 PM, Jed Smith wrote: >> Jeremy, Jan - what do you think? Is this a bad move? I feel like there >> is a consequence to this that I am unaware of, but it fixes my issue. >> > > Ah, well that''s interesting. > > There''s a couple of comments: > > 1. you can''t do this with just a compile-time test, since the same > kernel can also boot native > 2. nothing in a Xen PV domU environment should be using the tsc > directly, so this shouldn''t have an effect. If something is using > the tsc we should track it down. > > I wonder, however, if you''re getting the same result as Jan''s suggestion > of making sched_clock unstable by making the tsc unstable.That''s what I went for - I had initially just commented out the assignment of sched_clock_stable to 1, which fixed it, but this felt cleaner.> In that case, this patch may help:Will try and report. Regards, Jed Smith Systems Administrator Linode, LLC +1 (609) 593-7103 x1209 jed@linode.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jed Smith
2010-Jul-15 14:23 UTC
Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On Jul 14, 2010, at 5:36 PM, Jeremy Fitzhardinge wrote:> In that case, this patch may help:And it does. I''ll roll it into our current kernels. Thanks! Tested-by: Jed Smith <jed@linode.com> Regards, Jed Smith Systems Administrator Linode, LLC +1 (609) 593-7103 x1209 jed@linode.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Stefano Stabellini
2010-Jul-15 15:57 UTC
Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On Wed, 14 Jul 2010, Jeremy Fitzhardinge wrote:> On 07/14/2010 12:24 PM, Jed Smith wrote: > > Jeremy, Jan - what do you think? Is this a bad move? I feel like there > > is a consequence to this that I am unaware of, but it fixes my issue. > > > > Ah, well that''s interesting. > > There''s a couple of comments: > > 1. you can''t do this with just a compile-time test, since the same > kernel can also boot native > 2. nothing in a Xen PV domU environment should be using the tsc > directly, so this shouldn''t have an effect. If something is using > the tsc we should track it down. > > I wonder, however, if you''re getting the same result as Jan''s suggestion > of making sched_clock unstable by making the tsc unstable. > > In that case, this patch may help: > > Subject: [PATCH] xen: disable xen_sched_clock by default > > xen_sched_clock only counts unstolen time. In principle this should > be useful to the Linux scheduler so that it knows how much time a process > actually consumed. But in practice this doesn''t work very well as the > scheduler expects the sched_clock time to be synchronized between > cpus. It also uses sched_clock to measure the time a task spends > sleeping, in which case "unstolen time" isn''t meaningful. > > So just use plain xen_clocksource_read to return wallclock nanoseconds > for sched_clock. >I think that in this context is worth mentioning that xen_clocksource_read ends up calling pvclock_get_nsec_offset that calls native_read_tsc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Jul-15 16:30 UTC
Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On 07/15/2010 08:57 AM, Stefano Stabellini wrote:> On Wed, 14 Jul 2010, Jeremy Fitzhardinge wrote: > >> Subject: [PATCH] xen: disable xen_sched_clock by default >> >> xen_sched_clock only counts unstolen time. In principle this should >> be useful to the Linux scheduler so that it knows how much time a process >> actually consumed. But in practice this doesn''t work very well as the >> scheduler expects the sched_clock time to be synchronized between >> cpus. It also uses sched_clock to measure the time a task spends >> sleeping, in which case "unstolen time" isn''t meaningful. >> >> So just use plain xen_clocksource_read to return wallclock nanoseconds >> for sched_clock. >> >> > I think that in this context is worth mentioning that > xen_clocksource_read ends up calling pvclock_get_nsec_offset that calls > native_read_tsc. >That''s different. That''s not a general rdtsc, but a specific use of the tsc within the Xen clock ABI. We know and expect that the raw tsc value could be dubious, but we also have the Xen-provided corrections to apply to it. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Jul-15 16:40 UTC
RE: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > Sent: Thursday, July 15, 2010 10:30 AM > Subject: Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC > unstable under Xen > > On 07/15/2010 08:57 AM, Stefano Stabellini wrote: > > On Wed, 14 Jul 2010, Jeremy Fitzhardinge wrote: > > > >> Subject: [PATCH] xen: disable xen_sched_clock by default > >> > >> xen_sched_clock only counts unstolen time. In principle this should > >> be useful to the Linux scheduler so that it knows how much time a > process > >> actually consumed. But in practice this doesn''t work very well as > the > >> scheduler expects the sched_clock time to be synchronized between > >> cpus. It also uses sched_clock to measure the time a task spends > >> sleeping, in which case "unstolen time" isn''t meaningful. > >> > >> So just use plain xen_clocksource_read to return wallclock > nanoseconds > >> for sched_clock. > >> > >> > > I think that in this context is worth mentioning that > > xen_clocksource_read ends up calling pvclock_get_nsec_offset that > calls > > native_read_tsc. > > That''s different. That''s not a general rdtsc, but a specific use of > the tsc within the Xen clock ABI. We know and expect that the raw tsc > value could be dubious, but we also have the Xen-provided corrections to > apply to it.Sorry if I''m a bit behind on this... Isn''t the real problem that, in a PV guest, the cpuid instructions that are testing the TSC-related CPUID bits are obtaining the actual hardware value, rather than what Xen would like the guest to believe? IOW, isn''t the correct fix to use pvcpuid instead of cpuid when xen_pvdomain() is true? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Jul-15 16:45 UTC
Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On 07/15/2010 09:40 AM, Dan Magenheimer wrote:> Isn''t the real problem that, in a PV guest, the cpuid instructions > that are testing the TSC-related CPUID bits are obtaining the actual > hardware value, rather than what Xen would like the guest to believe? >No, because there shouldn''t be any "naked" rdtscs in the kernel.> IOW, isn''t the correct fix to use pvcpuid instead of cpuid when > xen_pvdomain() is true? >Every use of cpuid in the kernel goes via the cpuid pvop, which ends up doing the Xen cpuid rather than the native one. Usermode cpuid is still the "real" one, unless they explicitly use the Xen version. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Jul-15 17:14 UTC
RE: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
> > Isn''t the real problem that, in a PV guest, the cpuid instructions > > that are testing the TSC-related CPUID bits are obtaining the actual > > hardware value, rather than what Xen would like the guest to believe? > > No, because there shouldn''t be any "naked" rdtscs in the kernel. > > > IOW, isn''t the correct fix to use pvcpuid instead of cpuid when > > xen_pvdomain() is true? > > Every use of cpuid in the kernel goes via the cpuid pvop, which ends up > doing the Xen cpuid rather than the native one. Usermode cpuid is > still the "real" one, unless they explicitly use the Xen version.OK, then I''m confused. Either: - this is one of those recent Intel boxes where all the TSCs should be sync''ed but due to firmware issues they are not, in which case this is a Linux bug that has already been fixed upstream; or - this isn''t Xen 4.0+ but should be fixed in 4.0; or - this is Xen 4.0+ and the default tsc_mode is being overridden Otherwise, why is TSC not synchronized and pvclock always getting an offset of 0? Apologies if this was stated in a differently titled thread of this conversation but, Jed, what is your Xen version? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Jul-15 17:30 UTC
Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On 07/15/2010 10:14 AM, Dan Magenheimer wrote:>>> Isn''t the real problem that, in a PV guest, the cpuid instructions >>> that are testing the TSC-related CPUID bits are obtaining the actual >>> hardware value, rather than what Xen would like the guest to believe? >>> >> No, because there shouldn''t be any "naked" rdtscs in the kernel. >> >> >>> IOW, isn''t the correct fix to use pvcpuid instead of cpuid when >>> xen_pvdomain() is true? >>> >> Every use of cpuid in the kernel goes via the cpuid pvop, which ends up >> doing the Xen cpuid rather than the native one. Usermode cpuid is >> still the "real" one, unless they explicitly use the Xen version. >> > OK, then I''m confused. Either: > - this is one of those recent Intel boxes where all the TSCs should > be sync''ed but due to firmware issues they are not, in which case > this is a Linux bug that has already been fixed upstream; or > - this isn''t Xen 4.0+ but should be fixed in 4.0; or > - this is Xen 4.0+ and the default tsc_mode is being overridden > > Otherwise, why is TSC not synchronized and pvclock always getting > an offset of 0?No, this bug doesn''t really have anything to do with tsc sync issues. The situation is: * The scheduler uses its own timebase, called sched_clock * We have a pvop for sched_clock * The Xen implementation for sched_clock counts unstolen ns, rather than wallclock ns, since this is (somewhat, in theory) useful * However, the scheduler checks to see if the tsc is stable (because the default sched_clock is based on the tsc), and if so, assumes that sched_clock is synced across all cpus - but of course the amount of stolen time is different for each vcpu Unfortunately, while the idea of counting unstolen time is useful to see how much work got done in a timeslice, it pretty useless for counting how long something was asleep for (since you don''t care about how much time was "stolen" while you were asleep). And the scheduler uses the same timebase for measuring both. So the fix is to simply use plain Xen system time as the scheduler clock, as that will be synced across cpus. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Jul-15 17:48 UTC
RE: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
> > OK, then I''m confused. Either: > > - this is one of those recent Intel boxes where all the TSCs should > > be sync''ed but due to firmware issues they are not, in which case > > this is a Linux bug that has already been fixed upstream; or > > - this isn''t Xen 4.0+ but should be fixed in 4.0; or > > - this is Xen 4.0+ and the default tsc_mode is being overridden > > > > Otherwise, why is TSC not synchronized and pvclock always getting > > an offset of 0? > > No, this bug doesn''t really have anything to do with tsc sync issues. > > The situation is: > > * The scheduler uses its own timebase, called sched_clock > * We have a pvop for sched_clock > * The Xen implementation for sched_clock counts unstolen ns, rather > than wallclock ns, since this is (somewhat, in theory) useful > * However, the scheduler checks to see if the tsc is stable > (because > the default sched_clock is based on the tsc), and if so, assumes > that sched_clock is synced across all cpus - but of course the > amount of stolen time is different for each vcpu > > Unfortunately, while the idea of counting unstolen time is useful to > see > how much work got done in a timeslice, it pretty useless for counting > how long something was asleep for (since you don''t care about how much > time was "stolen" while you were asleep). And the scheduler uses the > same timebase for measuring both. > > So the fix is to simply use plain Xen system time as the scheduler > clock, as that will be synced across cpus.OK, that makes sense. Thanks for the thorough explanation. Maybe the xen_sched_clock code should be entirely removed rather than ifdef''d since it is no longer used and "(somewhat, in theory)" led to a strange bug? Or if you are confident that it will be useful in the future by some linux scheduler, maybe add some comments about how enabling it may cause the effects Jed saw. And maybe an even better answer is to submit a patch upstream so that the scheduler doesn''t use the same timebase for measuring both, since the kernel is making a bad assumption about real vs virtual time. I''d imagine KVM users might benefit from that also. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Jul-15 18:05 UTC
Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On 07/15/2010 10:48 AM, Dan Magenheimer wrote:> Maybe the xen_sched_clock code should be entirely removed > rather than ifdef''d since it is no longer used and > "(somewhat, in theory)" led to a strange bug? Or if > you are confident that it will be useful in the future > by some linux scheduler, maybe add some comments about > how enabling it may cause the effects Jed saw.Yes, I can probably remove it altogether, though it isn''t actually selectable without manually editing the Kconfig file.> And maybe an even better answer is to submit a patch upstream > so that the scheduler doesn''t use the same timebase for > measuring both, since the kernel is making a bad assumption > about real vs virtual time. I''d imagine KVM users might benefit > from that also.Its unclear how useful it is anyway. I''ve discussed it with Peter Zijlstra from time to time, but making the scheduler use two timebases is non-trivial I think. Or perhaps more accurate to say that I don''t want to be getting into the scheduler, since it is not only a technical minefield. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ævar Arnfjörð Bjarmason
2010-Aug-15 20:57 UTC
Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On Thu, Jul 15, 2010 at 18:05, Jeremy Fitzhardinge <jeremy@goop.org> wrote:> On 07/15/2010 10:48 AM, Dan Magenheimer wrote: >> Maybe the xen_sched_clock code should be entirely removed >> rather than ifdef''d since it is no longer used and >> "(somewhat, in theory)" led to a strange bug? Or if >> you are confident that it will be useful in the future >> by some linux scheduler, maybe add some comments about >> how enabling it may cause the effects Jed saw. > > Yes, I can probably remove it altogether, though it isn''t actually > selectable without manually editing the Kconfig file. > >> And maybe an even better answer is to submit a patch upstream >> so that the scheduler doesn''t use the same timebase for >> measuring both, since the kernel is making a bad assumption >> about real vs virtual time. I''d imagine KVM users might benefit >> from that also. > > Its unclear how useful it is anyway. I''ve discussed it with Peter > Zijlstra from time to time, but making the scheduler use two timebases > is non-trivial I think. Or perhaps more accurate to say that I don''t > want to be getting into the scheduler, since it is not only a > technical minefield.Hi all, I was wondering what the status on this issue was. Maybe I''ve missed something, but I''ve looked through the lkml and xen archives and I haven''t found a re-rolled patch that addresses this issue. I''m not very familiar with this area (well, any area) of the kernel, but if there''s something needed to push this forward I''d be happy to help with that, or if this is already being addressed somewhere else a pointer to where that is would be welcome. Thanks. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Aug-15 23:40 UTC
Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On 08/15/2010 01:57 PM, Ævar Arnfjörð Bjarmason wrote:> Hi all, I was wondering what the status on this issue was. Maybe I''ve > missed something, but I''ve looked through the lkml and xen archives and > I haven''t found a re-rolled patch that addresses this issue. > > I''m not very familiar with this area (well, any area) of the kernel, > but if there''s something needed to push this forward I''d be happy to > help with that, or if this is already being addressed somewhere else a > pointer to where that is would be welcome.A patch that fixes this is in mainline and all the supported stable kernels. Itis "xen: drop xen_sched_clock in favour of using plain wallclock time". J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ævar Arnfjörð Bjarmason
2010-Aug-16 01:27 UTC
Re: [Xen-devel] Re: [PATCH] x86: unconditionally mark TSC unstable under Xen
On Sun, Aug 15, 2010 at 23:40, Jeremy Fitzhardinge <jeremy@goop.org> wrote:> On 08/15/2010 01:57 PM, Ævar Arnfjörð Bjarmason wrote: >> Hi all, I was wondering what the status on this issue was. Maybe I''ve >> missed something, but I''ve looked through the lkml and xen archives and >> I haven''t found a re-rolled patch that addresses this issue. >> >> I''m not very familiar with this area (well, any area) of the kernel, >> but if there''s something needed to push this forward I''d be happy to >> help with that, or if this is already being addressed somewhere else a >> pointer to where that is would be welcome. > > A patch that fixes this is in mainline and all the supported stable > kernels. Itis "xen: drop xen_sched_clock in favour of using plain > wallclock time".Thanks! I didn''t spot that when looking through your patches. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel