Sheng Yang
2012-Apr-12 19:22 UTC
Debian stable kernel got timer issue when running as PV guest
(Sorry for duplicate mail, got a typo in the mailing list address...) Hi, Recently we got some reports of Debian(2.6.32-41 package) migration hang on some certain machines. I''ve identified one issue in Xen, but I think there is probably another issue in the kernel. Here is the case. [ 0.000000] Booting paravirtualized kernel on Xen [ 0.000000] Xen version: 3.4.2 (preserve-AD) [ 0.000000] NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:1 nr_node_ids:1 [ 0.000000] PERCPU: Embedded 15 pages/cpu @c1608000 s37656 r0 d23784 u65536 [ 0.000000] pcpu-alloc: s37656 r0 d23784 u65536 alloc=16*4096 [ 0.000000] pcpu-alloc: [0] 0 [508119.807590] trying to map vcpu_info 0 at c1609010, mfn 992cac, offset 16 [508119.807593] cpu 0 using vcpu_info at c1609010 [508119.807594] Xen: using vcpu_info placement [508119.807598] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32416 Dmesg show that when booting, timestamp of printk jumped from 0 to a big number([508119.807590] in this case) immediately. And when migrating: [509508.914333] suspending xenstore... [516212.055921] trying to map vcpu_info 0 at c1609010, mfn 895fd7, offset 16 [516212.055930] cpu 0 using vcpu_info at c1609010 Timestamp jumped again. We can reproduce above issues on our Sandy Bridge machines. After this, call trace and guest hang maybe observed on some machines: [516383.019499] INFO: task xenwatch:12 blocked for more than 120 seconds. [516383.019566] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [516383.019578] xenwatch D c1610e20 0 12 2 0x00000000 [516383.019591] c781eec0 00000246 c1610e58 c1610e20 c781f300 c1441e20 c1441e20 001cf000 [516383.019605] c781f07c c1610e20 00000000 00000001 c1441e20 c62e01c0 c1610e20 c62e01c0 [516383.019617] c127e18e c781f07c c7830020 c7830020 c1441e20 c1441e20 c127f2f1 c781f080 [516383.019629] Call Trace: [516383.019640] [<c127e18e>] ? schedule+0x78f/0x7dc [516383.019645] [<c127f2f1>] ? _spin_unlock_irqrestore+0xd/0xf [516383.019649] [<c127e4a1>] ? schedule_timeout+0x20/0xb0 [516383.019656] [<c100573c>] ? xen_force_evtchn_callback+0xc/0x10 [516383.019660] [<c127e3aa>] ? wait_for_common+0xa4/0x100 [516383.019665] [<c1033315>] ? default_wake_function+0x0/0x8 [516383.019671] [<c104a144>] ? kthread_stop+0x4f/0x8e [516383.019675] [<c1047883>] ? cleanup_workqueue_thread+0x3a/0x45 [516383.019679] [<c1047903>] ? destroy_workqueue+0x56/0x85 [516383.019684] [<c106a395>] ? stop_machine_destroy+0x23/0x37 [516383.019690] [<c11962d8>] ? shutdown_handler+0x200/0x22f [516383.019694] [<c1197439>] ? xenwatch_thread+0xdc/0x103 [516383.019698] [<c104a322>] ? autoremove_wake_function+0x0/0x2d [516383.019701] [<c119735d>] ? xenwatch_thread+0x0/0x103 [516383.019705] [<c104a0f0>] ? kthread+0x61/0x66 [516383.019709] [<c104a08f>] ? kthread+0x0/0x66 [516383.019714] [<c1008d87>] ? kernel_thread_helper+0x7/0x10 But I cannot reproduce it call trace and hang on our Sandy Bridge. I''ve spent some time to identify the timestamp jump issue, and finally found it''s due to Invarient TSC (CPUID Leaf 0x80000007 EDX:8, also called non-stop TSC). The present of the feature would enable a parameter in the kernel named: sched_clock_stable. Seems this parameter is unable to work with Xen''s pvclock. If sched_clock_stable() is set, value returned by xen_clocksource_read() would be returned as sched_clock_cpu() directly, but CMIIW the value returned by xen_clocksource_read() is based on host(vcpu) uptime rather than this VM''s uptime, then result in the timestamp jump. I''ve compiled a kernel, force sched_clock_stable=0, then it solved the timestamp jump issue as expected. Luckily, seems it also solved the call trace and guest hang issue as well. Attachment is a (untested) patch to mask the CPUID leaf 0x80000007. I think the issue can be easily reproduced using a Westmere or SandyBridge machine(my old colleagues at Intel said the feature likely existed after Nehalem) running newer version of PV guest, check the guest cpuinfo you would see nonstop_tsc, and you would notice the abnormal timestamp of printk. Sorry I don''t have a Xen unstable environment by hand now. But I think this should be the case we saw. BTW: the original environment is xen-3.4.2, but I found the feature remain unmasked by latest xen-unstable tree. -- regards Yang, Sheng -- -- regards Yang, Sheng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Apr-13 07:56 UTC
Re: Debian stable kernel got timer issue when running as PV guest
>>> On 12.04.12 at 21:22, Sheng Yang <sheng@yasker.org> wrote: > I''ve compiled a kernel, force sched_clock_stable=0, then it solved the > timestamp jump issue as expected. Luckily, seems it also solved the call > trace and guest hang issue as well.And this is also how it should be fixed.> Attachment is a (untested) patch to mask the CPUID leaf 0x80000007. I think > the issue can be easily reproduced using a Westmere or SandyBridge > machine(my old colleagues at Intel said the feature likely existed after > Nehalem) running newer version of PV guest, check the guest cpuinfo you > would see nonstop_tsc, and you would notice the abnormal timestamp of > printk.Masking the entire leaf is certainly out of question. And even masking the individual bit is questionable - a PV kernel simply shouldn''t look at it imo (for other than possibly reporting to user mode purposes). Jan
David Vrabel
2012-Apr-13 10:37 UTC
Re: Debian stable kernel got timer issue when running as PV guest
On 13/04/12 08:56, Jan Beulich wrote:>>>> On 12.04.12 at 21:22, Sheng Yang <sheng@yasker.org> wrote: >> I''ve compiled a kernel, force sched_clock_stable=0, then it solved the >> timestamp jump issue as expected. Luckily, seems it also solved the call >> trace and guest hang issue as well. > > And this is also how it should be fixed.Something like this? I''ve not tested it yet as I need to track down some of the problem hardware and get it set up. 8<--------------- xen: always set the sched clock as unstable It''s not clear to me if the Xen clock source can be used as a stable sched clock. Also, even if the guest is started on a system whose underying TSC is stable it may be migrated to one where it''s not. So never mark the sched clock as stable. Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- arch/x86/xen/time.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index 0296a95..b22cd9c 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -473,6 +473,9 @@ static void __init xen_time_init(void) do_settimeofday(&tp); setup_force_cpu_cap(X86_FEATURE_TSC); + setup_clear_cpu_cap(X86_FEATURE_CONSTANT_TSC); + setup_clear_cpu_cap(X86_FEATURE_NONSTOP_TSC); + sched_clock_stable = 0; xen_setup_runstate_info(cpu); xen_setup_timer(cpu);
Jan Beulich
2012-Apr-13 11:00 UTC
Re: Debian stable kernel got timer issue when running as PV guest
>>> On 13.04.12 at 12:37, David Vrabel <dvrabel@cantab.net> wrote: > On 13/04/12 08:56, Jan Beulich wrote: >>>>> On 12.04.12 at 21:22, Sheng Yang <sheng@yasker.org> wrote: >>> I''ve compiled a kernel, force sched_clock_stable=0, then it solved the >>> timestamp jump issue as expected. Luckily, seems it also solved the call >>> trace and guest hang issue as well. >> >> And this is also how it should be fixed. > > Something like this? I''ve not tested it yet as I need to track down > some of the problem hardware and get it set up.Yeah, except that I''m not sure you really need to clear the feature flags. Just making sure sched_clock_stable never gets set should be enough; playing with the feature flags always implies that users will see bigger differences in /proc/cpuinfo between native and Xen kernels... Jjan> 8<--------------- > xen: always set the sched clock as unstable > > It''s not clear to me if the Xen clock source can be used as a stable > sched clock. Also, even if the guest is started on a system whose > underying TSC is stable it may be migrated to one where it''s not. So > never mark the sched clock as stable. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com> > --- > arch/x86/xen/time.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c > index 0296a95..b22cd9c 100644 > --- a/arch/x86/xen/time.c > +++ b/arch/x86/xen/time.c > @@ -473,6 +473,9 @@ static void __init xen_time_init(void) > do_settimeofday(&tp); > > setup_force_cpu_cap(X86_FEATURE_TSC); > + setup_clear_cpu_cap(X86_FEATURE_CONSTANT_TSC); > + setup_clear_cpu_cap(X86_FEATURE_NONSTOP_TSC); > + sched_clock_stable = 0; > > xen_setup_runstate_info(cpu); > xen_setup_timer(cpu);
David Vrabel
2012-Apr-13 16:10 UTC
Re: Debian stable kernel got timer issue when running as PV guest
On 13/04/12 12:00, Jan Beulich wrote:>>>> On 13.04.12 at 12:37, David Vrabel <dvrabel@cantab.net> wrote: >> On 13/04/12 08:56, Jan Beulich wrote: >>>>>> On 12.04.12 at 21:22, Sheng Yang <sheng@yasker.org> wrote: >>>> I''ve compiled a kernel, force sched_clock_stable=0, then it solved the >>>> timestamp jump issue as expected. Luckily, seems it also solved the call >>>> trace and guest hang issue as well. >>> >>> And this is also how it should be fixed. >> >> Something like this? I''ve not tested it yet as I need to track down >> some of the problem hardware and get it set up. > > Yeah, except that I''m not sure you really need to clear the feature > flags. Just making sure sched_clock_stable never gets set should be > enough; playing with the feature flags always implies that users will > see bigger differences in /proc/cpuinfo between native and Xen > kernels...I have a system with both NONSTOP_TSC and CONSTANT_TSC so sched_clock_stable should be true. VMs start and migrate fine with no unexpected jumps in time. I think more digging is required here to find out why time is screwy on this particular system. David>> 8<--------------- >> xen: always set the sched clock as unstable >> >> It''s not clear to me if the Xen clock source can be used as a stable >> sched clock. Also, even if the guest is started on a system whose >> underying TSC is stable it may be migrated to one where it''s not. So >> never mark the sched clock as stable. >> >> Signed-off-by: David Vrabel <david.vrabel@citrix.com> >> --- >> arch/x86/xen/time.c | 3 +++ >> 1 files changed, 3 insertions(+), 0 deletions(-) >> >> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c >> index 0296a95..b22cd9c 100644 >> --- a/arch/x86/xen/time.c >> +++ b/arch/x86/xen/time.c >> @@ -473,6 +473,9 @@ static void __init xen_time_init(void) >> do_settimeofday(&tp); >> >> setup_force_cpu_cap(X86_FEATURE_TSC); >> + setup_clear_cpu_cap(X86_FEATURE_CONSTANT_TSC); >> + setup_clear_cpu_cap(X86_FEATURE_NONSTOP_TSC); >> + sched_clock_stable = 0; >> >> xen_setup_runstate_info(cpu); >> xen_setup_timer(cpu);
Sheng Yang
2012-Apr-13 16:15 UTC
Re: Debian stable kernel got timer issue when running as PV guest
On Fri, Apr 13, 2012 at 9:10 AM, David Vrabel <david.vrabel@citrix.com>wrote:> On 13/04/12 12:00, Jan Beulich wrote: > >>>> On 13.04.12 at 12:37, David Vrabel <dvrabel@cantab.net> wrote: > >> On 13/04/12 08:56, Jan Beulich wrote: > >>>>>> On 12.04.12 at 21:22, Sheng Yang <sheng@yasker.org> wrote: > >>>> I''ve compiled a kernel, force sched_clock_stable=0, then it solved the > >>>> timestamp jump issue as expected. Luckily, seems it also solved the > call > >>>> trace and guest hang issue as well. > >>> > >>> And this is also how it should be fixed. > >> > >> Something like this? I''ve not tested it yet as I need to track down > >> some of the problem hardware and get it set up. > > > > Yeah, except that I''m not sure you really need to clear the feature > > flags. Just making sure sched_clock_stable never gets set should be > > enough; playing with the feature flags always implies that users will > > see bigger differences in /proc/cpuinfo between native and Xen > > kernels... > > I have a system with both NONSTOP_TSC and CONSTANT_TSC so > sched_clock_stable should be true. VMs start and migrate fine with no > unexpected jumps in time. I think more digging is required here to find > out why time is screwy on this particular system. >That''s the reason I said there should be another (kernel) bug, triggered by this. In the original mail, I''ve already said on our Sandy Bridge machine, I can only reproduce the timestamp of printk jump issue, but not the migration hang. Did you see the timestamp jump on the PV guest? -- regards Yang, Sheng> David > > >> 8<--------------- > >> xen: always set the sched clock as unstable > >> > >> It''s not clear to me if the Xen clock source can be used as a stable > >> sched clock. Also, even if the guest is started on a system whose > >> underying TSC is stable it may be migrated to one where it''s not. So > >> never mark the sched clock as stable. > >> > >> Signed-off-by: David Vrabel <david.vrabel@citrix.com> > >> --- > >> arch/x86/xen/time.c | 3 +++ > >> 1 files changed, 3 insertions(+), 0 deletions(-) > >> > >> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c > >> index 0296a95..b22cd9c 100644 > >> --- a/arch/x86/xen/time.c > >> +++ b/arch/x86/xen/time.c > >> @@ -473,6 +473,9 @@ static void __init xen_time_init(void) > >> do_settimeofday(&tp); > >> > >> setup_force_cpu_cap(X86_FEATURE_TSC); > >> + setup_clear_cpu_cap(X86_FEATURE_CONSTANT_TSC); > >> + setup_clear_cpu_cap(X86_FEATURE_NONSTOP_TSC); > >> + sched_clock_stable = 0; > >> > >> xen_setup_runstate_info(cpu); > >> xen_setup_timer(cpu); >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Sheng Yang
2012-Apr-13 17:27 UTC
Re: Debian stable kernel got timer issue when running as PV guest
On Fri, Apr 13, 2012 at 12:56 AM, Jan Beulich <JBeulich@suse.com> wrote:> >>> On 12.04.12 at 21:22, Sheng Yang <sheng@yasker.org> wrote: > > I''ve compiled a kernel, force sched_clock_stable=0, then it solved the > > timestamp jump issue as expected. Luckily, seems it also solved the call > > trace and guest hang issue as well. > > And this is also how it should be fixed. > > > Attachment is a (untested) patch to mask the CPUID leaf 0x80000007. I > think > > the issue can be easily reproduced using a Westmere or SandyBridge > > machine(my old colleagues at Intel said the feature likely existed after > > Nehalem) running newer version of PV guest, check the guest cpuinfo you > > would see nonstop_tsc, and you would notice the abnormal timestamp of > > printk. > > Masking the entire leaf is certainly out of question. And even masking > the individual bit is questionable - a PV kernel simply shouldn''t look at > it imo (for other than possibly reporting to user mode purposes). > > Jan > >The CPUID detection part in the kernel is handled by CPU vendor, not Xen. And the way how Xen control it is through CPUID it present to the guest. 1. We can only mask one bit of it. But currently this leaf got only this feature. I don''t think it would be a big problem of mask the whole leaf. I think it''s already a problem that Xen handle PV guest a blacklist of cpu feature rather than a white list, so when some new feature slipped in(like this time), nobody would know what would happen. I am really thinking of some thing like: switch ( input[0] ) case... case... +default: regs[0] = regs[1] = regs[2] = regs[3] = 0; Maybe there are some reason that we didn''t set a default value for pv cpuid policy, but I can''t see why. 2. If we want to present the cpu feature to the guest and disable that feature in the guest, then what''s the point? I don''t think it is a good idea. What if there are something else interactive with this cpuid feature but we failed to disable(e.g. something other than sched_clock_stable)? Just don''t show it would be a better/cleaner way to do it, as long as we agreed this feature is useless even troublesome for PV guest. -- regards Yang, Sheng _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel