Konrad Rzeszutek Wilk
2012-Mar-22 03:04 UTC
Re: Regression in v3.4-rc0 " BUG: soft lockup - CPU#0 stuck for 29s! [migration/0:6]..[<ffffffff810d3b8b>] stop_machine_cpu_stop+0x7b/0xf
On Wed, Mar 21, 2012 at 05:32:21PM +0100, Peter Zijlstra wrote:> On Wed, 2012-03-21 at 17:30 +0100, Peter Zijlstra wrote: > > On Wed, 2012-03-21 at 16:57 +0100, Peter Zijlstra wrote: > > > On Wed, 2012-03-21 at 11:26 -0400, Konrad Rzeszutek Wilk wrote: > > > > On Tue, Mar 20, 2012 at 07:53:22PM -0400, Konrad Rzeszutek Wilk wrote: > > > > > Seeing this in v3.4-rc0 tree and didn''t see that with v3.3: > > > > > > > > Hey Peter, > > > > > > > > Git bisection points this to the fault of > > > > 5fbd036b552f633abb394a319f7c62a5c86a9cd7 " sched: Cleanup cpu_active madness" > > > > > > > > thoughts? (also attaching the .config) > > > > > > Argh.. so when is this? boot? No that''s somewhat unexpected. I have one > > > report of funnies during a hotplug bash that I''m looking into, but I > > > haven''t actually been able to reproduce that report myself either. > > > > is arch/x86/xen/smp.c:cpu_bringup() missing a call to > > notify_cpu_starting() before doing set_cpu_online()? > > > > Also, shouldn''t that also take the ipi_call_lock() around setting the > > cpu online? > > > And before you ask, yes all that should live in generic code... somehow. > This per-arch replication of the cpu hotplug logic is driving me insane.Thanks to Peter, here is the patch that fixes the regression.
The CPU hotplug code has now a callback to help bring up the CPU. Without the call we end up getting: BUG: soft lockup - CPU#0 stuck for 29s! [migration/0:6] Modules linked in: CPU ] Pid: 6, comm: migration/0 Not tainted 3.3.0upstream-01180-ged378a5 #1 Dell Inc. PowerEdge T105 /0RR825 RIP: e030:[<ffffffff810d3b8b>] [<ffffffff810d3b8b>] stop_machine_cpu_stop+0x7b/0xf0 RSP: e02b:ffff8800ceaabdb0 EFLAGS: 00000293 .. snip.. Call Trace: [<ffffffff810d3b10>] ? stop_one_cpu_nowait+0x50/0x50 [<ffffffff810d3841>] cpu_stopper_thread+0xf1/0x1c0 [<ffffffff815a9776>] ? __schedule+0x3c6/0x760 [<ffffffff815aa749>] ? _raw_spin_unlock_irqrestore+0x19/0x30 [<ffffffff810d3750>] ? res_counter_charge+0x150/0x150 [<ffffffff8108dc76>] kthread+0x96/0xa0 [<ffffffff815b27e4>] kernel_thread_helper+0x4/0x10 [<ffffffff815aacbc>] ? retint_restore_ar This fixes it. Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/smp.c | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 315d8fa..02900e8 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -75,8 +75,14 @@ static void __cpuinit cpu_bringup(void) xen_setup_cpu_clockevents(); + notify_cpu_starting(cpu); + + ipi_call_lock(); set_cpu_online(cpu, true); + ipi_call_unlock(); + this_cpu_write(cpu_state, CPU_ONLINE); + wmb(); /* We can take interrupts now: we''re officially "up". */ -- 1.7.7.5
On Wed, 2012-03-21 at 23:04 -0400, Konrad Rzeszutek Wilk wrote:> The CPU hotplug code has now a callback to help bring up the CPU. > Without the call we end up getting:Its had this for a while now (since 2008, see e545a614). Its just that generic infrastructure started using it only now.> BUG: soft lockup - CPU#0 stuck for 29s! [migration/0:6] > Modules linked in: > CPU ] Pid: 6, comm: migration/0 Not tainted 3.3.0upstream-01180-ged378a5 #1 Dell Inc. PowerEdge T105 /0RR825 > RIP: e030:[<ffffffff810d3b8b>] [<ffffffff810d3b8b>] stop_machine_cpu_stop+0x7b/0xf0 > RSP: e02b:ffff8800ceaabdb0 EFLAGS: 00000293 > .. snip.. > Call Trace: > [<ffffffff810d3b10>] ? stop_one_cpu_nowait+0x50/0x50 > [<ffffffff810d3841>] cpu_stopper_thread+0xf1/0x1c0 > [<ffffffff815a9776>] ? __schedule+0x3c6/0x760 > [<ffffffff815aa749>] ? _raw_spin_unlock_irqrestore+0x19/0x30 > [<ffffffff810d3750>] ? res_counter_charge+0x150/0x150 > [<ffffffff8108dc76>] kthread+0x96/0xa0 > [<ffffffff815b27e4>] kernel_thread_helper+0x4/0x10 > [<ffffffff815aacbc>] ? retint_restore_ar > > This fixes it. > > Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>