Glauber de Oliveira Costa
2006-Oct-09 21:22 UTC
[Xen-devel] [PATCH] BUG() on soft lockup upon suspend/resume
Hi, In systems with vcpu > 1, a BUG due to a detected soft lockup seems to be triggered after system resume/suspend. This is probably due to the lack of seqlocking around the region that does the local time processing. The following patch fix this. -- Glauber de Oliveira Costa Red Hat Inc. "Free as in Freedom" _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2006-Oct-09 22:22 UTC
Re: [Xen-devel] [PATCH] BUG() on soft lockup upon suspend/resume
On 9/10/06 10:22 pm, "Glauber de Oliveira Costa" <gcosta@redhat.com> wrote:> In systems with vcpu > 1, a BUG due to a detected soft lockup seems to be > triggered after system resume/suspend. This is probably due to the lack of > seqlocking around the region that does the local time processing.We do SMP save/restore tests regularly and do not see this issue. It ought to be avoided by the fact that, when we bring up a CPU, we touch_softlockup_watchdog() in cpu_bringup(), before enabling interrupts. For CPU0 on resume, the touch is done in time_resume() in arch/i386/kernel/time-xen.c. In general that local accounting work does not need to be done under the xtime_lock. Native x86 does not take the lock in smp_local_timer_interrupt() (apic.c) for example. I think we need to understand the issue you are hitting a bit more before deciding on the right fix. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Glauber de Oliveira Costa
2006-Oct-10 00:29 UTC
Re: [Xen-devel] [PATCH] BUG() on soft lockup upon suspend/resume
> > > In systems with vcpu > 1, a BUG due to a detected soft lockup seems to be > > triggered after system resume/suspend. This is probably due to the lack of > > seqlocking around the region that does the local time processing. > > We do SMP save/restore tests regularly and do not see this issue. It ought > to be avoided by the fact that, when we bring up a CPU, we > touch_softlockup_watchdog() in cpu_bringup(), before enabling interrupts. > For CPU0 on resume, the touch is done in time_resume() in > arch/i386/kernel/time-xen.c.This happens not only (once) when the system comes back. It do happen a lot after it. So even if the first touch is right, I suspect this issue is more related to a situation in which we are already resumed for a long time, with all set up> > I think we need to understand the issue you are hitting a bit more before > deciding on the right fix.Right, here it goes more info: I''m on a 8-way x86_64 machine, and This is the sort of info I see repeatedly: BUG: soft lockup detected on CPU#1! Call Trace: <IRQ> [<ffffffff802ace9d>] softlockup_tick+0xf8/0x113 [<ffffffff8026d591>] timer_interrupt+0x38a/0x3d8 [<ffffffff80210e87>] handle_IRQ_event+0x2d/0x60 [<ffffffff802ad1e6>] __do_IRQ+0xa5/0x107 [<ffffffff8028be7a>] _local_bh_enable+0x61/0xc5 [<ffffffff8026b4c9>] do_IRQ+0xe7/0xf5 [<ffffffff8039386e>] evtchn_do_upcall+0x86/0xe0 [<ffffffff8025e2a2>] do_hypervisor_callback+0x1e/0x2c <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff8026cb13>] raw_safe_halt+0x84/0xa8 [<ffffffff8026a121>] xen_idle+0x38/0x4a [<ffffffff80248e66>] cpu_idle+0x97/0xba It obviously never happen on CPU#0, but I see it on all others (vcpus=4) If you have any other opinion on what else may be causing this, it''s very welcome. I''ll keep investigating. -- Glauber de Oliveira Costa Red Hat Inc. "Free as in Freedom" _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2006-Oct-10 10:06 UTC
Re: [Xen-devel] [PATCH] BUG() on soft lockup upon suspend/resume
On 10/10/06 01:29, "Glauber de Oliveira Costa" <gcosta@redhat.com> wrote:> BUG: soft lockup detected on CPU#1! > > Call Trace:The trace indicates the CPU is idling, so certainly this is a bogus softlockup warning. I guess we already knew that. ;-)> It obviously never happen on CPU#0, but I see it on all others (vcpus=4)Probably worth instrumenting the warning message to print jiffies and timestamp. Also the timestamp values for all other CPUs and see how much they vary. We want to find out if one CPU is cimply lagging in touching the softlockup watchdog, or if perhaps jiffies is updating in big jumps. Given that this is so repro''able for you, it''s weird noone else has reported it. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel