During a suspend/resume cycle (e.g., a migration), xen_console_resume() is called from xen_suspend() which is called with local irqs disabled by stop_machine(). xen_console_resume() calls rebind_evtchn_irq() which attempts to lock the irq_mapping_update_lock mutex. This produces the BUG listed below. The lock was changed to a mutex by 773659 (xen/irq: Alter the locking to use a mutex instead of a spinlock.). Clearly we can''t just revert this change but it''s not clear to me what the correct fix here is. Can xen_console_resume() be deferred to later in the resume process? David [ 39.877210] BUG: sleeping function called from invalid context at /anfs/drall/scratch/davidvr/x86/linux/kernel/mutex.c:413 [ 39.877210] in_atomic(): 1, irqs_disabled(): 1, pid: 7, name: migration/0 [ 39.877210] no locks held by migration/0/7. [ 39.877210] irq event stamp: 38 [ 39.877210] hardirqs last enabled at (37): [<ffffffff812d07b0>] _raw_spin_unlock_irq+0x30/0x50 [ 39.877210] hardirqs last disabled at (38): [<ffffffff81091ab5>] stop_machine_cpu_stop+0x95/0x110 [ 39.877210] softirqs last enabled at (0): [<ffffffff8103cd98>] copy_process.part.63+0x348/0x13e0 [ 39.877210] softirqs last disabled at (0): [< (null)>] (null) [ 39.877210] CPU: 0 PID: 7 Comm: migration/0 Not tainted 3.10.0-rc6.davidvr #82 [ 39.877210] 0000000000000000 ffff88000ec8fbc8 ffffffff812cc2cf ffff88000ec8fbe8 [ 39.877210] ffffffff810723b5 ffff88000ed16800 ffff88000ec8ffd8 ffff88000ec8fc68 [ 39.877210] ffffffff812cd181 ffff88000ec8fc18 ffffffff81092b17 ffff88000ed16800 [ 39.877210] Call Trace: [ 39.877210] [<ffffffff812cc2cf>] dump_stack+0x19/0x1b [ 39.877210] [<ffffffff810723b5>] __might_sleep+0xf5/0x130 [ 39.877210] [<ffffffff812cd181>] mutex_lock_nested+0x41/0x3b0 [ 39.877210] [<ffffffff81092b17>] ? __irq_put_desc_unlock+0x27/0x50 [ 39.877210] [<ffffffff8109410c>] ? __disable_irq_nosync+0x4c/0x70 [ 39.877210] [<ffffffff81091a01>] ? queue_stop_cpus_work+0xc1/0xe0 [ 39.877210] [<ffffffff811bc912>] rebind_evtchn_irq+0x42/0xc0 [ 39.877210] [<ffffffff811e3c80>] xen_console_resume+0x60/0x70 [ 39.877210] [<ffffffff811b87c2>] xen_suspend+0x92/0xb0 [ 39.877210] [<ffffffff81091acb>] stop_machine_cpu_stop+0xab/0x110 [ 39.877210] [<ffffffff81091a20>] ? queue_stop_cpus_work+0xe0/0xe0 [ 39.877210] [<ffffffff8109170b>] cpu_stopper_thread+0x8b/0x140 [ 39.877210] [<ffffffff812d073f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80 [ 39.877210] [<ffffffff81086f25>] ? trace_hardirqs_on_caller+0x105/0x1d0 [ 39.877210] [<ffffffff81086ffd>] ? trace_hardirqs_on+0xd/0x10 [ 39.877210] [<ffffffff812d0749>] ? _raw_spin_unlock_irqrestore+0x49/0x80 [ 39.877210] [<ffffffff8106a6c6>] smpboot_thread_fn+0x166/0x1f0 [ 39.877210] [<ffffffff8106a560>] ? lg_global_unlock+0x80/0x80 [ 39.877210] [<ffffffff8106271b>] kthread+0xdb/0xe0 [ 39.877210] [<ffffffff81062640>] ? __kthread_parkme+0x90/0x90 [ 39.877210] [<ffffffff812d1eac>] ret_from_fork+0x7c/0xb0 [ 39.877210] [<ffffffff81062640>] ? __kthread_parkme+0x90/0x90
On Wed, 2013-06-19 at 13:08 +0100, David Vrabel wrote:> During a suspend/resume cycle (e.g., a migration), xen_console_resume() > is called from xen_suspend() which is called with local irqs disabled by > stop_machine(). > > xen_console_resume() calls rebind_evtchn_irq() which attempts to lock > the irq_mapping_update_lock mutex. This produces the BUG listed below. > > The lock was changed to a mutex by 773659 (xen/irq: Alter the locking to > use a mutex instead of a spinlock.). Clearly we can''t just revert this > change but it''s not clear to me what the correct fix here is. > > Can xen_console_resume() be deferred to later in the resume process?Does that risk losing debug output from the early part of the resume? That would be worth avoiding. So I think it''s just a case of making sure that the rest of the driver is functional if we move the call later. It might even be possible to move that restore into the standard device model restore hook instead of this slightly special case. Does a similar issue not exist with console drivers for a native hibernate? I wonder how they get around that. I have a vague recollection of a debug option for leaving the console unsuspended over hibernate. Ian.> > David > > [ 39.877210] BUG: sleeping function called from invalid context at > /anfs/drall/scratch/davidvr/x86/linux/kernel/mutex.c:413 > [ 39.877210] in_atomic(): 1, irqs_disabled(): 1, pid: 7, name: migration/0 > [ 39.877210] no locks held by migration/0/7. > [ 39.877210] irq event stamp: 38 > [ 39.877210] hardirqs last enabled at (37): [<ffffffff812d07b0>] > _raw_spin_unlock_irq+0x30/0x50 > [ 39.877210] hardirqs last disabled at (38): [<ffffffff81091ab5>] > stop_machine_cpu_stop+0x95/0x110 > [ 39.877210] softirqs last enabled at (0): [<ffffffff8103cd98>] > copy_process.part.63+0x348/0x13e0 > [ 39.877210] softirqs last disabled at (0): [< (null)>] > (null) > [ 39.877210] CPU: 0 PID: 7 Comm: migration/0 Not tainted > 3.10.0-rc6.davidvr #82 > [ 39.877210] 0000000000000000 ffff88000ec8fbc8 ffffffff812cc2cf > ffff88000ec8fbe8 > [ 39.877210] ffffffff810723b5 ffff88000ed16800 ffff88000ec8ffd8 > ffff88000ec8fc68 > [ 39.877210] ffffffff812cd181 ffff88000ec8fc18 ffffffff81092b17 > ffff88000ed16800 > [ 39.877210] Call Trace: > [ 39.877210] [<ffffffff812cc2cf>] dump_stack+0x19/0x1b > [ 39.877210] [<ffffffff810723b5>] __might_sleep+0xf5/0x130 > [ 39.877210] [<ffffffff812cd181>] mutex_lock_nested+0x41/0x3b0 > [ 39.877210] [<ffffffff81092b17>] ? __irq_put_desc_unlock+0x27/0x50 > [ 39.877210] [<ffffffff8109410c>] ? __disable_irq_nosync+0x4c/0x70 > [ 39.877210] [<ffffffff81091a01>] ? queue_stop_cpus_work+0xc1/0xe0 > [ 39.877210] [<ffffffff811bc912>] rebind_evtchn_irq+0x42/0xc0 > [ 39.877210] [<ffffffff811e3c80>] xen_console_resume+0x60/0x70 > [ 39.877210] [<ffffffff811b87c2>] xen_suspend+0x92/0xb0 > [ 39.877210] [<ffffffff81091acb>] stop_machine_cpu_stop+0xab/0x110 > [ 39.877210] [<ffffffff81091a20>] ? queue_stop_cpus_work+0xe0/0xe0 > [ 39.877210] [<ffffffff8109170b>] cpu_stopper_thread+0x8b/0x140 > [ 39.877210] [<ffffffff812d073f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80 > [ 39.877210] [<ffffffff81086f25>] ? trace_hardirqs_on_caller+0x105/0x1d0 > [ 39.877210] [<ffffffff81086ffd>] ? trace_hardirqs_on+0xd/0x10 > [ 39.877210] [<ffffffff812d0749>] ? _raw_spin_unlock_irqrestore+0x49/0x80 > [ 39.877210] [<ffffffff8106a6c6>] smpboot_thread_fn+0x166/0x1f0 > [ 39.877210] [<ffffffff8106a560>] ? lg_global_unlock+0x80/0x80 > [ 39.877210] [<ffffffff8106271b>] kthread+0xdb/0xe0 > [ 39.877210] [<ffffffff81062640>] ? __kthread_parkme+0x90/0x90 > [ 39.877210] [<ffffffff812d1eac>] ret_from_fork+0x7c/0xb0 > [ 39.877210] [<ffffffff81062640>] ? __kthread_parkme+0x90/0x90 >
Konrad Rzeszutek Wilk
2013-Jun-19 15:45 UTC
Re: xen_console_resume() may sleep with irqs disabled
On Wed, Jun 19, 2013 at 02:05:43PM +0100, Ian Campbell wrote:> On Wed, 2013-06-19 at 13:08 +0100, David Vrabel wrote: > > During a suspend/resume cycle (e.g., a migration), xen_console_resume() > > is called from xen_suspend() which is called with local irqs disabled by > > stop_machine(). > > > > xen_console_resume() calls rebind_evtchn_irq() which attempts to lock > > the irq_mapping_update_lock mutex. This produces the BUG listed below. > > > > The lock was changed to a mutex by 773659 (xen/irq: Alter the locking to > > use a mutex instead of a spinlock.). Clearly we can''t just revert this > > change but it''s not clear to me what the correct fix here is. > > > > Can xen_console_resume() be deferred to later in the resume process? > > Does that risk losing debug output from the early part of the resume? > That would be worth avoiding. > > So I think it''s just a case of making sure that the rest of the driver > is functional if we move the call later. > > It might even be possible to move that restore into the standard device > model restore hook instead of this slightly special case. > > Does a similar issue not exist with console drivers for a native > hibernate? I wonder how they get around that. I have a vague > recollection of a debug option for leaving the console unsuspended over > hibernate.<nods> I think it was called ''no_console_suspend'' Which looks like you just need to sprinkle ''suspend_console()'' and ''resume_console()'' around in the manage.c code?> > Ian. > > > > > David > > > > [ 39.877210] BUG: sleeping function called from invalid context at > > /anfs/drall/scratch/davidvr/x86/linux/kernel/mutex.c:413 > > [ 39.877210] in_atomic(): 1, irqs_disabled(): 1, pid: 7, name: migration/0 > > [ 39.877210] no locks held by migration/0/7. > > [ 39.877210] irq event stamp: 38 > > [ 39.877210] hardirqs last enabled at (37): [<ffffffff812d07b0>] > > _raw_spin_unlock_irq+0x30/0x50 > > [ 39.877210] hardirqs last disabled at (38): [<ffffffff81091ab5>] > > stop_machine_cpu_stop+0x95/0x110 > > [ 39.877210] softirqs last enabled at (0): [<ffffffff8103cd98>] > > copy_process.part.63+0x348/0x13e0 > > [ 39.877210] softirqs last disabled at (0): [< (null)>] > > (null) > > [ 39.877210] CPU: 0 PID: 7 Comm: migration/0 Not tainted > > 3.10.0-rc6.davidvr #82 > > [ 39.877210] 0000000000000000 ffff88000ec8fbc8 ffffffff812cc2cf > > ffff88000ec8fbe8 > > [ 39.877210] ffffffff810723b5 ffff88000ed16800 ffff88000ec8ffd8 > > ffff88000ec8fc68 > > [ 39.877210] ffffffff812cd181 ffff88000ec8fc18 ffffffff81092b17 > > ffff88000ed16800 > > [ 39.877210] Call Trace: > > [ 39.877210] [<ffffffff812cc2cf>] dump_stack+0x19/0x1b > > [ 39.877210] [<ffffffff810723b5>] __might_sleep+0xf5/0x130 > > [ 39.877210] [<ffffffff812cd181>] mutex_lock_nested+0x41/0x3b0 > > [ 39.877210] [<ffffffff81092b17>] ? __irq_put_desc_unlock+0x27/0x50 > > [ 39.877210] [<ffffffff8109410c>] ? __disable_irq_nosync+0x4c/0x70 > > [ 39.877210] [<ffffffff81091a01>] ? queue_stop_cpus_work+0xc1/0xe0 > > [ 39.877210] [<ffffffff811bc912>] rebind_evtchn_irq+0x42/0xc0 > > [ 39.877210] [<ffffffff811e3c80>] xen_console_resume+0x60/0x70 > > [ 39.877210] [<ffffffff811b87c2>] xen_suspend+0x92/0xb0 > > [ 39.877210] [<ffffffff81091acb>] stop_machine_cpu_stop+0xab/0x110 > > [ 39.877210] [<ffffffff81091a20>] ? queue_stop_cpus_work+0xe0/0xe0 > > [ 39.877210] [<ffffffff8109170b>] cpu_stopper_thread+0x8b/0x140 > > [ 39.877210] [<ffffffff812d073f>] ? _raw_spin_unlock_irqrestore+0x3f/0x80 > > [ 39.877210] [<ffffffff81086f25>] ? trace_hardirqs_on_caller+0x105/0x1d0 > > [ 39.877210] [<ffffffff81086ffd>] ? trace_hardirqs_on+0xd/0x10 > > [ 39.877210] [<ffffffff812d0749>] ? _raw_spin_unlock_irqrestore+0x49/0x80 > > [ 39.877210] [<ffffffff8106a6c6>] smpboot_thread_fn+0x166/0x1f0 > > [ 39.877210] [<ffffffff8106a560>] ? lg_global_unlock+0x80/0x80 > > [ 39.877210] [<ffffffff8106271b>] kthread+0xdb/0xe0 > > [ 39.877210] [<ffffffff81062640>] ? __kthread_parkme+0x90/0x90 > > [ 39.877210] [<ffffffff812d1eac>] ret_from_fork+0x7c/0xb0 > > [ 39.877210] [<ffffffff81062640>] ? __kthread_parkme+0x90/0x90 > > > >