thr3ads.net - Xen devel - [Xen-devel] I/O related (?) domU crash on xen 4.0 + pv

If this information is useful, please help other people find it:
Share via:

Rick Boone

2010-Jun-08 20:16 UTC

[Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops

Hey,

I''m running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 - 
my domU''s are continually locking up, under heavy IO load. My issue 
seems similar to what these guys are reportiing:

https://bugzilla.redhat.com/show_bug.cgi?id=551552
https://bugzilla.redhat.com/show_bug.cgi?id=526627
https://bugzilla.redhat.com/show_bug.cgi?id=550724

Any solutions/ideas that haven''t been covered in those reports?
I''ve
turned off tickless kernel, and have also set the guest I/O scheduler to 
"noop", but machines are still crashing. I''m using LVM-backed
block
devices on the dom0. For awhile, I didn''t have the kernel set to panic 
on a hung task, and from digging around while the kernel was still up, I 
was able to determine that the device that seems to be causing issues is 
one that sees a lot of IO (it''s receiving all of the logs on a heavily 
used web server).

Here''s a couple of my tracebacks:
1)

INFO: task pdflush:36 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0
  ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94
  00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec
Call Trace:
  [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72
  [<ffffffff813f1ee0>] io_schedule+0x37/0x59
  [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71
  [<ffffffff810c2e77>] sync_page+0x5a/0x72
  [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3
  [<ffffffff810c2e1d>] ? sync_page+0x0/0x72
  [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144
  [<ffffffff810c2dce>] __lock_page+0x71/0x8c
  [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
  [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf
  [<ffffffff810cac23>] ? __writepage+0x0/0x5a
  [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36
  [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
  [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f
  [<ffffffff810cb230>] do_writepages+0x3f/0x5e
  [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7
  [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355
  [<ffffffff810cc726>] ? pdflush+0x0/0x286
  [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32
  [<ffffffff811269bc>] writeback_inodes+0x79/0xdf
  [<ffffffff81107819>] ? sync_supers+0xb3/0xce
  [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a
  [<ffffffff810cc84c>] ? pdflush+0x126/0x286
  [<ffffffff810cc889>] pdflush+0x163/0x286
  [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a
  [<ffffffff810cc726>] ? pdflush+0x0/0x286
  [<ffffffff810754ce>] kthread+0x9e/0xa8
  [<ffffffff8101606a>] child_rip+0xa/0x20
  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
  [<ffffffff81016060>] ? child_rip+0x0/0x20
1 lock held by pdflush/36:
  #0:  (&type->s_umount_key#23){......}, at: [<ffffffff811269a6>] 
writeback_inodes+0x63/0xdf
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4
Call Trace:
  [<ffffffff8105c8f3>] panic+0xb2/0x168
  [<ffffffff81085881>] ? print_lock+0x96/0xb1
  [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9
  [<ffffffff8101907a>] ? show_stack+0x2a/0x40
  [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
  [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
  [<ffffffff810b146c>] watchdog+0x209/0x258
  [<ffffffff810b12d8>] ? watchdog+0x75/0x258
  [<ffffffff8104a45f>] ? complete+0x52/0x71
  [<ffffffff810b1263>] ? watchdog+0x0/0x258
  [<ffffffff810754ce>] kthread+0x9e/0xa8
  [<ffffffff8101606a>] child_rip+0xa/0x20
  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
  [<ffffffff81016060>] ? child_rip+0x0/0x20

---------------------------------------------

2)
INFO: task kjournald:951 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
  ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00
  ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001
  0000000000000001 0000000000000200 0000000000000001 000000000160015f
Call Trace:
  [<ffffffff813f1ee0>] io_schedule+0x37/0x59
  [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36
  [<ffffffff8112cb73>] sync_buffer+0x51/0x69
  [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c
  [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
  [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
  [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e
  [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
  [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48
  [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd]
  [<ffffffff8100fa82>] ? check_events+0x12/0x20
  [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
  [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f
  [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84
  [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0
  [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd]
  [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e
  [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd]
  [<ffffffff810754ce>] kthread+0x9e/0xa8
  [<ffffffff8101606a>] child_rip+0xa/0x20
  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
  [<ffffffff81016060>] ? child_rip+0x0/0x20
no locks held by kjournald/951.
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18
Call Trace:
  [<ffffffff8105c8f3>] panic+0xb2/0x168
  [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9
  [<ffffffff8101907a>] ? show_stack+0x2a/0x40
  [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
  [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
  [<ffffffff810b146c>] watchdog+0x209/0x258
  [<ffffffff810b12d8>] ? watchdog+0x75/0x258
  [<ffffffff8104a45f>] ? complete+0x52/0x71
  [<ffffffff810b1263>] ? watchdog+0x0/0x258
  [<ffffffff810754ce>] kthread+0x9e/0xa8
  [<ffffffff8101606a>] child_rip+0xa/0x20
  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
  [<ffffffff81016060>] ? child_rip+0x0/0x20

-- Rick Boone



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Jun-08 23:17 UTC

head link

Re: [Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops

On 06/08/2010 01:16 PM, Rick Boone wrote:> Hey,
>
> I''m running into an issue with pv_ops kernel (2.6.31.13) and xen
4.0 -
> my domU''s are continually locking up, under heavy IO load. My
issue
> seems similar to what these guys are reportiing:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=551552
> https://bugzilla.redhat.com/show_bug.cgi?id=526627
> https://bugzilla.redhat.com/show_bug.cgi?id=550724
>
> Any solutions/ideas that haven''t been covered in those reports?
I''ve
> turned off tickless kernel, and have also set the guest I/O scheduler
> to "noop", but machines are still crashing. I''m using
LVM-backed block
> devices on the dom0. For awhile, I didn''t have the kernel set to
panic
> on a hung task, and from digging around while the kernel was still up,
> I was able to determine that the device that seems to be causing
> issues is one that sees a lot of IO (it''s receiving all of the
logs on
> a heavily used web server).
Does booting the domU with "clocksource=jiffies" help at all?

Have you tried a 2.6.32-based domU?  I used to see these kinds of
lockups with older kernels on my linode domU, but have seen none since
moving it to Fedora 12 with a 2.6.32 kernel.

Are you using a distro kernel, or one you''ve built yourself?

    J
>
> Here''s a couple of my tracebacks:
> 1)
>
> INFO: task pdflush:36 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
>  ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0
>  ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94
>  00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec
> Call Trace:
>  [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72
>  [<ffffffff813f1ee0>] io_schedule+0x37/0x59
>  [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71
>  [<ffffffff810c2e77>] sync_page+0x5a/0x72
>  [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3
>  [<ffffffff810c2e1d>] ? sync_page+0x0/0x72
>  [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144
>  [<ffffffff810c2dce>] __lock_page+0x71/0x8c
>  [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
>  [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf
>  [<ffffffff810cac23>] ? __writepage+0x0/0x5a
>  [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36
>  [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f
>  [<ffffffff810cb230>] do_writepages+0x3f/0x5e
>  [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7
>  [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355
>  [<ffffffff810cc726>] ? pdflush+0x0/0x286
>  [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32
>  [<ffffffff811269bc>] writeback_inodes+0x79/0xdf
>  [<ffffffff81107819>] ? sync_supers+0xb3/0xce
>  [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a
>  [<ffffffff810cc84c>] ? pdflush+0x126/0x286
>  [<ffffffff810cc889>] pdflush+0x163/0x286
>  [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a
>  [<ffffffff810cc726>] ? pdflush+0x0/0x286
>  [<ffffffff810754ce>] kthread+0x9e/0xa8
>  [<ffffffff8101606a>] child_rip+0xa/0x20
>  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81016060>] ? child_rip+0x0/0x20
> 1 lock held by pdflush/36:
>  #0:  (&type->s_umount_key#23){......}, at:
[<ffffffff811269a6>]
> writeback_inodes+0x63/0xdf
> Kernel panic - not syncing: hung_task: blocked tasks
> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4
> Call Trace:
>  [<ffffffff8105c8f3>] panic+0xb2/0x168
>  [<ffffffff81085881>] ? print_lock+0x96/0xb1
>  [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9
>  [<ffffffff8101907a>] ? show_stack+0x2a/0x40
>  [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
>  [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
>  [<ffffffff810b146c>] watchdog+0x209/0x258
>  [<ffffffff810b12d8>] ? watchdog+0x75/0x258
>  [<ffffffff8104a45f>] ? complete+0x52/0x71
>  [<ffffffff810b1263>] ? watchdog+0x0/0x258
>  [<ffffffff810754ce>] kthread+0x9e/0xa8
>  [<ffffffff8101606a>] child_rip+0xa/0x20
>  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81016060>] ? child_rip+0x0/0x20
>
> ---------------------------------------------
>
> 2)
> INFO: task kjournald:951 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
>  ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00
>  ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001
>  0000000000000001 0000000000000200 0000000000000001 000000000160015f
> Call Trace:
>  [<ffffffff813f1ee0>] io_schedule+0x37/0x59
>  [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36
>  [<ffffffff8112cb73>] sync_buffer+0x51/0x69
>  [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c
>  [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
>  [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
>  [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e
>  [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
>  [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48
>  [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd]
>  [<ffffffff8100fa82>] ? check_events+0x12/0x20
>  [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f
>  [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84
>  [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0
>  [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd]
>  [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e
>  [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd]
>  [<ffffffff810754ce>] kthread+0x9e/0xa8
>  [<ffffffff8101606a>] child_rip+0xa/0x20
>  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81016060>] ? child_rip+0x0/0x20
> no locks held by kjournald/951.
> Kernel panic - not syncing: hung_task: blocked tasks
> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18
> Call Trace:
>  [<ffffffff8105c8f3>] panic+0xb2/0x168
>  [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9
>  [<ffffffff8101907a>] ? show_stack+0x2a/0x40
>  [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
>  [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
>  [<ffffffff810b146c>] watchdog+0x209/0x258
>  [<ffffffff810b12d8>] ? watchdog+0x75/0x258
>  [<ffffffff8104a45f>] ? complete+0x52/0x71
>  [<ffffffff810b1263>] ? watchdog+0x0/0x258
>  [<ffffffff810754ce>] kthread+0x9e/0xa8
>  [<ffffffff8101606a>] child_rip+0xa/0x20
>  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81016060>] ? child_rip+0x0/0x20
>
> -- Rick Boone
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>   

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Rick Boone

2010-Jun-10 18:42 UTC

head link

Re: [Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops

I booted into 2.6.32.12 yesterday, but I''m still getting a hung task 
crash. (Initially, I got a soft lockup crash on one of the cpu''s, after
switching to 2.6.32.12, but it seems I was able to resolve that by 
switching the kernel to tickless)

I built the kernel myself (for both dom0 and domU), from the 
xen/stable-2.6.31.x and xen/stable-2.6.32.x branches.

I tried "clocksource=jiffies" earlier, but didnt give it a fair enough
shake. The time drift ended up becoming pretty bad, even with ntp, so I 
had to switch out of it. I''ll give it another try today; I''m
sure I can
probably figure out the drift issue.

-- Rick

On 06/08/2010 04:17 PM, Jeremy Fitzhardinge wrote:> On 06/08/2010 01:16 PM, Rick Boone wrote:
>    
>> Hey,
>>
>> I''m running into an issue with pv_ops kernel (2.6.31.13) and
xen 4.0 -
>> my domU''s are continually locking up, under heavy IO load. My
issue
>> seems similar to what these guys are reportiing:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=551552
>> https://bugzilla.redhat.com/show_bug.cgi?id=526627
>> https://bugzilla.redhat.com/show_bug.cgi?id=550724
>>
>> Any solutions/ideas that haven''t been covered in those
reports? I''ve
>> turned off tickless kernel, and have also set the guest I/O scheduler
>> to "noop", but machines are still crashing. I''m
using LVM-backed block
>> devices on the dom0. For awhile, I didn''t have the kernel set
to panic
>> on a hung task, and from digging around while the kernel was still up,
>> I was able to determine that the device that seems to be causing
>> issues is one that sees a lot of IO (it''s receiving all of the
logs on
>> a heavily used web server).
>>      
> Does booting the domU with "clocksource=jiffies" help at all?
>
> Have you tried a 2.6.32-based domU?  I used to see these kinds of
> lockups with older kernels on my linode domU, but have seen none since
> moving it to Fedora 12 with a 2.6.32 kernel.
>
> Are you using a distro kernel, or one you''ve built yourself?
>
>      J
>
>    
>> Here''s a couple of my tracebacks:
>> 1)
>>
>> INFO: task pdflush:36 blocked for more than 120 seconds.
>> "echo 0>  /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
>>   ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0
>>   ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94
>>   00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec
>> Call Trace:
>>   [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72
>>   [<ffffffff813f1ee0>] io_schedule+0x37/0x59
>>   [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71
>>   [<ffffffff810c2e77>] sync_page+0x5a/0x72
>>   [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3
>>   [<ffffffff810c2e1d>] ? sync_page+0x0/0x72
>>   [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144
>>   [<ffffffff810c2dce>] __lock_page+0x71/0x8c
>>   [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
>>   [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf
>>   [<ffffffff810cac23>] ? __writepage+0x0/0x5a
>>   [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36
>>   [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
>>   [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f
>>   [<ffffffff810cb230>] do_writepages+0x3f/0x5e
>>   [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7
>>   [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355
>>   [<ffffffff810cc726>] ? pdflush+0x0/0x286
>>   [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32
>>   [<ffffffff811269bc>] writeback_inodes+0x79/0xdf
>>   [<ffffffff81107819>] ? sync_supers+0xb3/0xce
>>   [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a
>>   [<ffffffff810cc84c>] ? pdflush+0x126/0x286
>>   [<ffffffff810cc889>] pdflush+0x163/0x286
>>   [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a
>>   [<ffffffff810cc726>] ? pdflush+0x0/0x286
>>   [<ffffffff810754ce>] kthread+0x9e/0xa8
>>   [<ffffffff8101606a>] child_rip+0xa/0x20
>>   [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>   [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>   [<ffffffff81016060>] ? child_rip+0x0/0x20
>> 1 lock held by pdflush/36:
>>   #0:  (&type->s_umount_key#23){......}, at:
[<ffffffff811269a6>]
>> writeback_inodes+0x63/0xdf
>> Kernel panic - not syncing: hung_task: blocked tasks
>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4
>> Call Trace:
>>   [<ffffffff8105c8f3>] panic+0xb2/0x168
>>   [<ffffffff81085881>] ? print_lock+0x96/0xb1
>>   [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9
>>   [<ffffffff8101907a>] ? show_stack+0x2a/0x40
>>   [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
>>   [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
>>   [<ffffffff810b146c>] watchdog+0x209/0x258
>>   [<ffffffff810b12d8>] ? watchdog+0x75/0x258
>>   [<ffffffff8104a45f>] ? complete+0x52/0x71
>>   [<ffffffff810b1263>] ? watchdog+0x0/0x258
>>   [<ffffffff810754ce>] kthread+0x9e/0xa8
>>   [<ffffffff8101606a>] child_rip+0xa/0x20
>>   [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>   [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>   [<ffffffff81016060>] ? child_rip+0x0/0x20
>>
>> ---------------------------------------------
>>
>> 2)
>> INFO: task kjournald:951 blocked for more than 120 seconds.
>> "echo 0>  /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
>>   ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00
>>   ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001
>>   0000000000000001 0000000000000200 0000000000000001 000000000160015f
>> Call Trace:
>>   [<ffffffff813f1ee0>] io_schedule+0x37/0x59
>>   [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36
>>   [<ffffffff8112cb73>] sync_buffer+0x51/0x69
>>   [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c
>>   [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
>>   [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
>>   [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e
>>   [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
>>   [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48
>>   [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2
[jbd]
>>   [<ffffffff8100fa82>] ? check_events+0x12/0x20
>>   [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
>>   [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f
>>   [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84
>>   [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0
>>   [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd]
>>   [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e
>>   [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd]
>>   [<ffffffff810754ce>] kthread+0x9e/0xa8
>>   [<ffffffff8101606a>] child_rip+0xa/0x20
>>   [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>   [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>   [<ffffffff81016060>] ? child_rip+0x0/0x20
>> no locks held by kjournald/951.
>> Kernel panic - not syncing: hung_task: blocked tasks
>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18
>> Call Trace:
>>   [<ffffffff8105c8f3>] panic+0xb2/0x168
>>   [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9
>>   [<ffffffff8101907a>] ? show_stack+0x2a/0x40
>>   [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
>>   [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
>>   [<ffffffff810b146c>] watchdog+0x209/0x258
>>   [<ffffffff810b12d8>] ? watchdog+0x75/0x258
>>   [<ffffffff8104a45f>] ? complete+0x52/0x71
>>   [<ffffffff810b1263>] ? watchdog+0x0/0x258
>>   [<ffffffff810754ce>] kthread+0x9e/0xa8
>>   [<ffffffff8101606a>] child_rip+0xa/0x20
>>   [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>   [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>   [<ffffffff81016060>] ? child_rip+0x0/0x20
>>
>> -- Rick Boone
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>>      
>    

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Jun-10 22:47 UTC

head link

Re: [Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops

On 06/10/2010 11:42 AM, Rick Boone wrote:> I booted into 2.6.32.12 yesterday, but I''m still getting a hung
task
> crash. (Initially, I got a soft lockup crash on one of the cpu''s,
> after switching to 2.6.32.12, but it seems I was able to resolve that
> by switching the kernel to tickless)
>
> I built the kernel myself (for both dom0 and domU), from the
> xen/stable-2.6.31.x and xen/stable-2.6.32.x branches.
>
> I tried "clocksource=jiffies" earlier, but didnt give it a fair
enough
> shake. The time drift ended up becoming pretty bad, even with ntp, so
> I had to switch out of it. I''ll give it another try today;
I''m sure I
> can probably figure out the drift issue.
Yes, jiffies is a horrible clocksource; it is just a diagnostic test
(not even a workaround), since it seems to have some effect.  I''m
really
not sure what the root cause of the problem is; it seems that something
is deadlocked in the io scheduler, and everything else ends up getting
hung as a result.

If you can get the full output of sysrq-t while the domain is hung it
may help try to work out who the culprit is (I''ve never managed to get
a
complete dump in a hung state, but that was with an older kernel with
some bugs in the console code which truncated the output; 2.6.32 should
be OK).

    J
>
> -- Rick
>
> On 06/08/2010 04:17 PM, Jeremy Fitzhardinge wrote:
>> On 06/08/2010 01:16 PM, Rick Boone wrote:
>>   
>>> Hey,
>>>
>>> I''m running into an issue with pv_ops kernel (2.6.31.13)
and xen 4.0 -
>>> my domU''s are continually locking up, under heavy IO load.
My issue
>>> seems similar to what these guys are reportiing:
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=551552
>>> https://bugzilla.redhat.com/show_bug.cgi?id=526627
>>> https://bugzilla.redhat.com/show_bug.cgi?id=550724
>>>
>>> Any solutions/ideas that haven''t been covered in those
reports? I''ve
>>> turned off tickless kernel, and have also set the guest I/O
scheduler
>>> to "noop", but machines are still crashing. I''m
using LVM-backed block
>>> devices on the dom0. For awhile, I didn''t have the kernel
set to panic
>>> on a hung task, and from digging around while the kernel was still
up,
>>> I was able to determine that the device that seems to be causing
>>> issues is one that sees a lot of IO (it''s receiving all of
the logs on
>>> a heavily used web server).
>>>     
>> Does booting the domU with "clocksource=jiffies" help at all?
>>
>> Have you tried a 2.6.32-based domU?  I used to see these kinds of
>> lockups with older kernels on my linode domU, but have seen none since
>> moving it to Fedora 12 with a 2.6.32 kernel.
>>
>> Are you using a distro kernel, or one you''ve built yourself?
>>
>>     J
>>
>>   
>>> Here''s a couple of my tracebacks:
>>> 1)
>>>
>>> INFO: task pdflush:36 blocked for more than 120 seconds.
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
>>>  ffff8801e963f9c0 0000000000000282 00000000e4f10f56
ffff8801e963f9d0
>>>  ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590
0000000103902b94
>>>  00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0
ffffffff811f14ec
>>> Call Trace:
>>>  [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72
>>>  [<ffffffff813f1ee0>] io_schedule+0x37/0x59
>>>  [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71
>>>  [<ffffffff810c2e77>] sync_page+0x5a/0x72
>>>  [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3
>>>  [<ffffffff810c2e1d>] ? sync_page+0x0/0x72
>>>  [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144
>>>  [<ffffffff810c2dce>] __lock_page+0x71/0x8c
>>>  [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
>>>  [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf
>>>  [<ffffffff810cac23>] ? __writepage+0x0/0x5a
>>>  [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36
>>>  [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
>>>  [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f
>>>  [<ffffffff810cb230>] do_writepages+0x3f/0x5e
>>>  [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7
>>>  [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355
>>>  [<ffffffff810cc726>] ? pdflush+0x0/0x286
>>>  [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32
>>>  [<ffffffff811269bc>] writeback_inodes+0x79/0xdf
>>>  [<ffffffff81107819>] ? sync_supers+0xb3/0xce
>>>  [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a
>>>  [<ffffffff810cc84c>] ? pdflush+0x126/0x286
>>>  [<ffffffff810cc889>] pdflush+0x163/0x286
>>>  [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a
>>>  [<ffffffff810cc726>] ? pdflush+0x0/0x286
>>>  [<ffffffff810754ce>] kthread+0x9e/0xa8
>>>  [<ffffffff8101606a>] child_rip+0xa/0x20
>>>  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>>  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>>  [<ffffffff81016060>] ? child_rip+0x0/0x20
>>> 1 lock held by pdflush/36:
>>>  #0:  (&type->s_umount_key#23){......}, at:
[<ffffffff811269a6>]
>>> writeback_inodes+0x63/0xdf
>>> Kernel panic - not syncing: hung_task: blocked tasks
>>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4
>>> Call Trace:
>>>  [<ffffffff8105c8f3>] panic+0xb2/0x168
>>>  [<ffffffff81085881>] ? print_lock+0x96/0xb1
>>>  [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9
>>>  [<ffffffff8101907a>] ? show_stack+0x2a/0x40
>>>  [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
>>>  [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
>>>  [<ffffffff810b146c>] watchdog+0x209/0x258
>>>  [<ffffffff810b12d8>] ? watchdog+0x75/0x258
>>>  [<ffffffff8104a45f>] ? complete+0x52/0x71
>>>  [<ffffffff810b1263>] ? watchdog+0x0/0x258
>>>  [<ffffffff810754ce>] kthread+0x9e/0xa8
>>>  [<ffffffff8101606a>] child_rip+0xa/0x20
>>>  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>>  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>>  [<ffffffff81016060>] ? child_rip+0x0/0x20
>>>
>>> ---------------------------------------------
>>>
>>> 2)
>>> INFO: task kjournald:951 blocked for more than 120 seconds.
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
>>>  ffff8801c8993bf0 0000000000000282 00000000cf63a654
ffff8801c8993c00
>>>  ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470
0000000000000001
>>>  0000000000000001 0000000000000200 0000000000000001
000000000160015f
>>> Call Trace:
>>>  [<ffffffff813f1ee0>] io_schedule+0x37/0x59
>>>  [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36
>>>  [<ffffffff8112cb73>] sync_buffer+0x51/0x69
>>>  [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c
>>>  [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
>>>  [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
>>>  [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e
>>>  [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
>>>  [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48
>>>  [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2
[jbd]
>>>  [<ffffffff8100fa82>] ? check_events+0x12/0x20
>>>  [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
>>>  [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f
>>>  [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84
>>>  [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0
>>>  [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd]
>>>  [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e
>>>  [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd]
>>>  [<ffffffff810754ce>] kthread+0x9e/0xa8
>>>  [<ffffffff8101606a>] child_rip+0xa/0x20
>>>  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>>  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>>  [<ffffffff81016060>] ? child_rip+0x0/0x20
>>> no locks held by kjournald/951.
>>> Kernel panic - not syncing: hung_task: blocked tasks
>>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18
>>> Call Trace:
>>>  [<ffffffff8105c8f3>] panic+0xb2/0x168
>>>  [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9
>>>  [<ffffffff8101907a>] ? show_stack+0x2a/0x40
>>>  [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
>>>  [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
>>>  [<ffffffff810b146c>] watchdog+0x209/0x258
>>>  [<ffffffff810b12d8>] ? watchdog+0x75/0x258
>>>  [<ffffffff8104a45f>] ? complete+0x52/0x71
>>>  [<ffffffff810b1263>] ? watchdog+0x0/0x258
>>>  [<ffffffff810754ce>] kthread+0x9e/0xa8
>>>  [<ffffffff8101606a>] child_rip+0xa/0x20
>>>  [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>>  [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>>  [<ffffffff81016060>] ? child_rip+0x0/0x20
>>>
>>> -- Rick Boone
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>   
>>>     
>>   

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jun 2010 - I/O related (?) domU crash on xen 4.0 + pv_ops

[Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops

Re: [Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops

Re: [Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops

Re: [Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops