Rick Boone
2010-Jun-08 20:16 UTC
[Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops
Hey, I''m running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 - my domU''s are continually locking up, under heavy IO load. My issue seems similar to what these guys are reportiing: https://bugzilla.redhat.com/show_bug.cgi?id=551552 https://bugzilla.redhat.com/show_bug.cgi?id=526627 https://bugzilla.redhat.com/show_bug.cgi?id=550724 Any solutions/ideas that haven''t been covered in those reports? I''ve turned off tickless kernel, and have also set the guest I/O scheduler to "noop", but machines are still crashing. I''m using LVM-backed block devices on the dom0. For awhile, I didn''t have the kernel set to panic on a hung task, and from digging around while the kernel was still up, I was able to determine that the device that seems to be causing issues is one that sees a lot of IO (it''s receiving all of the logs on a heavily used web server). Here''s a couple of my tracebacks: 1) INFO: task pdflush:36 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0 ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94 00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec Call Trace: [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72 [<ffffffff813f1ee0>] io_schedule+0x37/0x59 [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71 [<ffffffff810c2e77>] sync_page+0x5a/0x72 [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3 [<ffffffff810c2e1d>] ? sync_page+0x0/0x72 [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144 [<ffffffff810c2dce>] __lock_page+0x71/0x8c [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51 [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf [<ffffffff810cac23>] ? __writepage+0x0/0x5a [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36 [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f [<ffffffff810cb230>] do_writepages+0x3f/0x5e [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7 [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355 [<ffffffff810cc726>] ? pdflush+0x0/0x286 [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32 [<ffffffff811269bc>] writeback_inodes+0x79/0xdf [<ffffffff81107819>] ? sync_supers+0xb3/0xce [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a [<ffffffff810cc84c>] ? pdflush+0x126/0x286 [<ffffffff810cc889>] pdflush+0x163/0x286 [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a [<ffffffff810cc726>] ? pdflush+0x0/0x286 [<ffffffff810754ce>] kthread+0x9e/0xa8 [<ffffffff8101606a>] child_rip+0xa/0x20 [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 [<ffffffff81016060>] ? child_rip+0x0/0x20 1 lock held by pdflush/36: #0: (&type->s_umount_key#23){......}, at: [<ffffffff811269a6>] writeback_inodes+0x63/0xdf Kernel panic - not syncing: hung_task: blocked tasks Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4 Call Trace: [<ffffffff8105c8f3>] panic+0xb2/0x168 [<ffffffff81085881>] ? print_lock+0x96/0xb1 [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9 [<ffffffff8101907a>] ? show_stack+0x2a/0x40 [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87 [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49 [<ffffffff810b146c>] watchdog+0x209/0x258 [<ffffffff810b12d8>] ? watchdog+0x75/0x258 [<ffffffff8104a45f>] ? complete+0x52/0x71 [<ffffffff810b1263>] ? watchdog+0x0/0x258 [<ffffffff810754ce>] kthread+0x9e/0xa8 [<ffffffff8101606a>] child_rip+0xa/0x20 [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 [<ffffffff81016060>] ? child_rip+0x0/0x20 --------------------------------------------- 2) INFO: task kjournald:951 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00 ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001 0000000000000001 0000000000000200 0000000000000001 000000000160015f Call Trace: [<ffffffff813f1ee0>] io_schedule+0x37/0x59 [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36 [<ffffffff8112cb73>] sync_buffer+0x51/0x69 [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69 [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69 [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51 [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48 [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd] [<ffffffff8100fa82>] ? check_events+0x12/0x20 [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84 [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0 [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd] [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd] [<ffffffff810754ce>] kthread+0x9e/0xa8 [<ffffffff8101606a>] child_rip+0xa/0x20 [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 [<ffffffff81016060>] ? child_rip+0x0/0x20 no locks held by kjournald/951. Kernel panic - not syncing: hung_task: blocked tasks Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18 Call Trace: [<ffffffff8105c8f3>] panic+0xb2/0x168 [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9 [<ffffffff8101907a>] ? show_stack+0x2a/0x40 [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87 [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49 [<ffffffff810b146c>] watchdog+0x209/0x258 [<ffffffff810b12d8>] ? watchdog+0x75/0x258 [<ffffffff8104a45f>] ? complete+0x52/0x71 [<ffffffff810b1263>] ? watchdog+0x0/0x258 [<ffffffff810754ce>] kthread+0x9e/0xa8 [<ffffffff8101606a>] child_rip+0xa/0x20 [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 [<ffffffff81016060>] ? child_rip+0x0/0x20 -- Rick Boone _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Jun-08 23:17 UTC
Re: [Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops
On 06/08/2010 01:16 PM, Rick Boone wrote:> Hey, > > I''m running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 - > my domU''s are continually locking up, under heavy IO load. My issue > seems similar to what these guys are reportiing: > > https://bugzilla.redhat.com/show_bug.cgi?id=551552 > https://bugzilla.redhat.com/show_bug.cgi?id=526627 > https://bugzilla.redhat.com/show_bug.cgi?id=550724 > > Any solutions/ideas that haven''t been covered in those reports? I''ve > turned off tickless kernel, and have also set the guest I/O scheduler > to "noop", but machines are still crashing. I''m using LVM-backed block > devices on the dom0. For awhile, I didn''t have the kernel set to panic > on a hung task, and from digging around while the kernel was still up, > I was able to determine that the device that seems to be causing > issues is one that sees a lot of IO (it''s receiving all of the logs on > a heavily used web server).Does booting the domU with "clocksource=jiffies" help at all? Have you tried a 2.6.32-based domU? I used to see these kinds of lockups with older kernels on my linode domU, but have seen none since moving it to Fedora 12 with a 2.6.32 kernel. Are you using a distro kernel, or one you''ve built yourself? J> > Here''s a couple of my tracebacks: > 1) > > INFO: task pdflush:36 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0 > ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94 > 00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec > Call Trace: > [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72 > [<ffffffff813f1ee0>] io_schedule+0x37/0x59 > [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71 > [<ffffffff810c2e77>] sync_page+0x5a/0x72 > [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3 > [<ffffffff810c2e1d>] ? sync_page+0x0/0x72 > [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144 > [<ffffffff810c2dce>] __lock_page+0x71/0x8c > [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51 > [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf > [<ffffffff810cac23>] ? __writepage+0x0/0x5a > [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36 > [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f > [<ffffffff810cb230>] do_writepages+0x3f/0x5e > [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7 > [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355 > [<ffffffff810cc726>] ? pdflush+0x0/0x286 > [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32 > [<ffffffff811269bc>] writeback_inodes+0x79/0xdf > [<ffffffff81107819>] ? sync_supers+0xb3/0xce > [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a > [<ffffffff810cc84c>] ? pdflush+0x126/0x286 > [<ffffffff810cc889>] pdflush+0x163/0x286 > [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a > [<ffffffff810cc726>] ? pdflush+0x0/0x286 > [<ffffffff810754ce>] kthread+0x9e/0xa8 > [<ffffffff8101606a>] child_rip+0xa/0x20 > [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 > [<ffffffff81016060>] ? child_rip+0x0/0x20 > 1 lock held by pdflush/36: > #0: (&type->s_umount_key#23){......}, at: [<ffffffff811269a6>] > writeback_inodes+0x63/0xdf > Kernel panic - not syncing: hung_task: blocked tasks > Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4 > Call Trace: > [<ffffffff8105c8f3>] panic+0xb2/0x168 > [<ffffffff81085881>] ? print_lock+0x96/0xb1 > [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9 > [<ffffffff8101907a>] ? show_stack+0x2a/0x40 > [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87 > [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49 > [<ffffffff810b146c>] watchdog+0x209/0x258 > [<ffffffff810b12d8>] ? watchdog+0x75/0x258 > [<ffffffff8104a45f>] ? complete+0x52/0x71 > [<ffffffff810b1263>] ? watchdog+0x0/0x258 > [<ffffffff810754ce>] kthread+0x9e/0xa8 > [<ffffffff8101606a>] child_rip+0xa/0x20 > [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 > [<ffffffff81016060>] ? child_rip+0x0/0x20 > > --------------------------------------------- > > 2) > INFO: task kjournald:951 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00 > ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001 > 0000000000000001 0000000000000200 0000000000000001 000000000160015f > Call Trace: > [<ffffffff813f1ee0>] io_schedule+0x37/0x59 > [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36 > [<ffffffff8112cb73>] sync_buffer+0x51/0x69 > [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c > [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69 > [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69 > [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e > [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51 > [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48 > [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd] > [<ffffffff8100fa82>] ? check_events+0x12/0x20 > [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1 > [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f > [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84 > [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0 > [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd] > [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e > [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd] > [<ffffffff810754ce>] kthread+0x9e/0xa8 > [<ffffffff8101606a>] child_rip+0xa/0x20 > [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 > [<ffffffff81016060>] ? child_rip+0x0/0x20 > no locks held by kjournald/951. > Kernel panic - not syncing: hung_task: blocked tasks > Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18 > Call Trace: > [<ffffffff8105c8f3>] panic+0xb2/0x168 > [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9 > [<ffffffff8101907a>] ? show_stack+0x2a/0x40 > [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87 > [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49 > [<ffffffff810b146c>] watchdog+0x209/0x258 > [<ffffffff810b12d8>] ? watchdog+0x75/0x258 > [<ffffffff8104a45f>] ? complete+0x52/0x71 > [<ffffffff810b1263>] ? watchdog+0x0/0x258 > [<ffffffff810754ce>] kthread+0x9e/0xa8 > [<ffffffff8101606a>] child_rip+0xa/0x20 > [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b > [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 > [<ffffffff81016060>] ? child_rip+0x0/0x20 > > -- Rick Boone > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rick Boone
2010-Jun-10 18:42 UTC
Re: [Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops
I booted into 2.6.32.12 yesterday, but I''m still getting a hung task crash. (Initially, I got a soft lockup crash on one of the cpu''s, after switching to 2.6.32.12, but it seems I was able to resolve that by switching the kernel to tickless) I built the kernel myself (for both dom0 and domU), from the xen/stable-2.6.31.x and xen/stable-2.6.32.x branches. I tried "clocksource=jiffies" earlier, but didnt give it a fair enough shake. The time drift ended up becoming pretty bad, even with ntp, so I had to switch out of it. I''ll give it another try today; I''m sure I can probably figure out the drift issue. -- Rick On 06/08/2010 04:17 PM, Jeremy Fitzhardinge wrote:> On 06/08/2010 01:16 PM, Rick Boone wrote: > >> Hey, >> >> I''m running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 - >> my domU''s are continually locking up, under heavy IO load. My issue >> seems similar to what these guys are reportiing: >> >> https://bugzilla.redhat.com/show_bug.cgi?id=551552 >> https://bugzilla.redhat.com/show_bug.cgi?id=526627 >> https://bugzilla.redhat.com/show_bug.cgi?id=550724 >> >> Any solutions/ideas that haven''t been covered in those reports? I''ve >> turned off tickless kernel, and have also set the guest I/O scheduler >> to "noop", but machines are still crashing. I''m using LVM-backed block >> devices on the dom0. For awhile, I didn''t have the kernel set to panic >> on a hung task, and from digging around while the kernel was still up, >> I was able to determine that the device that seems to be causing >> issues is one that sees a lot of IO (it''s receiving all of the logs on >> a heavily used web server). >> > Does booting the domU with "clocksource=jiffies" help at all? > > Have you tried a 2.6.32-based domU? I used to see these kinds of > lockups with older kernels on my linode domU, but have seen none since > moving it to Fedora 12 with a 2.6.32 kernel. > > Are you using a distro kernel, or one you''ve built yourself? > > J > > >> Here''s a couple of my tracebacks: >> 1) >> >> INFO: task pdflush:36 blocked for more than 120 seconds. >> "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0 >> ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94 >> 00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec >> Call Trace: >> [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72 >> [<ffffffff813f1ee0>] io_schedule+0x37/0x59 >> [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71 >> [<ffffffff810c2e77>] sync_page+0x5a/0x72 >> [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3 >> [<ffffffff810c2e1d>] ? sync_page+0x0/0x72 >> [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144 >> [<ffffffff810c2dce>] __lock_page+0x71/0x8c >> [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51 >> [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf >> [<ffffffff810cac23>] ? __writepage+0x0/0x5a >> [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36 >> [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1 >> [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f >> [<ffffffff810cb230>] do_writepages+0x3f/0x5e >> [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7 >> [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355 >> [<ffffffff810cc726>] ? pdflush+0x0/0x286 >> [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32 >> [<ffffffff811269bc>] writeback_inodes+0x79/0xdf >> [<ffffffff81107819>] ? sync_supers+0xb3/0xce >> [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a >> [<ffffffff810cc84c>] ? pdflush+0x126/0x286 >> [<ffffffff810cc889>] pdflush+0x163/0x286 >> [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a >> [<ffffffff810cc726>] ? pdflush+0x0/0x286 >> [<ffffffff810754ce>] kthread+0x9e/0xa8 >> [<ffffffff8101606a>] child_rip+0xa/0x20 >> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b >> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 >> [<ffffffff81016060>] ? child_rip+0x0/0x20 >> 1 lock held by pdflush/36: >> #0: (&type->s_umount_key#23){......}, at: [<ffffffff811269a6>] >> writeback_inodes+0x63/0xdf >> Kernel panic - not syncing: hung_task: blocked tasks >> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4 >> Call Trace: >> [<ffffffff8105c8f3>] panic+0xb2/0x168 >> [<ffffffff81085881>] ? print_lock+0x96/0xb1 >> [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9 >> [<ffffffff8101907a>] ? show_stack+0x2a/0x40 >> [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87 >> [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49 >> [<ffffffff810b146c>] watchdog+0x209/0x258 >> [<ffffffff810b12d8>] ? watchdog+0x75/0x258 >> [<ffffffff8104a45f>] ? complete+0x52/0x71 >> [<ffffffff810b1263>] ? watchdog+0x0/0x258 >> [<ffffffff810754ce>] kthread+0x9e/0xa8 >> [<ffffffff8101606a>] child_rip+0xa/0x20 >> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b >> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 >> [<ffffffff81016060>] ? child_rip+0x0/0x20 >> >> --------------------------------------------- >> >> 2) >> INFO: task kjournald:951 blocked for more than 120 seconds. >> "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00 >> ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001 >> 0000000000000001 0000000000000200 0000000000000001 000000000160015f >> Call Trace: >> [<ffffffff813f1ee0>] io_schedule+0x37/0x59 >> [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36 >> [<ffffffff8112cb73>] sync_buffer+0x51/0x69 >> [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c >> [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69 >> [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69 >> [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e >> [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51 >> [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48 >> [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd] >> [<ffffffff8100fa82>] ? check_events+0x12/0x20 >> [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1 >> [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f >> [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84 >> [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0 >> [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd] >> [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e >> [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd] >> [<ffffffff810754ce>] kthread+0x9e/0xa8 >> [<ffffffff8101606a>] child_rip+0xa/0x20 >> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b >> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 >> [<ffffffff81016060>] ? child_rip+0x0/0x20 >> no locks held by kjournald/951. >> Kernel panic - not syncing: hung_task: blocked tasks >> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18 >> Call Trace: >> [<ffffffff8105c8f3>] panic+0xb2/0x168 >> [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9 >> [<ffffffff8101907a>] ? show_stack+0x2a/0x40 >> [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87 >> [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49 >> [<ffffffff810b146c>] watchdog+0x209/0x258 >> [<ffffffff810b12d8>] ? watchdog+0x75/0x258 >> [<ffffffff8104a45f>] ? complete+0x52/0x71 >> [<ffffffff810b1263>] ? watchdog+0x0/0x258 >> [<ffffffff810754ce>] kthread+0x9e/0xa8 >> [<ffffffff8101606a>] child_rip+0xa/0x20 >> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b >> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 >> [<ffffffff81016060>] ? child_rip+0x0/0x20 >> >> -- Rick Boone >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Jun-10 22:47 UTC
Re: [Xen-devel] I/O related (?) domU crash on xen 4.0 + pv_ops
On 06/10/2010 11:42 AM, Rick Boone wrote:> I booted into 2.6.32.12 yesterday, but I''m still getting a hung task > crash. (Initially, I got a soft lockup crash on one of the cpu''s, > after switching to 2.6.32.12, but it seems I was able to resolve that > by switching the kernel to tickless) > > I built the kernel myself (for both dom0 and domU), from the > xen/stable-2.6.31.x and xen/stable-2.6.32.x branches. > > I tried "clocksource=jiffies" earlier, but didnt give it a fair enough > shake. The time drift ended up becoming pretty bad, even with ntp, so > I had to switch out of it. I''ll give it another try today; I''m sure I > can probably figure out the drift issue.Yes, jiffies is a horrible clocksource; it is just a diagnostic test (not even a workaround), since it seems to have some effect. I''m really not sure what the root cause of the problem is; it seems that something is deadlocked in the io scheduler, and everything else ends up getting hung as a result. If you can get the full output of sysrq-t while the domain is hung it may help try to work out who the culprit is (I''ve never managed to get a complete dump in a hung state, but that was with an older kernel with some bugs in the console code which truncated the output; 2.6.32 should be OK). J> > -- Rick > > On 06/08/2010 04:17 PM, Jeremy Fitzhardinge wrote: >> On 06/08/2010 01:16 PM, Rick Boone wrote: >> >>> Hey, >>> >>> I''m running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 - >>> my domU''s are continually locking up, under heavy IO load. My issue >>> seems similar to what these guys are reportiing: >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=551552 >>> https://bugzilla.redhat.com/show_bug.cgi?id=526627 >>> https://bugzilla.redhat.com/show_bug.cgi?id=550724 >>> >>> Any solutions/ideas that haven''t been covered in those reports? I''ve >>> turned off tickless kernel, and have also set the guest I/O scheduler >>> to "noop", but machines are still crashing. I''m using LVM-backed block >>> devices on the dom0. For awhile, I didn''t have the kernel set to panic >>> on a hung task, and from digging around while the kernel was still up, >>> I was able to determine that the device that seems to be causing >>> issues is one that sees a lot of IO (it''s receiving all of the logs on >>> a heavily used web server). >>> >> Does booting the domU with "clocksource=jiffies" help at all? >> >> Have you tried a 2.6.32-based domU? I used to see these kinds of >> lockups with older kernels on my linode domU, but have seen none since >> moving it to Fedora 12 with a 2.6.32 kernel. >> >> Are you using a distro kernel, or one you''ve built yourself? >> >> J >> >> >>> Here''s a couple of my tracebacks: >>> 1) >>> >>> INFO: task pdflush:36 blocked for more than 120 seconds. >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0 >>> ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94 >>> 00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec >>> Call Trace: >>> [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72 >>> [<ffffffff813f1ee0>] io_schedule+0x37/0x59 >>> [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71 >>> [<ffffffff810c2e77>] sync_page+0x5a/0x72 >>> [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3 >>> [<ffffffff810c2e1d>] ? sync_page+0x0/0x72 >>> [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144 >>> [<ffffffff810c2dce>] __lock_page+0x71/0x8c >>> [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51 >>> [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf >>> [<ffffffff810cac23>] ? __writepage+0x0/0x5a >>> [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36 >>> [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1 >>> [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f >>> [<ffffffff810cb230>] do_writepages+0x3f/0x5e >>> [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7 >>> [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355 >>> [<ffffffff810cc726>] ? pdflush+0x0/0x286 >>> [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32 >>> [<ffffffff811269bc>] writeback_inodes+0x79/0xdf >>> [<ffffffff81107819>] ? sync_supers+0xb3/0xce >>> [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a >>> [<ffffffff810cc84c>] ? pdflush+0x126/0x286 >>> [<ffffffff810cc889>] pdflush+0x163/0x286 >>> [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a >>> [<ffffffff810cc726>] ? pdflush+0x0/0x286 >>> [<ffffffff810754ce>] kthread+0x9e/0xa8 >>> [<ffffffff8101606a>] child_rip+0xa/0x20 >>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b >>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 >>> [<ffffffff81016060>] ? child_rip+0x0/0x20 >>> 1 lock held by pdflush/36: >>> #0: (&type->s_umount_key#23){......}, at: [<ffffffff811269a6>] >>> writeback_inodes+0x63/0xdf >>> Kernel panic - not syncing: hung_task: blocked tasks >>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4 >>> Call Trace: >>> [<ffffffff8105c8f3>] panic+0xb2/0x168 >>> [<ffffffff81085881>] ? print_lock+0x96/0xb1 >>> [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9 >>> [<ffffffff8101907a>] ? show_stack+0x2a/0x40 >>> [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87 >>> [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49 >>> [<ffffffff810b146c>] watchdog+0x209/0x258 >>> [<ffffffff810b12d8>] ? watchdog+0x75/0x258 >>> [<ffffffff8104a45f>] ? complete+0x52/0x71 >>> [<ffffffff810b1263>] ? watchdog+0x0/0x258 >>> [<ffffffff810754ce>] kthread+0x9e/0xa8 >>> [<ffffffff8101606a>] child_rip+0xa/0x20 >>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b >>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 >>> [<ffffffff81016060>] ? child_rip+0x0/0x20 >>> >>> --------------------------------------------- >>> >>> 2) >>> INFO: task kjournald:951 blocked for more than 120 seconds. >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00 >>> ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001 >>> 0000000000000001 0000000000000200 0000000000000001 000000000160015f >>> Call Trace: >>> [<ffffffff813f1ee0>] io_schedule+0x37/0x59 >>> [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36 >>> [<ffffffff8112cb73>] sync_buffer+0x51/0x69 >>> [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c >>> [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69 >>> [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69 >>> [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e >>> [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51 >>> [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48 >>> [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd] >>> [<ffffffff8100fa82>] ? check_events+0x12/0x20 >>> [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1 >>> [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f >>> [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84 >>> [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0 >>> [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd] >>> [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e >>> [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd] >>> [<ffffffff810754ce>] kthread+0x9e/0xa8 >>> [<ffffffff8101606a>] child_rip+0xa/0x20 >>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b >>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 >>> [<ffffffff81016060>] ? child_rip+0x0/0x20 >>> no locks held by kjournald/951. >>> Kernel panic - not syncing: hung_task: blocked tasks >>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18 >>> Call Trace: >>> [<ffffffff8105c8f3>] panic+0xb2/0x168 >>> [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9 >>> [<ffffffff8101907a>] ? show_stack+0x2a/0x40 >>> [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87 >>> [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49 >>> [<ffffffff810b146c>] watchdog+0x209/0x258 >>> [<ffffffff810b12d8>] ? watchdog+0x75/0x258 >>> [<ffffffff8104a45f>] ? complete+0x52/0x71 >>> [<ffffffff810b1263>] ? watchdog+0x0/0x258 >>> [<ffffffff810754ce>] kthread+0x9e/0xa8 >>> [<ffffffff8101606a>] child_rip+0xa/0x20 >>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b >>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6 >>> [<ffffffff81016060>] ? child_rip+0x0/0x20 >>> >>> -- Rick Boone >>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel >>> >>> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel