List, I ran into a hang issue (race condition: cpu is high when the server is idle, meaning that btrfs is hanging, and IOwait is high as well) running 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, as shown below: Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 Total devices 18 FS bytes used 2.94TB devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 Btrfs v0.19-16-g075587c-dirty The filesystem, mounted in /mnt/btrfs is hanging, no existing or new process can access it, however ''df'' still displays the disk usage (3TB out of 5). The disks appear to be physically healthy. Please note that a significant number of files were placed on this filesystem, between 20 and 30 million files. The relevant kernel messages are displayed below: INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 Call Trace: [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task btrfs-transacti:4230 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-transac D 000000010042e1cc 0 4230 2 0x00000000 ffff8803e544d300 0000000000000046 0000000000004000 0000000000011680 ffff8803f3531fd8 ffff8803f3531fd8 ffff8803e544d300 0000000000011680 ffff8803fe488240 00000000000004c1 ffff8803ff8d7340 0000000381147502 Call Trace: [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff810b418e>] ? sync_buffer+0x3b/0x3f [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff81309054>] ? out_of_line_wait_on_bit+0x6b/0x77 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0186635>] ? write_dev_supers+0xf3/0x225 [btrfs] [<ffffffffa018693b>] ? write_all_supers+0x1d4/0x22c [btrfs] [<ffffffffa01898a1>] ? btrfs_commit_transaction+0x4fe/0x5e1 [btrfs] [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffffa0185628>] ? transaction_kthread+0x16b/0x1fd [btrfs] [<ffffffffa01854bd>] ? transaction_kthread+0x0/0x1fd [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task tar:31615 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. tar D 000000010042dee1 0 31615 4269 0x00000000 ffff8803ffa74d70 0000000000000082 0000000000004000 0000000000011680 ffff88010046dfd8 ffff88010046dfd8 ffff8803ffa74d70 0000000000011680 ffff880361cdd480 0000000161cdd480 ffff8803ff8becf0 0000000200005fff Call Trace: [<ffffffff8106d2af>] ? sync_page+0x0/0x45 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff8106d2f0>] ? sync_page+0x41/0x45 [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff8106d483>] ? wait_on_page_bit+0x6b/0x71 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0192917>] ? prepare_pages+0xe0/0x244 [btrfs] [<ffffffffa017e85a>] ? btrfs_check_data_free_space+0x69/0x206 [btrfs] [<ffffffffa0192f6a>] ? btrfs_file_write+0x405/0x711 [btrfs] [<ffffffff811ab83d>] ? tty_write+0x213/0x22e [<ffffffff810951b8>] ? vfs_write+0xad/0x149 [<ffffffff81095310>] ? sys_write+0x45/0x6e [<ffffffff810028eb>] ? system_call_fastpath+0x16/0x1b INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 Call Trace: [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task btrfs-transacti:4230 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-transac D 000000010042e1cc 0 4230 2 0x00000000 ffff8803e544d300 0000000000000046 0000000000004000 0000000000011680 ffff8803f3531fd8 ffff8803f3531fd8 ffff8803e544d300 0000000000011680 ffff8803fe488240 00000000000004c1 ffff8803ff8d7340 0000000381147502 Call Trace: [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff810b418e>] ? sync_buffer+0x3b/0x3f [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff81309054>] ? out_of_line_wait_on_bit+0x6b/0x77 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0186635>] ? write_dev_supers+0xf3/0x225 [btrfs] [<ffffffffa018693b>] ? write_all_supers+0x1d4/0x22c [btrfs] [<ffffffffa01898a1>] ? btrfs_commit_transaction+0x4fe/0x5e1 [btrfs] [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffffa0185628>] ? transaction_kthread+0x16b/0x1fd [btrfs] [<ffffffffa01854bd>] ? transaction_kthread+0x0/0x1fd [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task tar:31615 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. tar D 000000010042dee1 0 31615 4269 0x00000000 ffff8803ffa74d70 0000000000000082 0000000000004000 0000000000011680 ffff88010046dfd8 ffff88010046dfd8 ffff8803ffa74d70 0000000000011680 ffff880361cdd480 0000000161cdd480 ffff8803ff8becf0 0000000200005fff Call Trace: [<ffffffff8106d2af>] ? sync_page+0x0/0x45 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff8106d2f0>] ? sync_page+0x41/0x45 [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff8106d483>] ? wait_on_page_bit+0x6b/0x71 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0192917>] ? prepare_pages+0xe0/0x244 [btrfs] [<ffffffffa017e85a>] ? btrfs_check_data_free_space+0x69/0x206 [btrfs] [<ffffffffa0192f6a>] ? btrfs_file_write+0x405/0x711 [btrfs] [<ffffffff811ab83d>] ? tty_write+0x213/0x22e [<ffffffff810951b8>] ? vfs_write+0xad/0x149 [<ffffffff81095310>] ? sys_write+0x45/0x6e [<ffffffff810028eb>] ? system_call_fastpath+0x16/0x1b INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 Call Trace: [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task btrfs-transacti:4230 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-transac D 000000010042e1cc 0 4230 2 0x00000000 ffff8803e544d300 0000000000000046 0000000000004000 0000000000011680 ffff8803f3531fd8 ffff8803f3531fd8 ffff8803e544d300 0000000000011680 ffff8803fe488240 00000000000004c1 ffff8803ff8d7340 0000000381147502 Call Trace: [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff810b418e>] ? sync_buffer+0x3b/0x3f [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff810b4153>] ? sync_buffer+0x0/0x3f [<ffffffff81309054>] ? out_of_line_wait_on_bit+0x6b/0x77 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0186635>] ? write_dev_supers+0xf3/0x225 [btrfs] [<ffffffffa018693b>] ? write_all_supers+0x1d4/0x22c [btrfs] [<ffffffffa01898a1>] ? btrfs_commit_transaction+0x4fe/0x5e1 [btrfs] [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffffa0185628>] ? transaction_kthread+0x16b/0x1fd [btrfs] [<ffffffffa01854bd>] ? transaction_kthread+0x0/0x1fd [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 INFO: task tar:31615 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. tar D 000000010042dee1 0 31615 4269 0x00000000 ffff8803ffa74d70 0000000000000082 0000000000004000 0000000000011680 ffff88010046dfd8 ffff88010046dfd8 ffff8803ffa74d70 0000000000011680 ffff880361cdd480 0000000161cdd480 ffff8803ff8becf0 0000000200005fff Call Trace: [<ffffffff8106d2af>] ? sync_page+0x0/0x45 [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff8106d2f0>] ? sync_page+0x41/0x45 [<ffffffff81308fba>] ? __wait_on_bit+0x41/0x70 [<ffffffff8106d483>] ? wait_on_page_bit+0x6b/0x71 [<ffffffff81040722>] ? wake_bit_function+0x0/0x23 [<ffffffffa0192917>] ? prepare_pages+0xe0/0x244 [btrfs] [<ffffffffa017e85a>] ? btrfs_check_data_free_space+0x69/0x206 [btrfs] [<ffffffffa0192f6a>] ? btrfs_file_write+0x405/0x711 [btrfs] [<ffffffff811ab83d>] ? tty_write+0x213/0x22e [<ffffffff810951b8>] ? vfs_write+0xad/0x149 [<ffffffff81095310>] ? sys_write+0x45/0x6e [<ffffffff810028eb>] ? system_call_fastpath+0x16/0x1b INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 Call Trace: [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 [<ffffffff811470be>] ? get_request_wait+0xab/0x140 [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] [<ffffffff81040331>] ? kthread+0x79/0x81 [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810402b8>] ? kthread+0x0/0x81 [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 Jerome J. Ibanes -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote:> List, > > I ran into a hang issue (race condition: cpu is high when the server is > idle, meaning that btrfs is hanging, and IOwait is high as well) running > 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). > The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, > as shown below: > > Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > Total devices 18 FS bytes used 2.94TB > devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 > devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 > devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 > devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 > devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 > devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 > devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 > devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 > devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 > devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 > devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 > devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 > devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 > devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 > devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 > devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 > devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 > devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 > Btrfs v0.19-16-g075587c-dirty > > The filesystem, mounted in /mnt/btrfs is hanging, no existing or new > process can access it, however ''df'' still displays the disk usage (3TB out > of 5). The disks appear to be physically healthy. Please note that a > significant number of files were placed on this filesystem, between 20 and > 30 million files. > > The relevant kernel messages are displayed below: > > INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 > ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 > ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 > 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 > Call Trace: > [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] > [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > [<ffffffff81040331>] ? kthread+0x79/0x81 > [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff810402b8>] ? kthread+0x0/0x81 > [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. This is reproduceable in our setup. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote:> On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: >> List, >> >> I ran into a hang issue (race condition: cpu is high when the server is >> idle, meaning that btrfs is hanging, and IOwait is high as well) running >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). >> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, >> as shown below: >> >> Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 >> Total devices 18 FS bytes used 2.94TB >> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 >> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 >> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 >> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 >> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 >> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 >> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 >> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 >> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 >> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 >> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 >> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 >> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 >> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 >> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 >> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 >> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 >> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 >> Btrfs v0.19-16-g075587c-dirty >> >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new >> process can access it, however ''df'' still displays the disk usage (3TB out >> of 5). The disks appear to be physically healthy. Please note that a >> significant number of files were placed on this filesystem, between 20 and >> 30 million files. >> >> The relevant kernel messages are displayed below: >> >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 >> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 >> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 >> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 >> Call Trace: >> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 >> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 >> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e >> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 >> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 >> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc >> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 >> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 >> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] >> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] >> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] >> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] >> [<ffffffff81040331>] ? kthread+0x79/0x81 >> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 >> [<ffffffff810402b8>] ? kthread+0x0/0x81 >> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. > This is reproduceable in our setup.I think I know the cause of http://lkml.org/lkml/2010/6/8/375. The code in the first do-while loop in btrfs_commit_transaction set current process to TASK_UNINTERRUPTIBLE state, then calls btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and btrfs_run_ordered_operations(). All of these function may call cond_resched(). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote:> On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > >> List, > >> > >> I ran into a hang issue (race condition: cpu is high when the server is > >> idle, meaning that btrfs is hanging, and IOwait is high as well) running > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). > >> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, > >> as shown below: > >> > >> Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > >> Total devices 18 FS bytes used 2.94TB > >> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 > >> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 > >> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 > >> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 > >> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 > >> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 > >> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 > >> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 > >> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 > >> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 > >> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 > >> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 > >> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 > >> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 > >> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 > >> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 > >> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 > >> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 > >> Btrfs v0.19-16-g075587c-dirty > >> > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new > >> process can access it, however ''df'' still displays the disk usage (3TB out > >> of 5). The disks appear to be physically healthy. Please note that a > >> significant number of files were placed on this filesystem, between 20 and > >> 30 million files. > >> > >> The relevant kernel messages are displayed below: > >> > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 > >> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 > >> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 > >> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 > >> Call Trace: > >> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > >> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > >> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > >> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > >> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > >> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > >> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > >> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > >> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > >> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] > >> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > >> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > >> [<ffffffff81040331>] ? kthread+0x79/0x81 > >> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > >> [<ffffffff810402b8>] ? kthread+0x0/0x81 > >> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. > > This is reproduceable in our setup. > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > The code in the first do-while loop in btrfs_commit_transaction > set current process to TASK_UNINTERRUPTIBLE state, then calls > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > btrfs_run_ordered_operations(). All of these function may call > cond_resched().Hi, When I test random write, I saw a lot of threads jump into btree_writepages() and do noting and io throughput is zero for some time. Looks like there is a live lock. See the code of btree_writepages(): if (wbc->sync_mode == WB_SYNC_NONE) { struct btrfs_root *root = BTRFS_I(mapping->host)->root; u64 num_dirty; unsigned long thresh = 32 * 1024 * 1024; if (wbc->for_kupdate) return 0; /* this is a bit racy, but that''s ok */ num_dirty = root->fs_info->dirty_metadata_bytes;>>>>>> if (num_dirty < thresh)return 0; } The marked line is quite intrusive. In my test, the live lock is caused by the thresh check. The dirty_metadata_bytes < 32M. Without it, I can''t see the live lock. Not sure if this is related to the hang. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote:> On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > >> List, > >> > >> I ran into a hang issue (race condition: cpu is high when the server is > >> idle, meaning that btrfs is hanging, and IOwait is high as well) running > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). > >> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, > >> as shown below: > >> > >> Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > >> Total devices 18 FS bytes used 2.94TB > >> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 > >> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 > >> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 > >> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 > >> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 > >> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 > >> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 > >> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 > >> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 > >> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 > >> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 > >> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 > >> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 > >> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 > >> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 > >> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 > >> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 > >> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 > >> Btrfs v0.19-16-g075587c-dirty > >> > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new > >> process can access it, however ''df'' still displays the disk usage (3TB out > >> of 5). The disks appear to be physically healthy. Please note that a > >> significant number of files were placed on this filesystem, between 20 and > >> 30 million files. > >> > >> The relevant kernel messages are displayed below: > >> > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 > >> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 > >> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 > >> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 > >> Call Trace: > >> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > >> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > >> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > >> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > >> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > >> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > >> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > >> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > >> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > >> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] > >> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > >> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > >> [<ffffffff81040331>] ? kthread+0x79/0x81 > >> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > >> [<ffffffff810402b8>] ? kthread+0x0/0x81 > >> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. > > This is reproduceable in our setup. > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > The code in the first do-while loop in btrfs_commit_transaction > set current process to TASK_UNINTERRUPTIBLE state, then calls > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > btrfs_run_ordered_operations(). All of these function may call > cond_resched().The TASK_UNINTERRUPTIBLE problem was fixed with 2.6.35-rc1. You can find the changes in the master branch of the btrfs-unstable repo. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote:> On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: > > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > > >> List, > > >> > > >> I ran into a hang issue (race condition: cpu is high when the server is > > >> idle, meaning that btrfs is hanging, and IOwait is high as well) running > > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). > > >> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, > > >> as shown below: > > >> > > >> Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > > >> Total devices 18 FS bytes used 2.94TB > > >> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 > > >> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 > > >> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 > > >> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 > > >> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 > > >> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 > > >> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 > > >> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 > > >> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 > > >> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 > > >> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 > > >> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 > > >> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 > > >> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 > > >> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 > > >> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 > > >> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 > > >> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 > > >> Btrfs v0.19-16-g075587c-dirty > > >> > > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new > > >> process can access it, however ''df'' still displays the disk usage (3TB out > > >> of 5). The disks appear to be physically healthy. Please note that a > > >> significant number of files were placed on this filesystem, between 20 and > > >> 30 million files. > > >> > > >> The relevant kernel messages are displayed below: > > >> > > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. > > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > >> btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 > > >> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 > > >> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 > > >> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 > > >> Call Trace: > > >> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > > >> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > > >> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > > >> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > > >> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > > >> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > > >> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > > >> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > > >> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > > >> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] > > >> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > > >> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > > >> [<ffffffff81040331>] ? kthread+0x79/0x81 > > >> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > > >> [<ffffffff810402b8>] ? kthread+0x0/0x81 > > >> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > > > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. > > > This is reproduceable in our setup. > > > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > > The code in the first do-while loop in btrfs_commit_transaction > > set current process to TASK_UNINTERRUPTIBLE state, then calls > > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > > btrfs_run_ordered_operations(). All of these function may call > > cond_resched(). > Hi, > When I test random write, I saw a lot of threads jump into btree_writepages() > and do noting and io throughput is zero for some time. Looks like there is a > live lock. See the code of btree_writepages(): > if (wbc->sync_mode == WB_SYNC_NONE) { > struct btrfs_root *root = BTRFS_I(mapping->host)->root; > u64 num_dirty; > unsigned long thresh = 32 * 1024 * 1024; > > if (wbc->for_kupdate) > return 0; > > /* this is a bit racy, but that''s ok */ > num_dirty = root->fs_info->dirty_metadata_bytes; > >>>>>> if (num_dirty < thresh) > return 0; > } > The marked line is quite intrusive. In my test, the live lock is caused by the thresh > check. The dirty_metadata_bytes < 32M. Without it, I can''t see the live lock. Not > sure if this is related to the hang.How much ram do you have? The goal of the check is to avoid writing metadata blocks because once we write them we have to do more IO to cow them again if they are changed later. It shouldn''t be looping hard in btrfs there, what was the workload? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 14 Jun 2010, Chris Mason wrote:> On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: >> On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: >>> On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote: >>>> On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: >>>>> List, >>>>> >>>>> I ran into a hang issue (race condition: cpu is high when the server is >>>>> idle, meaning that btrfs is hanging, and IOwait is high as well) running >>>>> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). >>>>> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, >>>>> as shown below: >>>>> >>>>> Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 >>>>> Total devices 18 FS bytes used 2.94TB >>>>> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 >>>>> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 >>>>> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 >>>>> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 >>>>> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 >>>>> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 >>>>> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 >>>>> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 >>>>> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 >>>>> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 >>>>> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 >>>>> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 >>>>> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 >>>>> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 >>>>> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 >>>>> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 >>>>> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 >>>>> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 >>>>> Btrfs v0.19-16-g075587c-dirty >>>>> >>>>> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new >>>>> process can access it, however ''df'' still displays the disk usage (3TB out >>>>> of 5). The disks appear to be physically healthy. Please note that a >>>>> significant number of files were placed on this filesystem, between 20 and >>>>> 30 million files. >>>>> >>>>> The relevant kernel messages are displayed below: >>>>> >>>>> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. >>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 >>>>> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 >>>>> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 >>>>> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 >>>>> Call Trace: >>>>> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 >>>>> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 >>>>> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e >>>>> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 >>>>> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 >>>>> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc >>>>> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 >>>>> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 >>>>> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] >>>>> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] >>>>> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] >>>>> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] >>>>> [<ffffffff81040331>] ? kthread+0x79/0x81 >>>>> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 >>>>> [<ffffffff810402b8>] ? kthread+0x0/0x81 >>>>> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 >>>> This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. >>>> This is reproduceable in our setup. >>> >>> I think I know the cause of http://lkml.org/lkml/2010/6/8/375. >>> The code in the first do-while loop in btrfs_commit_transaction >>> set current process to TASK_UNINTERRUPTIBLE state, then calls >>> btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and >>> btrfs_run_ordered_operations(). All of these function may call >>> cond_resched(). >> Hi, >> When I test random write, I saw a lot of threads jump into btree_writepages() >> and do noting and io throughput is zero for some time. Looks like there is a >> live lock. See the code of btree_writepages(): >> if (wbc->sync_mode == WB_SYNC_NONE) { >> struct btrfs_root *root = BTRFS_I(mapping->host)->root; >> u64 num_dirty; >> unsigned long thresh = 32 * 1024 * 1024; >> >> if (wbc->for_kupdate) >> return 0; >> >> /* this is a bit racy, but that''s ok */ >> num_dirty = root->fs_info->dirty_metadata_bytes; >>>>>>>> if (num_dirty < thresh) >> return 0; >> } >> The marked line is quite intrusive. In my test, the live lock is caused by the thresh >> check. The dirty_metadata_bytes < 32M. Without it, I can''t see the live lock. Not >> sure if this is related to the hang. > > How much ram do you have? The goal of the check is to avoid writing > metadata blocks because once we write them we have to do more IO to cow > them again if they are changed later.This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory).> It shouldn''t be looping hard in btrfs there, what was the workload?The workload was the extraction of large tarballs (one at the time, about 300+ files extracted by second from a single tarball, which is pretty good), as you might expect, the disks were tested (read and write) for physical errors before I report this bug. Jerome J. Ibanes
On Mon, Jun 14, 2010 at 11:12:53AM -0700, Jerome Ibanes wrote:> On Mon, 14 Jun 2010, Chris Mason wrote: > > >On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: > >>On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: > >>>On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >>>>On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > >>>>>List, > >>>>> > >>>>>I ran into a hang issue (race condition: cpu is high when the server is > >>>>>idle, meaning that btrfs is hanging, and IOwait is high as well) running > >>>>>2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). > >>>>>The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, > >>>>>as shown below: > >>>>> > >>>>>Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > >>>>> Total devices 18 FS bytes used 2.94TB > >>>>> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 > >>>>> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 > >>>>> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 > >>>>> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 > >>>>> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 > >>>>> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 > >>>>> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 > >>>>> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 > >>>>> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 > >>>>> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 > >>>>> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 > >>>>> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 > >>>>> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 > >>>>> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 > >>>>> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 > >>>>> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 > >>>>> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 > >>>>> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 > >>>>>Btrfs v0.19-16-g075587c-dirty > >>>>> > >>>>>The filesystem, mounted in /mnt/btrfs is hanging, no existing or new > >>>>>process can access it, however ''df'' still displays the disk usage (3TB out > >>>>>of 5). The disks appear to be physically healthy. Please note that a > >>>>>significant number of files were placed on this filesystem, between 20 and > >>>>>30 million files. > >>>>> > >>>>>The relevant kernel messages are displayed below: > >>>>> > >>>>>INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. > >>>>>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >>>>>btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 > >>>>> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 > >>>>> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 > >>>>> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 > >>>>>Call Trace: > >>>>> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > >>>>> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > >>>>> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > >>>>> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > >>>>> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > >>>>> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > >>>>> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > >>>>> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > >>>>> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > >>>>> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] > >>>>> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > >>>>> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > >>>>> [<ffffffff81040331>] ? kthread+0x79/0x81 > >>>>> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > >>>>> [<ffffffff810402b8>] ? kthread+0x0/0x81 > >>>>> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > >>>>This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. > >>>>This is reproduceable in our setup. > >>> > >>>I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > >>>The code in the first do-while loop in btrfs_commit_transaction > >>>set current process to TASK_UNINTERRUPTIBLE state, then calls > >>>btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > >>>btrfs_run_ordered_operations(). All of these function may call > >>>cond_resched(). > >>Hi, > >>When I test random write, I saw a lot of threads jump into btree_writepages() > >>and do noting and io throughput is zero for some time. Looks like there is a > >>live lock. See the code of btree_writepages(): > >> if (wbc->sync_mode == WB_SYNC_NONE) { > >> struct btrfs_root *root = BTRFS_I(mapping->host)->root; > >> u64 num_dirty; > >> unsigned long thresh = 32 * 1024 * 1024; > >> > >> if (wbc->for_kupdate) > >> return 0; > >> > >> /* this is a bit racy, but that''s ok */ > >> num_dirty = root->fs_info->dirty_metadata_bytes; > >>>>>>>> if (num_dirty < thresh) > >> return 0; > >> } > >>The marked line is quite intrusive. In my test, the live lock is caused by the thresh > >>check. The dirty_metadata_bytes < 32M. Without it, I can''t see the live lock. Not > >>sure if this is related to the hang. > > > >How much ram do you have? The goal of the check is to avoid writing > >metadata blocks because once we write them we have to do more IO to cow > >them again if they are changed later. > > This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory). > > >It shouldn''t be looping hard in btrfs there, what was the workload? > > The workload was the extraction of large tarballs (one at the time, > about 300+ files extracted by second from a single tarball, which is > pretty good), as you might expect, the disks were tested (read and > write) for physical errors before I report this bug.I think Zheng is right and this one will get fixed by the latest code. The spinning writepage part should be a different problem. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Mon, Jun 14, 2010 at 11:12:53AM -0700, Jerome Ibanes wrote: >> On Mon, 14 Jun 2010, Chris Mason wrote: >> >>> On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: >>>> On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: >>>>> On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote: >>>>>> On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: >>>>>>> List, >>>>>>> >>>>>>> I ran into a hang issue (race condition: cpu is high when the server is >>>>>>> idle, meaning that btrfs is hanging, and IOwait is high as well) running >>>>>>> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). >>>>>>> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, >>>>>>> as shown below: >>>>>>> >>>>>>> Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 >>>>>>> Total devices 18 FS bytes used 2.94TB >>>>>>> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 >>>>>>> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 >>>>>>> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 >>>>>>> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 >>>>>>> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 >>>>>>> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 >>>>>>> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 >>>>>>> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 >>>>>>> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 >>>>>>> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 >>>>>>> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 >>>>>>> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 >>>>>>> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 >>>>>>> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 >>>>>>> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 >>>>>>> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 >>>>>>> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 >>>>>>> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 >>>>>>> Btrfs v0.19-16-g075587c-dirty >>>>>>> >>>>>>> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new >>>>>>> process can access it, however ''df'' still displays the disk usage (3TB out >>>>>>> of 5). The disks appear to be physically healthy. Please note that a >>>>>>> significant number of files were placed on this filesystem, between 20 and >>>>>>> 30 million files. >>>>>>> >>>>>>> The relevant kernel messages are displayed below: >>>>>>> >>>>>>> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. >>>>>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>>>> btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 >>>>>>> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 >>>>>>> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 >>>>>>> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 >>>>>>> Call Trace: >>>>>>> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 >>>>>>> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 >>>>>>> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e >>>>>>> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 >>>>>>> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 >>>>>>> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc >>>>>>> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 >>>>>>> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 >>>>>>> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] >>>>>>> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] >>>>>>> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] >>>>>>> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] >>>>>>> [<ffffffff81040331>] ? kthread+0x79/0x81 >>>>>>> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 >>>>>>> [<ffffffff810402b8>] ? kthread+0x0/0x81 >>>>>>> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 >>>>>> This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. >>>>>> This is reproduceable in our setup. >>>>> >>>>> I think I know the cause of http://lkml.org/lkml/2010/6/8/375. >>>>> The code in the first do-while loop in btrfs_commit_transaction >>>>> set current process to TASK_UNINTERRUPTIBLE state, then calls >>>>> btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and >>>>> btrfs_run_ordered_operations(). All of these function may call >>>>> cond_resched(). >>>> Hi, >>>> When I test random write, I saw a lot of threads jump into btree_writepages() >>>> and do noting and io throughput is zero for some time. Looks like there is a >>>> live lock. See the code of btree_writepages(): >>>> if (wbc->sync_mode == WB_SYNC_NONE) { >>>> struct btrfs_root *root = BTRFS_I(mapping->host)->root; >>>> u64 num_dirty; >>>> unsigned long thresh = 32 * 1024 * 1024; >>>> >>>> if (wbc->for_kupdate) >>>> return 0; >>>> >>>> /* this is a bit racy, but that''s ok */ >>>> num_dirty = root->fs_info->dirty_metadata_bytes; >>>>>>>>>> if (num_dirty < thresh) >>>> return 0; >>>> } >>>> The marked line is quite intrusive. In my test, the live lock is caused by the thresh >>>> check. The dirty_metadata_bytes < 32M. Without it, I can''t see the live lock. Not >>>> sure if this is related to the hang. >>> >>> How much ram do you have? The goal of the check is to avoid writing >>> metadata blocks because once we write them we have to do more IO to cow >>> them again if they are changed later. >> >> This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory). >> >>> It shouldn''t be looping hard in btrfs there, what was the workload? >> >> The workload was the extraction of large tarballs (one at the time, >> about 300+ files extracted by second from a single tarball, which is >> pretty good), as you might expect, the disks were tested (read and >> write) for physical errors before I report this bug. > > I think Zheng is right and this one will get fixed by the latest code. > The spinning writepage part should be a different problem.I''m trying to repro with 2.6.35-rc3, expect results within 24 hours. Jerome J. Ibanes
On Mon, 14 Jun 2010, Jerome Ibanes wrote:>> On Mon, Jun 14, 2010 at 11:12:53AM -0700, Jerome Ibanes wrote: >> > On Mon, 14 Jun 2010, Chris Mason wrote: >> > >> > > On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: >> > > > On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: >> > > > > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> >> > > > > wrote: >> > > > > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: >> > > > > > > List, >> > > > > > > >> > > > > > > I ran into a hang issue (race condition: cpu is high when the >> > > > > > > server is >> > > > > > > idle, meaning that btrfs is hanging, and IOwait is high as >> > > > > > > well) running >> > > > > > > 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ >> > > > > > > 16GB ram). >> > > > > > > The btrfs filesystem live on 18x300GB scsi spindles, >> > > > > > > configured as Raid-0, >> > > > > > > as shown below: >> > > > > > > >> > > > > > > Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 >> > > > > > > Total devices 18 FS bytes used 2.94TB >> > > > > > > devid 5 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d0 >> > > > > > > devid 17 size 279.39GB used 208.34GB path >> > > > > > > /dev/cciss/c1d8 >> > > > > > > devid 16 size 279.39GB used 209.33GB path >> > > > > > > /dev/cciss/c1d7 >> > > > > > > devid 4 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c0d4 >> > > > > > > devid 1 size 279.39GB used 233.72GB path >> > > > > > > /dev/cciss/c0d1 >> > > > > > > devid 13 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d4 >> > > > > > > devid 8 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d11 >> > > > > > > devid 12 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d3 >> > > > > > > devid 3 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c0d3 >> > > > > > > devid 9 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d12 >> > > > > > > devid 6 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d1 >> > > > > > > devid 11 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d2 >> > > > > > > devid 14 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d5 >> > > > > > > devid 2 size 279.39GB used 233.70GB path >> > > > > > > /dev/cciss/c0d2 >> > > > > > > devid 15 size 279.39GB used 209.33GB path >> > > > > > > /dev/cciss/c1d6 >> > > > > > > devid 10 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d13 >> > > > > > > devid 7 size 279.39GB used 208.33GB path >> > > > > > > /dev/cciss/c1d10 >> > > > > > > devid 18 size 279.39GB used 208.34GB path >> > > > > > > /dev/cciss/c1d9 >> > > > > > > Btrfs v0.19-16-g075587c-dirty >> > > > > > > >> > > > > > > The filesystem, mounted in /mnt/btrfs is hanging, no existing >> > > > > > > or new >> > > > > > > process can access it, however ''df'' still displays the disk >> > > > > > > usage (3TB out >> > > > > > > of 5). The disks appear to be physically healthy. Please note >> > > > > > > that a >> > > > > > > significant number of files were placed on this filesystem, >> > > > > > > between 20 and >> > > > > > > 30 million files. >> > > > > > > >> > > > > > > The relevant kernel messages are displayed below: >> > > > > > > >> > > > > > > INFO: task btrfs-submit-0:4220 blocked for more than 120 >> > > > > > > seconds. >> > > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >> > > > > > > this message. >> > > > > > > btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 >> > > > > > > ffff8803e584ac70 0000000000000046 0000000000004000 >> > > > > > > 0000000000011680 >> > > > > > > ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 >> > > > > > > 0000000000011680 >> > > > > > > 0000000000000001 ffff8803ff99d250 ffffffff8149f020 >> > > > > > > 0000000081150ab0 >> > > > > > > Call Trace: >> > > > > > > [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 >> > > > > > > [<ffffffff811470be>] ? get_request_wait+0xab/0x140 >> > > > > > > [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e >> > > > > > > [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 >> > > > > > > [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 >> > > > > > > [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc >> > > > > > > [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 >> > > > > > > [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 >> > > > > > > [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 >> > > > > > > [btrfs] >> > > > > > > [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f >> > > > > > > [btrfs] >> > > > > > > [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] >> > > > > > > [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] >> > > > > > > [<ffffffff81040331>] ? kthread+0x79/0x81 >> > > > > > > [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 >> > > > > > > [<ffffffff810402b8>] ? kthread+0x0/0x81 >> > > > > > > [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 >> > > > > > This looks like the issue we saw too, >> > > > > > http://lkml.org/lkml/2010/6/8/375. >> > > > > > This is reproduceable in our setup. >> > > > > >> > > > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. >> > > > > The code in the first do-while loop in btrfs_commit_transaction >> > > > > set current process to TASK_UNINTERRUPTIBLE state, then calls >> > > > > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and >> > > > > btrfs_run_ordered_operations(). All of these function may call >> > > > > cond_resched(). >> > > > Hi, >> > > > When I test random write, I saw a lot of threads jump into >> > > > btree_writepages() >> > > > and do noting and io throughput is zero for some time. Looks like >> > > > there is a >> > > > live lock. See the code of btree_writepages(): >> > > > if (wbc->sync_mode == WB_SYNC_NONE) { >> > > > struct btrfs_root *root = BTRFS_I(mapping->host)->root; >> > > > u64 num_dirty; >> > > > unsigned long thresh = 32 * 1024 * 1024; >> > > > >> > > > if (wbc->for_kupdate) >> > > > return 0; >> > > > >> > > > /* this is a bit racy, but that''s ok */ >> > > > num_dirty = root->fs_info->dirty_metadata_bytes; >> > > > > > > > > > if (num_dirty < thresh) >> > > > return 0; >> > > > } >> > > > The marked line is quite intrusive. In my test, the live lock is >> > > > caused by the thresh >> > > > check. The dirty_metadata_bytes < 32M. Without it, I can''t see the >> > > > live lock. Not >> > > > sure if this is related to the hang. >> > > >> > > How much ram do you have? The goal of the check is to avoid writing >> > > metadata blocks because once we write them we have to do more IO to >> > > cow >> > > them again if they are changed later. >> > >> > This server has 16GB of ram on a x86_64 (dual opteron 275, ecc memory). >> > >> > > It shouldn''t be looping hard in btrfs there, what was the workload? >> > >> > The workload was the extraction of large tarballs (one at the time, >> > about 300+ files extracted by second from a single tarball, which is >> > pretty good), as you might expect, the disks were tested (read and >> > write) for physical errors before I report this bug. >> >> I think Zheng is right and this one will get fixed by the latest code. >> The spinning writepage part should be a different problem. > > I''m trying to repro with 2.6.35-rc3, expect results within 24 hours.I can no longer repro this issue (after 48 hours of filesystem stress) under 2.6.35-rc3 on a filesystem with over 50 million files. I will reopen this thread should this reoccur. Jerome J. Ibanes
On Mon, Jun 14, 2010 at 09:28:29PM +0800, Chris Mason wrote:> On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: > > On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: > > > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > > > >> List, > > > >> > > > >> I ran into a hang issue (race condition: cpu is high when the server is > > > >> idle, meaning that btrfs is hanging, and IOwait is high as well) running > > > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). > > > >> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, > > > >> as shown below: > > > >> > > > >> Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > > > >> Total devices 18 FS bytes used 2.94TB > > > >> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 > > > >> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 > > > >> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 > > > >> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 > > > >> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 > > > >> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 > > > >> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 > > > >> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 > > > >> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 > > > >> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 > > > >> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 > > > >> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 > > > >> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 > > > >> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 > > > >> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 > > > >> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 > > > >> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 > > > >> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 > > > >> Btrfs v0.19-16-g075587c-dirty > > > >> > > > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new > > > >> process can access it, however ''df'' still displays the disk usage (3TB out > > > >> of 5). The disks appear to be physically healthy. Please note that a > > > >> significant number of files were placed on this filesystem, between 20 and > > > >> 30 million files. > > > >> > > > >> The relevant kernel messages are displayed below: > > > >> > > > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. > > > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > >> btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 > > > >> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 > > > >> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 > > > >> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 > > > >> Call Trace: > > > >> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > > > >> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > > > >> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > > > >> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > > > >> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > > > >> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > > > >> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > > > >> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > > > >> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > > > >> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] > > > >> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > > > >> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > > > >> [<ffffffff81040331>] ? kthread+0x79/0x81 > > > >> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > > > >> [<ffffffff810402b8>] ? kthread+0x0/0x81 > > > >> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > > > > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. > > > > This is reproduceable in our setup. > > > > > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > > > The code in the first do-while loop in btrfs_commit_transaction > > > set current process to TASK_UNINTERRUPTIBLE state, then calls > > > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > > > btrfs_run_ordered_operations(). All of these function may call > > > cond_resched(). > > Hi, > > When I test random write, I saw a lot of threads jump into btree_writepages() > > and do noting and io throughput is zero for some time. Looks like there is a > > live lock. See the code of btree_writepages(): > > if (wbc->sync_mode == WB_SYNC_NONE) { > > struct btrfs_root *root = BTRFS_I(mapping->host)->root; > > u64 num_dirty; > > unsigned long thresh = 32 * 1024 * 1024; > > > > if (wbc->for_kupdate) > > return 0; > > > > /* this is a bit racy, but that''s ok */ > > num_dirty = root->fs_info->dirty_metadata_bytes; > > >>>>>> if (num_dirty < thresh) > > return 0; > > } > > The marked line is quite intrusive. In my test, the live lock is caused by the thresh > > check. The dirty_metadata_bytes < 32M. Without it, I can''t see the live lock. Not > > sure if this is related to the hang. > > How much ram do you have? The goal of the check is to avoid writing > metadata blocks because once we write them we have to do more IO to cow > them again if they are changed later. > > It shouldn''t be looping hard in btrfs there, what was the workload?This is a fio randomwrite. Yep, I limited memory to a small size (~500M), because it makes me easily to produce a ''xxx blocked for more than 120 seconds'' issue. I can understand small memory could be an issue, but this still looks intrusive, right? The issue Yanmin reported is under 2.6.35-rc1, so might not be the ''TASK_UNINTERRUPTIBLE'' issue, but we will try -rc3 too. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 17, 2010 at 09:41:18AM +0800, Shaohua Li wrote:> On Mon, Jun 14, 2010 at 09:28:29PM +0800, Chris Mason wrote: > > On Sun, Jun 13, 2010 at 02:50:06PM +0800, Shaohua Li wrote: > > > On Fri, Jun 11, 2010 at 10:32:07AM +0800, Yan, Zheng wrote: > > > > On Fri, Jun 11, 2010 at 9:12 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > > > > On Fri, Jun 11, 2010 at 01:41:41AM +0800, Jerome Ibanes wrote: > > > > >> List, > > > > >> > > > > >> I ran into a hang issue (race condition: cpu is high when the server is > > > > >> idle, meaning that btrfs is hanging, and IOwait is high as well) running > > > > >> 2.6.34 on debian/lenny on a x86_64 server (dual Opteron 275 w/ 16GB ram). > > > > >> The btrfs filesystem live on 18x300GB scsi spindles, configured as Raid-0, > > > > >> as shown below: > > > > >> > > > > >> Label: none uuid: bc6442c6-2fe2-4236-a5aa-6b7841234c52 > > > > >> Total devices 18 FS bytes used 2.94TB > > > > >> devid 5 size 279.39GB used 208.33GB path /dev/cciss/c1d0 > > > > >> devid 17 size 279.39GB used 208.34GB path /dev/cciss/c1d8 > > > > >> devid 16 size 279.39GB used 209.33GB path /dev/cciss/c1d7 > > > > >> devid 4 size 279.39GB used 208.33GB path /dev/cciss/c0d4 > > > > >> devid 1 size 279.39GB used 233.72GB path /dev/cciss/c0d1 > > > > >> devid 13 size 279.39GB used 208.33GB path /dev/cciss/c1d4 > > > > >> devid 8 size 279.39GB used 208.33GB path /dev/cciss/c1d11 > > > > >> devid 12 size 279.39GB used 208.33GB path /dev/cciss/c1d3 > > > > >> devid 3 size 279.39GB used 208.33GB path /dev/cciss/c0d3 > > > > >> devid 9 size 279.39GB used 208.33GB path /dev/cciss/c1d12 > > > > >> devid 6 size 279.39GB used 208.33GB path /dev/cciss/c1d1 > > > > >> devid 11 size 279.39GB used 208.33GB path /dev/cciss/c1d2 > > > > >> devid 14 size 279.39GB used 208.33GB path /dev/cciss/c1d5 > > > > >> devid 2 size 279.39GB used 233.70GB path /dev/cciss/c0d2 > > > > >> devid 15 size 279.39GB used 209.33GB path /dev/cciss/c1d6 > > > > >> devid 10 size 279.39GB used 208.33GB path /dev/cciss/c1d13 > > > > >> devid 7 size 279.39GB used 208.33GB path /dev/cciss/c1d10 > > > > >> devid 18 size 279.39GB used 208.34GB path /dev/cciss/c1d9 > > > > >> Btrfs v0.19-16-g075587c-dirty > > > > >> > > > > >> The filesystem, mounted in /mnt/btrfs is hanging, no existing or new > > > > >> process can access it, however ''df'' still displays the disk usage (3TB out > > > > >> of 5). The disks appear to be physically healthy. Please note that a > > > > >> significant number of files were placed on this filesystem, between 20 and > > > > >> 30 million files. > > > > >> > > > > >> The relevant kernel messages are displayed below: > > > > >> > > > > >> INFO: task btrfs-submit-0:4220 blocked for more than 120 seconds. > > > > >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > >> btrfs-submit- D 000000010042e12f 0 4220 2 0x00000000 > > > > >> ffff8803e584ac70 0000000000000046 0000000000004000 0000000000011680 > > > > >> ffff8803f7349fd8 ffff8803f7349fd8 ffff8803e584ac70 0000000000011680 > > > > >> 0000000000000001 ffff8803ff99d250 ffffffff8149f020 0000000081150ab0 > > > > >> Call Trace: > > > > >> [<ffffffff813089f3>] ? io_schedule+0x71/0xb1 > > > > >> [<ffffffff811470be>] ? get_request_wait+0xab/0x140 > > > > >> [<ffffffff810406f4>] ? autoremove_wake_function+0x0/0x2e > > > > >> [<ffffffff81143a4d>] ? elv_rq_merge_ok+0x89/0x97 > > > > >> [<ffffffff8114a245>] ? blk_recount_segments+0x17/0x27 > > > > >> [<ffffffff81147429>] ? __make_request+0x2d6/0x3fc > > > > >> [<ffffffff81145b16>] ? generic_make_request+0x207/0x268 > > > > >> [<ffffffff81145c12>] ? submit_bio+0x9b/0xa2 > > > > >> [<ffffffffa01aa081>] ? btrfs_requeue_work+0xd7/0xe1 [btrfs] > > > > >> [<ffffffffa01a5365>] ? run_scheduled_bios+0x297/0x48f [btrfs] > > > > >> [<ffffffffa01aa687>] ? worker_loop+0x17c/0x452 [btrfs] > > > > >> [<ffffffffa01aa50b>] ? worker_loop+0x0/0x452 [btrfs] > > > > >> [<ffffffff81040331>] ? kthread+0x79/0x81 > > > > >> [<ffffffff81003674>] ? kernel_thread_helper+0x4/0x10 > > > > >> [<ffffffff810402b8>] ? kthread+0x0/0x81 > > > > >> [<ffffffff81003670>] ? kernel_thread_helper+0x0/0x10 > > > > > This looks like the issue we saw too, http://lkml.org/lkml/2010/6/8/375. > > > > > This is reproduceable in our setup. > > > > > > > > I think I know the cause of http://lkml.org/lkml/2010/6/8/375. > > > > The code in the first do-while loop in btrfs_commit_transaction > > > > set current process to TASK_UNINTERRUPTIBLE state, then calls > > > > btrfs_start_delalloc_inodes, btrfs_wait_ordered_extents and > > > > btrfs_run_ordered_operations(). All of these function may call > > > > cond_resched(). > > > Hi, > > > When I test random write, I saw a lot of threads jump into btree_writepages() > > > and do noting and io throughput is zero for some time. Looks like there is a > > > live lock. See the code of btree_writepages(): > > > if (wbc->sync_mode == WB_SYNC_NONE) { > > > struct btrfs_root *root = BTRFS_I(mapping->host)->root; > > > u64 num_dirty; > > > unsigned long thresh = 32 * 1024 * 1024; > > > > > > if (wbc->for_kupdate) > > > return 0; > > > > > > /* this is a bit racy, but that''s ok */ > > > num_dirty = root->fs_info->dirty_metadata_bytes; > > > >>>>>> if (num_dirty < thresh) > > > return 0; > > > } > > > The marked line is quite intrusive. In my test, the live lock is caused by the thresh > > > check. The dirty_metadata_bytes < 32M. Without it, I can''t see the live lock. Not > > > sure if this is related to the hang. > > > > How much ram do you have? The goal of the check is to avoid writing > > metadata blocks because once we write them we have to do more IO to cow > > them again if they are changed later. > > > > It shouldn''t be looping hard in btrfs there, what was the workload? > This is a fio randomwrite. Yep, I limited memory to a small size (~500M), because it makes > me easily to produce a ''xxx blocked for more than 120 seconds'' issue. I can understand small > memory could be an issue, but this still looks intrusive, right? > > The issue Yanmin reported is under 2.6.35-rc1, so might not be the > ''TASK_UNINTERRUPTIBLE'' issue, but we will try -rc3 too.I still get below message with 2.6.35-rc3. The system is still running, because my fio test finished even with the message. INFO: task flush-btrfs-134:14144 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-btrfs-1 D 0000000100346d5c 4480 14144 2 0x00000000 ffff88016fd51530 0000000000000046 0000000000000001 ffff880100000000 ffff88023ef0f100 0000000000013ac0 0000000000013ac0 0000000000004000 0000000000013ac0 ffff88018be2dfd8 ffff88016fd51530 ffff88018be2dfd8 Call Trace: [<ffffffff8124be61>] ? wait_block_group_cache_progress+0xc0/0xe4 [<ffffffff81052977>] ? autoremove_wake_function+0x0/0x2a [<ffffffff81052977>] ? autoremove_wake_function+0x0/0x2a [<ffffffff8124f956>] ? find_free_extent+0x694/0x9c4 [<ffffffff8124fd53>] ? btrfs_reserve_extent+0xcd/0x189 [<ffffffff81262803>] ? cow_file_range+0x19e/0x2fc [<ffffffff81262fe0>] ? run_delalloc_range+0xa7/0x393 [<ffffffff8127832e>] ? test_range_bit+0x2b/0x127 [<ffffffff8127b44e>] ? find_lock_delalloc_range+0x1af/0x1d1 [<ffffffff8127b655>] ? __extent_writepage+0x1e5/0x61f [<ffffffff812c1fac>] ? prio_tree_next+0x1c0/0x221 [<ffffffff812bf3c8>] ? cpumask_any_but+0x28/0x37 [<ffffffff810b4af7>] ? page_mkclean+0x120/0x148 [<ffffffff8127bed2>] ? extent_write_cache_pages.clone.0+0x15e/0x26c [<ffffffff8127c0db>] ? extent_writepages+0x41/0x5a [<ffffffff81264319>] ? btrfs_get_extent+0x0/0x798 [<ffffffff810e6fcb>] ? writeback_single_inode+0xd1/0x2e8 [<ffffffff810e7d02>] ? writeback_inodes_wb+0x40d/0x51f [<ffffffff810e7f47>] ? wb_writeback+0x133/0x1b2 [<ffffffff810e81ae>] ? wb_do_writeback+0x148/0x15e [<ffffffff810e81fe>] ? bdi_writeback_task+0x3a/0x113 [<ffffffff81052897>] ? bit_waitqueue+0x14/0xa4 [<ffffffff810a9b19>] ? bdi_start_fn+0x0/0xc2 [<ffffffff810a9b7c>] ? bdi_start_fn+0x63/0xc2 [<ffffffff810a9b19>] ? bdi_start_fn+0x0/0xc2 [<ffffffff81052525>] ? kthread+0x75/0x7d [<ffffffff81003654>] ? kernel_thread_helper+0x4/0x10 [<ffffffff810524b0>] ? kthread+0x0/0x7d [<ffffffff81003650>] ? kernel_thread_helper+0x0/0x10 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html