Hi, I''m hitting a btrfs locking issue with 3.7.0-rc8. The btrfs filesystem in question is backing a Ceph OSD under a heavy write load from many cephfs clients. I reported this issue a while ago: http://www.spinics.net/lists/linux-btrfs/msg19370.html when I was testing what I thought might be part of the 3.7 btrfs patch queue, using Josef Bacik''s btrfs-next tree. I spent some time attempting to bisect the btrfs patch queue just before it was merged for 3.7, but got nowhere due to false negatives. I''ve just been able to get back to testing 3.7-rc, and found that I can still trigger the issue. First I get this lockdep splat: [ 1184.201331] ============================================[ 1184.206716] [ INFO: possible recursive locking detected ] [ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted [ 1184.217156] --------------------------------------------- [ 1184.222544] ceph-osd/42177 is trying to acquire lock: [ 1184.227589] (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.237270] [ 1184.237270] but task is already holding lock: [ 1184.243114] (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.252786] [ 1184.252786] other info that might help us debug this: [ 1184.259303] Possible unsafe locking scenario: [ 1184.259303] [ 1184.265220] CPU0 [ 1184.267680] ---- [ 1184.270133] lock(&fs_info->chunk_mutex); [ 1184.274276] lock(&fs_info->chunk_mutex); [ 1184.278417] [ 1184.278417] *** DEADLOCK *** [ 1184.278417] [ 1184.284325] May be due to missing lock nesting notation [ 1184.284325] [ 1184.291099] 4 locks held by ceph-osd/42177: [ 1184.295277] #0: (sb_writers#7){.+.+.+}, at: [<ffffffffa0584664>] btrfs_file_aio_write+0x64/0x320 [btrfs] [ 1184.305103] #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffffa058466e>] btrfs_file_aio_write+0x6e/0x320 [btrfs] [ 1184.316108] #2: (sb_internal){.+.+..}, at: [<ffffffffa05746f4>] start_transaction+0x1c4/0x450 [btrfs] [ 1184.325632] #3: (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.335761] [ 1184.335761] stack backtrace: [ 1184.340126] Pid: 42177, comm: ceph-osd Not tainted 3.7.0-rc8-00013-gdf2fc24 #438 [ 1184.347508] Call Trace: [ 1184.349962] [<ffffffff81042eca>] ? vprintk_emit+0x42a/0x4c0 [ 1184.355619] [<ffffffff8109b579>] print_deadlock_bug+0xe9/0x100 [ 1184.361556] [<ffffffff8109d246>] validate_chain+0x596/0x750 [ 1184.367222] [<ffffffff8109d849>] __lock_acquire+0x449/0x510 [ 1184.372894] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.379417] [<ffffffff8109d9d9>] lock_acquire+0xc9/0x120 [ 1184.384855] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.391377] [<ffffffff8109d849>] ? __lock_acquire+0x449/0x510 [ 1184.397204] [<ffffffff814ab17d>] __mutex_lock_common+0x5d/0x3a0 [ 1184.403221] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.409762] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.416323] [<ffffffffa055a5c9>] ? do_chunk_alloc+0x179/0x340 [btrfs] [ 1184.422849] [<ffffffff814ab62a>] mutex_lock_nested+0x4a/0x60 [ 1184.428640] [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] [ 1184.435018] [<ffffffffa05635bc>] find_free_extent+0xa3c/0xb70 [btrfs] [ 1184.441555] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1184.449051] [<ffffffffa0563772>] btrfs_reserve_extent+0x82/0x190 [btrfs] [ 1184.455843] [<ffffffffa0563905>] btrfs_alloc_free_block+0x85/0x230 [btrfs] [ 1184.462828] [<ffffffffa054fb6a>] __btrfs_cow_block+0x14a/0x4b0 [btrfs] [ 1184.469471] [<ffffffffa05a50d3>] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] [ 1184.476962] [<ffffffffa055056d>] btrfs_cow_block+0x18d/0x230 [btrfs] [ 1184.483426] [<ffffffffa0553460>] btrfs_search_slot+0x360/0x730 [btrfs] [ 1184.490067] [<ffffffffa055426d>] btrfs_insert_empty_items+0x8d/0x100 [btrfs] [ 1184.497199] [<ffffffff81159be3>] ? kmem_cache_alloc+0xd3/0x170 [ 1184.503151] [<ffffffffa0596f25>] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] [ 1184.510125] [<ffffffffa059b0d7>] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] [ 1184.516928] [<ffffffffa059b1de>] btrfs_alloc_chunk+0x5e/0x90 [btrfs] [ 1184.523374] [<ffffffffa055a391>] ? check_system_chunk+0x71/0x130 [btrfs] [ 1184.530179] [<ffffffffa055a71e>] do_chunk_alloc+0x2ce/0x340 [btrfs] [ 1184.536555] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1184.544070] [<ffffffffa055e35c>] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] [ 1184.551574] [<ffffffffa055ff00>] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] [ 1184.558987] [<ffffffffa05842de>] __btrfs_buffered_write+0x15e/0x350 [btrfs] [ 1184.566042] [<ffffffffa0584809>] btrfs_file_aio_write+0x209/0x320 [btrfs] [ 1184.572944] [<ffffffffa0584600>] ? __btrfs_direct_write+0x130/0x130 [btrfs] [ 1184.579984] [<ffffffff81165014>] do_sync_readv_writev+0x94/0xe0 [ 1184.585985] [<ffffffff81166023>] do_readv_writev+0xe3/0x1e0 [ 1184.591645] [<ffffffff81183112>] ? fget_light+0x122/0x170 [ 1184.597131] [<ffffffff81166166>] vfs_writev+0x46/0x60 [ 1184.602266] [<ffffffff8116629f>] sys_writev+0x5f/0xc0 [ 1184.607398] [<ffffffff81264b6e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 1184.613832] [<ffffffff814b7042>] system_call_fastpath+0x16/0x1b and then a little later: [ 1319.463719] INFO: task ceph-osd:42177 blocked for more than 120 seconds. [ 1319.470409] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1319.478227] ceph-osd D 0000000000000246 0 42177 1 0x00000000 [ 1319.485330] ffff880402c4d358 0000000000000046 00007fffeccc49d0 ffff880402c4dfd8 [ 1319.492852] ffff880402c4c010 ffff880402c4c000 ffff880402c4c000 ffff880402c4c000 [ 1319.500449] ffff880402c4dfd8 ffff880402c4c000 ffff880c2151bec0 ffff8804e9403ec0 [ 1319.508012] Call Trace: [ 1319.510495] [<ffffffff814acc5d>] schedule+0x5d/0x60 [ 1319.515478] [<ffffffff814acf67>] schedule_preempt_disabled+0x27/0x40 [ 1319.521967] [<ffffffff814ab375>] __mutex_lock_common+0x255/0x3a0 [ 1319.528090] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1319.534659] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1319.541264] [<ffffffff814ab62a>] mutex_lock_nested+0x4a/0x60 [ 1319.547118] [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] [ 1319.553532] [<ffffffffa05635bc>] find_free_extent+0xa3c/0xb70 [btrfs] [ 1319.560122] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1319.567620] [<ffffffffa0563772>] btrfs_reserve_extent+0x82/0x190 [btrfs] [ 1319.574460] [<ffffffffa0563905>] btrfs_alloc_free_block+0x85/0x230 [btrfs] [ 1319.581464] [<ffffffffa054fb6a>] __btrfs_cow_block+0x14a/0x4b0 [btrfs] [ 1319.588117] [<ffffffffa05a50d3>] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] [ 1319.595644] [<ffffffffa055056d>] btrfs_cow_block+0x18d/0x230 [btrfs] [ 1319.602110] [<ffffffffa0553460>] btrfs_search_slot+0x360/0x730 [btrfs] [ 1319.608767] [<ffffffffa055426d>] btrfs_insert_empty_items+0x8d/0x100 [btrfs] [ 1319.615897] [<ffffffff81159be3>] ? kmem_cache_alloc+0xd3/0x170 [ 1319.621910] [<ffffffffa0596f25>] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] [ 1319.628893] [<ffffffffa059b0d7>] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] [ 1319.635736] [<ffffffffa059b1de>] btrfs_alloc_chunk+0x5e/0x90 [btrfs] [ 1319.642184] [<ffffffffa055a391>] ? check_system_chunk+0x71/0x130 [btrfs] [ 1319.649009] [<ffffffffa055a71e>] do_chunk_alloc+0x2ce/0x340 [btrfs] [ 1319.655366] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1319.662935] [<ffffffffa055e35c>] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] [ 1319.670465] [<ffffffffa055ff00>] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] [ 1319.677891] [<ffffffffa05842de>] __btrfs_buffered_write+0x15e/0x350 [btrfs] [ 1319.685000] [<ffffffffa0584809>] btrfs_file_aio_write+0x209/0x320 [btrfs] [ 1319.691952] [<ffffffffa0584600>] ? __btrfs_direct_write+0x130/0x130 [btrfs] [ 1319.699054] [<ffffffff81165014>] do_sync_readv_writev+0x94/0xe0 [ 1319.705118] [<ffffffff81166023>] do_readv_writev+0xe3/0x1e0 [ 1319.710828] [<ffffffff81183112>] ? fget_light+0x122/0x170 [ 1319.716324] [<ffffffff81166166>] vfs_writev+0x46/0x60 [ 1319.721466] [<ffffffff8116629f>] sys_writev+0x5f/0xc0 [ 1319.726674] [<ffffffff81264b6e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 1319.733198] [<ffffffff814b7042>] system_call_fastpath+0x16/0x1b [ 1319.739274] INFO: lockdep is turned off. [ 1319.743261] INFO: task ceph-osd:42236 blocked for more than 120 seconds. [ 1319.749953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1319.757812] ceph-osd D 0000000000000246 0 42236 1 0x00000000 [ 1319.765056] ffff88052e7bdcc8 0000000000000046 ffff880624666120 ffff88052e7bdfd8 [ 1319.772785] ffff88052e7bc010 ffff88052e7bc000 ffff88052e7bc000 ffff88052e7bc000 [ 1319.780332] ffff88052e7bdfd8 ffff88052e7bc000 ffff88062453bec0 ffff88052e7b5e20 [ 1319.787905] Call Trace: [ 1319.790368] [<ffffffff814acc5d>] schedule+0x5d/0x60 [ 1319.795335] [<ffffffff814acf67>] schedule_preempt_disabled+0x27/0x40 [ 1319.801762] [<ffffffff814ab375>] __mutex_lock_common+0x255/0x3a0 [ 1319.807861] [<ffffffffa0548b4f>] ? btrfs_statfs+0x5f/0x240 [btrfs] [ 1319.814143] [<ffffffff811703cf>] ? final_putname+0x3f/0x50 [ 1319.819724] [<ffffffffa0548b4f>] ? btrfs_statfs+0x5f/0x240 [btrfs] [ 1319.825982] [<ffffffff814ab62a>] mutex_lock_nested+0x4a/0x60 [ 1319.831731] [<ffffffffa0548b4f>] btrfs_statfs+0x5f/0x240 [btrfs] [ 1319.837846] [<ffffffff811968c0>] statfs_by_dentry+0x60/0x90 [ 1319.843508] [<ffffffff81196a0b>] vfs_statfs+0x1b/0xb0 [ 1319.848659] [<ffffffff81196c00>] user_statfs+0x40/0x60 [ 1319.853887] [<ffffffff81196caa>] sys_statfs+0x2a/0x50 [ 1319.859036] [<ffffffff814b7042>] system_call_fastpath+0x16/0x1b [ 1319.865035] INFO: lockdep is turned off. and then later still: [ 1439.546768] INFO: task btrfs-transacti:27591 blocked for more than 120 seconds. [ 1439.554139] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1439.561949] btrfs-transacti D ffff880993231810 0 27591 2 0x00000000 [ 1439.569205] ffff8808e79d3c58 0000000000000046 0000000000000000 ffff8808e79d3fd8 [ 1439.576820] ffff8808e79d2010 ffff8808e79d2000 ffff8808e79d2000 ffff8808e79d2000 [ 1439.584384] ffff8808e79d3fd8 ffff8808e79d2000 ffff880b3f165e20 ffff8808d8cdbec0 [ 1439.592040] Call Trace: [ 1439.594573] [<ffffffff814acc5d>] schedule+0x5d/0x60 [ 1439.599646] [<ffffffff814aac3d>] schedule_timeout+0x3d/0x260 [ 1439.605402] [<ffffffff8109c410>] ? trace_hardirqs_on_caller+0x20/0x1d0 [ 1439.612023] [<ffffffff8109c5cd>] ? trace_hardirqs_on+0xd/0x10 [ 1439.617939] [<ffffffff8106781a>] ? prepare_to_wait+0x7a/0x90 [ 1439.623788] [<ffffffffa0573a46>] btrfs_commit_transaction+0x336/0xa10 [btrfs] [ 1439.631128] [<ffffffffa05748fb>] ? start_transaction+0x3cb/0x450 [btrfs] [ 1439.637943] [<ffffffff810674e0>] ? wake_up_bit+0x40/0x40 [ 1439.643380] [<ffffffffa056c223>] transaction_kthread+0x123/0x210 [btrfs] [ 1439.650259] [<ffffffffa056c100>] ? btrfs_bio_wq_end_io+0x90/0x90 [btrfs] [ 1439.657134] [<ffffffff81066dd1>] kthread+0xe1/0xf0 [ 1439.662109] [<ffffffff81066cf0>] ? __init_kthread_worker+0x70/0x70 [ 1439.668398] [<ffffffff814b6f9c>] ret_from_fork+0x7c/0xb0 [ 1439.673858] [<ffffffff81066cf0>] ? __init_kthread_worker+0x70/0x70 [ 1439.680181] INFO: lockdep is turned off. [ 1439.685095] INFO: task ceph-osd:42177 blocked for more than 120 seconds. [ 1439.691847] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1439.699743] ceph-osd D 0000000000000246 0 42177 1 0x00000000 [ 1439.706941] ffff880402c4d358 0000000000000046 00007fffeccc49d0 ffff880402c4dfd8 [ 1439.714494] ffff880402c4c010 ffff880402c4c000 ffff880402c4c000 ffff880402c4c000 [ 1439.722039] ffff880402c4dfd8 ffff880402c4c000 ffff880c2151bec0 ffff8804e9403ec0 [ 1439.729688] Call Trace: [ 1439.732217] [<ffffffff814acc5d>] schedule+0x5d/0x60 [ 1439.737179] [<ffffffff814acf67>] schedule_preempt_disabled+0x27/0x40 [ 1439.743705] [<ffffffff814ab375>] __mutex_lock_common+0x255/0x3a0 [ 1439.749843] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1439.756481] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] [ 1439.763037] [<ffffffff814ab62a>] mutex_lock_nested+0x4a/0x60 [ 1439.768854] [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] [ 1439.775290] [<ffffffffa05635bc>] find_free_extent+0xa3c/0xb70 [btrfs] [ 1439.781932] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1439.789534] [<ffffffffa0563772>] btrfs_reserve_extent+0x82/0x190 [btrfs] [ 1439.796474] [<ffffffffa0563905>] btrfs_alloc_free_block+0x85/0x230 [btrfs] [ 1439.803499] [<ffffffffa054fb6a>] __btrfs_cow_block+0x14a/0x4b0 [btrfs] [ 1439.810253] [<ffffffffa05a50d3>] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] [ 1439.817877] [<ffffffffa055056d>] btrfs_cow_block+0x18d/0x230 [btrfs] [ 1439.824427] [<ffffffffa0553460>] btrfs_search_slot+0x360/0x730 [btrfs] [ 1439.831162] [<ffffffffa055426d>] btrfs_insert_empty_items+0x8d/0x100 [btrfs] [ 1439.838423] [<ffffffff81159be3>] ? kmem_cache_alloc+0xd3/0x170 [ 1439.844514] [<ffffffffa0596f25>] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] [ 1439.851619] [<ffffffffa059b0d7>] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] [ 1439.858535] [<ffffffffa059b1de>] btrfs_alloc_chunk+0x5e/0x90 [btrfs] [ 1439.865079] [<ffffffffa055a391>] ? check_system_chunk+0x71/0x130 [btrfs] [ 1439.871939] [<ffffffffa055a71e>] do_chunk_alloc+0x2ce/0x340 [btrfs] [ 1439.878375] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1439.885933] [<ffffffffa055e35c>] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] [ 1439.893514] [<ffffffffa055ff00>] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] [ 1439.901018] [<ffffffffa05842de>] __btrfs_buffered_write+0x15e/0x350 [btrfs] [ 1439.908226] [<ffffffffa0584809>] btrfs_file_aio_write+0x209/0x320 [btrfs] [ 1439.915235] [<ffffffffa0584600>] ? __btrfs_direct_write+0x130/0x130 [btrfs] [ 1439.922375] [<ffffffff81165014>] do_sync_readv_writev+0x94/0xe0 [ 1439.928475] [<ffffffff81166023>] do_readv_writev+0xe3/0x1e0 [ 1439.934197] [<ffffffff81183112>] ? fget_light+0x122/0x170 [ 1439.939738] [<ffffffff81166166>] vfs_writev+0x46/0x60 [ 1439.944948] [<ffffffff8116629f>] sys_writev+0x5f/0xc0 [ 1439.950139] [<ffffffff81264b6e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 1439.956621] [<ffffffff814b7042>] system_call_fastpath+0x16/0x1b [ 1439.962742] INFO: lockdep is turned off. [ 1439.966769] INFO: task ceph-osd:42236 blocked for more than 120 seconds. [ 1439.973558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1439.981513] ceph-osd D 0000000000000246 0 42236 1 0x00000000 [ 1439.988788] ffff88052e7bdcc8 0000000000000046 ffff880624666120 ffff88052e7bdfd8 [ 1439.996611] ffff88052e7bc010 ffff88052e7bc000 ffff88052e7bc000 ffff88052e7bc000 [ 1440.004237] ffff88052e7bdfd8 ffff88052e7bc000 ffff88062453bec0 ffff88052e7b5e20 [ 1440.011956] Call Trace: [ 1440.014516] [<ffffffff814acc5d>] schedule+0x5d/0x60 [ 1440.019578] [<ffffffff814acf67>] schedule_preempt_disabled+0x27/0x40 [ 1440.026054] [<ffffffff814ab375>] __mutex_lock_common+0x255/0x3a0 [ 1440.032245] [<ffffffffa0548b4f>] ? btrfs_statfs+0x5f/0x240 [btrfs] [ 1440.038624] [<ffffffff811703cf>] ? final_putname+0x3f/0x50 [ 1440.044323] [<ffffffffa0548b4f>] ? btrfs_statfs+0x5f/0x240 [btrfs] [ 1440.050952] [<ffffffff814ab62a>] mutex_lock_nested+0x4a/0x60 [ 1440.056805] [<ffffffffa0548b4f>] btrfs_statfs+0x5f/0x240 [btrfs] [ 1440.063029] [<ffffffff811968c0>] statfs_by_dentry+0x60/0x90 [ 1440.068739] [<ffffffff81196a0b>] vfs_statfs+0x1b/0xb0 [ 1440.073962] [<ffffffff81196c00>] user_statfs+0x40/0x60 [ 1440.079203] [<ffffffff81196caa>] sys_statfs+0x2a/0x50 [ 1440.084401] [<ffffffff814b7042>] system_call_fastpath+0x16/0x1b [ 1440.090816] INFO: lockdep is turned off. Please let me know what I can do to help resolve this issue. Thanks -- Jim -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/05/2012 09:07 AM, Jim Schutt wrote:> Hi, > > I''m hitting a btrfs locking issue with 3.7.0-rc8. > > The btrfs filesystem in question is backing a Ceph OSD > under a heavy write load from many cephfs clients. > > I reported this issue a while ago: > http://www.spinics.net/lists/linux-btrfs/msg19370.html > when I was testing what I thought might be part of the > 3.7 btrfs patch queue, using Josef Bacik''s btrfs-next tree. > > I spent some time attempting to bisect the btrfs patch queue > just before it was merged for 3.7, but got nowhere due to > false negatives. > > I''ve just been able to get back to testing 3.7-rc, and found > that I can still trigger the issue. > > First I get this lockdep splat: > > [ 1184.201331] ============================================> [ 1184.206716] [ INFO: possible recursive locking detected ] > [ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted > [ 1184.217156] --------------------------------------------- > [ 1184.222544] ceph-osd/42177 is trying to acquire lock: > [ 1184.227589] (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] > [ 1184.237270] > [ 1184.237270] but task is already holding lock: > [ 1184.243114] (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] > [ 1184.252786] > [ 1184.252786] other info that might help us debug this: > [ 1184.259303] Possible unsafe locking scenario: > [ 1184.259303] > [ 1184.265220] CPU0 > [ 1184.267680] ---- > [ 1184.270133] lock(&fs_info->chunk_mutex); > [ 1184.274276] lock(&fs_info->chunk_mutex); > [ 1184.278417] > [ 1184.278417] *** DEADLOCK ***To try to debug this, I applied this patch: -----cut here ----- diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c72ead8..86da319 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1470,6 +1470,10 @@ struct btrfs_fs_info { int backup_root_index; int num_tolerated_disk_barrier_failures; + + /* protect against recursive do_chunk_alloc() */ + const void *chunk_alloc_task; + const struct btrfs_space_info *prev_sinfo; }; /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7cda519..d61e93d 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2087,6 +2087,9 @@ int open_ctree(struct super_block *sb, fs_info->btree_inode->i_ino = BTRFS_BTREE_INODE_OBJECTID; set_nlink(fs_info->btree_inode, 1); + + fs_info->chunk_alloc_task = NULL; + fs_info->prev_sinfo = NULL; /* * we set the i_size on the btree inode to the max possible int. * the real end of the address space is determined by all of diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d3e2c1..1e0e421 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3582,6 +3582,16 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, } BUG_ON(!space_info); /* Logic error */ + if (mutex_is_locked(&fs_info->chunk_mutex) && + unlikely(fs_info->chunk_alloc_task == current)) { + WARN_ONCE(1, "do_chunk_alloc() about to recursively acquire " + "fs_info->chunk_mutex: impending deadlock avoided!\n" + "outer call space_info = %p flags %#llx\n" + "nested call space_info = %p flags %#llx\n", + fs_info->prev_sinfo, fs_info->prev_sinfo->flags, + space_info, space_info->flags); + return -EDEADLK; + } again: spin_lock(&space_info->lock); if (force < space_info->force_alloc) @@ -3603,6 +3613,8 @@ again: spin_unlock(&space_info->lock); mutex_lock(&fs_info->chunk_mutex); + fs_info->chunk_alloc_task = current; + fs_info->prev_sinfo = space_info; /* * The chunk_mutex is held throughout the entirety of a chunk @@ -3655,6 +3667,8 @@ again: space_info->chunk_alloc = 0; spin_unlock(&space_info->lock); out: + fs_info->chunk_alloc_task = NULL; + fs_info->prev_sinfo = NULL; mutex_unlock(&fs_info->chunk_mutex); return ret; } -----cut here ----- After a few trials, I got this during a test: [ 1614.213516] ------------[ cut here ]------------ [ 1614.218185] WARNING: at fs/btrfs/extent-tree.c:3592 do_chunk_alloc+0xec/0x3f0 [btrfs]() [ 1614.226228] Hardware name: X8DTH-i/6/iF/6F [ 1614.230484] do_chunk_alloc() about to recursively acquire fs_info->chunk_mutex: impending deadlock avoided! [ 1614.230484] outer call space_info = ffff880c214dac00 flags 0x1 [ 1614.230484] nested call space_info = ffff880c214dbc00 flags 0x4 [ 1614.251968] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput joydev sg hid_generic sd_mod iTCO_wdt iTCO_vendor_support ata_piix libata button coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd ioatdma i7core_edac dm_mod edac_core nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000 [ 1614.320138] Pid: 84332, comm: ceph-osd Not tainted 3.7.0-rc8-00031-g2ec1a5f #458 [ 1614.327564] Call Trace: [ 1614.330054] [<ffffffff8103ff04>] warn_slowpath_common+0x94/0xc0 [ 1614.336145] [<ffffffff8103ffe6>] warn_slowpath_fmt+0x46/0x50 [ 1614.342081] [<ffffffffa052053c>] do_chunk_alloc+0xec/0x3f0 [btrfs] [ 1614.348431] [<ffffffffa052966c>] find_free_extent+0xa3c/0xb70 [btrfs] [ 1614.355013] [<ffffffffa051c9e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1614.362580] [<ffffffffa0529822>] btrfs_reserve_extent+0x82/0x190 [btrfs] [ 1614.369404] [<ffffffffa05299b5>] btrfs_alloc_free_block+0x85/0x230 [btrfs] [ 1614.376417] [<ffffffffa0515b6a>] __btrfs_cow_block+0x14a/0x4b0 [btrfs] [ 1614.383127] [<ffffffffa056b2b3>] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] [ 1614.390716] [<ffffffffa051656d>] btrfs_cow_block+0x18d/0x230 [btrfs] [ 1614.397288] [<ffffffffa0519460>] btrfs_search_slot+0x360/0x730 [btrfs] [ 1614.403988] [<ffffffffa051a26d>] btrfs_insert_empty_items+0x8d/0x100 [btrfs] [ 1614.411202] [<ffffffff81157413>] ? kmem_cache_alloc+0xd3/0x170 [ 1614.417248] [<ffffffffa055cfe5>] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] [ 1614.424271] [<ffffffffa0561197>] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] [ 1614.431142] [<ffffffffa056129e>] btrfs_alloc_chunk+0x5e/0x90 [btrfs] [ 1614.437625] [<ffffffffa0520391>] ? check_system_chunk+0x71/0x130 [btrfs] [ 1614.444468] [<ffffffffa05207be>] do_chunk_alloc+0x36e/0x3f0 [btrfs] [ 1614.450853] [<ffffffffa051c9e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1614.458422] [<ffffffffa052440c>] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] [ 1614.466046] [<ffffffffa0525fb0>] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] [ 1614.473573] [<ffffffffa054a39e>] __btrfs_buffered_write+0x15e/0x350 [btrfs] [ 1614.480691] [<ffffffffa054a8c9>] btrfs_file_aio_write+0x209/0x320 [btrfs] [ 1614.487642] [<ffffffffa053a550>] ? btrfs_end_transaction+0x10/0x20 [btrfs] [ 1614.494723] [<ffffffffa054ed73>] ? __btrfs_setxattr+0xf3/0x110 [btrfs] [ 1614.501398] [<ffffffff814abf0e>] ? mutex_unlock+0xe/0x10 [ 1614.506918] [<ffffffffa054a6c0>] ? __btrfs_direct_write+0x130/0x130 [btrfs] [ 1614.514098] [<ffffffff81162754>] do_sync_readv_writev+0x94/0xe0 [ 1614.520240] [<ffffffff81163813>] do_readv_writev+0xe3/0x1e0 [ 1614.526005] [<ffffffff81180ca2>] ? fget_light+0x122/0x170 [ 1614.531537] [<ffffffff81163956>] vfs_writev+0x46/0x60 [ 1614.536752] [<ffffffff81163a8f>] sys_writev+0x5f/0xc0 [ 1614.542167] [<ffffffff812637de>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 1614.548678] [<ffffffff814b77c2>] system_call_fastpath+0x16/0x1b [ 1614.554884] ---[ end trace 5c724b2e8e635fff ]--- So, evidently a do_chunk_alloc() call for BTRFS_BLOCK_GROUP_DATA ended up needing more space for metadata, resulting in a recursive call to do_chunk_alloc() for BTRFS_BLOCK_GROUP_METADATA. Since the metadata allocation fails, I get some aborted transactions: [ 1614.564332] WARNING: at fs/btrfs/super.c:246 __btrfs_abort_transaction+0x60/0x110 [btrfs]() [ 1614.572784] Hardware name: X8DTH-i/6/iF/6F [ 1614.577028] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput joydev sg hid_generic sd_mod iTCO_wdt iTCO_vendor_support ata_piix libata button coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd ioatdma i7core_edac dm_mod edac_core nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000 [ 1614.647280] Pid: 84332, comm: ceph-osd Tainted: G W 3.7.0-rc8-00031-g2ec1a5f #458 [ 1614.655759] Call Trace: [ 1614.658359] [<ffffffff8103ff04>] warn_slowpath_common+0x94/0xc0 [ 1614.664505] [<ffffffff8103ffe6>] warn_slowpath_fmt+0x46/0x50 [ 1614.670357] [<ffffffffa050e5c0>] __btrfs_abort_transaction+0x60/0x110 [btrfs] [ 1614.677681] [<ffffffffa05296ad>] find_free_extent+0xa7d/0xb70 [btrfs] [ 1614.684309] [<ffffffffa051c9e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1614.691889] [<ffffffffa0529822>] btrfs_reserve_extent+0x82/0x190 [btrfs] [ 1614.698803] [<ffffffffa05299b5>] btrfs_alloc_free_block+0x85/0x230 [btrfs] [ 1614.705871] [<ffffffffa0515b6a>] __btrfs_cow_block+0x14a/0x4b0 [btrfs] [ 1614.712666] [<ffffffffa056b2b3>] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] [ 1614.720393] [<ffffffffa051656d>] btrfs_cow_block+0x18d/0x230 [btrfs] [ 1614.727042] [<ffffffffa0519460>] btrfs_search_slot+0x360/0x730 [btrfs] [ 1614.733884] [<ffffffffa051a26d>] btrfs_insert_empty_items+0x8d/0x100 [btrfs] [ 1614.741202] [<ffffffff81157413>] ? kmem_cache_alloc+0xd3/0x170 [ 1614.747361] [<ffffffffa055cfe5>] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] [ 1614.754573] [<ffffffffa0561197>] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] [ 1614.761638] [<ffffffffa056129e>] btrfs_alloc_chunk+0x5e/0x90 [btrfs] [ 1614.768288] [<ffffffffa0520391>] ? check_system_chunk+0x71/0x130 [btrfs] [ 1614.775403] [<ffffffffa05207be>] do_chunk_alloc+0x36e/0x3f0 [btrfs] [ 1614.781998] [<ffffffffa051c9e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] [ 1614.789694] [<ffffffffa052440c>] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] [ 1614.797419] [<ffffffffa0525fb0>] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] [ 1614.805054] [<ffffffffa054a39e>] __btrfs_buffered_write+0x15e/0x350 [btrfs] [ 1614.812341] [<ffffffffa054a8c9>] btrfs_file_aio_write+0x209/0x320 [btrfs] [ 1614.819510] [<ffffffffa053a550>] ? btrfs_end_transaction+0x10/0x20 [btrfs] [ 1614.826800] [<ffffffffa054ed73>] ? __btrfs_setxattr+0xf3/0x110 [btrfs] [ 1614.833847] [<ffffffff814abf0e>] ? mutex_unlock+0xe/0x10 [ 1614.839543] [<ffffffffa054a6c0>] ? __btrfs_direct_write+0x130/0x130 [btrfs] [ 1614.847074] [<ffffffff81162754>] do_sync_readv_writev+0x94/0xe0 [ 1614.853294] [<ffffffff81163813>] do_readv_writev+0xe3/0x1e0 [ 1614.859233] [<ffffffff81180ca2>] ? fget_light+0x122/0x170 [ 1614.865064] [<ffffffff81163956>] vfs_writev+0x46/0x60 [ 1614.870338] [<ffffffff81163a8f>] sys_writev+0x5f/0xc0 [ 1614.875708] [<ffffffff812637de>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 1614.882356] [<ffffffff814b77c2>] system_call_fastpath+0x16/0x1b [ 1614.888536] ---[ end trace 5c724b2e8e636000 ]--- [ 1614.893309] BTRFS warning (device dm-64): find_free_extent:5878: Aborting unused transaction(error 35). [ 1614.902879] BTRFS warning (device dm-64): __btrfs_alloc_chunk:3488: Aborting unused transaction(error 35). [ 1614.912806] BTRFS warning (device dm-64): find_free_extent:5878: Aborting unused transaction(Object already exists). [ 1614.916914] divide error: 0000 [#1] SMP [ 1614.916955] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput joydev sg hid_generic sd_mod iTCO_wdt iTCO_vendor_support ata_piix libata button coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd ioatdma i7core_edac dm_mod edac_core nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000 [ 1614.916961] CPU 21 [ 1614.916961] Pid: 84341, comm: ceph-osd Tainted: G W 3.7.0-rc8-00031-g2ec1a5f #458 Supermicro X8DTH-i/6/iF/6F/X8DTH [ 1614.916985] RIP: 0010:[<ffffffffa056031d>] [<ffffffffa056031d>] __btrfs_map_block+0xcd/0x670 [btrfs] [ 1614.916986] RSP: 0018:ffff8804dad7bad8 EFLAGS: 00010246 [ 1614.916987] RAX: 0000000002c00000 RBX: 0000000002c00000 RCX: 0000000040000000 [ 1614.916988] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1614.916989] RBP: ffff8804dad7bb68 R08: 0000000000000001 R09: 0000000000000000 [ 1614.916990] R10: ffff880c14f99990 R11: 0000000000000000 R12: ffff8804dad7bb90 [ 1614.916991] R13: ffff880b49e5c080 R14: ffff880c16324298 R15: ffff8804dad7be68 [ 1614.916992] FS: 00007fffe6cba700(0000) GS:ffff880c3fd20000(0000) knlGS:0000000000000000 [ 1614.916993] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1614.916994] CR2: ffffffffff600400 CR3: 00000009cb6ba000 CR4: 00000000000007e0 [ 1614.916995] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1614.916997] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1614.916998] Process ceph-osd (pid: 84341, threadinfo ffff8804dad7a000, task ffff8804dad5de20) [ 1614.916999] Stack: [ 1614.917004] ffff8804dad5e4d8 000000ec54c11738 ffff880c14f99990 ffff880754c11cc0 [ 1614.917008] 0000000000000000 ffffffff81110eed 0000000000000000 0000000000000000 [ 1614.917011] 00000000dad7bb98 0000000000000046 0000000000000000 ffffffff81110eed [ 1614.917011] Call Trace: [ 1614.917019] [<ffffffff81110eed>] ? test_set_page_writeback+0x6d/0x170 [ 1614.917021] [<ffffffff81110eed>] ? test_set_page_writeback+0x6d/0x170 [ 1614.917032] [<ffffffffa05608ce>] btrfs_map_block+0xe/0x10 [btrfs] [ 1614.917042] [<ffffffffa053d387>] btrfs_merge_bio_hook+0x57/0x80 [btrfs] [ 1614.917052] [<ffffffffa0551c23>] submit_extent_page+0xc3/0x1d0 [btrfs] [ 1614.917054] [<ffffffff81107960>] ? find_get_pages_tag+0x1e0/0x1e0 [ 1614.917068] [<ffffffffa0557aff>] __extent_writepage+0x69f/0x760 [btrfs] [ 1614.917081] [<ffffffffa0551f50>] ? extent_io_tree_init+0x90/0x90 [btrfs] [ 1614.917091] [<ffffffffa0557fd2>] extent_write_cache_pages.clone.3+0x242/0x3d0 [btrfs] [ 1614.917102] [<ffffffffa05581af>] extent_writepages+0x4f/0x70 [btrfs] [ 1614.917112] [<ffffffffa05436b0>] ? btrfs_lookup+0x70/0x70 [btrfs] [ 1614.917116] [<ffffffff81180ca2>] ? fget_light+0x122/0x170 [ 1614.917127] [<ffffffffa053de77>] btrfs_writepages+0x27/0x30 [btrfs] [ 1614.917130] [<ffffffff81112cb3>] do_writepages+0x23/0x40 [ 1614.917134] [<ffffffff8110702e>] __filemap_fdatawrite_range+0x4e/0x50 [ 1614.917137] [<ffffffff811072e3>] filemap_fdatawrite_range+0x13/0x20 [ 1614.917143] [<ffffffff81193399>] sys_sync_file_range+0x109/0x170 [ 1614.917148] [<ffffffff814b77c2>] system_call_fastpath+0x16/0x1b [ 1614.917167] Code: 66 0f 1f 44 00 00 4d 8b 6a 60 48 29 c3 8b 45 c4 41 39 45 18 b8 00 00 00 00 0f 4d 45 c4 31 d2 89 45 c4 49 63 75 10 48 89 d8 89 f7 <48> f7 f7 49 89 c6 48 89 45 c8 4c 0f af f6 4c 39 f3 73 10 0f 0b [ 1614.917178] RIP [<ffffffffa056031d>] __btrfs_map_block+0xcd/0x670 [btrfs] [ 1614.917179] RSP <ffff8804dad7bad8> [ 1614.917687] ---[ end trace 5c724b2e8e636001 ]--- But from there, things go downhill: [ 1615.971959] divide error: 0000 [#2] SMP [ 1615.975920] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput joydev sg hid_generic sd_mod iTCO_wdt iTCO_vendor_support ata_piix libata button coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd ioatdma i7core_edac dm_mod edac_core nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000 [ 1616.041983] CPU 10 [ 1616.043917] Pid: 33589, comm: kworker/10:0 Tainted: G D W 3.7.0-rc8-00031-g2ec1a5f #458 Supermicro X8DTH-i/6/iF/6F/X8DTH [ 1616.055643] RIP: 0010:[<ffffffffa056031d>] [<ffffffffa056031d>] __btrfs_map_block+0xcd/0x670 [btrfs] [ 1616.064894] RSP: 0018:ffff880c1704f778 EFLAGS: 00010246 [ 1616.070198] RAX: 0000000003000000 RBX: 0000000003000000 RCX: 0000000040000000 [ 1616.077312] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1616.084434] RBP: ffff880c1704f808 R08: 0000000000000000 R09: 0000000000000000 [ 1616.091556] R10: ffff880c14f99990 R11: 0000000000000000 R12: ffff880c1704f830 [ 1616.098669] R13: ffff880b49e5c080 R14: ffff880c16324298 R15: ffff880c1704fb08 [ 1616.105783] FS: 0000000000000000(0000) GS:ffff880c3fc80000(0000) knlGS:0000000000000000 [ 1616.113855] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1616.119603] CR2: 00007ffff7ddc020 CR3: 00000009cb6ba000 CR4: 00000000000007e0 [ 1616.126719] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1616.133839] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1616.140951] Process kworker/10:0 (pid: 33589, threadinfo ffff880c1704e000, task ffff880c19d48000) [ 1616.149795] Stack: [ 1616.151808] ffff8804f8b71000 ffff880754c44c60 ffff880c14f99990 ffff880754c451e8 [ 1616.159280] 0000000000000000 ffffffff81110eed 0000000000000000 0000000000000000 [ 1616.166749] 000000001704f838 0000000000000046 0000000000000000 ffffffff81110eed [ 1616.174217] Call Trace: [ 1616.176679] [<ffffffff81110eed>] ? test_set_page_writeback+0x6d/0x170 [ 1616.183199] [<ffffffff81110eed>] ? test_set_page_writeback+0x6d/0x170 [ 1616.189745] [<ffffffffa05608ce>] btrfs_map_block+0xe/0x10 [btrfs] [ 1616.195935] [<ffffffffa053d387>] btrfs_merge_bio_hook+0x57/0x80 [btrfs] [ 1616.202651] [<ffffffffa0551c23>] submit_extent_page+0xc3/0x1d0 [btrfs] [ 1616.209257] [<ffffffff81107960>] ? find_get_pages_tag+0x1e0/0x1e0 [ 1616.215447] [<ffffffffa0557aff>] __extent_writepage+0x69f/0x760 [btrfs] [ 1616.222158] [<ffffffffa0551f50>] ? extent_io_tree_init+0x90/0x90 [btrfs] [ 1616.228953] [<ffffffffa0557fd2>] extent_write_cache_pages.clone.3+0x242/0x3d0 [btrfs] [ 1616.236875] [<ffffffffa05581af>] extent_writepages+0x4f/0x70 [btrfs] [ 1616.243325] [<ffffffffa05436b0>] ? btrfs_lookup+0x70/0x70 [btrfs] [ 1616.249499] [<ffffffff8117c660>] ? igrab+0x40/0x70 [ 1616.254393] [<ffffffffa053de77>] btrfs_writepages+0x27/0x30 [btrfs] [ 1616.260735] [<ffffffff81112cb3>] do_writepages+0x23/0x40 [ 1616.266127] [<ffffffff8110702e>] __filemap_fdatawrite_range+0x4e/0x50 [ 1616.272674] [<ffffffff814aebbb>] ? _raw_spin_unlock+0x2b/0x50 [ 1616.278499] [<ffffffff8110724c>] filemap_flush+0x1c/0x20 [ 1616.283905] [<ffffffffa053be64>] btrfs_start_delalloc_inodes+0xd4/0x240 [btrfs] [ 1616.291301] [<ffffffffa0539a95>] btrfs_commit_transaction+0x2c5/0xa10 [btrfs] [ 1616.298521] [<ffffffffa053a560>] ? btrfs_end_transaction+0x20/0x20 [btrfs] [ 1616.305479] [<ffffffff81067c70>] ? wake_up_bit+0x40/0x40 [ 1616.310888] [<ffffffffa053a5d6>] do_async_commit+0x76/0x90 [btrfs] [ 1616.317168] [<ffffffffa053a560>] ? btrfs_end_transaction+0x20/0x20 [btrfs] [ 1616.324117] [<ffffffff8105fd33>] process_one_work+0x2d3/0x4b0 [ 1616.329947] [<ffffffff8105fc98>] ? process_one_work+0x238/0x4b0 [ 1616.335980] [<ffffffffa053a560>] ? btrfs_end_transaction+0x20/0x20 [btrfs] [ 1616.342937] [<ffffffff8106219a>] worker_thread+0x20a/0x330 [ 1616.348512] [<ffffffff81061f90>] ? manage_workers+0x170/0x170 [ 1616.354340] [<ffffffff81067561>] kthread+0xe1/0xf0 [ 1616.359228] [<ffffffff81067480>] ? __init_kthread_worker+0x70/0x70 [ 1616.365496] [<ffffffff814b771c>] ret_from_fork+0x7c/0xb0 [ 1616.370905] [<ffffffff81067480>] ? __init_kthread_worker+0x70/0x70 [ 1616.377155] Code: 66 0f 1f 44 00 00 4d 8b 6a 60 48 29 c3 8b 45 c4 41 39 45 18 b8 00 00 00 00 0f 4d 45 c4 31 d2 89 45 c4 49 63 75 10 48 89 d8 89 f7 <48> f7 f7 49 89 c6 48 89 45 c8 4c 0f af f6 4c 39 f3 73 10 0f 0b [ 1616.397533] RIP [<ffffffffa056031d>] __btrfs_map_block+0xcd/0x670 [btrfs] [ 1616.404461] RSP <ffff880c1704f778> [ 1616.410313] ---[ end trace 5c724b2e8e636002 ]--- [ 1616.419293] BUG: unable to handle kernel paging request at ffffffffffffffa8 [ 1616.426291] IP: [<ffffffff81066be0>] kthread_data+0x10/0x20 [ 1616.431886] PGD 1a0d067 PUD 1a0e067 PMD 0 [ 1616.436059] Oops: 0000 [#3] SMP [ 1616.439335] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput joydev sg hid_generic sd_mod iTCO_wdt iTCO_vendor_support ata_piix libata button coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd ioatdma i7core_edac dm_mod edac_core nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000 [ 1616.504982] CPU 10 [ 1616.506911] Pid: 33589, comm: kworker/10:0 Tainted: G D W 3.7.0-rc8-00031-g2ec1a5f #458 Supermicro X8DTH-i/6/iF/6F/X8DTH [ 1616.518630] RIP: 0010:[<ffffffff81066be0>] [<ffffffff81066be0>] kthread_data+0x10/0x20 [ 1616.526632] RSP: 0018:ffff880c1704f448 EFLAGS: 00010082 [ 1616.531930] RAX: 0000000000000000 RBX: ffff880c3fc932c0 RCX: 000000000000000a [ 1616.539045] RDX: ffffffff81ccaa20 RSI: 000000000000000a RDI: ffff880c19d48000 [ 1616.546161] RBP: ffff880c1704f448 R08: ffff880c19d48070 R09: 0000000000000001 [ 1616.553271] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000a [ 1616.560385] R13: 000000000000000a R14: 0000000000000001 R15: 0000000000000000 [ 1616.567498] FS: 0000000000000000(0000) GS:ffff880c3fc80000(0000) knlGS:0000000000000000 [ 1616.575562] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1616.581284] CR2: ffffffffffffffa8 CR3: 0000000b7429b000 CR4: 00000000000007e0 [ 1616.588390] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1616.595502] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1616.602609] Process kworker/10:0 (pid: 33589, threadinfo ffff880c1704e000, task ffff880c19d48000) [ 1616.611451] Stack: [ 1616.613464] ffff880c1704f478 ffffffff8105f861 000000000000000a ffff880c3fc932c0 [ 1616.620896] 000000000000000a ffff880c19d483c0 ffff880c1704f508 ffffffff814acb4e [ 1616.628329] 0000000000000018 ffff880c1704ffd8 ffff880c1704e010 ffff880c1704e000 [ 1616.635765] Call Trace: [ 1616.638217] [<ffffffff8105f861>] wq_worker_sleeping+0x21/0xa0 [ 1616.644031] [<ffffffff814acb4e>] __schedule+0x17e/0x690 [ 1616.649327] [<ffffffff814ad3dd>] schedule+0x5d/0x60 [ 1616.654281] [<ffffffff810456c3>] do_exit+0x3e3/0x430 [ 1616.659320] [<ffffffff814afe68>] oops_end+0xd8/0xf0 [ 1616.664273] [<ffffffff8100593a>] die+0x6a/0x80 [ 1616.668792] [<ffffffff814af7be>] do_trap+0x7e/0x170 [ 1616.673747] [<ffffffff81002f45>] do_divide_error+0x95/0xa0 [ 1616.679321] [<ffffffffa056031d>] ? __btrfs_map_block+0xcd/0x670 [btrfs] [ 1616.686003] [<ffffffff8126381d>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 1616.692510] [<ffffffff814af08c>] ? restore_args+0x30/0x30 [ 1616.697983] [<ffffffff814b8758>] divide_error+0x18/0x20 [ 1616.703292] [<ffffffffa056031d>] ? __btrfs_map_block+0xcd/0x670 [btrfs] [ 1616.709981] [<ffffffffa05602a1>] ? __btrfs_map_block+0x51/0x670 [btrfs] [ 1616.716663] [<ffffffff81110eed>] ? test_set_page_writeback+0x6d/0x170 [ 1616.723169] [<ffffffff81110eed>] ? test_set_page_writeback+0x6d/0x170 [ 1616.729688] [<ffffffffa05608ce>] btrfs_map_block+0xe/0x10 [btrfs] [ 1616.735859] [<ffffffffa053d387>] btrfs_merge_bio_hook+0x57/0x80 [btrfs] [ 1616.742551] [<ffffffffa0551c23>] submit_extent_page+0xc3/0x1d0 [btrfs] [ 1616.749144] [<ffffffff81107960>] ? find_get_pages_tag+0x1e0/0x1e0 [ 1616.755316] [<ffffffffa0557aff>] __extent_writepage+0x69f/0x760 [btrfs] [ 1616.762008] [<ffffffffa0551f50>] ? extent_io_tree_init+0x90/0x90 [btrfs] [ 1616.768784] [<ffffffffa0557fd2>] extent_write_cache_pages.clone.3+0x242/0x3d0 [btrfs] [ 1616.776686] [<ffffffffa05581af>] extent_writepages+0x4f/0x70 [btrfs] [ 1616.783114] [<ffffffffa05436b0>] ? btrfs_lookup+0x70/0x70 [btrfs] [ 1616.789276] [<ffffffff8117c660>] ? igrab+0x40/0x70 [ 1616.794151] [<ffffffffa053de77>] btrfs_writepages+0x27/0x30 [btrfs] [ 1616.800486] [<ffffffff81112cb3>] do_writepages+0x23/0x40 [ 1616.805872] [<ffffffff8110702e>] __filemap_fdatawrite_range+0x4e/0x50 [ 1616.812380] [<ffffffff814aebbb>] ? _raw_spin_unlock+0x2b/0x50 [ 1616.818196] [<ffffffff8110724c>] filemap_flush+0x1c/0x20 [ 1616.823591] [<ffffffffa053be64>] btrfs_start_delalloc_inodes+0xd4/0x240 [btrfs] [ 1616.830974] [<ffffffffa0539a95>] btrfs_commit_transaction+0x2c5/0xa10 [btrfs] [ 1616.838182] [<ffffffffa053a560>] ? btrfs_end_transaction+0x20/0x20 [btrfs] [ 1616.845121] [<ffffffff81067c70>] ? wake_up_bit+0x40/0x40 [ 1616.850517] [<ffffffffa053a5d6>] do_async_commit+0x76/0x90 [btrfs] [ 1616.856774] [<ffffffffa053a560>] ? btrfs_end_transaction+0x20/0x20 [btrfs] [ 1616.863712] [<ffffffff8105fd33>] process_one_work+0x2d3/0x4b0 [ 1616.869530] [<ffffffff8105fc98>] ? process_one_work+0x238/0x4b0 [ 1616.875531] [<ffffffffa053a560>] ? btrfs_end_transaction+0x20/0x20 [btrfs] [ 1616.882469] [<ffffffff8106219a>] worker_thread+0x20a/0x330 [ 1616.888028] [<ffffffff81061f90>] ? manage_workers+0x170/0x170 [ 1616.893844] [<ffffffff81067561>] kthread+0xe1/0xf0 [ 1616.898711] [<ffffffff81067480>] ? __init_kthread_worker+0x70/0x70 [ 1616.904960] [<ffffffff814b771c>] ret_from_fork+0x7c/0xb0 [ 1616.910345] [<ffffffff81067480>] ? __init_kthread_worker+0x70/0x70 [ 1616.916593] Code: 68 03 00 00 48 8b 40 98 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 68 03 00 00 <48> 8b 40 a8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 [ 1616.936489] RIP [<ffffffff81066be0>] kthread_data+0x10/0x20 [ 1616.942150] RSP <ffff880c1704f448> [ 1616.945626] CR2: ffffffffffffffa8 [ 1616.948930] ---[ end trace 5c724b2e8e636003 ]--- [ 1616.953536] Fixing recursive fault but reboot is needed! Is there some way to avoid this do_chunk_alloc() deadlock by learning before a call to do_chunk_alloc() for more data space that more metadata space will be needed, and doing that call first? Or, is there some other way to fix this problem? Thanks -- Jim
On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote:> Hi, > > I''m hitting a btrfs locking issue with 3.7.0-rc8. > > The btrfs filesystem in question is backing a Ceph OSD > under a heavy write load from many cephfs clients. > > I reported this issue a while ago: > http://www.spinics.net/lists/linux-btrfs/msg19370.html > when I was testing what I thought might be part of the > 3.7 btrfs patch queue, using Josef Bacik''s btrfs-next tree. > > I spent some time attempting to bisect the btrfs patch queue > just before it was merged for 3.7, but got nowhere due to > false negatives. > > I''ve just been able to get back to testing 3.7-rc, and found > that I can still trigger the issue.Hi Jim, Could you please apply the following patch to test if it works? (It''s against 3.7-rc8.) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d3e2c1..100289b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data) if (data) flags = BTRFS_BLOCK_GROUP_DATA; - else if (root == root->fs_info->chunk_root) + else if (root == root->fs_info->chunk_root || + root == root->fs_info->dev_root) flags = BTRFS_BLOCK_GROUP_SYSTEM; else flags = BTRFS_BLOCK_GROUP_METADATA; @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct btrfs_root *root, u64 type) num_dev = 1; /* DUP or single */ /* metadata for updaing devices and chunk tree */ + num_dev = num_dev << 1 return btrfs_calc_trans_metadata_size(root, num_dev + 1); } @@ -4351,7 +4353,7 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info) fs_info->extent_root->block_rsv = &fs_info->global_block_rsv; fs_info->csum_root->block_rsv = &fs_info->global_block_rsv; - fs_info->dev_root->block_rsv = &fs_info->global_block_rsv; + fs_info->dev_root->block_rsv = &fs_info->chunk_block_rsv; fs_info->tree_root->block_rsv = &fs_info->global_block_rsv; fs_info->chunk_root->block_rsv = &fs_info->chunk_block_rsv; thanks, liubo> > First I get this lockdep splat: > > [ 1184.201331] ============================================> [ 1184.206716] [ INFO: possible recursive locking detected ] > [ 1184.212111] 3.7.0-rc8-00013-gdf2fc24 #438 Not tainted > [ 1184.217156] --------------------------------------------- > [ 1184.222544] ceph-osd/42177 is trying to acquire lock: > [ 1184.227589] (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] > [ 1184.237270] > [ 1184.237270] but task is already holding lock: > [ 1184.243114] (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] > [ 1184.252786] > [ 1184.252786] other info that might help us debug this: > [ 1184.259303] Possible unsafe locking scenario: > [ 1184.259303] > [ 1184.265220] CPU0 > [ 1184.267680] ---- > [ 1184.270133] lock(&fs_info->chunk_mutex); > [ 1184.274276] lock(&fs_info->chunk_mutex); > [ 1184.278417] > [ 1184.278417] *** DEADLOCK *** > [ 1184.278417] > [ 1184.284325] May be due to missing lock nesting notation > [ 1184.284325] > [ 1184.291099] 4 locks held by ceph-osd/42177: > [ 1184.295277] #0: (sb_writers#7){.+.+.+}, at: [<ffffffffa0584664>] btrfs_file_aio_write+0x64/0x320 [btrfs] > [ 1184.305103] #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffffa058466e>] btrfs_file_aio_write+0x6e/0x320 [btrfs] > [ 1184.316108] #2: (sb_internal){.+.+..}, at: [<ffffffffa05746f4>] start_transaction+0x1c4/0x450 [btrfs] > [ 1184.325632] #3: (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] > [ 1184.335761] > [ 1184.335761] stack backtrace: > [ 1184.340126] Pid: 42177, comm: ceph-osd Not tainted 3.7.0-rc8-00013-gdf2fc24 #438 > [ 1184.347508] Call Trace: > [ 1184.349962] [<ffffffff81042eca>] ? vprintk_emit+0x42a/0x4c0 > [ 1184.355619] [<ffffffff8109b579>] print_deadlock_bug+0xe9/0x100 > [ 1184.361556] [<ffffffff8109d246>] validate_chain+0x596/0x750 > [ 1184.367222] [<ffffffff8109d849>] __lock_acquire+0x449/0x510 > [ 1184.372894] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] > [ 1184.379417] [<ffffffff8109d9d9>] lock_acquire+0xc9/0x120 > [ 1184.384855] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] > [ 1184.391377] [<ffffffff8109d849>] ? __lock_acquire+0x449/0x510 > [ 1184.397204] [<ffffffff814ab17d>] __mutex_lock_common+0x5d/0x3a0 > [ 1184.403221] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] > [ 1184.409762] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] > [ 1184.416323] [<ffffffffa055a5c9>] ? do_chunk_alloc+0x179/0x340 [btrfs] > [ 1184.422849] [<ffffffff814ab62a>] mutex_lock_nested+0x4a/0x60 > [ 1184.428640] [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] > [ 1184.435018] [<ffffffffa05635bc>] find_free_extent+0xa3c/0xb70 [btrfs] > [ 1184.441555] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] > [ 1184.449051] [<ffffffffa0563772>] btrfs_reserve_extent+0x82/0x190 [btrfs] > [ 1184.455843] [<ffffffffa0563905>] btrfs_alloc_free_block+0x85/0x230 [btrfs] > [ 1184.462828] [<ffffffffa054fb6a>] __btrfs_cow_block+0x14a/0x4b0 [btrfs] > [ 1184.469471] [<ffffffffa05a50d3>] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] > [ 1184.476962] [<ffffffffa055056d>] btrfs_cow_block+0x18d/0x230 [btrfs] > [ 1184.483426] [<ffffffffa0553460>] btrfs_search_slot+0x360/0x730 [btrfs] > [ 1184.490067] [<ffffffffa055426d>] btrfs_insert_empty_items+0x8d/0x100 [btrfs] > [ 1184.497199] [<ffffffff81159be3>] ? kmem_cache_alloc+0xd3/0x170 > [ 1184.503151] [<ffffffffa0596f25>] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] > [ 1184.510125] [<ffffffffa059b0d7>] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] > [ 1184.516928] [<ffffffffa059b1de>] btrfs_alloc_chunk+0x5e/0x90 [btrfs] > [ 1184.523374] [<ffffffffa055a391>] ? check_system_chunk+0x71/0x130 [btrfs] > [ 1184.530179] [<ffffffffa055a71e>] do_chunk_alloc+0x2ce/0x340 [btrfs] > [ 1184.536555] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] > [ 1184.544070] [<ffffffffa055e35c>] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] > [ 1184.551574] [<ffffffffa055ff00>] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] > [ 1184.558987] [<ffffffffa05842de>] __btrfs_buffered_write+0x15e/0x350 [btrfs] > [ 1184.566042] [<ffffffffa0584809>] btrfs_file_aio_write+0x209/0x320 [btrfs] > [ 1184.572944] [<ffffffffa0584600>] ? __btrfs_direct_write+0x130/0x130 [btrfs] > [ 1184.579984] [<ffffffff81165014>] do_sync_readv_writev+0x94/0xe0 > [ 1184.585985] [<ffffffff81166023>] do_readv_writev+0xe3/0x1e0 > [ 1184.591645] [<ffffffff81183112>] ? fget_light+0x122/0x170 > [ 1184.597131] [<ffffffff81166166>] vfs_writev+0x46/0x60 > [ 1184.602266] [<ffffffff8116629f>] sys_writev+0x5f/0xc0 > [ 1184.607398] [<ffffffff81264b6e>] ? trace_hardirqs_on_thunk+0x3a/0x3f > [ 1184.613832] [<ffffffff814b7042>] system_call_fastpath+0x16/0x1b > > and then a little later: > > [ 1319.463719] INFO: task ceph-osd:42177 blocked for more than 120 seconds. > [ 1319.470409] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1319.478227] ceph-osd D 0000000000000246 0 42177 1 0x00000000 > [ 1319.485330] ffff880402c4d358 0000000000000046 00007fffeccc49d0 ffff880402c4dfd8 > [ 1319.492852] ffff880402c4c010 ffff880402c4c000 ffff880402c4c000 ffff880402c4c000 > [ 1319.500449] ffff880402c4dfd8 ffff880402c4c000 ffff880c2151bec0 ffff8804e9403ec0 > [ 1319.508012] Call Trace: > [ 1319.510495] [<ffffffff814acc5d>] schedule+0x5d/0x60 > [ 1319.515478] [<ffffffff814acf67>] schedule_preempt_disabled+0x27/0x40 > [ 1319.521967] [<ffffffff814ab375>] __mutex_lock_common+0x255/0x3a0 > [ 1319.528090] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] > [ 1319.534659] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] > [ 1319.541264] [<ffffffff814ab62a>] mutex_lock_nested+0x4a/0x60 > [ 1319.547118] [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] > [ 1319.553532] [<ffffffffa05635bc>] find_free_extent+0xa3c/0xb70 [btrfs] > [ 1319.560122] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] > [ 1319.567620] [<ffffffffa0563772>] btrfs_reserve_extent+0x82/0x190 [btrfs] > [ 1319.574460] [<ffffffffa0563905>] btrfs_alloc_free_block+0x85/0x230 [btrfs] > [ 1319.581464] [<ffffffffa054fb6a>] __btrfs_cow_block+0x14a/0x4b0 [btrfs] > [ 1319.588117] [<ffffffffa05a50d3>] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] > [ 1319.595644] [<ffffffffa055056d>] btrfs_cow_block+0x18d/0x230 [btrfs] > [ 1319.602110] [<ffffffffa0553460>] btrfs_search_slot+0x360/0x730 [btrfs] > [ 1319.608767] [<ffffffffa055426d>] btrfs_insert_empty_items+0x8d/0x100 [btrfs] > [ 1319.615897] [<ffffffff81159be3>] ? kmem_cache_alloc+0xd3/0x170 > [ 1319.621910] [<ffffffffa0596f25>] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] > [ 1319.628893] [<ffffffffa059b0d7>] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] > [ 1319.635736] [<ffffffffa059b1de>] btrfs_alloc_chunk+0x5e/0x90 [btrfs] > [ 1319.642184] [<ffffffffa055a391>] ? check_system_chunk+0x71/0x130 [btrfs] > [ 1319.649009] [<ffffffffa055a71e>] do_chunk_alloc+0x2ce/0x340 [btrfs] > [ 1319.655366] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] > [ 1319.662935] [<ffffffffa055e35c>] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] > [ 1319.670465] [<ffffffffa055ff00>] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] > [ 1319.677891] [<ffffffffa05842de>] __btrfs_buffered_write+0x15e/0x350 [btrfs] > [ 1319.685000] [<ffffffffa0584809>] btrfs_file_aio_write+0x209/0x320 [btrfs] > [ 1319.691952] [<ffffffffa0584600>] ? __btrfs_direct_write+0x130/0x130 [btrfs] > [ 1319.699054] [<ffffffff81165014>] do_sync_readv_writev+0x94/0xe0 > [ 1319.705118] [<ffffffff81166023>] do_readv_writev+0xe3/0x1e0 > [ 1319.710828] [<ffffffff81183112>] ? fget_light+0x122/0x170 > [ 1319.716324] [<ffffffff81166166>] vfs_writev+0x46/0x60 > [ 1319.721466] [<ffffffff8116629f>] sys_writev+0x5f/0xc0 > [ 1319.726674] [<ffffffff81264b6e>] ? trace_hardirqs_on_thunk+0x3a/0x3f > [ 1319.733198] [<ffffffff814b7042>] system_call_fastpath+0x16/0x1b > [ 1319.739274] INFO: lockdep is turned off. > [ 1319.743261] INFO: task ceph-osd:42236 blocked for more than 120 seconds. > [ 1319.749953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1319.757812] ceph-osd D 0000000000000246 0 42236 1 0x00000000 > [ 1319.765056] ffff88052e7bdcc8 0000000000000046 ffff880624666120 ffff88052e7bdfd8 > [ 1319.772785] ffff88052e7bc010 ffff88052e7bc000 ffff88052e7bc000 ffff88052e7bc000 > [ 1319.780332] ffff88052e7bdfd8 ffff88052e7bc000 ffff88062453bec0 ffff88052e7b5e20 > [ 1319.787905] Call Trace: > [ 1319.790368] [<ffffffff814acc5d>] schedule+0x5d/0x60 > [ 1319.795335] [<ffffffff814acf67>] schedule_preempt_disabled+0x27/0x40 > [ 1319.801762] [<ffffffff814ab375>] __mutex_lock_common+0x255/0x3a0 > [ 1319.807861] [<ffffffffa0548b4f>] ? btrfs_statfs+0x5f/0x240 [btrfs] > [ 1319.814143] [<ffffffff811703cf>] ? final_putname+0x3f/0x50 > [ 1319.819724] [<ffffffffa0548b4f>] ? btrfs_statfs+0x5f/0x240 [btrfs] > [ 1319.825982] [<ffffffff814ab62a>] mutex_lock_nested+0x4a/0x60 > [ 1319.831731] [<ffffffffa0548b4f>] btrfs_statfs+0x5f/0x240 [btrfs] > [ 1319.837846] [<ffffffff811968c0>] statfs_by_dentry+0x60/0x90 > [ 1319.843508] [<ffffffff81196a0b>] vfs_statfs+0x1b/0xb0 > [ 1319.848659] [<ffffffff81196c00>] user_statfs+0x40/0x60 > [ 1319.853887] [<ffffffff81196caa>] sys_statfs+0x2a/0x50 > [ 1319.859036] [<ffffffff814b7042>] system_call_fastpath+0x16/0x1b > [ 1319.865035] INFO: lockdep is turned off. > > and then later still: > > [ 1439.546768] INFO: task btrfs-transacti:27591 blocked for more than 120 seconds. > [ 1439.554139] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1439.561949] btrfs-transacti D ffff880993231810 0 27591 2 0x00000000 > [ 1439.569205] ffff8808e79d3c58 0000000000000046 0000000000000000 ffff8808e79d3fd8 > [ 1439.576820] ffff8808e79d2010 ffff8808e79d2000 ffff8808e79d2000 ffff8808e79d2000 > [ 1439.584384] ffff8808e79d3fd8 ffff8808e79d2000 ffff880b3f165e20 ffff8808d8cdbec0 > [ 1439.592040] Call Trace: > [ 1439.594573] [<ffffffff814acc5d>] schedule+0x5d/0x60 > [ 1439.599646] [<ffffffff814aac3d>] schedule_timeout+0x3d/0x260 > [ 1439.605402] [<ffffffff8109c410>] ? trace_hardirqs_on_caller+0x20/0x1d0 > [ 1439.612023] [<ffffffff8109c5cd>] ? trace_hardirqs_on+0xd/0x10 > [ 1439.617939] [<ffffffff8106781a>] ? prepare_to_wait+0x7a/0x90 > [ 1439.623788] [<ffffffffa0573a46>] btrfs_commit_transaction+0x336/0xa10 [btrfs] > [ 1439.631128] [<ffffffffa05748fb>] ? start_transaction+0x3cb/0x450 [btrfs] > [ 1439.637943] [<ffffffff810674e0>] ? wake_up_bit+0x40/0x40 > [ 1439.643380] [<ffffffffa056c223>] transaction_kthread+0x123/0x210 [btrfs] > [ 1439.650259] [<ffffffffa056c100>] ? btrfs_bio_wq_end_io+0x90/0x90 [btrfs] > [ 1439.657134] [<ffffffff81066dd1>] kthread+0xe1/0xf0 > [ 1439.662109] [<ffffffff81066cf0>] ? __init_kthread_worker+0x70/0x70 > [ 1439.668398] [<ffffffff814b6f9c>] ret_from_fork+0x7c/0xb0 > [ 1439.673858] [<ffffffff81066cf0>] ? __init_kthread_worker+0x70/0x70 > [ 1439.680181] INFO: lockdep is turned off. > [ 1439.685095] INFO: task ceph-osd:42177 blocked for more than 120 seconds. > [ 1439.691847] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1439.699743] ceph-osd D 0000000000000246 0 42177 1 0x00000000 > [ 1439.706941] ffff880402c4d358 0000000000000046 00007fffeccc49d0 ffff880402c4dfd8 > [ 1439.714494] ffff880402c4c010 ffff880402c4c000 ffff880402c4c000 ffff880402c4c000 > [ 1439.722039] ffff880402c4dfd8 ffff880402c4c000 ffff880c2151bec0 ffff8804e9403ec0 > [ 1439.729688] Call Trace: > [ 1439.732217] [<ffffffff814acc5d>] schedule+0x5d/0x60 > [ 1439.737179] [<ffffffff814acf67>] schedule_preempt_disabled+0x27/0x40 > [ 1439.743705] [<ffffffff814ab375>] __mutex_lock_common+0x255/0x3a0 > [ 1439.749843] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] > [ 1439.756481] [<ffffffffa055a5d3>] ? do_chunk_alloc+0x183/0x340 [btrfs] > [ 1439.763037] [<ffffffff814ab62a>] mutex_lock_nested+0x4a/0x60 > [ 1439.768854] [<ffffffffa055a5d3>] do_chunk_alloc+0x183/0x340 [btrfs] > [ 1439.775290] [<ffffffffa05635bc>] find_free_extent+0xa3c/0xb70 [btrfs] > [ 1439.781932] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] > [ 1439.789534] [<ffffffffa0563772>] btrfs_reserve_extent+0x82/0x190 [btrfs] > [ 1439.796474] [<ffffffffa0563905>] btrfs_alloc_free_block+0x85/0x230 [btrfs] > [ 1439.803499] [<ffffffffa054fb6a>] __btrfs_cow_block+0x14a/0x4b0 [btrfs] > [ 1439.810253] [<ffffffffa05a50d3>] ? btrfs_set_lock_blocking_rw+0xe3/0x160 [btrfs] > [ 1439.817877] [<ffffffffa055056d>] btrfs_cow_block+0x18d/0x230 [btrfs] > [ 1439.824427] [<ffffffffa0553460>] btrfs_search_slot+0x360/0x730 [btrfs] > [ 1439.831162] [<ffffffffa055426d>] btrfs_insert_empty_items+0x8d/0x100 [btrfs] > [ 1439.838423] [<ffffffff81159be3>] ? kmem_cache_alloc+0xd3/0x170 > [ 1439.844514] [<ffffffffa0596f25>] btrfs_alloc_dev_extent+0xb5/0x190 [btrfs] > [ 1439.851619] [<ffffffffa059b0d7>] __btrfs_alloc_chunk+0x6a7/0x750 [btrfs] > [ 1439.858535] [<ffffffffa059b1de>] btrfs_alloc_chunk+0x5e/0x90 [btrfs] > [ 1439.865079] [<ffffffffa055a391>] ? check_system_chunk+0x71/0x130 [btrfs] > [ 1439.871939] [<ffffffffa055a71e>] do_chunk_alloc+0x2ce/0x340 [btrfs] > [ 1439.878375] [<ffffffffa05569e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs] > [ 1439.885933] [<ffffffffa055e35c>] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs] > [ 1439.893514] [<ffffffffa055ff00>] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs] > [ 1439.901018] [<ffffffffa05842de>] __btrfs_buffered_write+0x15e/0x350 [btrfs] > [ 1439.908226] [<ffffffffa0584809>] btrfs_file_aio_write+0x209/0x320 [btrfs] > [ 1439.915235] [<ffffffffa0584600>] ? __btrfs_direct_write+0x130/0x130 [btrfs] > [ 1439.922375] [<ffffffff81165014>] do_sync_readv_writev+0x94/0xe0 > [ 1439.928475] [<ffffffff81166023>] do_readv_writev+0xe3/0x1e0 > [ 1439.934197] [<ffffffff81183112>] ? fget_light+0x122/0x170 > [ 1439.939738] [<ffffffff81166166>] vfs_writev+0x46/0x60 > [ 1439.944948] [<ffffffff8116629f>] sys_writev+0x5f/0xc0 > [ 1439.950139] [<ffffffff81264b6e>] ? trace_hardirqs_on_thunk+0x3a/0x3f > [ 1439.956621] [<ffffffff814b7042>] system_call_fastpath+0x16/0x1b > [ 1439.962742] INFO: lockdep is turned off. > [ 1439.966769] INFO: task ceph-osd:42236 blocked for more than 120 seconds. > [ 1439.973558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1439.981513] ceph-osd D 0000000000000246 0 42236 1 0x00000000 > [ 1439.988788] ffff88052e7bdcc8 0000000000000046 ffff880624666120 ffff88052e7bdfd8 > [ 1439.996611] ffff88052e7bc010 ffff88052e7bc000 ffff88052e7bc000 ffff88052e7bc000 > [ 1440.004237] ffff88052e7bdfd8 ffff88052e7bc000 ffff88062453bec0 ffff88052e7b5e20 > [ 1440.011956] Call Trace: > [ 1440.014516] [<ffffffff814acc5d>] schedule+0x5d/0x60 > [ 1440.019578] [<ffffffff814acf67>] schedule_preempt_disabled+0x27/0x40 > [ 1440.026054] [<ffffffff814ab375>] __mutex_lock_common+0x255/0x3a0 > [ 1440.032245] [<ffffffffa0548b4f>] ? btrfs_statfs+0x5f/0x240 [btrfs] > [ 1440.038624] [<ffffffff811703cf>] ? final_putname+0x3f/0x50 > [ 1440.044323] [<ffffffffa0548b4f>] ? btrfs_statfs+0x5f/0x240 [btrfs] > [ 1440.050952] [<ffffffff814ab62a>] mutex_lock_nested+0x4a/0x60 > [ 1440.056805] [<ffffffffa0548b4f>] btrfs_statfs+0x5f/0x240 [btrfs] > [ 1440.063029] [<ffffffff811968c0>] statfs_by_dentry+0x60/0x90 > [ 1440.068739] [<ffffffff81196a0b>] vfs_statfs+0x1b/0xb0 > [ 1440.073962] [<ffffffff81196c00>] user_statfs+0x40/0x60 > [ 1440.079203] [<ffffffff81196caa>] sys_statfs+0x2a/0x50 > [ 1440.084401] [<ffffffff814b7042>] system_call_fastpath+0x16/0x1b > [ 1440.090816] INFO: lockdep is turned off. > > > Please let me know what I can do to help resolve this issue. > > Thanks -- Jim > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/09/2012 07:04 AM, Liu Bo wrote:> On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: >> > Hi, >> > >> > I''m hitting a btrfs locking issue with 3.7.0-rc8. >> > >> > The btrfs filesystem in question is backing a Ceph OSD >> > under a heavy write load from many cephfs clients. >> > >> > I reported this issue a while ago: >> > http://www.spinics.net/lists/linux-btrfs/msg19370.html >> > when I was testing what I thought might be part of the >> > 3.7 btrfs patch queue, using Josef Bacik''s btrfs-next tree. >> > >> > I spent some time attempting to bisect the btrfs patch queue >> > just before it was merged for 3.7, but got nowhere due to >> > false negatives. >> > >> > I''ve just been able to get back to testing 3.7-rc, and found >> > that I can still trigger the issue. > Hi Jim, > > Could you please apply the following patch to test if it works?Hi, So far, with your patch applied I''ve been unable to reproduce the recursive deadlock. Thanks a lot for this patch! This issue has been troubling me for a while. I''ve been trying to learn more about btrfs internals - if you have the time to answer a couple questions about your patch, I''d really appreciate it.> > (It''s against 3.7-rc8.) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 3d3e2c1..100289b 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root > *root, int data) > > if (data) > flags = BTRFS_BLOCK_GROUP_DATA; > - else if (root == root->fs_info->chunk_root) > + else if (root == root->fs_info->chunk_root || > + root == root->fs_info->dev_root) > flags = BTRFS_BLOCK_GROUP_SYSTEM; > else > flags = BTRFS_BLOCK_GROUP_METADATA; > @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct > btrfs_root *root, u64 type) > num_dev = 1; /* DUP or single */ > > /* metadata for updaing devices and chunk tree */ > + num_dev = num_dev << 1AFAICS this is doubling the size of the reserve, which reduces the chance of a recursive do_chunk_alloc(), right?> return btrfs_calc_trans_metadata_size(root, num_dev + 1);btrfs_calc_trans_metadata_size(root, num_items) multiplies its num_items argument by another factor of three - do you know if there is there some rationale behind that number, or is it perhaps also an empirically determined factor? What I''m wondering about is that if the size of the reserve is empirically determined, will it need to be increased again later when machines are more capable, and can handle a higher load? Do you think it''s feasible to modify the locking for do_chunk_alloc to allow it to recurse without deadlock? Thanks -- Jim> } > > @@ -4351,7 +4353,7 @@ static void init_global_block_rsv(struct > btrfs_fs_info *fs_info) > > fs_info->extent_root->block_rsv = &fs_info->global_block_rsv; > fs_info->csum_root->block_rsv = &fs_info->global_block_rsv; > - fs_info->dev_root->block_rsv = &fs_info->global_block_rsv; > + fs_info->dev_root->block_rsv = &fs_info->chunk_block_rsv; > fs_info->tree_root->block_rsv = &fs_info->global_block_rsv; > fs_info->chunk_root->block_rsv = &fs_info->chunk_block_rsv; > > > thanks, > liubo > >
On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote:> On 12/09/2012 07:04 AM, Liu Bo wrote: > > On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: > > Hi Jim, > > > > Could you please apply the following patch to test if it works? > > Hi, > > So far, with your patch applied I''ve been unable to reproduce > the recursive deadlock. Thanks a lot for this patch! > This issue has been troubling me for a while.Hi Jim, Good news for us :)> > I''ve been trying to learn more about btrfs internals - > if you have the time to answer a couple questions about > your patch, I''d really appreciate it.See below.> > > > > (It''s against 3.7-rc8.) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index 3d3e2c1..100289b 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root > > *root, int data) > > > > if (data) > > flags = BTRFS_BLOCK_GROUP_DATA; > > - else if (root == root->fs_info->chunk_root) > > + else if (root == root->fs_info->chunk_root || > > + root == root->fs_info->dev_root) > > flags = BTRFS_BLOCK_GROUP_SYSTEM; > > else > > flags = BTRFS_BLOCK_GROUP_METADATA; > > @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct > > btrfs_root *root, u64 type) > > num_dev = 1; /* DUP or single */ > > > > /* metadata for updaing devices and chunk tree */ > > + num_dev = num_dev << 1 > > AFAICS this is doubling the size of the reserve, which > reduces the chance of a recursive do_chunk_alloc(), right? >Not like that, we hit the deadlock because updating device tree also uses METADATA chunk, which may be called when we''re actually allocating a METADATA chunk, so the patch I sent you makes updating device tree use SYSTEM chunk, which we''ll have some code to check if it is enough before allocating a chunk(if not, we''ll allocate a SYSTEM chunk first). Here I double the size just because the worst case of allocating a DATA/METADATA chunk -may- results in 1)adding a SYSTEM chunk + 2)adding dev extent per chunk stripe + 3)updating chunk stripes''s bytes_used> > return btrfs_calc_trans_metadata_size(root, num_dev + 1); > > btrfs_calc_trans_metadata_size(root, num_items) multiplies its > num_items argument by another factor of three - do you know if > there is there some rationale behind that number, or is it > perhaps also an empirically determined factor?The height of Btrfs''s metadata btree is at most 8, leaf is on 0 level while node is at most on 7 level. Each btree update may results in COWing a node and its sibling nodes, where the factor of tree comes from> > What I''m wondering about is that if the size of the reserve is > empirically determined, will it need to be increased again > later when machines are more capable, and can handle a higher > load? > > Do you think it''s feasible to modify the locking for > do_chunk_alloc to allow it to recurse without deadlock?Well, it could be, but IMO it''ll bring us complexity, so worse maintainance. Any questions? Feel free to ask. thanks, liubo
On Wed, Dec 12, 2012 at 09:37:37AM +0800, Liu Bo wrote:> On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote: > > On 12/09/2012 07:04 AM, Liu Bo wrote: > > > On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: > > > Hi Jim, > > > > > > Could you please apply the following patch to test if it works? > > > > Hi, > > > > So far, with your patch applied I''ve been unable to reproduce > > the recursive deadlock. Thanks a lot for this patch! > > This issue has been troubling me for a while. > > Hi Jim, > > Good news for us :) > > > > > I''ve been trying to learn more about btrfs internals - > > if you have the time to answer a couple questions about > > your patch, I''d really appreciate it. > > See below. > > > > > > > > > (It''s against 3.7-rc8.) > > > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > > index 3d3e2c1..100289b 100644 > > > --- a/fs/btrfs/extent-tree.c > > > +++ b/fs/btrfs/extent-tree.c > > > @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root > > > *root, int data) > > > > > > if (data) > > > flags = BTRFS_BLOCK_GROUP_DATA; > > > - else if (root == root->fs_info->chunk_root) > > > + else if (root == root->fs_info->chunk_root || > > > + root == root->fs_info->dev_root) > > > flags = BTRFS_BLOCK_GROUP_SYSTEM; > > > else > > > flags = BTRFS_BLOCK_GROUP_METADATA; > > > @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct > > > btrfs_root *root, u64 type) > > > num_dev = 1; /* DUP or single */ > > > > > > /* metadata for updaing devices and chunk tree */ > > > + num_dev = num_dev << 1 > > > > AFAICS this is doubling the size of the reserve, which > > reduces the chance of a recursive do_chunk_alloc(), right? > > > > Not like that, we hit the deadlock because updating device tree also > uses METADATA chunk, which may be called when we''re actually allocating > a METADATA chunk, so the patch I sent you makes updating device tree > use SYSTEM chunk, which we''ll have some code to check if it is enough > before allocating a chunk(if not, we''ll allocate a SYSTEM chunk first). > > Here I double the size just because the worst case of allocating a > DATA/METADATA chunk -may- results in > > 1)adding a SYSTEM chunk + > 2)adding dev extent per chunk stripe + > 3)updating chunk stripes''s bytes_used > > > > return btrfs_calc_trans_metadata_size(root, num_dev + 1); > > > > btrfs_calc_trans_metadata_size(root, num_items) multiplies its > > num_items argument by another factor of three - do you know if > > there is there some rationale behind that number, or is it > > perhaps also an empirically determined factor? > > The height of Btrfs''s metadata btree is at most 8, > leaf is on 0 level while node is at most on 7 level. > > Each btree update may results in COWing a node and its sibling nodes, > where the factor of tree comes froms/tree/three/g> > > > > What I''m wondering about is that if the size of the reserve is > > empirically determined, will it need to be increased again > > later when machines are more capable, and can handle a higher > > load? > > > > Do you think it''s feasible to modify the locking for > > do_chunk_alloc to allow it to recurse without deadlock? > > Well, it could be, but IMO it''ll bring us complexity, so worse > maintainance. > > Any questions? Feel free to ask. > > thanks, > liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/
On 12/11/2012 06:37 PM, Liu Bo wrote:> On Tue, Dec 11, 2012 at 09:33:15AM -0700, Jim Schutt wrote: >> On 12/09/2012 07:04 AM, Liu Bo wrote: >>> On Wed, Dec 05, 2012 at 09:07:05AM -0700, Jim Schutt wrote: >>> Hi Jim, >>> >>> Could you please apply the following patch to test if it works? >> >> Hi, >> >> So far, with your patch applied I''ve been unable to reproduce >> the recursive deadlock. Thanks a lot for this patch! >> This issue has been troubling me for a while. > > Hi Jim, > > Good news for us :) > >> >> I''ve been trying to learn more about btrfs internals - >> if you have the time to answer a couple questions about >> your patch, I''d really appreciate it. > > See below. > >> >>> >>> (It''s against 3.7-rc8.) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 3d3e2c1..100289b 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -3346,7 +3346,8 @@ u64 btrfs_get_alloc_profile(struct btrfs_root >>> *root, int data) >>> >>> if (data) >>> flags = BTRFS_BLOCK_GROUP_DATA; >>> - else if (root == root->fs_info->chunk_root) >>> + else if (root == root->fs_info->chunk_root || >>> + root == root->fs_info->dev_root) >>> flags = BTRFS_BLOCK_GROUP_SYSTEM; >>> else >>> flags = BTRFS_BLOCK_GROUP_METADATA; >>> @@ -3535,6 +3536,7 @@ static u64 get_system_chunk_thresh(struct >>> btrfs_root *root, u64 type) >>> num_dev = 1; /* DUP or single */ >>> >>> /* metadata for updaing devices and chunk tree */ >>> + num_dev = num_dev << 1 >> >> AFAICS this is doubling the size of the reserve, which >> reduces the chance of a recursive do_chunk_alloc(), right? >> > > Not like that, we hit the deadlock because updating device tree also > uses METADATA chunk, which may be called when we''re actually allocating > a METADATA chunk, so the patch I sent you makes updating device tree > use SYSTEM chunk, which we''ll have some code to check if it is enough > before allocating a chunk(if not, we''ll allocate a SYSTEM chunk first). > > Here I double the size just because the worst case of allocating a > DATA/METADATA chunk -may- results in > > 1)adding a SYSTEM chunk + > 2)adding dev extent per chunk stripe + > 3)updating chunk stripes''s bytes_used > >>> return btrfs_calc_trans_metadata_size(root, num_dev + 1); >> >> btrfs_calc_trans_metadata_size(root, num_items) multiplies its >> num_items argument by another factor of three - do you know if >> there is there some rationale behind that number, or is it >> perhaps also an empirically determined factor? > > The height of Btrfs''s metadata btree is at most 8, > leaf is on 0 level while node is at most on 7 level. > > Each btree update may results in COWing a node and its sibling nodes, > where the factor of tree comes from > >> >> What I''m wondering about is that if the size of the reserve is >> empirically determined, will it need to be increased again >> later when machines are more capable, and can handle a higher >> load? >> >> Do you think it''s feasible to modify the locking for >> do_chunk_alloc to allow it to recurse without deadlock? > > Well, it could be, but IMO it''ll bring us complexity, so worse > maintainance. > > Any questions? Feel free to ask.Your response was very helpful. Thanks a lot! -- Jim> > thanks, > liubo > >