David Sterba
2013-Apr-18 14:42 UTC
[bug] 3.9-rc7+next: NULL deref in btrfs_next_old_leaf/btrfs_search_slot
Hi, xfstests loop has hit this after a day, failing test was 276. The sources are btrfs-next/linus-base branch. I''ve hit this some time ago with 3.9.0-rc4-default+ . [64394.422743] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 [64394.426716] IP: [<ffffffffa0010e0f>] btrfs_search_slot+0xbf/0x9e0 [btrfs] [64394.426716] PGD 79201067 PUD 771c4067 PMD 0 [64394.426716] Oops: 0002 [#1] SMP [64394.426716] Modules linked in: aoe dm_crypt loop btrfs [64394.426716] CPU 1 [64394.426716] Pid: 32556, comm: btrfs-cache-1 Not tainted 3.9.0-rc7-default+ #292 Intel Corporation Santa Rosa platform/Matanzas [64394.426716] RIP: 0010:[<ffffffffa0010e0f>] [<ffffffffa0010e0f>] btrfs_search_slot+0xbf/0x9e0 [btrfs] [64394.426716] RSP: 0018:ffff88004cb4fb68 EFLAGS: 00010202 [64394.426716] RAX: 0000000000000000 RBX: ffff880003ec5b18 RCX: ffff880077893000 [64394.426716] RDX: 0000000000000000 RSI: ffff880077893000 RDI: 0000000000000000 [64394.426716] RBP: ffff88004cb4fbf8 R08: 0000000000000000 R09: 0000000000000000 [64394.426716] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [64394.426716] R13: 0000000000000030 R14: 0000000000000000 R15: 0000160000000000 [64394.426716] FS: 0000000000000000(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000 [64394.426716] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [64394.426716] CR2: 0000000000000078 CR3: 00000000760cf000 CR4: 00000000000007e0 [64394.426716] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [64394.426716] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [64394.426716] Process btrfs-cache-1 (pid: 32556, threadinfo ffff88004cb4e000, task ffff8800250f0a80) [64394.426716] Stack: [64394.426716] 000000794cb4fc58 ffff880077893000 0000100000000001 ffffffff00000000 [64394.426716] 0000000000000001 0000000000000000 00000000364d0000 ffff88004cb4fc58 [64394.426716] 000000006cfbb7a8 ffff880077893000 0000000000000000 00000032ffffffff [64394.426716] Call Trace: [64394.426716] [<ffffffffa0014977>] btrfs_next_old_leaf+0x247/0x4e0 [btrfs] [64394.426716] [<ffffffffa0014c20>] btrfs_next_leaf+0x10/0x20 [btrfs] [64394.426716] [<ffffffffa0019775>] caching_thread+0x1f5/0x3c0 [btrfs] [64394.426716] [<ffffffffa0063cc4>] worker_loop+0xc4/0x5a0 [btrfs] [64394.426716] [<ffffffffa0063c00>] ? btrfs_queue_worker+0x330/0x330 [btrfs] [64394.426716] [<ffffffff8107669e>] kthread+0xde/0xf0 [64394.426716] [<ffffffff810765c0>] ? flush_kthread_worker+0x1e0/0x1e0 [64394.426716] [<ffffffff81967eac>] ret_from_fork+0x7c/0xb0 [64394.426716] [<ffffffff810765c0>] ? flush_kthread_worker+0x1e0/0x1e0 [64394.426716] Code: 8c ff ff ff ff 45 0f b6 e4 a8 10 49 bf 00 00 00 00 00 16 00 00 44 89 65 88 0f 84 a6 01 00 00 48 8b 4d b8 48 8b 41 08 48 89 45 c0 <f0> ff 40 78 48 8b 45 c0 4c 89 fa 48 b9 b7 6d db b6 6d db b6 6d [64394.426716] RIP [<ffffffffa0010e0f>] btrfs_search_slot+0xbf/0x9e0 [btrfs] [64394.426716] RSP <ffff88004cb4fb68> [64394.426716] CR2: 0000000000000078 Happens probably during umount of the scratch partition: 32562 pts/0 D+ 0:00 /bin/umount /dev/sda9 [<ffffffff813a55c3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffffa00227ff>] btrfs_free_block_groups+0x2f/0x390 [btrfs] [<ffffffffa0030606>] close_ctree+0x146/0x240 [btrfs] [<ffffffffa0003bf9>] btrfs_put_super+0x19/0x20 [btrfs] [<ffffffff81163572>] generic_shutdown_super+0x62/0xf0 [<ffffffff81163696>] kill_anon_super+0x16/0x30 [<ffffffffa0004b2a>] btrfs_kill_super+0x1a/0x90 [btrfs] [<ffffffff811638ed>] deactivate_locked_super+0x3d/0x90 [<ffffffff8116458a>] deactivate_super+0x4a/0x70 [<ffffffff81180720>] mntput_no_expire+0x100/0x160 [<ffffffff811817bc>] sys_umount+0xcc/0x3c0 [<ffffffff81967f59>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff /dev/sda5 on /mnt/a1 type btrfs (rw,noatime,compress-force=lzo,space_cache,enospc_debug) $ btrfs fi df /mnt/a1 Data, RAID10: total=17.95GB, used=41.12MB Data: total=8.00MB, used=0.00 System, RAID10: total=16.00MB, used=4.00KB System: total=4.00MB, used=0.00 Metadata, RAID10: total=2.00GB, used=2.71MB Metadata: total=8.00MB, used=0.00 previous test parameters were same as now: MKFS_OPTIONS -- -m single -d single /dev/sda9 MOUNT_OPTIONS -- -o enospc_debug,space_cache,noatime,compress-force=lzo /dev/sda9 /mnt/a2 david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2013-Apr-18 14:49 UTC
Re: [bug] 3.9-rc7+next: NULL deref in btrfs_next_old_leaf/btrfs_search_slot
On Thu, Apr 18, 2013 at 04:42:18PM +0200, David Sterba wrote:> xfstests loop has hit this after a day, failing test was 276.sorry it''s test 273 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2013-Apr-25 17:18 UTC
Re: [bug] 3.9-rc7+next: NULL deref in btrfs_next_old_leaf/btrfs_search_slot
On Thu, Apr 18, 2013 at 04:42:18PM +0200, David Sterba wrote:> [64394.422743] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 > [64394.426716] RIP: 0010:[<ffffffffa0010e0f>] [<ffffffffa0010e0f>] btrfs_search_slot+0xbf/0x9e0 [btrfs] > [64394.426716] [<ffffffffa0014977>] btrfs_next_old_leaf+0x247/0x4e0 [btrfs] > [64394.426716] [<ffffffffa0014c20>] btrfs_next_leaf+0x10/0x20 [btrfs]The bisection set was reduced to these patches (on top of cmason/for-linus 4bc4bee45): 1 Btrfs: cleanup unused function Liu Bo 2 btrfs: enhance superblock checks David Sterba 3 Btrfs: add some free space cache tests Josef Bacik 4 btrfs: merge save_error_info helpers into one David Sterba 5 btrfs: clean up transaction abort messages David Sterba 6 Btrfs: cleanup unused arguments of btrfs_csum_data Liu Bo 7 Btrfs: use helper to cleanup tree roots Liu Bo 8 Btrfs: add a incompatible format change for smaller metadata extent ref Josef Bacik The bisecting process points to patch 8, ie the ''first bad'', but the reproducer has proved to be unreliable and I think it''s very sensitive to scheduling timing. Reproducer is to simply run 273 in a loop on a single/single filesystem, the null deref happens during umount. The tricky part is that it does not happen every time, I must not touch the testbox nor let any process run except ''dstat''. It usually crashed on first or second test run, but I''ve left it up to 10 to be sure. I don''t think it''s caused by the skinny metadata, because it does not touch the affected functions, but somehow helps to make it visible. Although the test never crashed when patch 7 Btrfs: use helper to cleanup tree roots http://git.kernel.org/cgit/linux/kernel/git/josef/btrfs-next.git/commit/?id=3fa215686c574d26f43f1bcf6c9f69658e02908f was on top (I once left it running overnight, all fine), that''s my main suspect: --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3483,20 +3483,7 @@ int close_ctree(struct btrfs_root *root) percpu_counter_sum(&fs_info->delalloc_bytes)); } - free_extent_buffer(fs_info->extent_root->node); - free_extent_buffer(fs_info->extent_root->commit_root); - free_extent_buffer(fs_info->tree_root->node); - free_extent_buffer(fs_info->tree_root->commit_root); - free_extent_buffer(fs_info->chunk_root->node); - free_extent_buffer(fs_info->chunk_root->commit_root); - free_extent_buffer(fs_info->dev_root->node); - free_extent_buffer(fs_info->dev_root->commit_root); - free_extent_buffer(fs_info->csum_root->node); - free_extent_buffer(fs_info->csum_root->commit_root); - if (fs_info->quota_root) { - free_extent_buffer(fs_info->quota_root->node); - free_extent_buffer(fs_info->quota_root->commit_root); - } + free_root_pointers(fs_info, 1); and if you look what free_root_pointers does, sets all the pointers to NULL. The crash site: gdb) l *(btrfs_search_slot+0xb6) 0x11076 is in btrfs_search_slot (/home/dsterba/linux-2.6/arch/x86/include/asm/atomic.h:95). 90 * 91 * Atomically increments @v by 1. 92 */ 93 static inline void atomic_inc(atomic_t *v) 94 { 95 asm volatile(LOCK_PREFIX "incl %0" 96 : "+m" (v->counter)); 97 } (gdb) l *(btrfs_search_slot+0xb5) 0x11075 is in btrfs_search_slot (fs/btrfs/ctree.c:2513). 2508 if (p->search_commit_root) { 2509 /* 2510 * the commit roots are read only 2511 * so we always do read locks 2512 */ 2513 b = root->commit_root; 2514 extent_buffer_get(b); 2515 level = btrfs_header_level(b); 2516 if (!p->skip_locking) 2517 btrfs_tree_read_lock(b); so it''s the extent_buffer_get() call, the offset of ->refs field is 120 = 0x78, matches "NULL pointer dereference at 0000000000000078". So root->commit_root is NULL. If we look what happens in disk-io.c::close_ctree: 3476 fs_info->closing = 2; 3477 smp_mb(); ... 3486 free_root_pointers(fs_info, 1); 3487 3488 btrfs_free_block_groups(fs_info); 3489 3490 del_fs_roots(fs_info); 3491 3492 iput(fs_info->btree_inode); 3493 3494 btrfs_stop_workers(&fs_info->generic_worker); ... 3507 btrfs_stop_workers(&fs_info->caching_workers); Line 3477 assures that caching thread will exit the main loop when it sees fs_closing > 1, but if it''s past this check yet has to do some work, and close_tree() calls free_root_pointers(), bang. As we can see the caching thread goes down later. What I see here is a use-after-free that has been undetected so far and only exposed by patch 7. This diff looks like the fix (on top of patch 8): --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3486,14 +3486,14 @@ int close_ctree(struct btrfs_root *root) percpu_counter_sum(&fs_info->delalloc_bytes)); } - free_root_pointers(fs_info, 1); - btrfs_free_block_groups(fs_info); del_fs_roots(fs_info); iput(fs_info->btree_inode); + free_root_pointers(fs_info, 1); + btrfs_stop_workers(&fs_info->generic_worker); btrfs_stop_workers(&fs_info->fixup_workers); btrfs_stop_workers(&fs_info->delalloc_workers); --- david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Bo
2013-Apr-27 02:50 UTC
Re: [bug] 3.9-rc7+next: NULL deref in btrfs_next_old_leaf/btrfs_search_slot
On Thu, Apr 25, 2013 at 07:18:59PM +0200, David Sterba wrote:> On Thu, Apr 18, 2013 at 04:42:18PM +0200, David Sterba wrote: > > [64394.422743] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 > > [64394.426716] RIP: 0010:[<ffffffffa0010e0f>] [<ffffffffa0010e0f>] btrfs_search_slot+0xbf/0x9e0 [btrfs] > > [64394.426716] [<ffffffffa0014977>] btrfs_next_old_leaf+0x247/0x4e0 [btrfs] > > [64394.426716] [<ffffffffa0014c20>] btrfs_next_leaf+0x10/0x20 [btrfs] > > The bisection set was reduced to these patches (on top of > cmason/for-linus 4bc4bee45): > > 1 Btrfs: cleanup unused function Liu Bo > 2 btrfs: enhance superblock checks David Sterba > 3 Btrfs: add some free space cache tests Josef Bacik > 4 btrfs: merge save_error_info helpers into one David Sterba > 5 btrfs: clean up transaction abort messages David Sterba > 6 Btrfs: cleanup unused arguments of btrfs_csum_data Liu Bo > 7 Btrfs: use helper to cleanup tree roots Liu Bo > 8 Btrfs: add a incompatible format change for smaller metadata extent ref Josef Bacik > > The bisecting process points to patch 8, ie the ''first bad'', but the > reproducer has proved to be unreliable and I think it''s very sensitive > to scheduling timing. > > Reproducer is to simply run 273 in a loop on a single/single filesystem, > the null deref happens during umount. The tricky part is that it does not > happen every time, I must not touch the testbox nor let any process run > except ''dstat''. It usually crashed on first or second test run, but I''ve > left it up to 10 to be sure. > > I don''t think it''s caused by the skinny metadata, because it does not > touch the affected functions, but somehow helps to make it visible. > > Although the test never crashed when patch 7 > Btrfs: use helper to cleanup tree roots > http://git.kernel.org/cgit/linux/kernel/git/josef/btrfs-next.git/commit/?id=3fa215686c574d26f43f1bcf6c9f69658e02908f > was on top (I once left it running overnight, all fine), that''s my main > suspect: > > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3483,20 +3483,7 @@ int close_ctree(struct btrfs_root *root) > percpu_counter_sum(&fs_info->delalloc_bytes)); > } > > - free_extent_buffer(fs_info->extent_root->node); > - free_extent_buffer(fs_info->extent_root->commit_root); > - free_extent_buffer(fs_info->tree_root->node); > - free_extent_buffer(fs_info->tree_root->commit_root); > - free_extent_buffer(fs_info->chunk_root->node); > - free_extent_buffer(fs_info->chunk_root->commit_root); > - free_extent_buffer(fs_info->dev_root->node); > - free_extent_buffer(fs_info->dev_root->commit_root); > - free_extent_buffer(fs_info->csum_root->node); > - free_extent_buffer(fs_info->csum_root->commit_root); > - if (fs_info->quota_root) { > - free_extent_buffer(fs_info->quota_root->node); > - free_extent_buffer(fs_info->quota_root->commit_root); > - } > + free_root_pointers(fs_info, 1); > > and if you look what free_root_pointers does, sets all the pointers to NULL. > > The crash site: > > gdb) l *(btrfs_search_slot+0xb6) > 0x11076 is in btrfs_search_slot (/home/dsterba/linux-2.6/arch/x86/include/asm/atomic.h:95). > 90 * > 91 * Atomically increments @v by 1. > 92 */ > 93 static inline void atomic_inc(atomic_t *v) > 94 { > 95 asm volatile(LOCK_PREFIX "incl %0" > 96 : "+m" (v->counter)); > 97 } > > (gdb) l *(btrfs_search_slot+0xb5) > 0x11075 is in btrfs_search_slot (fs/btrfs/ctree.c:2513). > 2508 if (p->search_commit_root) { > 2509 /* > 2510 * the commit roots are read only > 2511 * so we always do read locks > 2512 */ > 2513 b = root->commit_root; > 2514 extent_buffer_get(b); > 2515 level = btrfs_header_level(b); > 2516 if (!p->skip_locking) > 2517 btrfs_tree_read_lock(b); > > so it''s the extent_buffer_get() call, the offset of ->refs field is 120 = 0x78, > matches "NULL pointer dereference at 0000000000000078". So root->commit_root > is NULL. > > If we look what happens in disk-io.c::close_ctree: > > 3476 fs_info->closing = 2; > 3477 smp_mb(); > ... > 3486 free_root_pointers(fs_info, 1); > 3487 > 3488 btrfs_free_block_groups(fs_info); > 3489 > 3490 del_fs_roots(fs_info); > 3491 > 3492 iput(fs_info->btree_inode); > 3493 > 3494 btrfs_stop_workers(&fs_info->generic_worker); > ... > 3507 btrfs_stop_workers(&fs_info->caching_workers); > > Line 3477 assures that caching thread will exit the main loop when it sees > fs_closing > 1, but if it''s past this check yet has to do some work, and > close_tree() calls free_root_pointers(), bang. As we can see the caching thread > goes down later. > > What I see here is a use-after-free that has been undetected so far and only > exposed by patch 7. > > This diff looks like the fix (on top of patch 8): > > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3486,14 +3486,14 @@ int close_ctree(struct btrfs_root *root) > percpu_counter_sum(&fs_info->delalloc_bytes)); > } > > - free_root_pointers(fs_info, 1); > - > btrfs_free_block_groups(fs_info); > > del_fs_roots(fs_info); > > iput(fs_info->btree_inode); > > + free_root_pointers(fs_info, 1); > + > btrfs_stop_workers(&fs_info->generic_worker); > btrfs_stop_workers(&fs_info->fixup_workers); > btrfs_stop_workers(&fs_info->delalloc_workers); > --- > > > davidThanks for tracking it Dave, I''m trying to reproduce it here. thanks, liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Sterba
2013-Apr-29 22:17 UTC
Re: [bug] 3.9-rc7+next: NULL deref in btrfs_next_old_leaf/btrfs_search_slot
On Sat, Apr 27, 2013 at 10:50:43AM +0800, Liu Bo wrote:> > @@ -3486,14 +3486,14 @@ int close_ctree(struct btrfs_root *root) > > percpu_counter_sum(&fs_info->delalloc_bytes)); > > } > > > > - free_root_pointers(fs_info, 1); > > - > > btrfs_free_block_groups(fs_info); > > > > del_fs_roots(fs_info); > > > > iput(fs_info->btree_inode); > > > > + free_root_pointers(fs_info, 1); > > + > > btrfs_stop_workers(&fs_info->generic_worker); > > btrfs_stop_workers(&fs_info->fixup_workers); > > btrfs_stop_workers(&fs_info->delalloc_workers); > > --- > > Thanks for tracking it Dave, I''m trying to reproduce it here.Adding sleeps between free_root_pointers and stopping workers should leave enough space for the threads to work and touch the freed data. The proposed fix is not entirely correct, it just reduces the race window. Freeing roots must come after stopping the workers, same holds for any of the other cleanup functions that may access released resources. But for example the free block groups need the caching thread alive so it cannot be trivially fixed by moving everything after stop_workers and must be decided case by case. There is another instance of a similar sequence that is mis-ordered and has to be fixed as well. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html