Alex Lyakas
2013-Jan-28 16:11 UTC
backref for an extent not found in send_root (!backref_ctx->found_itself)
Hi Jan, I have a set of unit tests (part of the larger system) for the send-receive functionality, with which I am able to hit this error: Jan 28 18:01:00 687-dev kernel: [16968.451358] btrfs: ERROR did not find backref in send_root. inode=259, offset=139264, disk_byte=4263936 found extent=4263936 As the code states, this could indicate a bug in backref walking. This reproduces with "for-linus" branch. Typically this happens when a snapshot is deleted, immediately a new snap with the same name is created, and then "btrfs send" is issued without parent (i.e., full-send) on this snap. To debug this further, we can do one of two things: # I can apply patches/debug prints & reproduce # I can work to isolate the unit test into a bash script and send you a script that reproduces Please let me know. Thanks, Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Schmidt
2013-Jan-29 09:07 UTC
Re: backref for an extent not found in send_root (!backref_ctx->found_itself)
Hi Alex, On Mon, January 28, 2013 at 17:11 (+0100), Alex Lyakas wrote:> Hi Jan, > I have a set of unit tests (part of the larger system) for the > send-receive functionality, with which I am able to hit this error: > > Jan 28 18:01:00 687-dev kernel: [16968.451358] btrfs: ERROR did not > find backref in send_root. inode=259, offset=139264, disk_byte=4263936 > found extent=4263936 > > As the code states, this could indicate a bug in backref walking. This > reproduces with "for-linus" branch. > > Typically this happens when a snapshot is deleted, immediately a new > snap with the same name is created, and then "btrfs send" is issued > without parent (i.e., full-send) on this snap. > > To debug this further, we can do one of two things: > # I can apply patches/debug prints & reproduce > # I can work to isolate the unit test into a bash script and send you > a script that reproducesI''d prefer #2 of the above. You can also send me the unit tests you''ve got if I can get them running without multiple days of setup. I''m guessing that this is more likely going to end up in send.c than in backref.c, perhaps Alexander would like to trace this one down. But anyway, send me a reproducer (in private, if you don''t want to publish it) and we''ll see what''s going on. Thanks, -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Lyakas
2013-Jan-31 19:24 UTC
Re: backref for an extent not found in send_root (!backref_ctx->found_itself)
Hi Jan, attached are bash scripts to repro the issue. Some instructions on how to run them: - create 2 btrfs filesystems with "mkfs.btrfs /dev/sdXXX". I don''t think that size matters. - mount them in /mnt/src and /mnt/dst - mount options: noatime,nodatasum,nodatacow,nospace_cache - put the 3 scripts into one directory and cd to it - run btrfs_init_tests.sh (it sets up a small file tree for tests) - run btrfs_test_first_ref_jan.sh After about 20-30 seconds, it hits the error I mentioned and script stops. It happens on "for-linus" branch, top commit 1eafa6c73791e4f312324ddad9cbcaf6a1b6052b. I suspect the issue might be that the test schedules a lot of subvolumes for deletion, and once the cleaner thread kicks in and also starts doing backref stuff, the problem happens. Another small note: there is an issue in btrfs-progs subvolume listing code (also used by send). When it finds a ROOT_ITEM in the root tree that is not linked with ROOT_REF/ROOT_BACKREF (i.e., one scheduled for deletion), it gets confused and exits. Miao sent a patch to fix it here: http://www.spinics.net/lists/linux-btrfs/msg19767.html I don''t think it got merged into progs yet (progs are really behind:() If you want a quick fix, add code like this to the beginning of __list_subvol_fill_paths (but Miao sent a better patch): /* * due to change in __list_subvol_search(), root_lookup * might contain subvolumes with ref_tree==0 (in deletion). */ again: n = rb_first(&root_lookup->root); while (n) { struct root_info *entry = rb_entry(n, struct root_info, rb_node); if (entry->ref_tree == 0) { fprintf(stderr, "__list_subvol_fill_paths: drop root_id=%llu, because it has no ref_tree\n", entry->root_id); rb_erase(n, &root_lookup->root); free(entry); goto again; } n = rb_next(n); } Otherwise, "btrfs send" might fail, but this is not the failure we are looking for:) Thanks, Alex. On Tue, Jan 29, 2013 at 11:07 AM, Jan Schmidt <list.btrfs@jan-o-sch.net> wrote:> Hi Alex, > > On Mon, January 28, 2013 at 17:11 (+0100), Alex Lyakas wrote: >> Hi Jan, >> I have a set of unit tests (part of the larger system) for the >> send-receive functionality, with which I am able to hit this error: >> >> Jan 28 18:01:00 687-dev kernel: [16968.451358] btrfs: ERROR did not >> find backref in send_root. inode=259, offset=139264, disk_byte=4263936 >> found extent=4263936 >> >> As the code states, this could indicate a bug in backref walking. This >> reproduces with "for-linus" branch. >> >> Typically this happens when a snapshot is deleted, immediately a new >> snap with the same name is created, and then "btrfs send" is issued >> without parent (i.e., full-send) on this snap. >> >> To debug this further, we can do one of two things: >> # I can apply patches/debug prints & reproduce >> # I can work to isolate the unit test into a bash script and send you >> a script that reproduces > > I''d prefer #2 of the above. You can also send me the unit tests you''ve got if I > can get them running without multiple days of setup. > > I''m guessing that this is more likely going to end up in send.c than in > backref.c, perhaps Alexander would like to trace this one down. But anyway, send > me a reproducer (in private, if you don''t want to publish it) and we''ll see > what''s going on. > > Thanks, > -Jan
Alex Lyakas
2013-Feb-19 07:58 UTC
Re: backref for an extent not found in send_root (!backref_ctx->found_itself)
Hi Jan, any luck reproducing this? Alex. On Thu, Jan 31, 2013 at 9:24 PM, Alex Lyakas <alex.btrfs@zadarastorage.com> wrote:> Hi Jan, > attached are bash scripts to repro the issue. > > Some instructions on how to run them: > - create 2 btrfs filesystems with "mkfs.btrfs /dev/sdXXX". I don''t > think that size matters. > - mount them in /mnt/src and /mnt/dst > - mount options: noatime,nodatasum,nodatacow,nospace_cache > - put the 3 scripts into one directory and cd to it > - run btrfs_init_tests.sh (it sets up a small file tree for tests) > - run btrfs_test_first_ref_jan.sh > > After about 20-30 seconds, it hits the error I mentioned and script > stops. It happens on "for-linus" branch, top commit > 1eafa6c73791e4f312324ddad9cbcaf6a1b6052b. > I suspect the issue might be that the test schedules a lot of > subvolumes for deletion, and once the cleaner thread kicks in and also > starts doing backref stuff, the problem happens. > > Another small note: there is an issue in btrfs-progs subvolume listing > code (also used by send). When it finds a ROOT_ITEM in the root tree > that is not linked with ROOT_REF/ROOT_BACKREF (i.e., one scheduled for > deletion), it gets confused and exits. Miao sent a patch to fix it > here: > http://www.spinics.net/lists/linux-btrfs/msg19767.html > I don''t think it got merged into progs yet (progs are really behind:() > > If you want a quick fix, add code like this to the beginning of > __list_subvol_fill_paths (but Miao sent a better patch): > /* > * due to change in __list_subvol_search(), root_lookup > * might contain subvolumes with ref_tree==0 (in deletion). > */ > again: > n = rb_first(&root_lookup->root); > while (n) { > struct root_info *entry = rb_entry(n, struct root_info, rb_node); > if (entry->ref_tree == 0) { > fprintf(stderr, "__list_subvol_fill_paths: drop root_id=%llu, > because it has no ref_tree\n", entry->root_id); > rb_erase(n, &root_lookup->root); > free(entry); > goto again; > } > n = rb_next(n); > } > > Otherwise, "btrfs send" might fail, but this is not the failure we are > looking for:) > > Thanks, > Alex. > > > > > > On Tue, Jan 29, 2013 at 11:07 AM, Jan Schmidt <list.btrfs@jan-o-sch.net> wrote: >> Hi Alex, >> >> On Mon, January 28, 2013 at 17:11 (+0100), Alex Lyakas wrote: >>> Hi Jan, >>> I have a set of unit tests (part of the larger system) for the >>> send-receive functionality, with which I am able to hit this error: >>> >>> Jan 28 18:01:00 687-dev kernel: [16968.451358] btrfs: ERROR did not >>> find backref in send_root. inode=259, offset=139264, disk_byte=4263936 >>> found extent=4263936 >>> >>> As the code states, this could indicate a bug in backref walking. This >>> reproduces with "for-linus" branch. >>> >>> Typically this happens when a snapshot is deleted, immediately a new >>> snap with the same name is created, and then "btrfs send" is issued >>> without parent (i.e., full-send) on this snap. >>> >>> To debug this further, we can do one of two things: >>> # I can apply patches/debug prints & reproduce >>> # I can work to isolate the unit test into a bash script and send you >>> a script that reproduces >> >> I''d prefer #2 of the above. You can also send me the unit tests you''ve got if I >> can get them running without multiple days of setup. >> >> I''m guessing that this is more likely going to end up in send.c than in >> backref.c, perhaps Alexander would like to trace this one down. But anyway, send >> me a reproducer (in private, if you don''t want to publish it) and we''ll see >> what''s going on. >> >> Thanks, >> -Jan-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Schmidt
2013-Feb-19 08:11 UTC
Re: backref for an extent not found in send_root (!backref_ctx->found_itself)
Hi Alex, On Tue, February 19, 2013 at 08:58 (+0100), Alex Lyakas wrote:> any luck reproducing this?Just yesterday, yes. I was busy doing non-btrfs things, I''ve got an idea about the problem, but unfortunately I''m again busy doing other things right now. That will hopefully be the last distraction before fixing this. Thanks again for your good reproducer! -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Schmidt
2013-Feb-21 15:35 UTC
[PATCH] Btrfs: fix backref walking race with tree deletions
When a subvolume is removed, we remove the root item from the root tree, while the tree blocks and backrefs remain for a while. When backref walking comes across one of those orphan tree blocks, it can find a backref for a no longer existing root. This is all good, we only must tolerate __resolve_indirect_ref returning an error and continue with the good refs found. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> --- fs/btrfs/backref.c | 5 +---- 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 04edf69..bd605c8 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -352,11 +352,8 @@ static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info, err = __resolve_indirect_ref(fs_info, search_commit_root, time_seq, ref, parents, extent_item_pos); - if (err) { - if (ret == 0) - ret = err; + if (err) continue; - } /* we put the first parent into the ref at hand */ ULIST_ITER_INIT(&uiter); -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alex Lyakas
2013-Mar-12 11:47 UTC
Re: [PATCH] Btrfs: fix backref walking race with tree deletions
Hi Jan, I have been testing your patch, and definitely the reproducer now passes. However, my system tests still hit the issue, but unfortunately I am unable to isolate it into a bash script. All bash scripts work alright:) I have added some prints to attempt to debug this, and here is what I see: So the extent in question is: item 16 key (4378624 EXTENT_ITEM 8192) itemoff 3163 itemsize 66 extent refs 2 gen 6 flags 1 extent data backref root 261 objectid 276 offset 0 count 1 shared data backref parent 4214784 count 1 It is shared by three subvolumes: 257, 261,262, while 261/2 share the same leaf in the file tree. 257 has its own leaf. Inode in all three subvolumes is 276, and this is the only EXTENT_DATA of this inode. file tree key (257 ROOT_ITEM 6) ... leaf 4214784 items 38 free space 461 generation 6 owner 256 ... item 29 key (276 INODE_ITEM 0) itemoff 1897 itemsize 160 inode generation 6 transid 6 size 5777 block group 0 mode 100644 links 1 item 30 key (276 INODE_REF 271) itemoff 1871 itemsize 26 inode ref index 6 namelen 16 name: file_with_xattr1 item 29 key (276 INODE_ITEM 0) itemoff 1897 itemsize 160 inode generation 6 transid 6 size 5777 block group 0 mode 100644 links 1 item 30 key (276 INODE_REF 271) itemoff 1871 itemsize 26 inode ref index 6 namelen 16 name: file_with_xattr1 item 31 key (276 XATTR_ITEM 3895912174) itemoff 1815 itemsize 56 location key (0 UNKNOWN.0 0) type 8 namelen 16 datalen 10 name: user.ztest_attr1 data attr1_XXX item 32 key (276 XATTR_ITEM 4217771290) itemoff 1759 itemsize 56 location key (0 UNKNOWN.0 0) type 8 namelen 16 datalen 10 name: user.ztest_attr2 data attr2_YYY item 33 key (276 EXTENT_DATA 0) itemoff 1706 itemsize 53 extent data disk byte 4378624 nr 8192 extent data offset 0 nr 8192 ram 8192 extent compression 0 file tree key (261 ROOT_ITEM 10) ... leaf 4431872 items 38 free space 441 generation 11 owner 261 ... item 29 key (276 INODE_ITEM 0) itemoff 1877 itemsize 160 inode generation 6 transid 6 size 5777 block group 0 mode 100644 links 1 item 30 key (276 INODE_REF 271) itemoff 1851 itemsize 26 inode ref index 6 namelen 16 name: file_with_xattr1 item 31 key (276 XATTR_ITEM 3895912174) itemoff 1795 itemsize 56 location key (0 UNKNOWN.0 0) type 8 namelen 16 datalen 10 name: user.ztest_attr1 data attr1_XXX item 32 key (276 XATTR_ITEM 4217771290) itemoff 1739 itemsize 56 location key (0 UNKNOWN.0 0) type 8 namelen 16 datalen 10 name: user.ztest_attr2 data attr2_YYY item 33 key (276 EXTENT_DATA 0) itemoff 1686 itemsize 53 extent data disk byte 4378624 nr 8192 extent data offset 0 nr 8192 ram 8192 extent compression 0 file tree key (262 ROOT_ITEM 11) ... leaf 4431872 items 38 free space 441 generation 11 owner 261 ... item 29 key (276 INODE_ITEM 0) itemoff 1877 itemsize 160 inode generation 6 transid 6 size 5777 block group 0 mode 100644 links 1 item 30 key (276 INODE_REF 271) itemoff 1851 itemsize 26 inode ref index 6 namelen 16 name: file_with_xattr1 item 31 key (276 XATTR_ITEM 3895912174) itemoff 1795 itemsize 56 location key (0 UNKNOWN.0 0) type 8 namelen 16 datalen 10 name: user.ztest_attr1 data attr1_XXX item 32 key (276 XATTR_ITEM 4217771290) itemoff 1739 itemsize 56 location key (0 UNKNOWN.0 0) type 8 namelen 16 datalen 10 name: user.ztest_attr2 data attr2_YYY item 33 key (276 EXTENT_DATA 0) itemoff 1686 itemsize 53 extent data disk byte 4378624 nr 8192 extent data offset 0 nr 8192 ram 8192 extent compression 0 During the failure, roots 261 and 262 are found, but not root 257, which is the root being sent. The send has no parent in this case (full-send): btrfs [find_extent_clone] Search [rt=257 ino=276 off=0 len=8192] [found extent=4378624 extent_item_pos=0] btrfs [iterate_extent_inodes] resolving for extent 4378624 pos=0 btrfs [iterate_extent_inodes] extent 4378624 pos=0 found 1 leafs btrfs [iterate_extent_inodes] extent 4378624 pos=0 root 262 references leaf 4431872 btrfs [iterate_extent_inodes] extent 4378624 pos=0 root 261 references leaf 4431872 btrfs: ERROR did not find backref in send_root. inode=276, offset=0, disk_byte=4378624 found extent=4378624 So btrfs_find_all_leafs() finds one leaf 4431872, which is shared between 261 and 262. But it does not find leaf 4214784, which is the leaf of subvolume 257. The tree-dump I showed you is taken after the test failed, and at this point if I try "btrfs send", everything is resolved alright: btrfs [find_extent_clone] Search [rt=257 ino=277 off=0 len=8192] [found extent=4386816 extent_item_pos=0] btrfs [iterate_extent_inodes] resolving for extent 4386816 pos=0 btrfs [iterate_extent_inodes] extent 4386816 pos=0 found 2 leafs btrfs [iterate_extent_inodes] extent 4386816 pos=0 root 262 references leaf 4431872 btrfs [iterate_extent_inodes] extent 4386816 pos=0 root 261 references leaf 4431872 btrfs [iterate_extent_inodes] extent 4386816 pos=0 root 257 references leaf 4214784 Can you advise on how to debug this further? Thanks, Alex. On Thu, Feb 21, 2013 at 5:35 PM, Jan Schmidt <list.btrfs@jan-o-sch.net> wrote:> When a subvolume is removed, we remove the root item from the root tree, > while the tree blocks and backrefs remain for a while. When backref walking > comes across one of those orphan tree blocks, it can find a backref for a > no longer existing root. This is all good, we only must tolerate > __resolve_indirect_ref returning an error and continue with the good refs > found. > > Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> > Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> > --- > fs/btrfs/backref.c | 5 +---- > 1 files changed, 1 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c > index 04edf69..bd605c8 100644 > --- a/fs/btrfs/backref.c > +++ b/fs/btrfs/backref.c > @@ -352,11 +352,8 @@ static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info, > err = __resolve_indirect_ref(fs_info, search_commit_root, > time_seq, ref, parents, > extent_item_pos); > - if (err) { > - if (ret == 0) > - ret = err; > + if (err) > continue; > - } > > /* we put the first parent into the ref at hand */ > ULIST_ITER_INIT(&uiter); > -- > 1.7.1 >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Schmidt
2013-Mar-19 16:32 UTC
Re: [PATCH] Btrfs: fix backref walking race with tree deletions
Hi Alex, On Tue, March 12, 2013 at 12:47 (+0100), Alex Lyakas wrote:> Hi Jan, > I have been testing your patch, and definitely the reproducer now passes. > However, my system tests still hit the issue, but unfortunately I am > unable to isolate it into a bash script. All bash scripts work > alright:)Okay, I think I''ve got something here: I enhanced xfstest 276 (you just got cc-ed on that patch mail) and it reproducibly fails here. Can you please check with new xfstest 276 and your debug printks if that triggers the same error you''re seeing with your local test setup? Unfortunately, it sometimes needs a second or even third iteration to fail, I just use # while [ $? -eq 0 ]; do ./check 276; done for testing and it generally fails on the very first iteration here. One way or another, I''m going to fix whatever the reason for the failing test is. It would make me a tiny piece happier if you told me that you''re hitting the same bug - but any answer is appreciated :-) Thanks! -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html