Josef Bacik
2011-Dec-13 17:55 UTC
[PATCH] Btrfs: don''t panic if orphan item already exists
I''ve been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in a loop. This is because we will add an orphan item, do the truncate, the truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we''re left with an orphan item still in the fs. Then we come back later to do another truncate and it blows up because we already have an orphan item. This is ok so just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> --- fs/btrfs/inode.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ae5b354a..76e6c24 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2047,7 +2047,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode) /* insert an orphan item to track this unlinked/truncated file */ if (insert >= 1) { ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode)); - BUG_ON(ret); + BUG_ON(ret && ret != -EEXIST); } /* insert an orphan item to track subvolume contains orphan files */ -- 1.7.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Phillip Susi
2011-Dec-13 19:03 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On 12/13/2011 12:55 PM, Josef Bacik wrote:> I''ve been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in > a loop. This is because we will add an orphan item, do the truncate, the > truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we''re > left with an orphan item still in the fs. Then we come back later to do another > truncate and it blows up because we already have an orphan item. This is ok so > just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks,Wouldn''t it be better to fix the underlying bug, and remove the orphan item when the truncate fails? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Dec-13 19:09 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote:> On 12/13/2011 12:55 PM, Josef Bacik wrote: > >I''ve been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in > >a loop. This is because we will add an orphan item, do the truncate, the > >truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we''re > >left with an orphan item still in the fs. Then we come back later to do another > >truncate and it blows up because we already have an orphan item. This is ok so > >just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks, > > Wouldn''t it be better to fix the underlying bug, and remove the > orphan item when the truncate fails? >No because we still need the thing to be cleaned up. If the truncate fails we need to leave the orphan item there so the next time the fs is mounted the inode is cleaned up, that''s not a bug. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/14/2011 03:09 AM, Josef Bacik wrote:> On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote: >> On 12/13/2011 12:55 PM, Josef Bacik wrote: >>> I''ve been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in >>> a loop. This is because we will add an orphan item, do the truncate, the >>> truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we''re >>> left with an orphan item still in the fs. Then we come back later to do another >>> truncate and it blows up because we already have an orphan item. This is ok so >>> just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks, >> >> Wouldn''t it be better to fix the underlying bug, and remove the >> orphan item when the truncate fails? >> > > No because we still need the thing to be cleaned up. If the truncate fails we > need to leave the orphan item there so the next time the fs is mounted the inode > is cleaned up, that''s not a bug. Thanks, > > JosefHi, Josef I''m digging this issue too, actually xfstests 083 also can trigger this BUG_ON while run loops. and I agreed with Phillip''s opinion that we''d better "fix the underlying bug". If the btrfs_truncate faild with ENOSPC, we should not even call btrfs_orphan_del to clean the memory orphan list so that the next orphan item insert will be skipped. But, there is still a trouble. The user will get the fail result while the orphan inode still left in the fs. It''s strange. So in the end of the btrfs_truncate, if the btrfs_update_inode is successed, I will delete the orphan inode anyway. what do you think of this idea? I''ll make a patch if you do not have any comment. BTW, 083 will always make the btrfs_truncate fail with btrfs_truncate_inode_items for ENOSPC when the disk is almost full. thanks wubo> -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2011-Dec-14 09:46 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On wed, 14 Dec 2011 10:07:39 +0800, WuBo wrote:> On 12/14/2011 03:09 AM, Josef Bacik wrote: >> On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote: >>> On 12/13/2011 12:55 PM, Josef Bacik wrote: >>>> I''ve been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in >>>> a loop. This is because we will add an orphan item, do the truncate, the >>>> truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we''re >>>> left with an orphan item still in the fs. Then we come back later to do another >>>> truncate and it blows up because we already have an orphan item. This is ok so >>>> just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks, >>> >>> Wouldn''t it be better to fix the underlying bug, and remove the >>> orphan item when the truncate fails? >>> >> >> No because we still need the thing to be cleaned up. If the truncate fails we >> need to leave the orphan item there so the next time the fs is mounted the inode >> is cleaned up, that''s not a bug. Thanks, >> >> Josef > > Hi, Josef > > I''m digging this issue too, actually xfstests 083 also can trigger this BUG_ON > while run loops. and I agreed with Phillip''s opinion that we''d better "fix the > underlying bug". If the btrfs_truncate faild with ENOSPC, we should not even call > btrfs_orphan_del to clean the memory orphan list so that the next orphan item > insert will be skipped. > > But, there is still a trouble. The user will get the fail result while the orphan > inode still left in the fs. It''s strange. So in the end of the btrfs_truncate, > if the btrfs_update_inode is successed, I will delete the orphan inode anyway.Another reason for that we should fix the underlying bug: File0 | i_size v +-----------------------------------------------+ | | +-----------------------------------------------+ The user truncated File0, but failed when doing truncation: File0 | i_size | real size v v +---------------------------------------+ | | +---------------------------------------+ The user did pre-allocation for File0 (keep size): File0 | i_size | pre-allocated extent | v v v +---------------------------------------+-----------------------+ | | | +---------------------------------------+-----------------------+ And then, the user umounted and mount the file system again. Because we left the orphan item in the file system, btrfs will drop the pre-allocated extent when mounting it. It is not the expected result for users. Thanks Miao> > what do you think of this idea? I''ll make a patch if you do not have any comment. > > BTW, 083 will always make the btrfs_truncate fail with btrfs_truncate_inode_items > for ENOSPC when the disk is almost full. > > thanks > wubo > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Dec-14 14:58 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On Wed, Dec 14, 2011 at 05:46:37PM +0800, Miao Xie wrote:> On wed, 14 Dec 2011 10:07:39 +0800, WuBo wrote: > > On 12/14/2011 03:09 AM, Josef Bacik wrote: > >> On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote: > >>> On 12/13/2011 12:55 PM, Josef Bacik wrote: > >>>> I''ve been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in > >>>> a loop. This is because we will add an orphan item, do the truncate, the > >>>> truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we''re > >>>> left with an orphan item still in the fs. Then we come back later to do another > >>>> truncate and it blows up because we already have an orphan item. This is ok so > >>>> just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks, > >>> > >>> Wouldn''t it be better to fix the underlying bug, and remove the > >>> orphan item when the truncate fails? > >>> > >> > >> No because we still need the thing to be cleaned up. If the truncate fails we > >> need to leave the orphan item there so the next time the fs is mounted the inode > >> is cleaned up, that''s not a bug. Thanks, > >> > >> Josef > > > > Hi, Josef > > > > I''m digging this issue too, actually xfstests 083 also can trigger this BUG_ON > > while run loops. and I agreed with Phillip''s opinion that we''d better "fix the > > underlying bug". If the btrfs_truncate faild with ENOSPC, we should not even call > > btrfs_orphan_del to clean the memory orphan list so that the next orphan item > > insert will be skipped. > >There is no "underlying bug", there is a shitty situation, the shitty situation where we dont have enough room to free up space. There is no getting around this, there is no fix for this, we are fucked and there is nothing we can do about it. We have to clean it up from the in memory list because otherwise we will complain when we go to destroy the inode, this is to catch the case that we _didn''t_ do the cleanup properly, so that has to stay. There is nothing wrong with leaving the orphan item there. The next time we mount we will truncate to the i_size, so if later on down the road we were able to free up space and we could write to the inode again then i_size will grow and everything will be a-ok. There is little dange in leaving the orphan item there. The only danger is as Miao describes below...> > But, there is still a trouble. The user will get the fail result while the orphan > > inode still left in the fs. It''s strange. So in the end of the btrfs_truncate, > > if the btrfs_update_inode is successed, I will delete the orphan inode anyway. > > Another reason for that we should fix the underlying bug: > File0 | i_size > v > +-----------------------------------------------+ > | | > +-----------------------------------------------+ > > The user truncated File0, but failed when doing truncation: > File0 | i_size | real size > v v > +---------------------------------------+ > | | > +---------------------------------------+ > > The user did pre-allocation for File0 (keep size): > File0 | i_size | pre-allocated extent | > v v v > +---------------------------------------+-----------------------+ > | | | > +---------------------------------------+-----------------------+ > > And then, the user umounted and mount the file system again. Because we left the orphan item > in the file system, btrfs will drop the pre-allocated extent when mounting it. It is not > the expected result for users. >This is the only case where we will truncate space that we think should be there. I''m ok with it, because there is no actual data here, we''re just trimming pre-allocated space. The only other option is to keep track of inodes that have an orphan item and if we try to do anything to the inode (fallocate, write etc) we try and delete the orphan item first. That is a lot of call paths to have if (!list_empty(BTRFS_I(inode)->i_orphan)) delete_orphan_item() not to mention the case that the inode has left cache and we''ve re-read it back in, so everytime we read back in an inode we''ll have to search for its corresponding orphan item to see if we didn''t clean it up at some point in the past, which is going to mean an extra search on every inode lookup which is going to suck, just to deal with the case that somebody used fallocate with FALLOC_FL_KEEP_SIZE. So my patch is the lesser of two evils, either we leave the damn thing in there and deal with the remote chance that somebody preallocated space on the end of the file after failing to do a truncate, or we put a bunch of performance sucking checks and an extra btree search on every inode lookup to deal with this remote chance. My choice is my patch. Josef "really needs a vacation" Bacik -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Phillip Susi
2011-Dec-14 15:14 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On 12/14/2011 9:58 AM, Josef Bacik wrote:> There is no "underlying bug", there is a shitty situation, the shitty situationMaybe my assumptions are wrong somewhere then. You add the orphan item to make sure that the truncate will be finalized even if the system crashes before the transaction commits right? So if truncate() fails with -ENOSPC, then you shouldn''t be trying to finalize the truncate on the next mount, should you ( because the call did not succeed )? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Dec-14 15:27 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote:> On 12/14/2011 9:58 AM, Josef Bacik wrote: > >There is no "underlying bug", there is a shitty situation, the shitty situation > > Maybe my assumptions are wrong somewhere then. You add the orphan > item to make sure that the truncate will be finalized even if the > system crashes before the transaction commits right? So if > truncate() fails with -ENOSPC, then you shouldn''t be trying to > finalize the truncate on the next mount, should you ( because the > call did not succeed )? >Except consider the case that the program was written intelligently and checks for errors on truncate. So he writes 100G, truncates to 50M, and the truncate fails and he closes the file and exits. Then somewhere down the road the inode is evicted from cache and we reboot the box. Next time the box comes up it only looks like a 50M file, except we''re still taking up 100G of disk space, and we have no idea there''s space there and it''s still taken up in the allocator so it will just look like we''ve lost ~100G of space. This is why it''s left there, so everything can be cleaned up. Course now that I''ve had the chance to think and calm down a little bit there may be another option. We could probably keep track of how much we''ve deleted in btrfs_truncate_inode_items, and then if we fail to complete the truncate we just set the i_size to whatever it is when we stopped truncating, update the inode and remove the orphan item. I''ll look into this and see how tricky it is to do. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Dec-14 15:34 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote:> On 12/14/2011 9:58 AM, Josef Bacik wrote: > >There is no "underlying bug", there is a shitty situation, the shitty situation > > Maybe my assumptions are wrong somewhere then. You add the orphan > item to make sure that the truncate will be finalized even if the > system crashes before the transaction commits right? So if > truncate() fails with -ENOSPC, then you shouldn''t be trying to > finalize the truncate on the next mount, should you ( because the > call did not succeed )? >Yes because otherwise we''ll leak space since the i_size has been updated already. The other option is to make btrfs_truncate_inode_items update i_size as we truncate so if it fails we can delete the orphan item and then update the inode with the new i_size, that way we don''t leave the orphan item on disk and we don''t leak space. I''ll see how doable this is. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Dec-14 15:35 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On Wed, Dec 14, 2011 at 10:34:45AM -0500, Josef Bacik wrote:> On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote: > > On 12/14/2011 9:58 AM, Josef Bacik wrote: > > >There is no "underlying bug", there is a shitty situation, the shitty situation > > > > Maybe my assumptions are wrong somewhere then. You add the orphan > > item to make sure that the truncate will be finalized even if the > > system crashes before the transaction commits right? So if > > truncate() fails with -ENOSPC, then you shouldn''t be trying to > > finalize the truncate on the next mount, should you ( because the > > call did not succeed )? > > > > Yes because otherwise we''ll leak space since the i_size has been updated > already. The other option is to make btrfs_truncate_inode_items update i_size > as we truncate so if it fails we can delete the orphan item and then update the > inode with the new i_size, that way we don''t leave the orphan item on disk and > we don''t leak space. I''ll see how doable this is. Thanks, >Sorry for the double reply, my email is being wonky and it didn''t look like my original response went out. Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Phillip Susi
2011-Dec-14 15:41 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On 12/14/2011 10:27 AM, Josef Bacik wrote:> Except consider the case that the program was written intelligently and checks > for errors on truncate. So he writes 100G, truncates to 50M, and the truncate > fails and he closes the file and exits. Then somewhere down the road the inode > is evicted from cache and we reboot the box. Next time the box comes up it only > looks like a 50M file, except we''re still taking up 100G of disk space, and we > have no idea there''s space there and it''s still taken up in the allocator so it > will just look like we''ve lost ~100G of space. This is why it''s left there, so > everything can be cleaned up.I''m a little confused here. Is there a commit somewhere in there? How can the 100g allocation be committed, but not the i_size of the inode? Shouldn''t either both or neither be committed? If both are committed, and then the truncate fails, then I would expect the system to come back up after a crash with the file still at 100g. That is, as long as the orphan item is not left in place after the failed truncate. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Dec-14 15:46 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On Wed, Dec 14, 2011 at 10:41:05AM -0500, Phillip Susi wrote:> On 12/14/2011 10:27 AM, Josef Bacik wrote: > >Except consider the case that the program was written intelligently and checks > >for errors on truncate. So he writes 100G, truncates to 50M, and the truncate > >fails and he closes the file and exits. Then somewhere down the road the inode > >is evicted from cache and we reboot the box. Next time the box comes up it only > >looks like a 50M file, except we''re still taking up 100G of disk space, and we > >have no idea there''s space there and it''s still taken up in the allocator so it > >will just look like we''ve lost ~100G of space. This is why it''s left there, so > >everything can be cleaned up. > > I''m a little confused here. Is there a commit somewhere in there? > How can the 100g allocation be committed, but not the i_size of the > inode? Shouldn''t either both or neither be committed? If both are > committed, and then the truncate fails, then I would expect the > system to come back up after a crash with the file still at 100g. > That is, as long as the orphan item is not left in place after the > failed truncate. >100g allocation succeeds unmount mount truncate to 50m i_size is set to 50m truncate fails orphan item left unmount mount file looks like its only 50m but still has 100g of extents taking up space orphan cleanup happens and the inode is truncated and the extra space is cleaned up Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-Dec-14 16:45 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On Wed, Dec 14, 2011 at 10:34:45AM -0500, Josef Bacik wrote:> On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote: > > On 12/14/2011 9:58 AM, Josef Bacik wrote: > > >There is no "underlying bug", there is a shitty situation, the shitty situation > > > > Maybe my assumptions are wrong somewhere then. You add the orphan > > item to make sure that the truncate will be finalized even if the > > system crashes before the transaction commits right? So if > > truncate() fails with -ENOSPC, then you shouldn''t be trying to > > finalize the truncate on the next mount, should you ( because the > > call did not succeed )? > > > > Yes because otherwise we''ll leak space since the i_size has been updated > already. The other option is to make btrfs_truncate_inode_items update i_size > as we truncate so if it fails we can delete the orphan item and then update the > inode with the new i_size, that way we don''t leave the orphan item on disk and > we don''t leak space. I''ll see how doable this is. Thanks,If we fail with enospc though we''re very likely to not be able to update the inode item. It may work, but the failure case will still be there where we can''t make i_size match the file contents. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Dec-14 16:47 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On Wed, Dec 14, 2011 at 11:45:24AM -0500, Chris Mason wrote:> On Wed, Dec 14, 2011 at 10:34:45AM -0500, Josef Bacik wrote: > > On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote: > > > On 12/14/2011 9:58 AM, Josef Bacik wrote: > > > >There is no "underlying bug", there is a shitty situation, the shitty situation > > > > > > Maybe my assumptions are wrong somewhere then. You add the orphan > > > item to make sure that the truncate will be finalized even if the > > > system crashes before the transaction commits right? So if > > > truncate() fails with -ENOSPC, then you shouldn''t be trying to > > > finalize the truncate on the next mount, should you ( because the > > > call did not succeed )? > > > > > > > Yes because otherwise we''ll leak space since the i_size has been updated > > already. The other option is to make btrfs_truncate_inode_items update i_size > > as we truncate so if it fails we can delete the orphan item and then update the > > inode with the new i_size, that way we don''t leave the orphan item on disk and > > we don''t leak space. I''ll see how doable this is. Thanks, > > If we fail with enospc though we''re very likely to not be able to update > the inode item. It may work, but the failure case will still be there > where we can''t make i_size match the file contents. >Yeah I was thinking we just grab a reservation to update the inode just in case early on, so if that fails we just update the in memory i_size and we''re good to go, and then if the transaction fails during the actual truncate we can still do a btrfs_join_transaction() and use our saved reservation. Course we''re still screwed if our failure was ENOMEM... Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Phillip Susi
2011-Dec-14 19:59 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On 12/14/2011 10:46 AM, Josef Bacik wrote:> file looks like its only 50m but still has 100g of extents taking up space > orphan cleanup happens and the inode is truncated and the extra space is cleaned > upYes, but isn''t the only reason that the i_size change actually hit the disk is because of the orphan item? So with no orphan item, the i_size remains at 100g. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2011-Dec-15 01:42 UTC
Re: [PATCH] Btrfs: don''t panic if orphan item already exists
On wed, 14 Dec 2011 10:34:45 -0500, Josef Bacik wrote:> On Wed, Dec 14, 2011 at 10:14:13AM -0500, Phillip Susi wrote: >> On 12/14/2011 9:58 AM, Josef Bacik wrote: >>> There is no "underlying bug", there is a shitty situation, the shitty situation >> >> Maybe my assumptions are wrong somewhere then. You add the orphan >> item to make sure that the truncate will be finalized even if the >> system crashes before the transaction commits right? So if >> truncate() fails with -ENOSPC, then you shouldn''t be trying to >> finalize the truncate on the next mount, should you ( because the >> call did not succeed )? >> > > Yes because otherwise we''ll leak space since the i_size has been updated > already. The other option is to make btrfs_truncate_inode_items update i_size > as we truncate so if it fails we can delete the orphan item and then update the > inode with the new i_size, that way we don''t leave the orphan item on disk and > we don''t leak space. I''ll see how doable this is. Thanks,Agree. I have made a patch based on this idea, which is under test. Thanks Miao> > Josef > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/14/2011 10:58 PM, Josef Bacik wrote:> On Wed, Dec 14, 2011 at 05:46:37PM +0800, Miao Xie wrote: >> On wed, 14 Dec 2011 10:07:39 +0800, WuBo wrote: >>> On 12/14/2011 03:09 AM, Josef Bacik wrote: >>>> On Tue, Dec 13, 2011 at 02:03:14PM -0500, Phillip Susi wrote: >>>>> On 12/13/2011 12:55 PM, Josef Bacik wrote: >>>>>> I''ve been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in >>>>>> a loop. This is because we will add an orphan item, do the truncate, the >>>>>> truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we''re >>>>>> left with an orphan item still in the fs. Then we come back later to do another >>>>>> truncate and it blows up because we already have an orphan item. This is ok so >>>>>> just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks, >>>>> >>>>> Wouldn''t it be better to fix the underlying bug, and remove the >>>>> orphan item when the truncate fails? >>>>> >>>> >>>> No because we still need the thing to be cleaned up. If the truncate fails we >>>> need to leave the orphan item there so the next time the fs is mounted the inode >>>> is cleaned up, that''s not a bug. Thanks, >>>> >>>> Josef >>> >>> Hi, Josef >>> >>> I''m digging this issue too, actually xfstests 083 also can trigger this BUG_ON >>> while run loops. and I agreed with Phillip''s opinion that we''d better "fix the >>> underlying bug". If the btrfs_truncate faild with ENOSPC, we should not even call >>> btrfs_orphan_del to clean the memory orphan list so that the next orphan item >>> insert will be skipped. >>> > > There is no "underlying bug", there is a shitty situation, the shitty situation > where we dont have enough room to free up space. There is no getting around > this, there is no fix for this, we are fucked and there is nothing we can do > about it. We have to clean it up from the in memory list because otherwise we > will complain when we go to destroy the inode, this is to catch the case that we > _didn''t_ do the cleanup properly, so that has to stay.Oh I made a mistake, yes the orphan inode in memory list should be deleted. thanks for your explain.> > There is nothing wrong with leaving the orphan item there. The next time we > mount we will truncate to the i_size, so if later on down the road we were able > to free up space and we could write to the inode again then i_size will grow and > everything will be a-ok. There is little dange in leaving the orphan item > there. The only danger is as Miao describes below...If we left the orphan inode, I also worried another situation: The disk is almost full, and there are a lot of inode truncate failing, and then left a lot of orphan inodes in fs. In the next mount, btrfs should clean all of these orphan inodes even should do truncate again, which also need reserve space. but the disk is full and mount will be failed. So I think we''d better remove the orphan inode at the end of truncate. "update the new i_size and remove the orphan item" is a good idea. thanks wubo> >>> But, there is still a trouble. The user will get the fail result while the orphan >>> inode still left in the fs. It''s strange. So in the end of the btrfs_truncate, >>> if the btrfs_update_inode is successed, I will delete the orphan inode anyway. >> >> Another reason for that we should fix the underlying bug: >> File0 | i_size >> v >> +-----------------------------------------------+ >> | | >> +-----------------------------------------------+ >> >> The user truncated File0, but failed when doing truncation: >> File0 | i_size | real size >> v v >> +---------------------------------------+ >> | | >> +---------------------------------------+ >> >> The user did pre-allocation for File0 (keep size): >> File0 | i_size | pre-allocated extent | >> v v v >> +---------------------------------------+-----------------------+ >> | | | >> +---------------------------------------+-----------------------+ >> >> And then, the user umounted and mount the file system again. Because we left the orphan item >> in the file system, btrfs will drop the pre-allocated extent when mounting it. It is not >> the expected result for users. >> > > This is the only case where we will truncate space that we think should be > there. I''m ok with it, because there is no actual data here, we''re just > trimming pre-allocated space. > > The only other option is to keep track of inodes that have an orphan item and if > we try to do anything to the inode (fallocate, write etc) we try and delete the > orphan item first. That is a lot of call paths to have > > if (!list_empty(BTRFS_I(inode)->i_orphan)) > delete_orphan_item() > > not to mention the case that the inode has left cache and we''ve re-read it back > in, so everytime we read back in an inode we''ll have to search for its > corresponding orphan item to see if we didn''t clean it up at some point in the > past, which is going to mean an extra search on every inode lookup which is > going to suck, just to deal with the case that somebody used fallocate with > FALLOC_FL_KEEP_SIZE. > > So my patch is the lesser of two evils, either we leave the damn thing in there > and deal with the remote chance that somebody preallocated space on the end of > the file after failing to do a truncate, or we put a bunch of performance > sucking checks and an extra btree search on every inode lookup to deal with this > remote chance. My choice is my patch. > > Josef "really needs a vacation" Bacik >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html