Miao Xie
2011-Dec-06 05:35 UTC
[PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
The reason the deadlock is that: Task Btrfs-cleaner umount() down_write(&s->s_umount) close_ctree() wait for the end of btrfs-cleaner start_transaction reserve space shrink_delalloc() writeback_inodes_sb_nr_if_idle() down_read(&sb->s_umount) So, the deadlock has happened. We fix it by trying to lock >s_umount, if _trylock_ fails, it means the fs is on remounting or umounting. At this time, we will use the sync function of btrfs to sync all the delalloc file. It may waste lots of time, but as a corner case, we needn''t care. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> --- fs/btrfs/extent-tree.c | 23 ++++++++++++++++++++++- 1 files changed, 22 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 813c6bb..86c295d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3372,6 +3372,27 @@ out: return ret; } +void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root, + unsigned long nr_pages) +{ + struct super_block *sb = root->fs_info->sb; + + if (writeback_in_progress(sb->s_bdi)) + return; + + /* + * If we can not get s_umount, it means the fs is on remounting or + * umounting. At this time, we just sync all the delalloc file. + */ + if (down_read_trylock(&sb->s_umount)) { + writeback_inodes_sb_nr(sb, nr_pages); + up_read(&sb->s_umount); + } else { + btrfs_start_delalloc_inodes(root, 0); + btrfs_wait_ordered_extents(root, 0, 0); + } +} + /* * shrink metadata reservation for delalloc */ @@ -3416,7 +3437,7 @@ static int shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, smp_mb(); nr_pages = min_t(unsigned long, nr_pages, root->fs_info->delalloc_bytes >> PAGE_CACHE_SHIFT); - writeback_inodes_sb_nr_if_idle(root->fs_info->sb, nr_pages); + btrfs_writeback_inodes_sb_nr(root, nr_pages); spin_lock(&space_info->lock); if (reserved > space_info->bytes_may_use) -- 1.7.6.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro
2011-Dec-06 05:49 UTC
Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
> +void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root, > + unsigned long nr_pages) > +{ > + struct super_block *sb = root->fs_info->sb; > + > + if (writeback_in_progress(sb->s_bdi)) > + return; > + > + /* > + * If we can not get s_umount, it means the fs is on remounting or > + * umounting. At this time, we just sync all the delalloc file. > + */ > + if (down_read_trylock(&sb->s_umount)) { > + writeback_inodes_sb_nr(sb, nr_pages); > + up_read(&sb->s_umount); > + } else { > + btrfs_start_delalloc_inodes(root, 0); > + btrfs_wait_ordered_extents(root, 0, 0); > + } > +}If that can race with umount, what prevents sb, its ->s_bdi et.al. being freed under you? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2011-Dec-06 06:52 UTC
Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
On tue, 6 Dec 2011 05:49:06 +0000, Al Viro wrote:> >> +void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root, >> + unsigned long nr_pages) >> +{ >> + struct super_block *sb = root->fs_info->sb; >> + >> + if (writeback_in_progress(sb->s_bdi)) >> + return; >> + >> + /* >> + * If we can not get s_umount, it means the fs is on remounting or >> + * umounting. At this time, we just sync all the delalloc file. >> + */ >> + if (down_read_trylock(&sb->s_umount)) { >> + writeback_inodes_sb_nr(sb, nr_pages); >> + up_read(&sb->s_umount); >> + } else { >> + btrfs_start_delalloc_inodes(root, 0); >> + btrfs_wait_ordered_extents(root, 0, 0); >> + } >> +} > > If that can race with umount, what prevents sb, its ->s_bdi et.al. being freed > under you?In fact, it happened. See the following mail. http://marc.info/?l=linux-btrfs&m=131495252725296&w=2 The above function is called when some one want to modify the meta-data. Btrfs will wait until all the meta-data operations end, and then free ->s_bdi and the other objects. So we needn''t worry about those objects. (Maybe I misunderstood what you said. If yes, I''m sorry) Thanks Miao> -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Dec-06 09:59 UTC
Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
On Tue, Dec 06, 2011 at 01:35:47PM +0800, Miao Xie wrote:> The reason the deadlock is that: > Task Btrfs-cleaner > umount() > down_write(&s->s_umount) > close_ctree() > wait for the end of > btrfs-cleaner > start_transaction > reserve space > shrink_delalloc() > writeback_inodes_sb_nr_if_idle() > down_read(&sb->s_umount) > So, the deadlock has happened. > > We fix it by trying to lock >s_umount, if _trylock_ fails, it means the fs > is on remounting or umounting. At this time, we will use the sync function of > btrfs to sync all the delalloc file. It may waste lots of time, but as a > corner case, we needn''t care.I can''t see why you need the writeout when the trylocks fails. Umount needs to take care of writing out all pending file data anyway, so doing it from the cleaner thread in addition doesn''t sound like it would help. So I''d rather suggest to move the trylock into writeback_inodes_sb_nr_if_idle, and while you''re at it also rewrite writeback_inodes_sb_if_idle that ext4 is using to sit on top of writeback_inodes_sb_nr_if_idle to share that logic, and drop the unused writeback_inodes_sb_nr export. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2011-Dec-06 11:06 UTC
Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
On tue, 6 Dec 2011 04:59:23 -0500, Christoph Hellwig wrote:> On Tue, Dec 06, 2011 at 01:35:47PM +0800, Miao Xie wrote: >> The reason the deadlock is that: >> Task Btrfs-cleaner >> umount() >> down_write(&s->s_umount) >> close_ctree() >> wait for the end of >> btrfs-cleaner >> start_transaction >> reserve space >> shrink_delalloc() >> writeback_inodes_sb_nr_if_idle() >> down_read(&sb->s_umount) >> So, the deadlock has happened. >> >> We fix it by trying to lock >s_umount, if _trylock_ fails, it means the fs >> is on remounting or umounting. At this time, we will use the sync function of >> btrfs to sync all the delalloc file. It may waste lots of time, but as a >> corner case, we needn''t care. > > I can''t see why you need the writeout when the trylocks fails. Umount > needs to take care of writing out all pending file data anyway, so doing > it from the cleaner thread in addition doesn''t sound like it would help.umount invokes sync_fs() and write out all the dirty file data. For the other file systems, its OK because the file system does not introduce dirty pages by itself. But btrfs is different. Its automatic defragment will make lots of dirty pages after sync_fs() and reserve lots of meta-data space for those pages. And then the cleaner thread may find there is no enough space to reserve, it must sync the dirty file data and release the reserved space which is for the dirty file data.> > So I''d rather suggest to move the trylock into > writeback_inodes_sb_nr_if_idle, and while you''re at it also rewrite > writeback_inodes_sb_if_idle that ext4 is using to sit on top of > writeback_inodes_sb_nr_if_idle to share that logic, and drop the > unused writeback_inodes_sb_nr export.It is a good way. I will try it. (Someone is using this way to fix the other deadlock between freeze and writeback) Thanks Miao> -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Dec-06 11:23 UTC
Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote:> > I can''t see why you need the writeout when the trylocks fails. Umount > > needs to take care of writing out all pending file data anyway, so doing > > it from the cleaner thread in addition doesn''t sound like it would help. > > umount invokes sync_fs() and write out all the dirty file data. For the > other file systems, its OK because the file system does not introduce dirty pages > by itself. But btrfs is different. Its automatic defragment will make lots of dirty > pages after sync_fs() and reserve lots of meta-data space for those pages. > And then the cleaner thread may find there is no enough space to reserve, it must > sync the dirty file data and release the reserved space which is for the dirty > file data.I think the safest way to fix is is to write out all dirty data again once the cleaner thread has been safely stopped. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-Dec-06 21:36 UTC
Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
On Tue, Dec 06, 2011 at 06:23:23AM -0500, Christoph Hellwig wrote:> On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote: > > > I can''t see why you need the writeout when the trylocks fails. Umount > > > needs to take care of writing out all pending file data anyway, so doing > > > it from the cleaner thread in addition doesn''t sound like it would help. > > > > umount invokes sync_fs() and write out all the dirty file data. For the > > other file systems, its OK because the file system does not introduce dirty pages > > by itself. But btrfs is different. Its automatic defragment will make lots of dirty > > pages after sync_fs() and reserve lots of meta-data space for those pages. > > And then the cleaner thread may find there is no enough space to reserve, it must > > sync the dirty file data and release the reserved space which is for the dirty > > file data. > > I think the safest way to fix is is to write out all dirty data again > once the cleaner thread has been safely stopped. >Said another way we want to stop the autodefrag code before the unmount is ready to continue. We also want to stop balancing, scrub etc. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2011-Dec-07 02:31 UTC
Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
On tue, 6 Dec 2011 16:36:11 -0500, Chris Mason wrote:> On Tue, Dec 06, 2011 at 06:23:23AM -0500, Christoph Hellwig wrote: >> On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote: >>>> I can''t see why you need the writeout when the trylocks fails. Umount >>>> needs to take care of writing out all pending file data anyway, so doing >>>> it from the cleaner thread in addition doesn''t sound like it would help. >>> >>> umount invokes sync_fs() and write out all the dirty file data. For the >>> other file systems, its OK because the file system does not introduce dirty pages >>> by itself. But btrfs is different. Its automatic defragment will make lots of dirty >>> pages after sync_fs() and reserve lots of meta-data space for those pages. >>> And then the cleaner thread may find there is no enough space to reserve, it must >>> sync the dirty file data and release the reserved space which is for the dirty >>> file data. >> >> I think the safest way to fix is is to write out all dirty data again >> once the cleaner thread has been safely stopped. >> > > Said another way we want to stop the autodefrag code before the unmount > is ready to continue. We also want to stop balancing, scrub etc.But there is no good interface to do it before umount gets s_umount lock. I think trylock(in writeback_inodes_sb_nr_if_idle()) + dirty data flush can help us to fix the bug perfectly. Thanks Miao> > -chris > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ilya Dryomov
2011-Dec-07 11:11 UTC
Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
On Wed, Dec 07, 2011 at 10:31:35AM +0800, Miao Xie wrote:> On tue, 6 Dec 2011 16:36:11 -0500, Chris Mason wrote: > > On Tue, Dec 06, 2011 at 06:23:23AM -0500, Christoph Hellwig wrote: > >> On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote: > >>>> I can''t see why you need the writeout when the trylocks fails. Umount > >>>> needs to take care of writing out all pending file data anyway, so doing > >>>> it from the cleaner thread in addition doesn''t sound like it would help. > >>> > >>> umount invokes sync_fs() and write out all the dirty file data. For the > >>> other file systems, its OK because the file system does not introduce dirty pages > >>> by itself. But btrfs is different. Its automatic defragment will make lots of dirty > >>> pages after sync_fs() and reserve lots of meta-data space for those pages. > >>> And then the cleaner thread may find there is no enough space to reserve, it must > >>> sync the dirty file data and release the reserved space which is for the dirty > >>> file data. > >> > >> I think the safest way to fix is is to write out all dirty data again > >> once the cleaner thread has been safely stopped. > >> > > > > Said another way we want to stop the autodefrag code before the unmount > > is ready to continue. We also want to stop balancing, scrub etc. > > But there is no good interface to do it before umount gets s_umount lock. > I think trylock(in writeback_inodes_sb_nr_if_idle()) + dirty data flush > can help us to fix the bug perfectly.But it won''t fix the umount while balancing family of deadlocks (they are really of the same nature, vfs grabs s_umount mutex and we need it to proceed). (Balance cancelling code is part of restriper patches, it''s just a hook in close_ctree() that waits until we are done relocating a chunk - very similar to cleaner wait) One example would be that balancing code while dirtying pages calls balance_dirty_pages_ratelimited() for each dirtied page, as it should. And if balance_dirty_pages() then decides to initiate writeback we are stuck schedule()ing forever, because writeback can''t proceed w/o read-taking s_umount mutex which is fully held by vfs - it just skips the relocation inode. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2011-Dec-08 03:46 UTC
Re: [PATCH 2/2] Btrfs: fix deadlock on sb->s_umount when doing umount
On Wed, 7 Dec 2011 13:11:58 +0200, Ilya Dryomov wrote:> On Wed, Dec 07, 2011 at 10:31:35AM +0800, Miao Xie wrote: >> On tue, 6 Dec 2011 16:36:11 -0500, Chris Mason wrote: >>> On Tue, Dec 06, 2011 at 06:23:23AM -0500, Christoph Hellwig wrote: >>>> On Tue, Dec 06, 2011 at 07:06:40PM +0800, Miao Xie wrote: >>>>>> I can''t see why you need the writeout when the trylocks fails. Umount >>>>>> needs to take care of writing out all pending file data anyway, so doing >>>>>> it from the cleaner thread in addition doesn''t sound like it would help. >>>>> >>>>> umount invokes sync_fs() and write out all the dirty file data. For the >>>>> other file systems, its OK because the file system does not introduce dirty pages >>>>> by itself. But btrfs is different. Its automatic defragment will make lots of dirty >>>>> pages after sync_fs() and reserve lots of meta-data space for those pages. >>>>> And then the cleaner thread may find there is no enough space to reserve, it must >>>>> sync the dirty file data and release the reserved space which is for the dirty >>>>> file data. >>>> >>>> I think the safest way to fix is is to write out all dirty data again >>>> once the cleaner thread has been safely stopped. >>>> >>> >>> Said another way we want to stop the autodefrag code before the unmount >>> is ready to continue. We also want to stop balancing, scrub etc. >> >> But there is no good interface to do it before umount gets s_umount lock. >> I think trylock(in writeback_inodes_sb_nr_if_idle()) + dirty data flush >> can help us to fix the bug perfectly. > > But it won''t fix the umount while balancing family of deadlocks (they > are really of the same nature, vfs grabs s_umount mutex and we need it > to proceed). (Balance cancelling code is part of restriper patches, > it''s just a hook in close_ctree() that waits until we are done > relocating a chunk - very similar to cleaner wait)I will change the logic, we will add a while loop to check if something is running(xxx_running is not zero), if yes, invoke btrfs_sync_fs() to do dirty page flush.> > One example would be that balancing code while dirtying pages calls > balance_dirty_pages_ratelimited() for each dirtied page, as it should. > And if balance_dirty_pages() then decides to initiate writeback we are > stuck schedule()ing forever, because writeback can''t proceed w/o > read-taking s_umount mutex which is fully held by vfs - it just skips > the relocation inode.AFAIK, balance_dirty_pages() just wake up the flush thread, and the flush thread also doesn''t grab s_umount. So we needn''t worry about it.I think. Thanks Miao -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html