I saw a lock order warning on ext4 trigger. This should solve it.
Raciness shouldn''t matter much, because writeback can stop just
after we make the test and return anyway (so the API is racy anyway).
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Index: linux-2.6/fs/fs-writeback.c
==================================================================---
linux-2.6.orig/fs/fs-writeback.c	2010-11-16 21:44:32.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-16 21:49:37.000000000 +1100
@@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb);
  *
  * Invoke writeback_inodes_sb if no writeback is currently underway.
  * Returns 1 if writeback was started, 0 if not.
+ *
+ * May be called inside i_lock. May not start writeback if locks cannot
+ * be acquired.
  */
 int writeback_inodes_sb_if_idle(struct super_block *sb)
 {
 	if (!writeback_in_progress(sb->s_bdi)) {
-		down_read(&sb->s_umount);
-		writeback_inodes_sb(sb);
-		up_read(&sb->s_umount);
-		return 1;
-	} else
-		return 0;
+		if (down_read_trylock(&sb->s_umount)) {
+			writeback_inodes_sb(sb);
+			up_read(&sb->s_umount);
+			return 1;
+		}
+	}
+	return 0;
 }
 EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
 
@@ -1145,17 +1149,21 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
  *
  * Invoke writeback_inodes_sb if no writeback is currently underway.
  * Returns 1 if writeback was started, 0 if not.
+ *
+ * May be called inside i_lock. May not start writeback if locks cannot
+ * be acquired.
  */
 int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
 				   unsigned long nr)
 {
 	if (!writeback_in_progress(sb->s_bdi)) {
-		down_read(&sb->s_umount);
-		writeback_inodes_sb_nr(sb, nr);
-		up_read(&sb->s_umount);
-		return 1;
-	} else
-		return 0;
+		if (down_read_trylock(&sb->s_umount)) {
+			writeback_inodes_sb_nr(sb, nr);
+			up_read(&sb->s_umount);
+			return 1;
+		}
+	}
+	return 0;
 }
 EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
 
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
On Tue 16-11-10 22:00:58, Nick Piggin wrote:> I saw a lock order warning on ext4 trigger. This should solve it. > Raciness shouldn''t matter much, because writeback can stop just > after we make the test and return anyway (so the API is racy anyway).Hmm, for now the fix is OK. Ultimately, we probably want to call writeback_inodes_sb() directly from all the callers. They all just want to reduce uncertainty of delayed allocation reservations by writing delayed data and actually wait for some of the writeback to happen before they retry again the allocation. Although the callers generally cannot get umount_sem because they hold other locks, they have the superblock well pinned so grabbing umount_sem makes sense mostly to make assertions happy. But as I''m thinking about it, trylock *is* maybe the right answer to this anyway... So Acked-by: Jan Kara <jack@suse.cz> Honza> > Signed-off-by: Nick Piggin <npiggin@kernel.dk> > > Index: linux-2.6/fs/fs-writeback.c > ==================================================================> --- linux-2.6.orig/fs/fs-writeback.c 2010-11-16 21:44:32.000000000 +1100 > +++ linux-2.6/fs/fs-writeback.c 2010-11-16 21:49:37.000000000 +1100 > @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb); > * > * Invoke writeback_inodes_sb if no writeback is currently underway. > * Returns 1 if writeback was started, 0 if not. > + * > + * May be called inside i_lock. May not start writeback if locks cannot > + * be acquired. > */ > int writeback_inodes_sb_if_idle(struct super_block *sb) > { > if (!writeback_in_progress(sb->s_bdi)) { > - down_read(&sb->s_umount); > - writeback_inodes_sb(sb); > - up_read(&sb->s_umount); > - return 1; > - } else > - return 0; > + if (down_read_trylock(&sb->s_umount)) { > + writeback_inodes_sb(sb); > + up_read(&sb->s_umount); > + return 1; > + } > + } > + return 0; > } > EXPORT_SYMBOL(writeback_inodes_sb_if_idle); > > @@ -1145,17 +1149,21 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl > * > * Invoke writeback_inodes_sb if no writeback is currently underway. > * Returns 1 if writeback was started, 0 if not. > + * > + * May be called inside i_lock. May not start writeback if locks cannot > + * be acquired. > */ > int writeback_inodes_sb_nr_if_idle(struct super_block *sb, > unsigned long nr) > { > if (!writeback_in_progress(sb->s_bdi)) { > - down_read(&sb->s_umount); > - writeback_inodes_sb_nr(sb, nr); > - up_read(&sb->s_umount); > - return 1; > - } else > - return 0; > + if (down_read_trylock(&sb->s_umount)) { > + writeback_inodes_sb_nr(sb, nr); > + up_read(&sb->s_umount); > + return 1; > + } > + } > + return 0; > } > EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 16 Nov 2010 22:00:58 +1100 Nick Piggin <npiggin@kernel.dk> wrote:> I saw a lock order warning on ext4 trigger. This should solve it.Send us the trace, please. The code comment implies that someone is calling down_read() under i_lock? That would be bad, and I''d expect it to have produced a might_sleep() warning, not a lockdep trace. And I don''t see how we can call writeback_inodes_sb() under i_lock anyway, so I don''t really have a clue what''s going on here!> Raciness shouldn''t matter much, because writeback can stop just > after we make the test and return anyway (so the API is racy anyway). > > Signed-off-by: Nick Piggin <npiggin@kernel.dk> > > Index: linux-2.6/fs/fs-writeback.c > ==================================================================> --- linux-2.6.orig/fs/fs-writeback.c 2010-11-16 21:44:32.000000000 +1100 > +++ linux-2.6/fs/fs-writeback.c 2010-11-16 21:49:37.000000000 +1100 > @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb); > * > * Invoke writeback_inodes_sb if no writeback is currently underway. > * Returns 1 if writeback was started, 0 if not. > + * > + * May be called inside i_lock. May not start writeback if locks cannot > + * be acquired. > */ > int writeback_inodes_sb_if_idle(struct super_block *sb) > { > if (!writeback_in_progress(sb->s_bdi)) { > - down_read(&sb->s_umount); > - writeback_inodes_sb(sb); > - up_read(&sb->s_umount); > - return 1; > - } else > - return 0; > + if (down_read_trylock(&sb->s_umount)) { > + writeback_inodes_sb(sb); > + up_read(&sb->s_umount); > + return 1; > + } > + } > + return 0;And it''s pretty generous to describe a s/down_read/down_read_trylock/ as a "fix". Terms like "bandaid" and "workaround" come to mind. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 16, 2010 at 12:32:52PM -0800, Andrew Morton wrote:> On Tue, 16 Nov 2010 22:00:58 +1100 > Nick Piggin <npiggin@kernel.dk> wrote: > > > I saw a lock order warning on ext4 trigger. This should solve it. > > Send us the trace, please.I lost it, sorry.> The code comment implies that someone is calling down_read() under > i_lock? That would be bad, and I''d expect it to have produced a > might_sleep() warning, not a lockdep trace.Sorry not i_lock, i_mutex. writeback_inodes_sb_if_idle is called by ext4''s write_begin function which is called with i_mutex held from generic_file_buffered_write, I believe is the trace.> And I don''t see how we can call writeback_inodes_sb() under i_lock > anyway, so I don''t really have a clue what''s going on here! > > > Raciness shouldn''t matter much, because writeback can stop just > > after we make the test and return anyway (so the API is racy anyway).> > > > Signed-off-by: Nick Piggin <npiggin@kernel.dk> > > > > Index: linux-2.6/fs/fs-writeback.c > > ==================================================================> > --- linux-2.6.orig/fs/fs-writeback.c 2010-11-16 21:44:32.000000000 +1100 > > +++ linux-2.6/fs/fs-writeback.c 2010-11-16 21:49:37.000000000 +1100 > > @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb); > > * > > * Invoke writeback_inodes_sb if no writeback is currently underway. > > * Returns 1 if writeback was started, 0 if not. > > + * > > + * May be called inside i_lock. May not start writeback if locks cannot > > + * be acquired. > > */ > > int writeback_inodes_sb_if_idle(struct super_block *sb) > > { > > if (!writeback_in_progress(sb->s_bdi)) { > > - down_read(&sb->s_umount); > > - writeback_inodes_sb(sb); > > - up_read(&sb->s_umount); > > - return 1; > > - } else > > - return 0; > > + if (down_read_trylock(&sb->s_umount)) { > > + writeback_inodes_sb(sb); > > + up_read(&sb->s_umount); > > + return 1; > > + } > > + } > > + return 0; > > And it''s pretty generous to describe a s/down_read/down_read_trylock/ > as a "fix". Terms like "bandaid" and "workaround" come to mind.As much as the writeback_inodes_sb_if_idle API itself is a bandaid, I suppose. (it doesn''t do any rate limiting of the dirtier, it''s racy, it doesn''t specify how much to writeback, it''s synchronous, etc). Anyway, I don''t know, there''s not much other option for 2.6.37 AFAIKS. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/16/10 7:01 AM, Jan Kara wrote:> On Tue 16-11-10 22:00:58, Nick Piggin wrote: >> I saw a lock order warning on ext4 trigger. This should solve it. >> Raciness shouldn''t matter much, because writeback can stop just >> after we make the test and return anyway (so the API is racy anyway). > Hmm, for now the fix is OK. Ultimately, we probably want to call > writeback_inodes_sb() directly from all the callers. They all just want to > reduce uncertainty of delayed allocation reservations by writing delayed > data and actually wait for some of the writeback to happen before they > retry again the allocation.For ext4, at least, it''s just best-effort. We''re not actually out of space yet when this starts pushing. But it helps us avoid enospc: commit c8afb44682fcef6273e8b8eb19fab13ddd05b386 Author: Eric Sandeen <sandeen@redhat.com> Date: Wed Dec 23 07:58:12 2009 -0500 ext4: flush delalloc blocks when space is low Creating many small files in rapid succession on a small filesystem can lead to spurious ENOSPC; on a 104MB filesystem: for i in `seq 1 22500`; do echo -n > $SCRATCH_MNT/$i echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i done leads to ENOSPC even though after a sync, 40% of the fs is free again. <snip> We don''t need it to be synchronous - in fact I didn''t think it was ... ext4 should probably use btrfs''s new variant and just get rid of the one I put in, for a very large system/filesystem it could end up doing a rather insane amount of IO when the fs starts to get full. as for the locking problems ... sorry about that! -Eric> Although the callers generally cannot get umount_sem because they hold > other locks, they have the superblock well pinned so grabbing umount_sem > makes sense mostly to make assertions happy. But as I''m thinking about it, > trylock *is* maybe the right answer to this anyway... > > So > Acked-by: Jan Kara <jack@suse.cz> > > Honza >> >> Signed-off-by: Nick Piggin <npiggin@kernel.dk> >> >> Index: linux-2.6/fs/fs-writeback.c >> ==================================================================>> --- linux-2.6.orig/fs/fs-writeback.c 2010-11-16 21:44:32.000000000 +1100 >> +++ linux-2.6/fs/fs-writeback.c 2010-11-16 21:49:37.000000000 +1100 >> @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb); >> * >> * Invoke writeback_inodes_sb if no writeback is currently underway. >> * Returns 1 if writeback was started, 0 if not. >> + * >> + * May be called inside i_lock. May not start writeback if locks cannot >> + * be acquired. >> */ >> int writeback_inodes_sb_if_idle(struct super_block *sb) >> { >> if (!writeback_in_progress(sb->s_bdi)) { >> - down_read(&sb->s_umount); >> - writeback_inodes_sb(sb); >> - up_read(&sb->s_umount); >> - return 1; >> - } else >> - return 0; >> + if (down_read_trylock(&sb->s_umount)) { >> + writeback_inodes_sb(sb); >> + up_read(&sb->s_umount); >> + return 1; >> + } >> + } >> + return 0; >> } >> EXPORT_SYMBOL(writeback_inodes_sb_if_idle); >> >> @@ -1145,17 +1149,21 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl >> * >> * Invoke writeback_inodes_sb if no writeback is currently underway. >> * Returns 1 if writeback was started, 0 if not. >> + * >> + * May be called inside i_lock. May not start writeback if locks cannot >> + * be acquired. >> */ >> int writeback_inodes_sb_nr_if_idle(struct super_block *sb, >> unsigned long nr) >> { >> if (!writeback_in_progress(sb->s_bdi)) { >> - down_read(&sb->s_umount); >> - writeback_inodes_sb_nr(sb, nr); >> - up_read(&sb->s_umount); >> - return 1; >> - } else >> - return 0; >> + if (down_read_trylock(&sb->s_umount)) { >> + writeback_inodes_sb_nr(sb, nr); >> + up_read(&sb->s_umount); >> + return 1; >> + } >> + } >> + return 0; >> } >> EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle); >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 16, 2010 at 10:30:37PM -0600, Eric Sandeen wrote:> On 11/16/10 7:01 AM, Jan Kara wrote: > > On Tue 16-11-10 22:00:58, Nick Piggin wrote: > >> I saw a lock order warning on ext4 trigger. This should solve it. > >> Raciness shouldn''t matter much, because writeback can stop just > >> after we make the test and return anyway (so the API is racy anyway). > > Hmm, for now the fix is OK. Ultimately, we probably want to call > > writeback_inodes_sb() directly from all the callers. They all just want to > > reduce uncertainty of delayed allocation reservations by writing delayed > > data and actually wait for some of the writeback to happen before they > > retry again the allocation. > > For ext4, at least, it''s just best-effort. We''re not actually out of > space yet when this starts pushing. But it helps us avoid enospc: > > commit c8afb44682fcef6273e8b8eb19fab13ddd05b386 > Author: Eric Sandeen <sandeen@redhat.com> > Date: Wed Dec 23 07:58:12 2009 -0500 > > ext4: flush delalloc blocks when space is low > > Creating many small files in rapid succession on a small > filesystem can lead to spurious ENOSPC; on a 104MB filesystem: > > for i in `seq 1 22500`; do > echo -n > $SCRATCH_MNT/$i > echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i > done > > leads to ENOSPC even though after a sync, 40% of the fs is free > again. > > <snip> > > We don''t need it to be synchronous - in fact I didn''t think it was ...By synchronous, I just mean that the caller is the one who pushes the data into writeout. It _may_ be better if it was done by background writeback, with a feedback loop to throttle the caller (preferably placed outside any locks it is holding). To be pragmatic, I think the thing is fine to actually solve the problem at hand. I was just saying that it has a tiny little hackish feeling anyway, so a trylock will be right at home there :)> ext4 should probably use btrfs''s new variant and just get rid of the > one I put in, for a very large system/filesystem it could end up doing > a rather insane amount of IO when the fs starts to get full. > > as for the locking problems ... sorry about that!That''s no problem. So is that an ack? :) -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/16/10 10:38 PM, Nick Piggin wrote:> On Tue, Nov 16, 2010 at 10:30:37PM -0600, Eric Sandeen wrote: >> On 11/16/10 7:01 AM, Jan Kara wrote: >>> On Tue 16-11-10 22:00:58, Nick Piggin wrote: >>>> I saw a lock order warning on ext4 trigger. This should solve it. >>>> Raciness shouldn''t matter much, because writeback can stop just >>>> after we make the test and return anyway (so the API is racy anyway). >>> Hmm, for now the fix is OK. Ultimately, we probably want to call >>> writeback_inodes_sb() directly from all the callers. They all just want to >>> reduce uncertainty of delayed allocation reservations by writing delayed >>> data and actually wait for some of the writeback to happen before they >>> retry again the allocation. >> >> For ext4, at least, it''s just best-effort. We''re not actually out of >> space yet when this starts pushing. But it helps us avoid enospc: >> >> commit c8afb44682fcef6273e8b8eb19fab13ddd05b386 >> Author: Eric Sandeen <sandeen@redhat.com> >> Date: Wed Dec 23 07:58:12 2009 -0500 >> >> ext4: flush delalloc blocks when space is low >> >> Creating many small files in rapid succession on a small >> filesystem can lead to spurious ENOSPC; on a 104MB filesystem: >> >> for i in `seq 1 22500`; do >> echo -n > $SCRATCH_MNT/$i >> echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i >> done >> >> leads to ENOSPC even though after a sync, 40% of the fs is free >> again. >> >> <snip> >> >> We don''t need it to be synchronous - in fact I didn''t think it was ... > > By synchronous, I just mean that the caller is the one who pushes > the data into writeout. It _may_ be better if it was done by background > writeback, with a feedback loop to throttle the caller (preferably > placed outside any locks it is holding). > > To be pragmatic, I think the thing is fine to actually solve the > problem at hand. I was just saying that it has a tiny little hackish > feeling anyway, so a trylock will be right at home there :) > > >> ext4 should probably use btrfs''s new variant and just get rid of the >> one I put in, for a very large system/filesystem it could end up doing >> a rather insane amount of IO when the fs starts to get full. >> >> as for the locking problems ... sorry about that! > > That''s no problem. So is that an ack? :) >I''d like to test it with the original case it was supposed to solve; will do that tomorrow. -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:> On 11/16/10 10:38 PM, Nick Piggin wrote: > >> as for the locking problems ... sorry about that! > > > > That''s no problem. So is that an ack? :) > > > > I''d like to test it with the original case it was supposed to solve; will > do that tomorrow.OK, but it shouldn''t make much difference, unless there is a lot of strange activity happening on the sb (like mount / umount / remount / freeze / etc). -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote:> On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote: > > On 11/16/10 10:38 PM, Nick Piggin wrote: > > >> as for the locking problems ... sorry about that! > > > > > > That''s no problem. So is that an ack? :) > > > > > > > I''d like to test it with the original case it was supposed to solve; will > > do that tomorrow. > > OK, but it shouldn''t make much difference, unless there is a lot of > strange activity happening on the sb (like mount / umount / remount / > freeze / etc).This makes sense to me as well. Acked-by: "Theodore Ts''o" <tytso@mit.edu> So how do we want to send this patch to Linus? It''s a writeback change, so through some mm tree? Or it lives in fs/fs-writeback.c (which I always thought was weird; why is it not in mm/?), so maybe through the VFS tree, even though I don''t think Al would really care about this patch. Or I can push it to Linus through the ext4 tree.... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/17/10 12:10 AM, Nick Piggin wrote:> On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote: >> On 11/16/10 10:38 PM, Nick Piggin wrote: >>>> as for the locking problems ... sorry about that! >>> >>> That''s no problem. So is that an ack? :) >>> >> >> I''d like to test it with the original case it was supposed to solve; will >> do that tomorrow. > > OK, but it shouldn''t make much difference, unless there is a lot of > strange activity happening on the sb (like mount / umount / remount / > freeze / etc). >Ok, good point. And since I failed in my promise anyway, I won''t hold you up; Acked-by: Eric Sandeen <sandeen@redhat.com> Thanks, -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts''o" <tytso@mit.edu> wrote:> On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote: > > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote: > > > On 11/16/10 10:38 PM, Nick Piggin wrote: > > > >> as for the locking problems ... sorry about that! > > > > > > > > That''s no problem. So is that an ack? :) > > > > > > > > > > I''d like to test it with the original case it was supposed to solve; will > > > do that tomorrow. > > > > OK, but it shouldn''t make much difference, unless there is a lot of > > strange activity happening on the sb (like mount / umount / remount / > > freeze / etc). > > This makes sense to me as well. > > Acked-by: "Theodore Ts''o" <tytso@mit.edu> > > So how do we want to send this patch to Linus? It''s a writeback > change, so through some mm tree?It''s in my todo pile. Even though the patch sucks, but not as much as its changelog does. Am not particularly happy merging an alleged bugfix where the bug is, and I quote, "I saw a lock order warning on ext4 trigger". I mean, wtf? How is anyone supposed to review the code based on that?? Or to understand it a year from now? When I get to it I''ll troll this email thread and might be able to kludge together a description which might be able to fool people into thinking it makes sense. And someone at least needs to fix the comment: s/i_lock/i_mutex/. I guess that would be me.> Or it lives in fs/fs-writeback.c > (which I always thought was weird; why is it not in mm/?),The idea was that the writeback walk goes superblocks->inodes->address_spaces->pages->bios->drivers, so that code needs to be spread over fs->mm->block->drivers. The dividing line between fs and block is approximately at the "address_spaces" mark. Yes, that line is a bit smeary. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 17, 2010 at 07:29:00PM -0800, Andrew Morton wrote:> On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts''o" <tytso@mit.edu> wrote: > > > On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote: > > > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote: > > > > On 11/16/10 10:38 PM, Nick Piggin wrote: > > > > >> as for the locking problems ... sorry about that! > > > > > > > > > > That''s no problem. So is that an ack? :) > > > > > > > > > > > > > I''d like to test it with the original case it was supposed to solve; will > > > > do that tomorrow. > > > > > > OK, but it shouldn''t make much difference, unless there is a lot of > > > strange activity happening on the sb (like mount / umount / remount / > > > freeze / etc). > > > > This makes sense to me as well. > > > > Acked-by: "Theodore Ts''o" <tytso@mit.edu> > > > > So how do we want to send this patch to Linus? It''s a writeback > > change, so through some mm tree? > > It''s in my todo pile. Even though the patch sucks, but not as much as > its changelog does. Am not particularly happy merging an alleged > bugfix where the bug is, and I quote, "I saw a lock order warning on > ext4 trigger". I mean, wtf? How is anyone supposed to review the code > based on that?? Or to understand it a year from now?Sorry bout the confusion, it was supposed to be "i_mutex", and then it would have been a bit more obvious.> When I get to it I''ll troll this email thread and might be able to > kludge together a description which might be able to fool people into > thinking it makes sense."Lock order reversal between s_umount and i_mutex". i_mutex nests inside s_umount in some writeback paths (it was the end io handler to convert unwritten extents IIRC). But hmm, wouldn''t that be a bug? We aren''t allowed to take i_mutex inside writeback, are we? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 18 Nov 2010 17:00:00 +1100 Nick Piggin <npiggin@kernel.dk> wrote:> On Wed, Nov 17, 2010 at 07:29:00PM -0800, Andrew Morton wrote: > > On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts''o" <tytso@mit.edu> wrote: > > > > > On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote: > > > > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote: > > > > > On 11/16/10 10:38 PM, Nick Piggin wrote: > > > > > >> as for the locking problems ... sorry about that! > > > > > > > > > > > > That''s no problem. So is that an ack? :) > > > > > > > > > > > > > > > > I''d like to test it with the original case it was supposed to solve; will > > > > > do that tomorrow. > > > > > > > > OK, but it shouldn''t make much difference, unless there is a lot of > > > > strange activity happening on the sb (like mount / umount / remount / > > > > freeze / etc). > > > > > > This makes sense to me as well. > > > > > > Acked-by: "Theodore Ts''o" <tytso@mit.edu> > > > > > > So how do we want to send this patch to Linus? It''s a writeback > > > change, so through some mm tree? > > > > It''s in my todo pile. Even though the patch sucks, but not as much as > > its changelog does. Am not particularly happy merging an alleged > > bugfix where the bug is, and I quote, "I saw a lock order warning on > > ext4 trigger". I mean, wtf? How is anyone supposed to review the code > > based on that?? Or to understand it a year from now? > > Sorry bout the confusion, it was supposed to be "i_mutex", and then it > would have been a bit more obvious. > > > > When I get to it I''ll troll this email thread and might be able to > > kludge together a description which might be able to fool people into > > thinking it makes sense. > > "Lock order reversal between s_umount and i_mutex". > > i_mutex nests inside s_umount in some writeback paths (it was the end > io handler to convert unwritten extents IIRC). But hmm, wouldn''t that > be a bug? We aren''t allowed to take i_mutex inside writeback, are we?I''m not sure that s_umount versus i_mutex has come up before. Logically I''d expect i_mutex to nest inside s_umount. Because s_umount is a per-superblock thing, and i_mutex is a per-file thing, and files live under superblocks. Nesting s_umount outside i_mutex creates complex deadlock graphs between the various i_mutexes, I think. Someone tell me if btrfs has the same bug, via its call to writeback_inodes_sb_nr_if_idle()? I don''t see why these functions need s_umount at all, if they''re called from within ->write_begin against an inode on that superblock. If the superblock can get itself disappeared while we''re running ->write_begin on it, we have problems, no? In which case I''d suggest just removing the down_read(s_umount) and specifying that the caller must pin the superblock via some means. Only we can''t do that because we need to hold s_umount until the bdi_queue_work() worker has done its work. The fact that a call to ->write_begin can randomly return with s_umount held, to be randomly released at some random time in the future is a bit ugly, isn''t it? write_begin is a pretty low-level, per-inode thing. It''d be better if we pinned these superblocks via refcounting, not via holding s_umount but even then, having ->write_begin randomly bump sb refcounts for random periods of time is still pretty ugly. What a pickle. Can we just delete writeback_inodes_sb_nr_if_idle() and writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is pretty handwavy - do we know that deleting these things would make a jot of difference? And why _do_ we need to hold s_umount during the bdi_queue_work() handover? Would simply bumping s_count suffice? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 17, 2010 at 10:28:34PM -0800, Andrew Morton wrote:> On Thu, 18 Nov 2010 17:00:00 +1100 Nick Piggin <npiggin@kernel.dk> wrote: > > > On Wed, Nov 17, 2010 at 07:29:00PM -0800, Andrew Morton wrote: > > > On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts''o" <tytso@mit.edu> wrote: > > > > > > > On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote: > > > > > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote: > > > > > > On 11/16/10 10:38 PM, Nick Piggin wrote: > > > > > > >> as for the locking problems ... sorry about that! > > > > > > > > > > > > > > That''s no problem. So is that an ack? :) > > > > > > > > > > > > > > > > > > > I''d like to test it with the original case it was supposed to solve; will > > > > > > do that tomorrow. > > > > > > > > > > OK, but it shouldn''t make much difference, unless there is a lot of > > > > > strange activity happening on the sb (like mount / umount / remount / > > > > > freeze / etc). > > > > > > > > This makes sense to me as well. > > > > > > > > Acked-by: "Theodore Ts''o" <tytso@mit.edu> > > > > > > > > So how do we want to send this patch to Linus? It''s a writeback > > > > change, so through some mm tree? > > > > > > It''s in my todo pile. Even though the patch sucks, but not as much as > > > its changelog does. Am not particularly happy merging an alleged > > > bugfix where the bug is, and I quote, "I saw a lock order warning on > > > ext4 trigger". I mean, wtf? How is anyone supposed to review the code > > > based on that?? Or to understand it a year from now? > > > > Sorry bout the confusion, it was supposed to be "i_mutex", and then it > > would have been a bit more obvious. > > > > > > > When I get to it I''ll troll this email thread and might be able to > > > kludge together a description which might be able to fool people into > > > thinking it makes sense. > > > > "Lock order reversal between s_umount and i_mutex". > > > > i_mutex nests inside s_umount in some writeback paths (it was the end > > io handler to convert unwritten extents IIRC). But hmm, wouldn''t that > > be a bug? We aren''t allowed to take i_mutex inside writeback, are we? > > I''m not sure that s_umount versus i_mutex has come up before.Well, we appear to have i_mutex inside s_umount in ext4 as well, from unwritten extent conversion in writeback completion (that''s where the lock order was recorded in my trace, I''ll try to get another one). I don''t know if that''s a good idea. Definitely we can''t take i_mutex inside writeback if it can be triggered recursively from something like memory allocation in buffered write path. So long as it is asynchronous, on the completion path, it is probably OK.> Logically I''d expect i_mutex to nest inside s_umount. Because s_umount > is a per-superblock thing, and i_mutex is a per-file thing, and files > live under superblocks. Nesting s_umount outside i_mutex creates > complex deadlock graphs between the various i_mutexes, I think.You mean i_mutex outside s_umount?> Someone tell me if btrfs has the same bug, via its call to > writeback_inodes_sb_nr_if_idle()? > > I don''t see why these functions need s_umount at all, if they''re called > from within ->write_begin against an inode on that superblock. If the > superblock can get itself disappeared while we''re running ->write_begin > on it, we have problems, no? > > In which case I''d suggest just removing the down_read(s_umount) and > specifying that the caller must pin the superblock via some means. > > Only we can''t do that because we need to hold s_umount until the > bdi_queue_work() worker has done its work. > > The fact that a call to ->write_begin can randomly return with s_umount > held, to be randomly released at some random time in the future is a > bit ugly, isn''t it? write_begin is a pretty low-level, per-inode > thing.Yeah that whole writeback_inodes_if_idle is nasty> It''d be better if we pinned these superblocks via refcounting, not via > holding s_umount but even then, having ->write_begin randomly bump sb > refcounts for random periods of time is still pretty ugly. > > What a pickle. > > Can we just delete writeback_inodes_sb_nr_if_idle() and > writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is > pretty handwavy - do we know that deleting these things would make a > jot of difference? > > And why _do_ we need to hold s_umount during the bdi_queue_work() > handover? Would simply bumping s_count suffice?s_count just prevents it from going away, but s_umount is still needed to keep umount, remount,ro, freezing etc activity away. I don''t think there is an easy way to do it. Perhaps filesystem should have access to the dirty throttling path, kick writeback or delayed allocation etc as needed, and throttle against outstanding work that needs to be done, going through the normal writeback paths? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Nov 18, 2010, at 3:18 AM, Nick Piggin wrote:> s_count just prevents it from going away, but s_umount is still needed > to keep umount, remount,ro, freezing etc activity away. I don''t think > there is an easy way to do it.Hmm.... what about encoding the fact that we are in the process of unmounting the file system as a flag to keep remount, freeing, etc. away? The equivalent of the inode''s I_FREEING flag? After all, it''s not like we want freeze to wait until the umount is finished, and then continue on its merry way with the remount operation. We want it to fail, because we''re unmounting the file system. So maybe the real problem is a mutex really isn''t the right abstraction? -- Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/18/10 12:28 AM, Andrew Morton wrote:> On Thu, 18 Nov 2010 17:00:00 +1100 Nick Piggin <npiggin@kernel.dk> wrote: > >> On Wed, Nov 17, 2010 at 07:29:00PM -0800, Andrew Morton wrote: >>> On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts''o" <tytso@mit.edu> wrote: >>> >>>> On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote: >>>>> On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote: >>>>>> On 11/16/10 10:38 PM, Nick Piggin wrote: >>>>>>>> as for the locking problems ... sorry about that! >>>>>>> >>>>>>> That''s no problem. So is that an ack? :) >>>>>>> >>>>>> >>>>>> I''d like to test it with the original case it was supposed to solve; will >>>>>> do that tomorrow. >>>>> >>>>> OK, but it shouldn''t make much difference, unless there is a lot of >>>>> strange activity happening on the sb (like mount / umount / remount / >>>>> freeze / etc). >>>> >>>> This makes sense to me as well. >>>> >>>> Acked-by: "Theodore Ts''o" <tytso@mit.edu> >>>> >>>> So how do we want to send this patch to Linus? It''s a writeback >>>> change, so through some mm tree? >>> >>> It''s in my todo pile. Even though the patch sucks, but not as much as >>> its changelog does. Am not particularly happy merging an alleged >>> bugfix where the bug is, and I quote, "I saw a lock order warning on >>> ext4 trigger". I mean, wtf? How is anyone supposed to review the code >>> based on that?? Or to understand it a year from now? >> >> Sorry bout the confusion, it was supposed to be "i_mutex", and then it >> would have been a bit more obvious. >> >> >>> When I get to it I''ll troll this email thread and might be able to >>> kludge together a description which might be able to fool people into >>> thinking it makes sense. >> >> "Lock order reversal between s_umount and i_mutex". >> >> i_mutex nests inside s_umount in some writeback paths (it was the end >> io handler to convert unwritten extents IIRC). But hmm, wouldn''t that >> be a bug? We aren''t allowed to take i_mutex inside writeback, are we? > > I''m not sure that s_umount versus i_mutex has come up before. > > Logically I''d expect i_mutex to nest inside s_umount. Because s_umount > is a per-superblock thing, and i_mutex is a per-file thing, and files > live under superblocks. Nesting s_umount outside i_mutex creates > complex deadlock graphs between the various i_mutexes, I think. > > Someone tell me if btrfs has the same bug, via its call to > writeback_inodes_sb_nr_if_idle()? > > I don''t see why these functions need s_umount at all, if they''re called > from within ->write_begin against an inode on that superblock. If the > superblock can get itself disappeared while we''re running ->write_begin > on it, we have problems, no? > > In which case I''d suggest just removing the down_read(s_umount) and > specifying that the caller must pin the superblock via some means. > > Only we can''t do that because we need to hold s_umount until the > bdi_queue_work() worker has done its work. > > The fact that a call to ->write_begin can randomly return with s_umount > held, to be randomly released at some random time in the future is a > bit ugly, isn''t it? write_begin is a pretty low-level, per-inode > thing. > > It''d be better if we pinned these superblocks via refcounting, not via > holding s_umount but even then, having ->write_begin randomly bump sb > refcounts for random periods of time is still pretty ugly. > > What a pickle. > > Can we just delete writeback_inodes_sb_nr_if_idle() and > writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is > pretty handwavy - do we know that deleting these things would make a > jot of difference?Really? I thought it was pretty decent ;) Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems." shows the problem clearly, IIRC. I should have included that in the changelog, I suppose, sorry. -Eric> And why _do_ we need to hold s_umount during the bdi_queue_work() > handover? Would simply bumping s_count suffice? >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote:> > Can we just delete writeback_inodes_sb_nr_if_idle() and > > writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is > > pretty handwavy - do we know that deleting these things would make a > > jot of difference? > > Really? I thought it was pretty decent ;) > > Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems." > shows the problem clearly, IIRC. I should have included that in the > changelog, I suppose, sorry.Your email didn''t really impart any information :( I suppose I could accidentally delete those nasty little functions in a drivers/parport patch then wait and see if anyone notices. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 18 Nov 2010 19:18:22 +1100 Nick Piggin <npiggin@kernel.dk> wrote:> On Wed, Nov 17, 2010 at 10:28:34PM -0800, Andrew Morton wrote: > > > Logically I''d expect i_mutex to nest inside s_umount. Because s_umount > > is a per-superblock thing, and i_mutex is a per-file thing, and files > > live under superblocks. Nesting s_umount outside i_mutex creates > > complex deadlock graphs between the various i_mutexes, I think. > > You mean i_mutex outside s_umount? >Nope. i_mutex should nest inside s_umount. Just as inodes nest inside superblocks! Seems logical to me ;)> > Someone tell me if btrfs has the same bug, via its call to > > writeback_inodes_sb_nr_if_idle()? > > > > I don''t see why these functions need s_umount at all, if they''re called > > from within ->write_begin against an inode on that superblock. If the > > superblock can get itself disappeared while we''re running ->write_begin > > on it, we have problems, no? > > > > In which case I''d suggest just removing the down_read(s_umount) and > > specifying that the caller must pin the superblock via some means. > > > > Only we can''t do that because we need to hold s_umount until the > > bdi_queue_work() worker has done its work. > > > > The fact that a call to ->write_begin can randomly return with s_umount > > held, to be randomly released at some random time in the future is a > > bit ugly, isn''t it? write_begin is a pretty low-level, per-inode > > thing. > > Yeah that whole writeback_inodes_if_idle is nasty > > > > It''d be better if we pinned these superblocks via refcounting, not via > > holding s_umount but even then, having ->write_begin randomly bump sb > > refcounts for random periods of time is still pretty ugly. > > > > What a pickle. > > > > Can we just delete writeback_inodes_sb_nr_if_idle() and > > writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is > > pretty handwavy - do we know that deleting these things would make a > > jot of difference? > > > > And why _do_ we need to hold s_umount during the bdi_queue_work() > > handover? Would simply bumping s_count suffice? > > s_count just prevents it from going away, but s_umount is still needed > to keep umount, remount,ro, freezing etc activity away. I don''t think > there is an easy way to do it. > > Perhaps filesystem should have access to the dirty throttling path, kick > writeback or delayed allocation etc as needed, and throttle against > outstanding work that needs to be done, going through the normal > writeback paths?I just cannot believe that we need s_mount inside ->write_begin. Is it really the case that someone can come along and unmount or remount or freeze our filesystem while some other process is down performing a ->write_begin against one of its files? Kidding? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/18/10 11:10 AM, Andrew Morton wrote:> On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote: > >>> Can we just delete writeback_inodes_sb_nr_if_idle() and >>> writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is >>> pretty handwavy - do we know that deleting these things would make a >>> jot of difference? >> >> Really? I thought it was pretty decent ;) >> >> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems." >> shows the problem clearly, IIRC. I should have included that in the >> changelog, I suppose, sorry. > > Your email didn''t really impart any information :( > > I suppose I could accidentally delete those nasty little functions in a > drivers/parport patch then wait and see if anyone notices. >Um, ok, then, to answer the question directly : No, please don''t delete those functions, it will break ENOSPC handling in ext4 as shown by xfstests regression test #204 ... -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/18/10 12:04 PM, Eric Sandeen wrote:> On 11/18/10 11:10 AM, Andrew Morton wrote: >> On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote: >> >>>> Can we just delete writeback_inodes_sb_nr_if_idle() and >>>> writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is >>>> pretty handwavy - do we know that deleting these things would make a >>>> jot of difference? >>> >>> Really? I thought it was pretty decent ;) >>> >>> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems." >>> shows the problem clearly, IIRC. I should have included that in the >>> changelog, I suppose, sorry. >> >> Your email didn''t really impart any information :( >> >> I suppose I could accidentally delete those nasty little functions in a >> drivers/parport patch then wait and see if anyone notices. >> > > Um, ok, then, to answer the question directly : > > No, please don''t delete those functions, it will break ENOSPC handling > in ext4 as shown by xfstests regression test #204 ...Further - What is going on here is that with delayed allocation, ext4 takes reservations against free blocks based on the data blocks it must write out, and the worst-case metadata that the writeout may take. Getting writeback failing with ENOSPC would be bad. But then we wind up with a bunch of unflushed writes sitting on huge metadata reservations, and start hitting ENOSPC due to that worst-case reservation. After a sync we have tons of free space again, because the worst-case space reservations turned into usually best-case reality. That''s what the function is used for; once we start filling up the fs, we proactively flush data to free up the worst-case metadata reservations. Dropping it will put us back in the bad situation. If there are other ideas to fix it, I''m all ears, but this worked. -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Andrew Morton''s message of 2010-11-18 01:28:34 -0500:> I''m not sure that s_umount versus i_mutex has come up before. > > Logically I''d expect i_mutex to nest inside s_umount. Because s_umount > is a per-superblock thing, and i_mutex is a per-file thing, and files > live under superblocks. Nesting s_umount outside i_mutex creates > complex deadlock graphs between the various i_mutexes, I think. > > Someone tell me if btrfs has the same bug, via its call to > writeback_inodes_sb_nr_if_idle()?Btrfs is using the call differently, we kick off delalloc at transaction start time when many fewer locks are held. Since transaction start can happen with the inode mutex held, we''ll end up taking the s_unmount with the inode mutex held too. But, we never take the inode lock internally in the writeback paths.> > I don''t see why these functions need s_umount at all, if they''re called > from within ->write_begin against an inode on that superblock. If the > superblock can get itself disappeared while we''re running ->write_begin > on it, we have problems, no? > > In which case I''d suggest just removing the down_read(s_umount) and > specifying that the caller must pin the superblock via some means. > > Only we can''t do that because we need to hold s_umount until the > bdi_queue_work() worker has done its work. > > The fact that a call to ->write_begin can randomly return with s_umount > held, to be randomly released at some random time in the future is a > bit ugly, isn''t it? write_begin is a pretty low-level, per-inode > thing. > > It''d be better if we pinned these superblocks via refcounting, not via > holding s_umount but even then, having ->write_begin randomly bump sb > refcounts for random periods of time is still pretty ugly. > > What a pickle. > > Can we just delete writeback_inodes_sb_nr_if_idle() and > writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is > pretty handwavy - do we know that deleting these things would make a > jot of difference? > > And why _do_ we need to hold s_umount during the bdi_queue_work() > handover? Would simply bumping s_count suffice? >We don''t need to keep the super in ram, we need to keep the FS mounted so that writepage and friends continue to do useful things. s_count isn''t enough for that...but when the bdi stuff is passed an SB from something that has the SB explicitly pinned, we should be able to safely skip the locking. Since these functions are only used in that context it makes good sense to try_lock them or drop the lock completely. I think the only reason we need the lock: void writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr) { ... WARN_ON(!rwsem_is_locked(&sb->s_umount)); We''re not going to go from rw to ro with dirty pages unless something horrible has gone wrong (eios all around in the FS), so I''m not sure why we need the lock at all? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 18 Nov 2010 12:04:21 -0600 Eric Sandeen <sandeen@redhat.com> wrote:> On 11/18/10 11:10 AM, Andrew Morton wrote: > > On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote: > > > >>> Can we just delete writeback_inodes_sb_nr_if_idle() and > >>> writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is > >>> pretty handwavy - do we know that deleting these things would make a > >>> jot of difference? > >> > >> Really? I thought it was pretty decent ;) > >> > >> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems." > >> shows the problem clearly, IIRC. I should have included that in the > >> changelog, I suppose, sorry. > > > > Your email didn''t really impart any information :( > > > > I suppose I could accidentally delete those nasty little functions in a > > drivers/parport patch then wait and see if anyone notices. > > > > Um, ok, then, to answer the question directly : > > No, please don''t delete those functions, it will break ENOSPC handling > in ext4 as shown by xfstests regression test #204 ... >If those functions "fix" a testcase then it was by sheer luck, and the fs''s ENOSPC handling is still busted. For a start writeback_inodes_sb_if_idle() is a no-op if the device isn''t idle! Secondly, if the device _was_ idle, writeback_inodes_sb_if_idle() uses a work handoff to another thread, which means that the work might not get executed for another six weeks. So no, your ENOSPC handling is still busted and I''ll be doing you a favour when I send that parport patch. No? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Eric Sandeen''s message of 2010-11-18 13:24:57 -0500:> > Um, ok, then, to answer the question directly : > > > > No, please don''t delete those functions, it will break ENOSPC handling > > in ext4 as shown by xfstests regression test #204 ... > > Further - > > What is going on here is that with delayed allocation, ext4 takes reservations > against free blocks based on the data blocks it must write out, and the > worst-case metadata that the writeout may take. Getting writeback failing > with ENOSPC would be bad. > > But then we wind up with a bunch of unflushed writes sitting on huge > metadata reservations, and start hitting ENOSPC due to that worst-case > reservation. After a sync we have tons of free space again, because > the worst-case space reservations turned into usually best-case > reality. > > That''s what the function is used for; once we start filling up the > fs, we proactively flush data to free up the worst-case metadata > reservations. > > Dropping it will put us back in the bad situation. > > If there are other ideas to fix it, I''m all ears, but this worked.s/ext4/btrfs/g We do the accounting and kick IO in different places, but the idea is pretty much the same. Some of the reservations are freed when we start the IO and some are freed when the IO is done. I understand that XFS is similar but does the writeback from its own internal radix tree in the sky. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Andrew Morton''s message of 2010-11-18 13:36:38 -0500:> On Thu, 18 Nov 2010 12:04:21 -0600 Eric Sandeen <sandeen@redhat.com> wrote: > > > On 11/18/10 11:10 AM, Andrew Morton wrote: > > > On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote: > > > > > >>> Can we just delete writeback_inodes_sb_nr_if_idle() and > > >>> writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is > > >>> pretty handwavy - do we know that deleting these things would make a > > >>> jot of difference? > > >> > > >> Really? I thought it was pretty decent ;) > > >> > > >> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems." > > >> shows the problem clearly, IIRC. I should have included that in the > > >> changelog, I suppose, sorry. > > > > > > Your email didn''t really impart any information :( > > > > > > I suppose I could accidentally delete those nasty little functions in a > > > drivers/parport patch then wait and see if anyone notices. > > > > > > > Um, ok, then, to answer the question directly : > > > > No, please don''t delete those functions, it will break ENOSPC handling > > in ext4 as shown by xfstests regression test #204 ... > > > > If those functions "fix" a testcase then it was by sheer luck, and the > fs''s ENOSPC handling is still busted. > > For a start writeback_inodes_sb_if_idle() is a no-op if the device > isn''t idle! Secondly, if the device _was_ idle, > writeback_inodes_sb_if_idle() uses a work handoff to another thread, > which means that the work might not get executed for another six weeks. > > So no, your ENOSPC handling is still busted and I''ll be doing you a > favour when I send that parport patch.Btrfs uses it with this cool looping construct. It''s an innovative combination of while, 1, schedule_timeout(), and if all goes well, break. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 17, 2010 at 10:06:13PM -0500, Ted Ts''o wrote:> This makes sense to me as well. > > Acked-by: "Theodore Ts''o" <tytso@mit.edu> > > So how do we want to send this patch to Linus? It''s a writeback > change, so through some mm tree? Or it lives in fs/fs-writeback.c > (which I always thought was weird; why is it not in mm/?), so maybe > through the VFS tree, even though I don''t think Al would really care > about this patch.As in "don''t really like", TBH. I''ll take it (with saner commit message and comment in the source), but I really wonder if we are just asking for more trouble down the road. Specifically, I *really* want to see locking rules for that sucker spelled out. What can be held by caller? -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/18/10 12:36 PM, Andrew Morton wrote:> On Thu, 18 Nov 2010 12:04:21 -0600 Eric Sandeen <sandeen@redhat.com> wrote: > >> On 11/18/10 11:10 AM, Andrew Morton wrote: >>> On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote: >>> >>>>> Can we just delete writeback_inodes_sb_nr_if_idle() and >>>>> writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is >>>>> pretty handwavy - do we know that deleting these things would make a >>>>> jot of difference? >>>> >>>> Really? I thought it was pretty decent ;) >>>> >>>> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems." >>>> shows the problem clearly, IIRC. I should have included that in the >>>> changelog, I suppose, sorry. >>> >>> Your email didn''t really impart any information :( >>> >>> I suppose I could accidentally delete those nasty little functions in a >>> drivers/parport patch then wait and see if anyone notices. >>> >> >> Um, ok, then, to answer the question directly : >> >> No, please don''t delete those functions, it will break ENOSPC handling >> in ext4 as shown by xfstests regression test #204 ... >> > > If those functions "fix" a testcase then it was by sheer luck, and the > fs''s ENOSPC handling is still busted. > > For a start writeback_inodes_sb_if_idle() is a no-op if the device > isn''t idle!so writeback is already in progress and it''s already doing what we need, right? Space is being freed up as we speak in that case.> Secondly, if the device _was_ idle, > writeback_inodes_sb_if_idle() uses a work handoff to another thread, > which means that the work might not get executed for another six weeks.We start it quite early, before things are critical. Yeah, it''s not bulletproof but it is tons better. -Eric> So no, your ENOSPC handling is still busted and I''ll be doing you a > favour when I send that parport patch. > > No? >-- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 18 Nov 2010 13:02:43 -0600 Eric Sandeen <sandeen@redhat.com> wrote:> On 11/18/10 12:36 PM, Andrew Morton wrote: > > On Thu, 18 Nov 2010 12:04:21 -0600 Eric Sandeen <sandeen@redhat.com> wrote: > > > >> On 11/18/10 11:10 AM, Andrew Morton wrote: > >>> On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote: > >>> > >>>>> Can we just delete writeback_inodes_sb_nr_if_idle() and > >>>>> writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is > >>>>> pretty handwavy - do we know that deleting these things would make a > >>>>> jot of difference? > >>>> > >>>> Really? I thought it was pretty decent ;) > >>>> > >>>> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems." > >>>> shows the problem clearly, IIRC. I should have included that in the > >>>> changelog, I suppose, sorry. > >>> > >>> Your email didn''t really impart any information :( > >>> > >>> I suppose I could accidentally delete those nasty little functions in a > >>> drivers/parport patch then wait and see if anyone notices. > >>> > >> > >> Um, ok, then, to answer the question directly : > >> > >> No, please don''t delete those functions, it will break ENOSPC handling > >> in ext4 as shown by xfstests regression test #204 ... > >> > > > > If those functions "fix" a testcase then it was by sheer luck, and the > > fs''s ENOSPC handling is still busted. > > > > For a start writeback_inodes_sb_if_idle() is a no-op if the device > > isn''t idle! > > so writeback is already in progress and it''s already doing what we need, > right? Space is being freed up as we speak in that case.With no guarantee that it''s being freed at a sufficient rate.> > Secondly, if the device _was_ idle, > > writeback_inodes_sb_if_idle() uses a work handoff to another thread, > > which means that the work might not get executed for another six weeks. > > We start it quite early, before things are critical. > > Yeah, it''s not bulletproof but it is tons better.Translation: "it papers over a bug". Look, if this was a little best-effort poke-writeback-now performance tweak then fine. But as an attempt to prevent a synchronous and bogus ENOSPC error it''s just hopeless. Guys, fix the thing for real, and take that hack out. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 18 Nov 2010 13:51:15 -0500 Chris Mason <chris.mason@oracle.com> wrote:> > If those functions "fix" a testcase then it was by sheer luck, and the > > fs''s ENOSPC handling is still busted. > > > > For a start writeback_inodes_sb_if_idle() is a no-op if the device > > isn''t idle! Secondly, if the device _was_ idle, > > writeback_inodes_sb_if_idle() uses a work handoff to another thread, > > which means that the work might not get executed for another six weeks. > > > > So no, your ENOSPC handling is still busted and I''ll be doing you a > > favour when I send that parport patch. > > Btrfs uses it with this cool looping construct. It''s an innovative > combination of while, 1, schedule_timeout(), and if all goes well, break.If the calling code can do that then it doesn''t need to pass the work off to another thread at all. Just sychronously call writeback_inodes_sb(), then bye-bye goes writeback_inodes_sb_if_idle(), to great sighs of relief. And again: as btrfs is effectively making a synchronous call to writeback_inodes_sb() via schedule(), then surely it does not need to take s_umount to protect its own darn superblock!! -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Andrew Morton''s message of 2010-11-18 15:22:38 -0500:> On Thu, 18 Nov 2010 13:51:15 -0500 > Chris Mason <chris.mason@oracle.com> wrote: > > > > If those functions "fix" a testcase then it was by sheer luck, and the > > > fs''s ENOSPC handling is still busted. > > > > > > For a start writeback_inodes_sb_if_idle() is a no-op if the device > > > isn''t idle! Secondly, if the device _was_ idle, > > > writeback_inodes_sb_if_idle() uses a work handoff to another thread, > > > which means that the work might not get executed for another six weeks. > > > > > > So no, your ENOSPC handling is still busted and I''ll be doing you a > > > favour when I send that parport patch. > > > > Btrfs uses it with this cool looping construct. It''s an innovative > > combination of while, 1, schedule_timeout(), and if all goes well, break. > > If the calling code can do that then it doesn''t need to pass the work > off to another thread at all. Just sychronously call > writeback_inodes_sb(), then bye-bye goes writeback_inodes_sb_if_idle(), > to great sighs of relief.I think if_idle() is a different discussion than the lock. writeback_inodes_sb will not actually writeback inodes. It calls bdi_queue_task, which puts a work request into the list of things the bdi flusher thread is already doing, and then it waits for that work to finish. So, if the bdi flusher thread is already doing work on this bdi, it will finish that work (at which point our ENOSPC problems may be solved) and then it will do the work we''ve added. The btrfs enospc code is calling this while we start transactions, so it might happen once per process calling write(). So if you have 50 procs, we don''t want 50 work requests pigging our bdi work queue. We just want to let the flusher do its thing and wait for it to make progress. schedule_timeout isn''t a very pretty way to wait for that progress, but we haven''t had a pretty way to wait for IO...ever (don''t say congestion_wait, it really just broke down to schedule_timeout()). Without the if_idle variant, we''ll end up doing 50 calls to writeback_inodes_sb, one after the other, even if the 1st call was enough.> > And again: as btrfs is effectively making a synchronous call to > writeback_inodes_sb() via schedule(), then surely it does not need to > take s_umount to protect its own darn superblock!! >Definitely agree here. I''d love to get rid of the s_umount lock. Outside of the WARN_ON in __writeback_inodes_sb, I don''t think we need it. Christoph added the s_umount to protect against mount -o remount,ro, but I think we can deal with that via an extra check while actually walking the inodes/pages? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 18-11-10 13:33:54, Chris Mason wrote:> > Can we just delete writeback_inodes_sb_nr_if_idle() and > > writeback_inodes_sb_if_idle()? The changelog for 17bd55d037a02 is > > pretty handwavy - do we know that deleting these things would make a > > jot of difference? > > > > And why _do_ we need to hold s_umount during the bdi_queue_work() > > handover? Would simply bumping s_count suffice? > > > > We don''t need to keep the super in ram, we need to keep the FS mounted > so that writepage and friends continue to do useful things. s_count > isn''t enough for that...but when the bdi stuff is passed an SB from > something that has the SB explicitly pinned, we should be able to safely > skip the locking. > > Since these functions are only used in that context it makes good sense > to try_lock them or drop the lock completely. > > I think the only reason we need the lock: > > void writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr) > { > ... > WARN_ON(!rwsem_is_locked(&sb->s_umount));The above is the only reason why we need the lock in the call from ->write_begin(). Yes. But:> We''re not going to go from rw to ro with dirty pages unless something > horrible has gone wrong (eios all around in the FS), so I''m not sure why > we need the lock at all?One of the nasty cases is for example: Writeback thread decides inode needs writeout, so the thread gets inode reference. Then someone calls umount and gets EBUSY because writeback thread is working. Kind of unexpected... So generally we *do* need a serialization of writeback thread and umount / remount. Just in that particular case where we call the function from ->write_begin(), s_umount is overkill... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 17-11-10 22:28:34, Andrew Morton wrote:> I''m not sure that s_umount versus i_mutex has come up before. > > Logically I''d expect i_mutex to nest inside s_umount. Because s_umount > is a per-superblock thing, and i_mutex is a per-file thing, and files > live under superblocks. Nesting s_umount outside i_mutex creates > complex deadlock graphs between the various i_mutexes, I think. > > Someone tell me if btrfs has the same bug, via its call to > writeback_inodes_sb_nr_if_idle()? > > I don''t see why these functions need s_umount at all, if they''re called > from within ->write_begin against an inode on that superblock. If the > superblock can get itself disappeared while we''re running ->write_begin > on it, we have problems, no?As I wrote to Chris, the function just needs exclusion from umount / remount happening (and want to stop umount from returning EBUSY when writeback thread is writing something out). When the function is called from ->write_begin this is no issue as you properly noted so s_umount is not needed in that particular case.> In which case I''d suggest just removing the down_read(s_umount) and > specifying that the caller must pin the superblock via some means.Possibly, but currently the advantage is that we can have WARN_ON in the writeback code that complains if someone starts writeback without properly pinned superblock and we cannot easily do that with your general rule. I''m not saying that should stop us from changing the rule but it was kind of nice.> Only we can''t do that because we need to hold s_umount until the > bdi_queue_work() worker has done its work. > > The fact that a call to ->write_begin can randomly return with s_umount > held, to be randomly released at some random time in the future is a > bit ugly, isn''t it? write_begin is a pretty low-level, per-inode > thing.I guess you missed that writeback_inodes_sb_nr() (called from _if_idle variants) does: bdi_queue_work(sb->s_bdi, &work); wait_for_completion(&done); So we return only after all the IO has been submitted and unlock s_umount in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves because we are holding i_mutex and we need to get and put references to other inodes while doing writeback (those would be really horrible lock dependencies - writeback thread can put the last reference to an unlinked inode...). In fact, as I''m speaking about it, pushing things to writeback thread and waiting on the work does not help a bit with the locking issues (we didn''t wait for the work previously but that had other issues). Bug, sigh. What might be better interface for usecases like above is to allow filesystem to kick flusher thread to start doing background writeback (regardless of dirty limits). Then the caller can wait for some delayed allocation reservations to get freed (easy enough to check in ->writepage() and wake the waiters) - possibly with a reasonable timeout so that we don''t stall forever. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 18, 2010 at 09:58:31AM -0800, Andrew Morton wrote:> On Thu, 18 Nov 2010 19:18:22 +1100 Nick Piggin <npiggin@kernel.dk> wrote: > > > On Wed, Nov 17, 2010 at 10:28:34PM -0800, Andrew Morton wrote: > > > > > Logically I''d expect i_mutex to nest inside s_umount. Because s_umount > > > is a per-superblock thing, and i_mutex is a per-file thing, and files > > > live under superblocks. Nesting s_umount outside i_mutex creates > > > complex deadlock graphs between the various i_mutexes, I think. > > > > You mean i_mutex outside s_umount? > > > > Nope. i_mutex should nest inside s_umount. Just as inodes nest inside > superblocks! Seems logical to me ;)Right, but your last sentence seemed to suggest that the natural ordering creates deadlocks :)> > > And why _do_ we need to hold s_umount during the bdi_queue_work() > > > handover? Would simply bumping s_count suffice? > > > > s_count just prevents it from going away, but s_umount is still needed > > to keep umount, remount,ro, freezing etc activity away. I don''t think > > there is an easy way to do it. > > > > Perhaps filesystem should have access to the dirty throttling path, kick > > writeback or delayed allocation etc as needed, and throttle against > > outstanding work that needs to be done, going through the normal > > writeback paths? > > I just cannot believe that we need s_mount inside ->write_begin. Is it > really the case that someone can come along and unmount or remount or > freeze our filesystem while some other process is down performing a > ->write_begin against one of its files? Kidding?Not for the work handoff either? If that is all waited on synchronously before ->write_end returns, then no we shouldn''t need any more locks of course. But asynch writeout needs a mutex rather than refcount so the umount has something to block against and not just fail. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote:> On Wed 17-11-10 22:28:34, Andrew Morton wrote: > > The fact that a call to ->write_begin can randomly return with s_umount > > held, to be randomly released at some random time in the future is a > > bit ugly, isn''t it? write_begin is a pretty low-level, per-inode > > thing. > I guess you missed that writeback_inodes_sb_nr() (called from _if_idle > variants) does: > bdi_queue_work(sb->s_bdi, &work); > wait_for_completion(&done); > So we return only after all the IO has been submitted and unlock s_umount > in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves > because we are holding i_mutex and we need to get and put references > to other inodes while doing writeback (those would be really horrible lock > dependencies - writeback thread can put the last reference to an unlinked > inode...).But if we''re waiting for it, with the lock held, then surely it can deadlock just the same as if we submit it ourself? BTW. are you taking i_mutex inside writeback? I mutex can be held while entering page reclaim, and thus writepage... so it could be a bug too.> In fact, as I''m speaking about it, pushing things to writeback thread and > waiting on the work does not help a bit with the locking issues (we didn''t > wait for the work previously but that had other issues). Bug, sigh. > > What might be better interface for usecases like above is to allow > filesystem to kick flusher thread to start doing background writeback > (regardless of dirty limits). Then the caller can wait for some delayed > allocation reservations to get freed (easy enough to check in > ->writepage() and wake the waiters) - possibly with a reasonable timeout > so that we don''t stall forever.We really need to throttle the producer without any locks held, no? So the filesystem would like to to hook into dirtying path somewhere without i_mutex held (and without implementing your own .write). Eg. around the dirty throttling calls somewhere I suppose. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Nov 19, 2010, at 12:10 AM, Nick Piggin wrote:> But asynch writeout needs a mutex rather than refcount so the umount > has something to block against and not just fail.Or we use a completion handler instead of a mutex for umount? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 19-11-10 16:16:19, Nick Piggin wrote:> On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote: > > On Wed 17-11-10 22:28:34, Andrew Morton wrote: > > > The fact that a call to ->write_begin can randomly return with s_umount > > > held, to be randomly released at some random time in the future is a > > > bit ugly, isn''t it? write_begin is a pretty low-level, per-inode > > > thing. > > I guess you missed that writeback_inodes_sb_nr() (called from _if_idle > > variants) does: > > bdi_queue_work(sb->s_bdi, &work); > > wait_for_completion(&done); > > So we return only after all the IO has been submitted and unlock s_umount > > in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves > > because we are holding i_mutex and we need to get and put references > > to other inodes while doing writeback (those would be really horrible lock > > dependencies - writeback thread can put the last reference to an unlinked > > inode...). > > But if we''re waiting for it, with the lock held, then surely it can > deadlock just the same as if we submit it ourself?Yes, that''s what I realized as well and what needs fixing. It was an unintentional consequence of locking changes Christoph did to the writeback code to fix other races.> BTW. are you taking i_mutex inside writeback? I mutex can be held > while entering page reclaim, and thus writepage... so it could be a > bug too.No, writeback does not need i_mutex.> > In fact, as I''m speaking about it, pushing things to writeback thread and > > waiting on the work does not help a bit with the locking issues (we didn''t > > wait for the work previously but that had other issues). Bug, sigh. > > > > What might be better interface for usecases like above is to allow > > filesystem to kick flusher thread to start doing background writeback > > (regardless of dirty limits). Then the caller can wait for some delayed > > allocation reservations to get freed (easy enough to check in > > ->writepage() and wake the waiters) - possibly with a reasonable timeout > > so that we don''t stall forever. > > We really need to throttle the producer without any locks held, no? > So the filesystem would like to to hook into dirtying path somewhere > without i_mutex held (and without implementing your own .write). Eg. > around the dirty throttling calls somewhere I suppose.Well, the particular case where we need to do delayed allocation but the filesystem has already all blocks reserved is hard to detect without any locks. At least we need some locks to find out how many blocks do we need to reserve for that write(). So for filesystems it would solve the locking issues if we could bail out of write() path up to the point where we hold no locks, wait there / do necessary writeback and then restart the write... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Nov 2010 21:18:13 -0600 Eric Sandeen <sandeen@redhat.com> wrote:> On 11/17/10 12:10 AM, Nick Piggin wrote: > > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote: > >> On 11/16/10 10:38 PM, Nick Piggin wrote: > >>>> as for the locking problems ... sorry about that! > >>> > >>> That''s no problem. So is that an ack? :) > >>> > >> > >> I''d like to test it with the original case it was supposed to solve; will > >> do that tomorrow. > > > > OK, but it shouldn''t make much difference, unless there is a lot of > > strange activity happening on the sb (like mount / umount / remount / > > freeze / etc). > > > > Ok, good point. And since I failed in my promise anyway, I won''t > hold you up; > > Acked-by: Eric Sandeen <sandeen@redhat.com>That patch still sucks! We all breathlessly await v2, which will have a complete changelog and approximately correct comments. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 22, 2010 at 07:16:55PM +0100, Jan Kara wrote:> On Fri 19-11-10 16:16:19, Nick Piggin wrote: > > On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote: > > > On Wed 17-11-10 22:28:34, Andrew Morton wrote: > > > > The fact that a call to ->write_begin can randomly return with s_umount > > > > held, to be randomly released at some random time in the future is a > > > > bit ugly, isn''t it? write_begin is a pretty low-level, per-inode > > > > thing. > > > I guess you missed that writeback_inodes_sb_nr() (called from _if_idle > > > variants) does: > > > bdi_queue_work(sb->s_bdi, &work); > > > wait_for_completion(&done); > > > So we return only after all the IO has been submitted and unlock s_umount > > > in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves > > > because we are holding i_mutex and we need to get and put references > > > to other inodes while doing writeback (those would be really horrible lock > > > dependencies - writeback thread can put the last reference to an unlinked > > > inode...). > > > > But if we''re waiting for it, with the lock held, then surely it can > > deadlock just the same as if we submit it ourself? > Yes, that''s what I realized as well and what needs fixing. It was an > unintentional consequence of locking changes Christoph did to the writeback > code to fix other races. > > > BTW. are you taking i_mutex inside writeback? I mutex can be held > > while entering page reclaim, and thus writepage... so it could be a > > bug too. > No, writeback does not need i_mutex.I did in fact see i_mutex from writeback, which is how the lock order reversal was noticed in the first place. I haven''t had much luck in reproducing it yet. It did come from end_io_work, I believe. There is another deadlock in here, by the looks (although this is not the one which triggers lockdep -- the workqueue coupling means it defeats lockdep detection). Buffered write holds i_lock, and waits for inode writeback submission, but the work queue seems to be blocked on i_mutex (ext4_end_io_work), and so it deadlocks. Note that this is an AA deadlock, different from ABBA one relating to s_umount lock (but very similar call chains IIRC). [ 748.406457] SysRq : Show Blocked State [ 748.406685] task PC stack pid father [ 748.406868] kworker/9:1 D 0000000000000000 6296 118 2 0x00000000 [ 748.407173] ffff88012acb3c90 0000000000000046 0000000000000000 ffff88012b1cec80 [ 748.407686] 0000000000000002 ffff88012acb2000 ffff88012acb2000 000000000000d6c8 [ 748.408200] ffff88012acb2010 ffff88012acb3fd8 ffff880129c7dd00 ffff88012b1cec80 [ 748.408711] Call Trace: [ 748.408866] [<ffffffff81039431>] ? get_parent_ip+0x11/0x50 [ 748.409033] [<ffffffff8160145c>] __mutex_lock_common+0x18c/0x480 [ 748.409205] [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4] [ 748.409379] [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4] [ 748.409546] [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50 [ 748.409717] [<ffffffffa0042147>] ext4_end_io_work+0x37/0xb0 [ext4] [ 748.409883] [<ffffffff81068378>] process_one_work+0x1b8/0x5a0 [ 748.410046] [<ffffffff8106830e>] ? process_one_work+0x14e/0x5a0 [ 748.410219] [<ffffffffa0042110>] ? ext4_end_io_work+0x0/0xb0 [ext4] [ 748.410386] [<ffffffff81069675>] worker_thread+0x175/0x3a0 [ 748.410549] [<ffffffff81069500>] ? worker_thread+0x0/0x3a0 [ 748.410712] [<ffffffff8106e246>] kthread+0x96/0xa0 [ 748.410874] [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10 [ 748.411038] [<ffffffff81039878>] ? finish_task_switch+0x78/0x110 [ 748.411202] [<ffffffff816036c0>] ? restore_args+0x0/0x30 [ 748.411364] [<ffffffff8106e1b0>] ? kthread+0x0/0xa0 [ 748.411524] [<ffffffff81003ed0>] ? kernel_thread_helper+0x0/0x10 [ 748.606853] dbench D 0000000000000000 2872 2916 1 0x00000004 [ 748.607154] ffff880129ec58b8 0000000000000046 ffff880129ec5838 ffffffff810806fd [ 748.607665] ffff880129ec5898 ffff880129ec4000 ffff880129ec4000 000000000000d6c8 [ 748.608176] ffff880129ec4010 ffff880129ec5fd8 ffff88010a2d0000 ffff88011ff69f00 [ 748.608686] Call Trace: [ 748.608837] [<ffffffff810806fd>] ? trace_hardirqs_off+0xd/0x10 [ 748.609001] [<ffffffff81600bb5>] schedule_timeout+0x1f5/0x350 [ 748.609164] [<ffffffff8108380b>] ? mark_held_locks+0x6b/0xa0 [ 748.609328] [<ffffffff8160314b>] ? _raw_spin_unlock_irq+0x2b/0x60 [ 748.609493] [<ffffffff81039431>] ? get_parent_ip+0x11/0x50 [ 748.609656] [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0 [ 748.609820] [<ffffffff815ffacd>] wait_for_common+0x10d/0x190 [ 748.609984] [<ffffffff810426e0>] ? default_wake_function+0x0/0x10 [ 748.610149] [<ffffffff81602ec9>] ? _raw_spin_unlock_bh+0x39/0x40 [ 748.610314] [<ffffffff815ffbf8>] wait_for_completion+0x18/0x20 [ 748.610479] [<ffffffff8113d1e7>] writeback_inodes_sb_nr+0xf7/0x120 [ 748.610646] [<ffffffff8113d68d>] writeback_inodes_sb+0x4d/0x60 [ 748.610811] [<ffffffff8113d6d2>] ? writeback_inodes_sb_if_idle+0x32/0x60 [ 748.610978] [<ffffffff8113d6da>] writeback_inodes_sb_if_idle+0x3a/0x60 [ 748.611153] [<ffffffffa003fefd>] ext4_da_write_begin+0x20d/0x310 [ext4] [ 748.611321] [<ffffffff810cbbf4>] generic_file_buffered_write+0x114/0x2a0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Here is another one. Both i_mutex in writeback and s_umount in write(2)
underneath i_mutex seem like an interesting idea. The former you can
probably get away with (provided you solve the previous AA deadlock),
but the latter seems too problematic. I think my trylock patch solves
it.
[  409.479214] ======================================================[ 
409.479527] [ INFO: possible circular locking dependency detected ]
[  409.479689] 2.6.37-rc3+ #26[  409.479837]
-------------------------------------------------------
[  409.479998] umount/4020 is trying to acquire lock:[  409.480155]
(ext4-dio-unwritten){+.+...}, at: [<ffffffff81067a80>]
flush_workqueue+0x0/0x540
[  409.480178] [  409.480178] but task is already holding lock:
[  409.480178]  (&type->s_umount_key#23){++++..}, at:
[<ffffffff8111b40d>] deactivate_super+0x3d
/0x60
[  409.480178] [  409.480178] which lock already depends on the new
lock.
[  409.480178] 
[  409.480178] 
[  409.480178] the existing dependency chain (in reverse order) is:
[  409.480178] 
[  409.480178] -> #3 (&type->s_umount_key#23){++++..}:
[  409.480178]        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
[  409.480178]        [<ffffffff81601ba2>] down_read+0x42/0x60
[  409.480178]        [<ffffffff8113d6d2>]
writeback_inodes_sb_if_idle+0x32/0x60
[  409.480178]        [<ffffffffa003fef8>]
ext4_da_write_begin+0x208/0x2d0 [ext4]
[  409.480178]        [<ffffffff810cbbf4>]
generic_file_buffered_write+0x114/0x2a0
[  409.480178]        [<ffffffff810cc5e0>]
__generic_file_aio_write+0x240/0x470
[  409.480178]        [<ffffffff810cc876>]
generic_file_aio_write+0x66/0xd0
[  409.480178]        [<ffffffffa0034fad>] ext4_file_write+0x3d/0xd0
[ext4]
[  409.480178]        [<ffffffff81117702>] do_sync_write+0xd2/0x110
[  409.480178]        [<ffffffff811179c8>] vfs_write+0xc8/0x190
[  409.480178]        [<ffffffff8111851a>] sys_pwrite64+0x7a/0x90
[  409.480178]        [<ffffffff8100312b>]
system_call_fastpath+0x16/0x1b
[  409.480178] 
[  409.480178] -> #2 (&sb->s_type->i_mutex_key#13){+.+.+.}:
[  409.480178]        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
[  409.480178]        [<ffffffff81601329>]
__mutex_lock_common+0x59/0x480
[  409.480178]        [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50
[  409.480178]        [<ffffffffa0042107>] ext4_end_io_work+0x37/0xb0
[ext4]
[  409.480178]        [<ffffffff81068378>] process_one_work+0x1b8/0x5a0
[  409.480178]        [<ffffffff81069675>] worker_thread+0x175/0x3a0
[  409.480178]        [<ffffffff8106e246>] kthread+0x96/0xa0
[  409.480178]        [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
[  409.480178] 
[  409.480178] -> #1 ((&io->work)){+.+...}:
[  409.480178]        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
[  409.480178]        [<ffffffff81068364>] process_one_work+0x1a4/0x5a0
[  409.480178]        [<ffffffff81069675>] worker_thread+0x175/0x3a0
[  409.480178]        [<ffffffff8106e246>] kthread+0x96/0xa0
[  409.480178]        [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
[  409.480178] 
[  409.480178] -> #0 (ext4-dio-unwritten){+.+...}:
[  409.480178]        [<ffffffff81085122>] __lock_acquire+0x1382/0x1490
[  409.480178]        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
[  409.480178]        [<ffffffff81067bc8>] flush_workqueue+0x148/0x540
[  409.480178]        [<ffffffffa004f5db>] ext4_sync_fs+0x3b/0x100
[ext4]
[  409.480178]        [<ffffffff8114304e>] __sync_filesystem+0x5e/0x90
[  409.480178]        [<ffffffff81143132>] sync_filesystem+0x32/0x60
[  409.480178]        [<ffffffff8111a97f>]
generic_shutdown_super+0x2f/0x100
[  409.480178]        [<ffffffff8111aa7c>] kill_block_super+0x2c/0x50
[  409.480178]        [<ffffffff8111b1e5>]
deactivate_locked_super+0x45/0x60
[  409.480178]        [<ffffffff8111b415>] deactivate_super+0x45/0x60
[  409.480178]        [<ffffffff81136430>] mntput_no_expire+0xf0/0x190
[  409.480178]        [<ffffffff811376a9>] sys_umount+0x79/0x3a0
[  409.480178]        [<ffffffff8100312b>]
system_call_fastpath+0x16/0x1b
[  409.480178] [  409.480178] other info that might help us debug this:
[  409.480178] 
[  409.480178] 1 lock held by umount/4020:
[  409.480178]  #0:  (&type->s_umount_key#23){++++..}, at:
[<ffffffff8111b40d>] deactivate_super
+0x3d/0x60
[  409.480178] 
[  409.480178] stack backtrace:
[  409.480178] Pid: 4020, comm: umount Not tainted 2.6.37-rc3+ #26
[  409.480178] Call Trace:
[  409.480178]  [<ffffffff81082c39>] print_circular_bug+0xe9/0xf0
[  409.480178]  [<ffffffff81085122>] __lock_acquire+0x1382/0x1490
[  409.480178]  [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
[  409.480178]  [<ffffffff81067a80>] ? flush_workqueue+0x0/0x540
[  409.480178]  [<ffffffff81067bc8>] flush_workqueue+0x148/0x540
[  409.480178]  [<ffffffff81067a80>] ? flush_workqueue+0x0/0x540
[  409.480178]  [<ffffffffa004f5db>] ext4_sync_fs+0x3b/0x100 [ext4]
[  409.480178]  [<ffffffff8113d68d>] ? writeback_inodes_sb+0x4d/0x60
[  409.480178]  [<ffffffff8114304e>] __sync_filesystem+0x5e/0x90
[  409.480178]  [<ffffffff81143132>] sync_filesystem+0x32/0x60
[  409.480178]  [<ffffffff8111a97f>] generic_shutdown_super+0x2f/0x100
[  409.480178]  [<ffffffff8111aa7c>] kill_block_super+0x2c/0x50
[  409.480178]  [<ffffffff8111b1e5>] deactivate_locked_super+0x45/0x60
[  409.480178]  [<ffffffff8111b415>] deactivate_super+0x45/0x60
[  409.480178]  [<ffffffff81136430>] mntput_no_expire+0xf0/0x190
[  409.480178]  [<ffffffff811376a9>] sys_umount+0x79/0x3a0
[  409.480178]  [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
On Tue 23-11-10 19:07:58, Nick Piggin wrote:> On Mon, Nov 22, 2010 at 07:16:55PM +0100, Jan Kara wrote: > > On Fri 19-11-10 16:16:19, Nick Piggin wrote: > > > On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote: > > > > On Wed 17-11-10 22:28:34, Andrew Morton wrote: > > > > > The fact that a call to ->write_begin can randomly return with s_umount > > > > > held, to be randomly released at some random time in the future is a > > > > > bit ugly, isn''t it? write_begin is a pretty low-level, per-inode > > > > > thing. > > > > I guess you missed that writeback_inodes_sb_nr() (called from _if_idle > > > > variants) does: > > > > bdi_queue_work(sb->s_bdi, &work); > > > > wait_for_completion(&done); > > > > So we return only after all the IO has been submitted and unlock s_umount > > > > in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves > > > > because we are holding i_mutex and we need to get and put references > > > > to other inodes while doing writeback (those would be really horrible lock > > > > dependencies - writeback thread can put the last reference to an unlinked > > > > inode...). > > > > > > But if we''re waiting for it, with the lock held, then surely it can > > > deadlock just the same as if we submit it ourself? > > Yes, that''s what I realized as well and what needs fixing. It was an > > unintentional consequence of locking changes Christoph did to the writeback > > code to fix other races. > > > > > BTW. are you taking i_mutex inside writeback? I mutex can be held > > > while entering page reclaim, and thus writepage... so it could be a > > > bug too. > > No, writeback does not need i_mutex. > > I did in fact see i_mutex from writeback, which is how the lock order > reversal was noticed in the first place. I haven''t had much luck in > reproducing it yet. It did come from end_io_work, I believe. > > There is another deadlock in here, by the looks (although this is not > the one which triggers lockdep -- the workqueue coupling means it > defeats lockdep detection). > > Buffered write holds i_lock, and waits for inode writeback submission, > but the work queue seems to be blocked on i_mutex (ext4_end_io_work), > and so it deadlocks.Ah, I see. That''s a new thing introduced by Ted''s rewrite of ext4_writepages() - bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc. And indeed it introduced a deadlock as you describe... Honza> Note that this is an AA deadlock, different from ABBA one relating to > s_umount lock (but very similar call chains IIRC). > > > [ 748.406457] SysRq : Show Blocked State > [ 748.406685] task PC stack pid father > [ 748.406868] kworker/9:1 D 0000000000000000 6296 118 2 > 0x00000000 > [ 748.407173] ffff88012acb3c90 0000000000000046 0000000000000000 > ffff88012b1cec80 > [ 748.407686] 0000000000000002 ffff88012acb2000 ffff88012acb2000 > 000000000000d6c8 > [ 748.408200] ffff88012acb2010 ffff88012acb3fd8 ffff880129c7dd00 > ffff88012b1cec80 > [ 748.408711] Call Trace: > [ 748.408866] [<ffffffff81039431>] ? get_parent_ip+0x11/0x50 > [ 748.409033] [<ffffffff8160145c>] __mutex_lock_common+0x18c/0x480 > [ 748.409205] [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4] > [ 748.409379] [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4] > [ 748.409546] [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50 > [ 748.409717] [<ffffffffa0042147>] ext4_end_io_work+0x37/0xb0 [ext4] > [ 748.409883] [<ffffffff81068378>] process_one_work+0x1b8/0x5a0 > [ 748.410046] [<ffffffff8106830e>] ? process_one_work+0x14e/0x5a0 > [ 748.410219] [<ffffffffa0042110>] ? ext4_end_io_work+0x0/0xb0 [ext4] > [ 748.410386] [<ffffffff81069675>] worker_thread+0x175/0x3a0 > [ 748.410549] [<ffffffff81069500>] ? worker_thread+0x0/0x3a0 > [ 748.410712] [<ffffffff8106e246>] kthread+0x96/0xa0 > [ 748.410874] [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10 > [ 748.411038] [<ffffffff81039878>] ? finish_task_switch+0x78/0x110 > [ 748.411202] [<ffffffff816036c0>] ? restore_args+0x0/0x30 > [ 748.411364] [<ffffffff8106e1b0>] ? kthread+0x0/0xa0 > [ 748.411524] [<ffffffff81003ed0>] ? kernel_thread_helper+0x0/0x10 > [ 748.606853] dbench D 0000000000000000 2872 2916 1 > 0x00000004 > [ 748.607154] ffff880129ec58b8 0000000000000046 ffff880129ec5838 > ffffffff810806fd > [ 748.607665] ffff880129ec5898 ffff880129ec4000 ffff880129ec4000 > 000000000000d6c8 > [ 748.608176] ffff880129ec4010 ffff880129ec5fd8 ffff88010a2d0000 > ffff88011ff69f00 > [ 748.608686] Call Trace: > [ 748.608837] [<ffffffff810806fd>] ? trace_hardirqs_off+0xd/0x10 > [ 748.609001] [<ffffffff81600bb5>] schedule_timeout+0x1f5/0x350 > [ 748.609164] [<ffffffff8108380b>] ? mark_held_locks+0x6b/0xa0 > [ 748.609328] [<ffffffff8160314b>] ? _raw_spin_unlock_irq+0x2b/0x60 > [ 748.609493] [<ffffffff81039431>] ? get_parent_ip+0x11/0x50 > [ 748.609656] [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0 > [ 748.609820] [<ffffffff815ffacd>] wait_for_common+0x10d/0x190 > [ 748.609984] [<ffffffff810426e0>] ? default_wake_function+0x0/0x10 > [ 748.610149] [<ffffffff81602ec9>] ? _raw_spin_unlock_bh+0x39/0x40 > [ 748.610314] [<ffffffff815ffbf8>] wait_for_completion+0x18/0x20 > [ 748.610479] [<ffffffff8113d1e7>] writeback_inodes_sb_nr+0xf7/0x120 > [ 748.610646] [<ffffffff8113d68d>] writeback_inodes_sb+0x4d/0x60 > [ 748.610811] [<ffffffff8113d6d2>] ? > writeback_inodes_sb_if_idle+0x32/0x60 > [ 748.610978] [<ffffffff8113d6da>] > writeback_inodes_sb_if_idle+0x3a/0x60 > [ 748.611153] [<ffffffffa003fefd>] ext4_da_write_begin+0x20d/0x310 > [ext4] > [ 748.611321] [<ffffffff810cbbf4>] > generic_file_buffered_write+0x114/0x2a0 > > >-- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html