thr3ads.net - Btrfs devel - [patch] fs: fix deadlocks in writeback_if

If this information is useful, please help other people find it:
Share via:

Nick Piggin

2010-Nov-23 10:02 UTC

[patch] fs: fix deadlocks in writeback_if_idle

Taking s_umount lock inside i_mutex can result in an ABBA deadlock:

 ====================================================== [ INFO: possible
circular locking dependency detected ]
 2.6.37-rc3+ #26
 -------------------------------------------------------
 append_writer/12828 is trying to acquire lock:
  (&type->s_umount_key#24){+++++.}, at:
	[<ffffffff8113d6d2>] writeback_inodes_sb_if_idle+0x32/0x60
 
 but task is already holding lock:
  (&sb->s_type->i_mutex_key#14){+.+.+.}, at:
	[<ffffffff810cc863>] generic_file_aio_write+0x53/0xd0
 
 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:
 
 -> #3 (&sb->s_type->i_mutex_key#14){+.+.+.}:
        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
        [<ffffffff81601329>] __mutex_lock_common+0x59/0x480
        [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50
        [<ffffffffa003a147>] ext4_end_io_work+0x37/0xb0 [ext4]
        [<ffffffff81068378>] process_one_work+0x1b8/0x5a0
        [<ffffffff81069675>] worker_thread+0x175/0x3a0
        [<ffffffff8106e246>] kthread+0x96/0xa0
        [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
 
 -> #2 ((&io->work)){+.+...}:
        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
        [<ffffffff81068364>] process_one_work+0x1a4/0x5a0
        [<ffffffff81069675>] worker_thread+0x175/0x3a0
        [<ffffffff8106e246>] kthread+0x96/0xa0
        [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
 
 -> #1 (ext4-dio-unwritten){+.+...}:
        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
        [<ffffffff81067bc8>] flush_workqueue+0x148/0x540
        [<ffffffffa004761b>] ext4_sync_fs+0x3b/0x100 [ext4]
        [<ffffffff8114304e>] __sync_filesystem+0x5e/0x90
        [<ffffffff81143132>] sync_filesystem+0x32/0x60
        [<ffffffff8111a97f>] generic_shutdown_super+0x2f/0x100
        [<ffffffff8111aa7c>] kill_block_super+0x2c/0x50
        [<ffffffff8111b1e5>] deactivate_locked_super+0x45/0x60
        [<ffffffff8111b415>] deactivate_super+0x45/0x60
        [<ffffffff81136430>] mntput_no_expire+0xf0/0x190
        [<ffffffff811376a9>] sys_umount+0x79/0x3a0
        [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b
 
 -> #0 (&type->s_umount_key#24){+++++.}:
        [<ffffffff81085122>] __lock_acquire+0x1382/0x1490
        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
        [<ffffffff81601ba2>] down_read+0x42/0x60
        [<ffffffff8113d6d2>] writeback_inodes_sb_if_idle+0x32/0x60
        [<ffffffffa0037efd>] ext4_da_write_begin+0x20d/0x310 [ext4]
        [<ffffffff810cbbf4>] generic_file_buffered_write+0x114/0x2a0
        [<ffffffff810cc5e0>] __generic_file_aio_write+0x240/0x470
        [<ffffffff810cc876>] generic_file_aio_write+0x66/0xd0
        [<ffffffffa002cfad>] ext4_file_write+0x3d/0xd0 [ext4]
        [<ffffffff81117702>] do_sync_write+0xd2/0x110
        [<ffffffff811179c8>] vfs_write+0xc8/0x190
        [<ffffffff811183ec>] sys_write+0x4c/0x80
        [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b
 
 other info that might help us debug this:
 
 1 lock held by append_writer/12828:
  #0:  (&sb->s_type->i_mutex_key#14){+.+.+.}, at:
	[<ffffffff810cc863>] generic_file_aio_write+0x53/0xd0
 
 stack backtrace:
 Pid: 12828, comm: append_writer Not tainted 2.6.37-rc3+ #26
 Call Trace:
  [<ffffffff81082c39>] print_circular_bug+0xe9/0xf0
  [<ffffffff81085122>] __lock_acquire+0x1382/0x1490
  [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
  [<ffffffff8113d6d2>] ? writeback_inodes_sb_if_idle+0x32/0x60
  [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
  [<ffffffff81601ba2>] down_read+0x42/0x60
  [<ffffffff8113d6d2>] ? writeback_inodes_sb_if_idle+0x32/0x60
  [<ffffffff8113d6d2>] writeback_inodes_sb_if_idle+0x32/0x60
  [<ffffffffa0037efd>] ext4_da_write_begin+0x20d/0x310 [ext4]
  [<ffffffff81073dde>] ? up_read+0x1e/0x40
  [<ffffffff810cbbf4>] generic_file_buffered_write+0x114/0x2a0
  [<ffffffff8104fa22>] ? current_fs_time+0x22/0x30
  [<ffffffff810cc5e0>] __generic_file_aio_write+0x240/0x470
  [<ffffffff810cc863>] ? generic_file_aio_write+0x53/0xd0
  [<ffffffff810cc876>] generic_file_aio_write+0x66/0xd0
  [<ffffffffa002cfad>] ext4_file_write+0x3d/0xd0 [ext4]
  [<ffffffff81150db8>] ? fsnotify+0x88/0x5e0
  [<ffffffff81117702>] do_sync_write+0xd2/0x110
  [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
  [<ffffffff816027b9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
  [<ffffffff81022e9a>] ? smp_apic_timer_interrupt+0x6a/0xa0
  [<ffffffff811179c8>] vfs_write+0xc8/0x190
  [<ffffffff811183ec>] sys_write+0x4c/0x80
  [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b

Also, there can be an AA deadlock if the filesystem takes i_mutex in a
workqueue, if it also waits for work completion while holding i_mutex.

SysRq : Show Blocked State
task                        PC stack   pid father
kworker/9:1   D 0000000000000000  6296   118      2
Call Trace:
[<ffffffff81039431>] ? get_parent_ip+0x11/0x50
[<ffffffff8160145c>] __mutex_lock_common+0x18c/0x480
[<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4]
[<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4]
[<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50
[<ffffffffa0042147>] ext4_end_io_work+0x37/0xb0 [ext4]
[<ffffffff81068378>] process_one_work+0x1b8/0x5a0
[<ffffffff8106830e>] ? process_one_work+0x14e/0x5a0
[<ffffffffa0042110>] ? ext4_end_io_work+0x0/0xb0 [ext4]
[<ffffffff81069675>] worker_thread+0x175/0x3a0
[<ffffffff81069500>] ? worker_thread+0x0/0x3a0
[<ffffffff8106e246>] kthread+0x96/0xa0
[<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
[<ffffffff81039878>] ? finish_task_switch+0x78/0x110
[<ffffffff816036c0>] ? restore_args+0x0/0x30
[<ffffffff8106e1b0>] ? kthread+0x0/0xa0
[<ffffffff81003ed0>] ? kernel_thread_helper+0x0/0x10

dbench        D 0000000000000000  2872  2916      1
Call Trace:
[<ffffffff810806fd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff81600bb5>] schedule_timeout+0x1f5/0x350
[<ffffffff8108380b>] ? mark_held_locks+0x6b/0xa0
[<ffffffff8160314b>] ? _raw_spin_unlock_irq+0x2b/0x60
[<ffffffff81039431>] ? get_parent_ip+0x11/0x50
[<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
[<ffffffff815ffacd>] wait_for_common+0x10d/0x190
[<ffffffff810426e0>] ? default_wake_function+0x0/0x10
[<ffffffff81602ec9>] ? _raw_spin_unlock_bh+0x39/0x40
[<ffffffff815ffbf8>] wait_for_completion+0x18/0x20
[<ffffffff8113d1e7>] writeback_inodes_sb_nr+0xf7/0x120
[<ffffffff8113d68d>] writeback_inodes_sb+0x4d/0x60
[<ffffffff8113d6d2>] ?  writeback_inodes_sb_if_idle+0x32/0x60
[<ffffffff8113d6da>] writeback_inodes_sb_if_idle+0x3a/0x60
[<ffffffffa003fefd>] ext4_da_write_begin+0x20d/0x310
[<ffffffff810cbbf4>] generic_file_buffered_write+0x114/0x2a0

Avoid both these issues by issuing completely asynchronous writeback request in
writeback_inodes_sb_if_idle. Don''t let that fool you into thinking
these
functions don''t suck any more.

ext4 now passes extensive stress testing with xfstests, fs_mark, dbench,
with a writeback_inodes_if_idle call added directly into ext4_da_write_begin
to trigger the path frequently. Previously it would spew lockdep stuff and
hang in a number of ways very quickly.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

---
 fs/fs-writeback.c |   32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/fs-writeback.c
==================================================================---
linux-2.6.orig/fs/fs-writeback.c	2010-11-23 20:57:23.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-23 20:59:10.000000000 +1100
@@ -1152,16 +1152,17 @@ EXPORT_SYMBOL(writeback_inodes_sb);
  *
  * Invoke writeback_inodes_sb if no writeback is currently underway.
  * Returns 1 if writeback was started, 0 if not.
+ *
+ * Even if 1 is returned, writeback may not be started if memory allocation
+ * fails. This function makes no guarantees about anything.
  */
 int writeback_inodes_sb_if_idle(struct super_block *sb)
 {
 	if (!writeback_in_progress(sb->s_bdi)) {
-		down_read(&sb->s_umount);
-		writeback_inodes_sb(sb);
-		up_read(&sb->s_umount);
+		bdi_start_writeback(sb->s_bdi, get_nr_dirty_pages());
 		return 1;
-	} else
-		return 0;
+	}
+	return 0;
 }
 EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
 
@@ -1172,17 +1173,18 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
  *
  * Invoke writeback_inodes_sb if no writeback is currently underway.
  * Returns 1 if writeback was started, 0 if not.
+ *
+ * Even if 1 is returned, writeback may not be started if memory allocation
+ * fails. This function makes no guarantees about anything.
  */
 int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
 				   unsigned long nr)
 {
 	if (!writeback_in_progress(sb->s_bdi)) {
-		down_read(&sb->s_umount);
-		writeback_inodes_sb_nr(sb, nr);
-		up_read(&sb->s_umount);
+		bdi_start_writeback(sb->s_bdi, nr);
 		return 1;
-	} else
-		return 0;
+	}
+	return 0;
 }
 EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2010-Nov-23 10:11 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Tue, Nov 23, 2010 at 09:02:39PM +1100, Nick Piggin
wrote:>  int writeback_inodes_sb_if_idle(struct super_block *sb)
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
> -		down_read(&sb->s_umount);
> -		writeback_inodes_sb(sb);
> -		up_read(&sb->s_umount);
> +		bdi_start_writeback(sb->s_bdi, get_nr_dirty_pages());
>  		return 1;
> -	} else
> -		return 0;
> +	}
> +	return 0;
>  }
>  EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
>  
> @@ -1172,17 +1173,18 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * Even if 1 is returned, writeback may not be started if memory
allocation
> + * fails. This function makes no guarantees about anything.
>   */
>  int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
>  				   unsigned long nr)
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
> -		down_read(&sb->s_umount);
> -		writeback_inodes_sb_nr(sb, nr);
> -		up_read(&sb->s_umount);
> +		bdi_start_writeback(sb->s_bdi, nr);
>  		return 1;
> -	} else
> -		return 0;
> +	}
> +	return 0;
>  }
So I changed btrfs''s function too -- I don''t think
it''s too sane to take
s_umount inside a function that advertises itself to be freely called by
filesystems and not having anything of its locking rules documented.

The issue of writeback_inodes_sb being synchronous so far as it has to
wait until the work has been dequeued is another subtlety. That is a
funny interface though, really. It has 3 callers, sync, quota, and
ubifs. From a quick look, quota and ubifs seem to think it is some kind
of synchronous writeout API. It also really sucks that it can get
deadlocked behind an unrelated item in a workqueue. I think it should
just get converted over to the async-submission style as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Boaz Harrosh

2010-Nov-23 10:26 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On 11/23/2010 12:02 PM, Nick Piggin wrote:> 
> Taking s_umount lock inside i_mutex can result in an ABBA deadlock:
> 
>  ======================================================>  [ INFO:
possible circular locking dependency detected ]
>  2.6.37-rc3+ #26
>  -------------------------------------------------------
>  append_writer/12828 is trying to acquire lock:
>   (&type->s_umount_key#24){+++++.}, at:
> 	[<ffffffff8113d6d2>] writeback_inodes_sb_if_idle+0x32/0x60
>  
>  but task is already holding lock:
>   (&sb->s_type->i_mutex_key#14){+.+.+.}, at:
> 	[<ffffffff810cc863>] generic_file_aio_write+0x53/0xd0
>  
>  which lock already depends on the new lock.
>  
>  
>  the existing dependency chain (in reverse order) is:
>  
>  -> #3 (&sb->s_type->i_mutex_key#14){+.+.+.}:
>         [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
>         [<ffffffff81601329>] __mutex_lock_common+0x59/0x480
>         [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50
>         [<ffffffffa003a147>] ext4_end_io_work+0x37/0xb0 [ext4]
>         [<ffffffff81068378>] process_one_work+0x1b8/0x5a0
>         [<ffffffff81069675>] worker_thread+0x175/0x3a0
>         [<ffffffff8106e246>] kthread+0x96/0xa0
>         [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
>  
>  -> #2 ((&io->work)){+.+...}:
>         [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
>         [<ffffffff81068364>] process_one_work+0x1a4/0x5a0
>         [<ffffffff81069675>] worker_thread+0x175/0x3a0
>         [<ffffffff8106e246>] kthread+0x96/0xa0
>         [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
>  
>  -> #1 (ext4-dio-unwritten){+.+...}:
>         [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
>         [<ffffffff81067bc8>] flush_workqueue+0x148/0x540
>         [<ffffffffa004761b>] ext4_sync_fs+0x3b/0x100 [ext4]
>         [<ffffffff8114304e>] __sync_filesystem+0x5e/0x90
>         [<ffffffff81143132>] sync_filesystem+0x32/0x60
>         [<ffffffff8111a97f>] generic_shutdown_super+0x2f/0x100
>         [<ffffffff8111aa7c>] kill_block_super+0x2c/0x50
>         [<ffffffff8111b1e5>] deactivate_locked_super+0x45/0x60
>         [<ffffffff8111b415>] deactivate_super+0x45/0x60
>         [<ffffffff81136430>] mntput_no_expire+0xf0/0x190
>         [<ffffffff811376a9>] sys_umount+0x79/0x3a0
>         [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b
>  
>  -> #0 (&type->s_umount_key#24){+++++.}:
>         [<ffffffff81085122>] __lock_acquire+0x1382/0x1490
>         [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
>         [<ffffffff81601ba2>] down_read+0x42/0x60
>         [<ffffffff8113d6d2>] writeback_inodes_sb_if_idle+0x32/0x60
>         [<ffffffffa0037efd>] ext4_da_write_begin+0x20d/0x310 [ext4]
>         [<ffffffff810cbbf4>] generic_file_buffered_write+0x114/0x2a0
>         [<ffffffff810cc5e0>] __generic_file_aio_write+0x240/0x470
>         [<ffffffff810cc876>] generic_file_aio_write+0x66/0xd0
>         [<ffffffffa002cfad>] ext4_file_write+0x3d/0xd0 [ext4]
>         [<ffffffff81117702>] do_sync_write+0xd2/0x110
>         [<ffffffff811179c8>] vfs_write+0xc8/0x190
>         [<ffffffff811183ec>] sys_write+0x4c/0x80
>         [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b
>  
>  other info that might help us debug this:
>  
>  1 lock held by append_writer/12828:
>   #0:  (&sb->s_type->i_mutex_key#14){+.+.+.}, at:
> 	[<ffffffff810cc863>] generic_file_aio_write+0x53/0xd0
>  
>  stack backtrace:
>  Pid: 12828, comm: append_writer Not tainted 2.6.37-rc3+ #26
>  Call Trace:
>   [<ffffffff81082c39>] print_circular_bug+0xe9/0xf0
>   [<ffffffff81085122>] __lock_acquire+0x1382/0x1490
>   [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
>   [<ffffffff8113d6d2>] ? writeback_inodes_sb_if_idle+0x32/0x60
>   [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
>   [<ffffffff81601ba2>] down_read+0x42/0x60
>   [<ffffffff8113d6d2>] ? writeback_inodes_sb_if_idle+0x32/0x60
>   [<ffffffff8113d6d2>] writeback_inodes_sb_if_idle+0x32/0x60
>   [<ffffffffa0037efd>] ext4_da_write_begin+0x20d/0x310 [ext4]
>   [<ffffffff81073dde>] ? up_read+0x1e/0x40
>   [<ffffffff810cbbf4>] generic_file_buffered_write+0x114/0x2a0
>   [<ffffffff8104fa22>] ? current_fs_time+0x22/0x30
>   [<ffffffff810cc5e0>] __generic_file_aio_write+0x240/0x470
>   [<ffffffff810cc863>] ? generic_file_aio_write+0x53/0xd0
>   [<ffffffff810cc876>] generic_file_aio_write+0x66/0xd0
>   [<ffffffffa002cfad>] ext4_file_write+0x3d/0xd0 [ext4]
>   [<ffffffff81150db8>] ? fsnotify+0x88/0x5e0
>   [<ffffffff81117702>] do_sync_write+0xd2/0x110
>   [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
>   [<ffffffff816027b9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>   [<ffffffff81022e9a>] ? smp_apic_timer_interrupt+0x6a/0xa0
>   [<ffffffff811179c8>] vfs_write+0xc8/0x190
>   [<ffffffff811183ec>] sys_write+0x4c/0x80
>   [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b
> 
> Also, there can be an AA deadlock if the filesystem takes i_mutex in a
> workqueue, if it also waits for work completion while holding i_mutex.
> 
> SysRq : Show Blocked State
> task                        PC stack   pid father
> kworker/9:1   D 0000000000000000  6296   118      2
> Call Trace:
> [<ffffffff81039431>] ? get_parent_ip+0x11/0x50
> [<ffffffff8160145c>] __mutex_lock_common+0x18c/0x480
> [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4]
> [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4]
> [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50
> [<ffffffffa0042147>] ext4_end_io_work+0x37/0xb0 [ext4]
> [<ffffffff81068378>] process_one_work+0x1b8/0x5a0
> [<ffffffff8106830e>] ? process_one_work+0x14e/0x5a0
> [<ffffffffa0042110>] ? ext4_end_io_work+0x0/0xb0 [ext4]
> [<ffffffff81069675>] worker_thread+0x175/0x3a0
> [<ffffffff81069500>] ? worker_thread+0x0/0x3a0
> [<ffffffff8106e246>] kthread+0x96/0xa0
> [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
> [<ffffffff81039878>] ? finish_task_switch+0x78/0x110
> [<ffffffff816036c0>] ? restore_args+0x0/0x30
> [<ffffffff8106e1b0>] ? kthread+0x0/0xa0
> [<ffffffff81003ed0>] ? kernel_thread_helper+0x0/0x10
> 
> dbench        D 0000000000000000  2872  2916      1
> Call Trace:
> [<ffffffff810806fd>] ? trace_hardirqs_off+0xd/0x10
> [<ffffffff81600bb5>] schedule_timeout+0x1f5/0x350
> [<ffffffff8108380b>] ? mark_held_locks+0x6b/0xa0
> [<ffffffff8160314b>] ? _raw_spin_unlock_irq+0x2b/0x60
> [<ffffffff81039431>] ? get_parent_ip+0x11/0x50
> [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
> [<ffffffff815ffacd>] wait_for_common+0x10d/0x190
> [<ffffffff810426e0>] ? default_wake_function+0x0/0x10
> [<ffffffff81602ec9>] ? _raw_spin_unlock_bh+0x39/0x40
> [<ffffffff815ffbf8>] wait_for_completion+0x18/0x20
> [<ffffffff8113d1e7>] writeback_inodes_sb_nr+0xf7/0x120
> [<ffffffff8113d68d>] writeback_inodes_sb+0x4d/0x60
> [<ffffffff8113d6d2>] ?  writeback_inodes_sb_if_idle+0x32/0x60
> [<ffffffff8113d6da>] writeback_inodes_sb_if_idle+0x3a/0x60
> [<ffffffffa003fefd>] ext4_da_write_begin+0x20d/0x310
> [<ffffffff810cbbf4>] generic_file_buffered_write+0x114/0x2a0
> 
> Avoid both these issues by issuing completely asynchronous writeback
request in
> writeback_inodes_sb_if_idle. Don''t let that fool you into thinking
these
> functions don''t suck any more.
> 
> ext4 now passes extensive stress testing with xfstests, fs_mark, dbench,
> with a writeback_inodes_if_idle call added directly into
ext4_da_write_begin
> to trigger the path frequently. Previously it would spew lockdep stuff and
> hang in a number of ways very quickly.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  fs/fs-writeback.c |   32 ++++++++++++++++++++------------
>  1 file changed, 20 insertions(+), 12 deletions(-)
> 
> Index: linux-2.6/fs/fs-writeback.c
> ==================================================================> ---
linux-2.6.orig/fs/fs-writeback.c	2010-11-23 20:57:23.000000000 +1100
> +++ linux-2.6/fs/fs-writeback.c	2010-11-23 20:59:10.000000000 +1100
> @@ -1152,16 +1152,17 @@ EXPORT_SYMBOL(writeback_inodes_sb);
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * Even if 1 is returned, writeback may not be started if memory
allocation
> + * fails. This function makes no guarantees about anything.
>   */
>  int writeback_inodes_sb_if_idle(struct super_block *sb)
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
> -		down_read(&sb->s_umount);
> -		writeback_inodes_sb(sb);
> -		up_read(&sb->s_umount);
> +		bdi_start_writeback(sb->s_bdi, get_nr_dirty_pages());
>  		return 1;
> -	} else
> -		return 0;
> +	}
> +	return 0;
>  }
>  EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
>  
> @@ -1172,17 +1173,18 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * Even if 1 is returned, writeback may not be started if memory
allocation
> + * fails. This function makes no guarantees about anything.
>   */
>  int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
>  				   unsigned long nr)
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
> -		down_read(&sb->s_umount);
> -		writeback_inodes_sb_nr(sb, nr);
> -		up_read(&sb->s_umount);
> +		bdi_start_writeback(sb->s_bdi, nr);
>  		return 1;
> -	} else
> -		return 0;
> +	}
> +	return 0;
>  }
>  EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
>  
static inline int writeback_inodes_sb_if_idle(struct super_block *sb)
{
	return writeback_inodes_sb_nr_if_idle(sb, get_nr_dirty_pages());
}

In writeback.h, No?
But it has a single user so please just kill it.

Also writeback_inodes_sb_nr_if_idle() has a single user. Combined with above,
two users. Why not open code it in the two sites. It should be much
clearer to understand what the magic is all about?

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2010-Nov-23 10:54 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Tue, Nov 23, 2010 at 12:26:31PM +0200, Boaz Harrosh
wrote:> >   *
> >   * Invoke writeback_inodes_sb if no writeback is currently underway.
> >   * Returns 1 if writeback was started, 0 if not.
> > + *
> > + * Even if 1 is returned, writeback may not be started if memory
allocation
> > + * fails. This function makes no guarantees about anything.
> >   */
> >  int writeback_inodes_sb_if_idle(struct super_block *sb)
> >  {
> >  	if (!writeback_in_progress(sb->s_bdi)) {
> > -		down_read(&sb->s_umount);
> > -		writeback_inodes_sb(sb);
> > -		up_read(&sb->s_umount);
> > +		bdi_start_writeback(sb->s_bdi, get_nr_dirty_pages());
> >  		return 1;
> > -	} else
> > -		return 0;
> > +	}
> > +	return 0;
> >  }
> >  EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
> >  
> > @@ -1172,17 +1173,18 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
> >   *
> >   * Invoke writeback_inodes_sb if no writeback is currently underway.
> >   * Returns 1 if writeback was started, 0 if not.
> > + *
> > + * Even if 1 is returned, writeback may not be started if memory
allocation
> > + * fails. This function makes no guarantees about anything.
> >   */
> >  int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
> >  				   unsigned long nr)
> >  {
> >  	if (!writeback_in_progress(sb->s_bdi)) {
> > -		down_read(&sb->s_umount);
> > -		writeback_inodes_sb_nr(sb, nr);
> > -		up_read(&sb->s_umount);
> > +		bdi_start_writeback(sb->s_bdi, nr);
> >  		return 1;
> > -	} else
> > -		return 0;
> > +	}
> > +	return 0;
> >  }
> >  EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
> >  
> 
> static inline int writeback_inodes_sb_if_idle(struct super_block *sb)
> {
> 	return writeback_inodes_sb_nr_if_idle(sb, get_nr_dirty_pages());
> }
> 
> In writeback.h, No?
I didn''t care enough to move it :P I don''t know if it matters.

> But it has a single user so please just kill it.
> 
> Also writeback_inodes_sb_nr_if_idle() has a single user. Combined with
above,
> two users. Why not open code it in the two sites. It should be much
> clearer to understand what the magic is all about?
The filesystem shouldn''t be aware of the details (the
"magic") of how to
kick writeback, so I think the abstraction is right as is.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Boaz Harrosh

2010-Nov-23 12:00 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On 11/23/2010 12:54 PM, Nick Piggin wrote:> On Tue, Nov 23, 2010 at 12:26:31PM +0200, Boaz Harrosh wrote:
>>>   *
>>>   * Invoke writeback_inodes_sb if no writeback is currently
underway.
>>>   * Returns 1 if writeback was started, 0 if not.
>>> + *
>>> + * Even if 1 is returned, writeback may not be started if memory
allocation
>>> + * fails. This function makes no guarantees about anything.
>>>   */
>>>  int writeback_inodes_sb_if_idle(struct super_block *sb)
>>>  {
>>>  	if (!writeback_in_progress(sb->s_bdi)) {
>>> -		down_read(&sb->s_umount);
>>> -		writeback_inodes_sb(sb);
>>> -		up_read(&sb->s_umount);
>>> +		bdi_start_writeback(sb->s_bdi, get_nr_dirty_pages());
>>>  		return 1;
>>> -	} else
>>> -		return 0;
>>> +	}
>>> +	return 0;
>>>  }
>>>  EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
>>>  
>>> @@ -1172,17 +1173,18 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
>>>   *
>>>   * Invoke writeback_inodes_sb if no writeback is currently
underway.
>>>   * Returns 1 if writeback was started, 0 if not.
>>> + *
>>> + * Even if 1 is returned, writeback may not be started if memory
allocation
>>> + * fails. This function makes no guarantees about anything.
>>>   */
>>>  int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
>>>  				   unsigned long nr)
>>>  {
>>>  	if (!writeback_in_progress(sb->s_bdi)) {
>>> -		down_read(&sb->s_umount);
>>> -		writeback_inodes_sb_nr(sb, nr);
>>> -		up_read(&sb->s_umount);
>>> +		bdi_start_writeback(sb->s_bdi, nr);
>>>  		return 1;
>>> -	} else
>>> -		return 0;
>>> +	}
>>> +	return 0;
>>>  }
>>>  EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
>>>  
>>
>> static inline int writeback_inodes_sb_if_idle(struct super_block *sb)
>> {
>> 	return writeback_inodes_sb_nr_if_idle(sb, get_nr_dirty_pages());
>> }
>>
>> In writeback.h, No?
> 
> I didn''t care enough to move it :P I don''t know if it
matters.
> 
Than please just open code it in ext4 and completely remove it.
One stupid function is enough don''t you think?
> 
>> But it has a single user so please just kill it.
>>
>> Also writeback_inodes_sb_nr_if_idle() has a single user. Combined with
above,
>> two users. Why not open code it in the two sites. It should be much
>> clearer to understand what the magic is all about?
> 
> The filesystem shouldn''t be aware of the details (the
"magic") of how to
> kick writeback, so I think the abstraction is right as is.
> 
bdi_start_writeback() looks like a very clear abstraction to me but
if you have plans for it than sure, I''m convinced.
> Thanks,
> Nick
> 
Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Nov-23 12:34 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

Excerpts from Nick Piggin''s message of 2010-11-23 05:02:39 -0500:

[ ... ]
> 
> Avoid both these issues by issuing completely asynchronous writeback
request in
> writeback_inodes_sb_if_idle. Don''t let that fool you into thinking
these
> functions don''t suck any more.
> 
> ext4 now passes extensive stress testing with xfstests, fs_mark, dbench,
> with a writeback_inodes_if_idle call added directly into
ext4_da_write_begin
> to trigger the path frequently. Previously it would spew lockdep stuff and
> hang in a number of ways very quickly.
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  fs/fs-writeback.c |   32 ++++++++++++++++++++------------
>  1 file changed, 20 insertions(+), 12 deletions(-)
> 
> Index: linux-2.6/fs/fs-writeback.c
> ==================================================================> ---
linux-2.6.orig/fs/fs-writeback.c    2010-11-23 20:57:23.000000000 +1100
> +++ linux-2.6/fs/fs-writeback.c    2010-11-23 20:59:10.000000000 +1100
> @@ -1152,16 +1152,17 @@ EXPORT_SYMBOL(writeback_inodes_sb);
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * Even if 1 is returned, writeback may not be started if memory
allocation
> + * fails. This function makes no guarantees about anything.
>   */
>  int writeback_inodes_sb_if_idle(struct super_block *sb)
>  {
>      if (!writeback_in_progress(sb->s_bdi)) {
> -        down_read(&sb->s_umount);
> -        writeback_inodes_sb(sb);
> -        up_read(&sb->s_umount);
> +        bdi_start_writeback(sb->s_bdi, get_nr_dirty_pages());
I''ll put on my skis and channel Christoph for a minute.  This
isn''t
quite the same as the original.  writeback_inodes_sb() writes inodes on a
specific super block, while bdi_start_writeback() writes them for any SB
on that bdi.

For btrfs there''s only one bdi per SB, but for most everyone else a
disk
with a bunch of partitions is going to have multiple filesystems on the
same bdi.

My original btrfs patch just exported the bdi_ funcs so that btrfs could
do the above internally.  But Christoph objected, and I think he''s
right.  We should either give everyone a bdi or make sure the writeback
func kicks only one filesystem.

-chris
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2010-Nov-23 12:52 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Tue, Nov 23, 2010 at 07:34:07AM -0500, Chris Mason
wrote:> Excerpts from Nick Piggin''s message of 2010-11-23 05:02:39 -0500:
> 
> [ ... ]
> 
> > 
> > Avoid both these issues by issuing completely asynchronous writeback
request in
> > writeback_inodes_sb_if_idle. Don''t let that fool you into
thinking these
> > functions don''t suck any more.
> > 
> > ext4 now passes extensive stress testing with xfstests, fs_mark,
dbench,
> > with a writeback_inodes_if_idle call added directly into
ext4_da_write_begin
> > to trigger the path frequently. Previously it would spew lockdep stuff
and
> > hang in a number of ways very quickly.
> > 
> > Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> > 
> > ---
> >  fs/fs-writeback.c |   32 ++++++++++++++++++++------------
> >  1 file changed, 20 insertions(+), 12 deletions(-)
> > 
> > Index: linux-2.6/fs/fs-writeback.c
> > ==================================================================>
> --- linux-2.6.orig/fs/fs-writeback.c    2010-11-23 20:57:23.000000000 +1100
> > +++ linux-2.6/fs/fs-writeback.c    2010-11-23 20:59:10.000000000 +1100
> > @@ -1152,16 +1152,17 @@ EXPORT_SYMBOL(writeback_inodes_sb);
> >   *
> >   * Invoke writeback_inodes_sb if no writeback is currently underway.
> >   * Returns 1 if writeback was started, 0 if not.
> > + *
> > + * Even if 1 is returned, writeback may not be started if memory
allocation
> > + * fails. This function makes no guarantees about anything.
> >   */
> >  int writeback_inodes_sb_if_idle(struct super_block *sb)
> >  {
> >      if (!writeback_in_progress(sb->s_bdi)) {
> > -        down_read(&sb->s_umount);
> > -        writeback_inodes_sb(sb);
> > -        up_read(&sb->s_umount);
> > +        bdi_start_writeback(sb->s_bdi, get_nr_dirty_pages());
> 
> I''ll put on my skis and channel Christoph for a minute.  This
isn''t
> quite the same as the original.  writeback_inodes_sb() writes inodes on a
> specific super block, while bdi_start_writeback() writes them for any SB
> on that bdi.
Right.

 > For btrfs there''s only one bdi per SB, but for most everyone else
a disk
> with a bunch of partitions is going to have multiple filesystems on the
> same bdi.
> 
> My original btrfs patch just exported the bdi_ funcs so that btrfs could
> do the above internally.  But Christoph objected, and I think he''s
> right.  We should either give everyone a bdi or make sure the writeback
> func kicks only one filesystem.
Well it''s just kicking the writeback thread, and it will writeback
from that particular sb. It makes no further guarantees, and anyway
the sb has to compete for normal writeback within this bdi.

I think Christoph is right because filesystems should not really
know about how bdi writeback queueing works. But I don''t know if
it''s
worth doing anything more complex for this functionality?


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dmitry

2010-Nov-23 12:54 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Tue, 23 Nov 2010 21:02:39 +1100, Nick Piggin <npiggin@kernel.dk>
wrote:> 
> Taking s_umount lock inside i_mutex can result in an ABBA deadlock:
> 
>  ======================================================>  [ INFO:
possible circular locking dependency detected ]
>  2.6.37-rc3+ #26
>  -------------------------------------------------------
>  append_writer/12828 is trying to acquire lock:
>   (&type->s_umount_key#24){+++++.}, at:
> 	[<ffffffff8113d6d2>] writeback_inodes_sb_if_idle+0x32/0x60
>  
>  but task is already holding lock:
>   (&sb->s_type->i_mutex_key#14){+.+.+.}, at:
> 	[<ffffffff810cc863>] generic_file_aio_write+0x53/0xd0
>  
>  which lock already depends on the new lock.
>  
>  
>  the existing dependency chain (in reverse order) is:
>  
>  -> #3 (&sb->s_type->i_mutex_key#14){+.+.+.}:
>         [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
>         [<ffffffff81601329>] __mutex_lock_common+0x59/0x480
>         [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50
>         [<ffffffffa003a147>] ext4_end_io_work+0x37/0xb0 [ext4]
>         [<ffffffff81068378>] process_one_work+0x1b8/0x5a0
>         [<ffffffff81069675>] worker_thread+0x175/0x3a0
>         [<ffffffff8106e246>] kthread+0x96/0xa0
>         [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
>  
>  -> #2 ((&io->work)){+.+...}:
>         [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
>         [<ffffffff81068364>] process_one_work+0x1a4/0x5a0
>         [<ffffffff81069675>] worker_thread+0x175/0x3a0
>         [<ffffffff8106e246>] kthread+0x96/0xa0
>         [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
>  
>  -> #1 (ext4-dio-unwritten){+.+...}:
>         [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
>         [<ffffffff81067bc8>] flush_workqueue+0x148/0x540
>         [<ffffffffa004761b>] ext4_sync_fs+0x3b/0x100 [ext4]
>         [<ffffffff8114304e>] __sync_filesystem+0x5e/0x90
>         [<ffffffff81143132>] sync_filesystem+0x32/0x60
>         [<ffffffff8111a97f>] generic_shutdown_super+0x2f/0x100
>         [<ffffffff8111aa7c>] kill_block_super+0x2c/0x50
>         [<ffffffff8111b1e5>] deactivate_locked_super+0x45/0x60
>         [<ffffffff8111b415>] deactivate_super+0x45/0x60
>         [<ffffffff81136430>] mntput_no_expire+0xf0/0x190
>         [<ffffffff811376a9>] sys_umount+0x79/0x3a0
>         [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b
>  
>  -> #0 (&type->s_umount_key#24){+++++.}:
>         [<ffffffff81085122>] __lock_acquire+0x1382/0x1490
>         [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
>         [<ffffffff81601ba2>] down_read+0x42/0x60
>         [<ffffffff8113d6d2>] writeback_inodes_sb_if_idle+0x32/0x60
>         [<ffffffffa0037efd>] ext4_da_write_begin+0x20d/0x310 [ext4]
>         [<ffffffff810cbbf4>] generic_file_buffered_write+0x114/0x2a0
>         [<ffffffff810cc5e0>] __generic_file_aio_write+0x240/0x470
>         [<ffffffff810cc876>] generic_file_aio_write+0x66/0xd0
>         [<ffffffffa002cfad>] ext4_file_write+0x3d/0xd0 [ext4]
>         [<ffffffff81117702>] do_sync_write+0xd2/0x110
>         [<ffffffff811179c8>] vfs_write+0xc8/0x190
>         [<ffffffff811183ec>] sys_write+0x4c/0x80
>         [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b
>  
>  other info that might help us debug this:
>  
>  1 lock held by append_writer/12828:
>   #0:  (&sb->s_type->i_mutex_key#14){+.+.+.}, at:
> 	[<ffffffff810cc863>] generic_file_aio_write+0x53/0xd0
>  
>  stack backtrace:
>  Pid: 12828, comm: append_writer Not tainted 2.6.37-rc3+ #26
>  Call Trace:
>   [<ffffffff81082c39>] print_circular_bug+0xe9/0xf0
>   [<ffffffff81085122>] __lock_acquire+0x1382/0x1490
>   [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
>   [<ffffffff8113d6d2>] ? writeback_inodes_sb_if_idle+0x32/0x60
>   [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
>   [<ffffffff81601ba2>] down_read+0x42/0x60
>   [<ffffffff8113d6d2>] ? writeback_inodes_sb_if_idle+0x32/0x60
>   [<ffffffff8113d6d2>] writeback_inodes_sb_if_idle+0x32/0x60
>   [<ffffffffa0037efd>] ext4_da_write_begin+0x20d/0x310 [ext4]
>   [<ffffffff81073dde>] ? up_read+0x1e/0x40
>   [<ffffffff810cbbf4>] generic_file_buffered_write+0x114/0x2a0
>   [<ffffffff8104fa22>] ? current_fs_time+0x22/0x30
>   [<ffffffff810cc5e0>] __generic_file_aio_write+0x240/0x470
>   [<ffffffff810cc863>] ? generic_file_aio_write+0x53/0xd0
>   [<ffffffff810cc876>] generic_file_aio_write+0x66/0xd0
>   [<ffffffffa002cfad>] ext4_file_write+0x3d/0xd0 [ext4]
>   [<ffffffff81150db8>] ? fsnotify+0x88/0x5e0
>   [<ffffffff81117702>] do_sync_write+0xd2/0x110
>   [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
>   [<ffffffff816027b9>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>   [<ffffffff81022e9a>] ? smp_apic_timer_interrupt+0x6a/0xa0
>   [<ffffffff811179c8>] vfs_write+0xc8/0x190
>   [<ffffffff811183ec>] sys_write+0x4c/0x80
>   [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b
> 
> Also, there can be an AA deadlock if the filesystem takes i_mutex in a
> workqueue, if it also waits for work completion while holding i_mutex.
> 
> SysRq : Show Blocked State
> task                        PC stack   pid father
> kworker/9:1   D 0000000000000000  6296   118      2
> Call Trace:
> [<ffffffff81039431>] ? get_parent_ip+0x11/0x50
> [<ffffffff8160145c>] __mutex_lock_common+0x18c/0x480
> [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4]
> [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4]
> [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50
> [<ffffffffa0042147>] ext4_end_io_work+0x37/0xb0 [ext4]
> [<ffffffff81068378>] process_one_work+0x1b8/0x5a0
> [<ffffffff8106830e>] ? process_one_work+0x14e/0x5a0
> [<ffffffffa0042110>] ? ext4_end_io_work+0x0/0xb0 [ext4]
> [<ffffffff81069675>] worker_thread+0x175/0x3a0
> [<ffffffff81069500>] ? worker_thread+0x0/0x3a0
> [<ffffffff8106e246>] kthread+0x96/0xa0
> [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
> [<ffffffff81039878>] ? finish_task_switch+0x78/0x110
> [<ffffffff816036c0>] ? restore_args+0x0/0x30
> [<ffffffff8106e1b0>] ? kthread+0x0/0xa0
> [<ffffffff81003ed0>] ? kernel_thread_helper+0x0/0x10
> 
> dbench        D 0000000000000000  2872  2916      1
> Call Trace:
> [<ffffffff810806fd>] ? trace_hardirqs_off+0xd/0x10
> [<ffffffff81600bb5>] schedule_timeout+0x1f5/0x350
> [<ffffffff8108380b>] ? mark_held_locks+0x6b/0xa0
> [<ffffffff8160314b>] ? _raw_spin_unlock_irq+0x2b/0x60
> [<ffffffff81039431>] ? get_parent_ip+0x11/0x50
> [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
> [<ffffffff815ffacd>] wait_for_common+0x10d/0x190
> [<ffffffff810426e0>] ? default_wake_function+0x0/0x10
> [<ffffffff81602ec9>] ? _raw_spin_unlock_bh+0x39/0x40
> [<ffffffff815ffbf8>] wait_for_completion+0x18/0x20
> [<ffffffff8113d1e7>] writeback_inodes_sb_nr+0xf7/0x120
> [<ffffffff8113d68d>] writeback_inodes_sb+0x4d/0x60
> [<ffffffff8113d6d2>] ?  writeback_inodes_sb_if_idle+0x32/0x60
> [<ffffffff8113d6da>] writeback_inodes_sb_if_idle+0x3a/0x60
> [<ffffffffa003fefd>] ext4_da_write_begin+0x20d/0x310
> [<ffffffff810cbbf4>] generic_file_buffered_write+0x114/0x2a0
> 
> Avoid both these issues by issuing completely asynchronous writeback
request in
> writeback_inodes_sb_if_idle. Don''t let that fool you into thinking
these
> functions don''t suck any more.
> 
> ext4 now passes extensive stress testing with xfstests, fs_mark, dbench,
> with a writeback_inodes_if_idle call added directly into
ext4_da_write_begin
> to trigger the path frequently. Previously it would spew lockdep stuff and
> hang in a number of ways very quickly.I''m still able to trigger strange lockups on fsstress, but in other
place. System hangs so quickly so i''m not succeed in grabbing logs
yet.> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> ---
>  fs/fs-writeback.c |   32 ++++++++++++++++++++------------
>  1 file changed, 20 insertions(+), 12 deletions(-)
> 
> Index: linux-2.6/fs/fs-writeback.c
> ==================================================================> ---
linux-2.6.orig/fs/fs-writeback.c	2010-11-23 20:57:23.000000000 +1100
> +++ linux-2.6/fs/fs-writeback.c	2010-11-23 20:59:10.000000000 +1100
> @@ -1152,16 +1152,17 @@ EXPORT_SYMBOL(writeback_inodes_sb);
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * Even if 1 is returned, writeback may not be started if memory
allocation
> + * fails. This function makes no guarantees about anything.
>   */
>  int writeback_inodes_sb_if_idle(struct super_block *sb)
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
> -		down_read(&sb->s_umount);
> -		writeback_inodes_sb(sb);
> -		up_read(&sb->s_umount);
> +		bdi_start_writeback(sb->s_bdi, get_nr_dirty_pages());
>  		return 1;
> -	} else
> -		return 0;
> +	}
> +	return 0;
>  }
>  EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
>  
> @@ -1172,17 +1173,18 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * Even if 1 is returned, writeback may not be started if memory
allocation
> + * fails. This function makes no guarantees about anything.
>   */
>  int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
>  				   unsigned long nr)
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
> -		down_read(&sb->s_umount);
> -		writeback_inodes_sb_nr(sb, nr);
> -		up_read(&sb->s_umount);
> +		bdi_start_writeback(sb->s_bdi, nr);
>  		return 1;
> -	} else
> -		return 0;
> +	}
> +	return 0;
>  }
>  EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kara

2010-Nov-23 13:18 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Tue 23-11-10 21:11:49, Nick Piggin wrote:> The issue of writeback_inodes_sb being synchronous so far as it has to
> wait until the work has been dequeued is another subtlety. That is a
> funny interface though, really. It has 3 callers, sync, quota, and
> ubifs. From a quick look, quota and ubifs seem to think it is some kind
> of synchronous writeout API.  Yes, they expect it and it used to be the case before per-bdi flusher
threads existed (because the function submitted IO on its own). Then it
was changed to an async interface in per-bdi flusher thread patches and
then back again to a synchronous one... Sad history...
> It also really sucks that it can get deadlocked behind an unrelated item
> in a workqueue. I think it should just get converted over to the
> async-submission style as well.  Here I don''t quite agree. After my patches (currently in -mm tree)
all
work items have reasonably well defined lifetime so no livelocks should
occur. After all writeback thread is doing its best to do as much IO as
possible (and hopefully saturates the storage) so given all the IO work
items we cannot do much better. Where I see a room for improvement is
that work items usually try to achieve a common goal - for example when we
get two items "write all dirty pages", we have mostly fulfilled the
second
one after finishing the first one but we start from the beginning when
processing the second request. But it seems non-trivial to do this request
merging especially for different types of requests...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Nov-23 18:58 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

Excerpts from Nick Piggin''s message of 2010-11-23 07:52:23
-0500:> On Tue, Nov 23, 2010 at 07:34:07AM -0500, Chris Mason wrote:
> > Excerpts from Nick Piggin''s message of 2010-11-23 05:02:39
-0500:
> > 
> > [ ... ]
> > 
> > > 
> > > Avoid both these issues by issuing completely asynchronous
writeback request in
> > > writeback_inodes_sb_if_idle. Don''t let that fool you
into thinking these
> > > functions don''t suck any more.
> > > 
> > > ext4 now passes extensive stress testing with xfstests, fs_mark,
dbench,
> > > with a writeback_inodes_if_idle call added directly into
ext4_da_write_begin
> > > to trigger the path frequently. Previously it would spew lockdep
stuff and
> > > hang in a number of ways very quickly.
> > > 
> > > Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> > > 
> > > ---
> > >  fs/fs-writeback.c |   32 ++++++++++++++++++++------------
> > >  1 file changed, 20 insertions(+), 12 deletions(-)
> > > 
> > > Index: linux-2.6/fs/fs-writeback.c
> > >
==================================================================> > >
--- linux-2.6.orig/fs/fs-writeback.c    2010-11-23 20:57:23.000000000 +1100
> > > +++ linux-2.6/fs/fs-writeback.c    2010-11-23 20:59:10.000000000
+1100
> > > @@ -1152,16 +1152,17 @@ EXPORT_SYMBOL(writeback_inodes_sb);
> > >   *
> > >   * Invoke writeback_inodes_sb if no writeback is currently
underway.
> > >   * Returns 1 if writeback was started, 0 if not.
> > > + *
> > > + * Even if 1 is returned, writeback may not be started if memory
allocation
> > > + * fails. This function makes no guarantees about anything.
> > >   */
> > >  int writeback_inodes_sb_if_idle(struct super_block *sb)
> > >  {
> > >      if (!writeback_in_progress(sb->s_bdi)) {
> > > -        down_read(&sb->s_umount);
> > > -        writeback_inodes_sb(sb);
> > > -        up_read(&sb->s_umount);
> > > +        bdi_start_writeback(sb->s_bdi, get_nr_dirty_pages());
> > 
> > I''ll put on my skis and channel Christoph for a minute.  This
isn''t
> > quite the same as the original.  writeback_inodes_sb() writes inodes
on a
> > specific super block, while bdi_start_writeback() writes them for any
SB
> > on that bdi.
> 
> Right.
> 
> > For btrfs there''s only one bdi per SB, but for most everyone
else a disk
> > with a bunch of partitions is going to have multiple filesystems on
the
> > same bdi.
> > 
> > My original btrfs patch just exported the bdi_ funcs so that btrfs
could
> > do the above internally.  But Christoph objected, and I think
he''s
> > right.  We should either give everyone a bdi or make sure the
writeback
> > func kicks only one filesystem.
> 
> Well it''s just kicking the writeback thread, and it will writeback
> from that particular sb.
Hmmm?  It will writeback for all the SBs on that bdi.  In the current
form that ext4 uses, that gets pretty expensive if you have a bunch of
large partitions and you''re only running out of space on one of them.

For the _nr variant that btrfs uses, it''s worse for the filesystems
that don''t have a 1:1 bdi<->sb mapping.  It might not actually
write any
of the pages from the SB that is out of space.
> It makes no further guarantees, and anyway
> the sb has to compete for normal writeback within this bdi.
> 
> I think Christoph is right because filesystems should not really
> know about how bdi writeback queueing works. But I don''t know if
it''s
> worth doing anything more complex for this functionality?
I think we should make a writeback_inodes_sb_unlocked() that doesn''t
warn when the semaphore isn''t held.  That should be enough given where
btrfs and ext4 are calling it from.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2010-Nov-24 01:03 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Tue, Nov 23, 2010 at 01:58:24PM -0500, Chris Mason
wrote:> Excerpts from Nick Piggin''s message of 2010-11-23 07:52:23 -0500:
> > On Tue, Nov 23, 2010 at 07:34:07AM -0500, Chris Mason wrote:
> > > Excerpts from Nick Piggin''s message of 2010-11-23
05:02:39 -0500:
> > > 
> > > [ ... ]
> > > 
> > > > 
> > > > Avoid both these issues by issuing completely asynchronous
writeback request in
> > > > writeback_inodes_sb_if_idle. Don''t let that fool
you into thinking these
> > > > functions don''t suck any more.
> > > > 
> > > > ext4 now passes extensive stress testing with xfstests,
fs_mark, dbench,
> > > > with a writeback_inodes_if_idle call added directly into
ext4_da_write_begin
> > > > to trigger the path frequently. Previously it would spew
lockdep stuff and
> > > > hang in a number of ways very quickly.
> > > > 
> > > > Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> > > > 
> > > > ---
> > > >  fs/fs-writeback.c |   32 ++++++++++++++++++++------------
> > > >  1 file changed, 20 insertions(+), 12 deletions(-)
> > > > 
> > > > Index: linux-2.6/fs/fs-writeback.c
> > > >
==================================================================> > >
> --- linux-2.6.orig/fs/fs-writeback.c    2010-11-23 20:57:23.000000000 +1100
> > > > +++ linux-2.6/fs/fs-writeback.c    2010-11-23
20:59:10.000000000 +1100
> > > > @@ -1152,16 +1152,17 @@ EXPORT_SYMBOL(writeback_inodes_sb);
> > > >   *
> > > >   * Invoke writeback_inodes_sb if no writeback is currently
underway.
> > > >   * Returns 1 if writeback was started, 0 if not.
> > > > + *
> > > > + * Even if 1 is returned, writeback may not be started if
memory allocation
> > > > + * fails. This function makes no guarantees about anything.
> > > >   */
> > > >  int writeback_inodes_sb_if_idle(struct super_block *sb)
> > > >  {
> > > >      if (!writeback_in_progress(sb->s_bdi)) {
> > > > -        down_read(&sb->s_umount);
> > > > -        writeback_inodes_sb(sb);
> > > > -        up_read(&sb->s_umount);
> > > > +        bdi_start_writeback(sb->s_bdi,
get_nr_dirty_pages());
> > > 
> > > I''ll put on my skis and channel Christoph for a minute. 
This isn''t
> > > quite the same as the original.  writeback_inodes_sb() writes
inodes on a
> > > specific super block, while bdi_start_writeback() writes them for
any SB
> > > on that bdi.
> > 
> > Right.
> > 
> > > For btrfs there''s only one bdi per SB, but for most
everyone else a disk
> > > with a bunch of partitions is going to have multiple filesystems
on the
> > > same bdi.
> > > 
> > > My original btrfs patch just exported the bdi_ funcs so that
btrfs could
> > > do the above internally.  But Christoph objected, and I think
he''s
> > > right.  We should either give everyone a bdi or make sure the
writeback
> > > func kicks only one filesystem.
> > 
> > Well it''s just kicking the writeback thread, and it will
writeback
> > from that particular sb.
> 
> Hmmm?  It will writeback for all the SBs on that bdi.  In the current
> form that ext4 uses, that gets pretty expensive if you have a bunch of
> large partitions and you''re only running out of space on one of
them.
Right. But if the bdi has writeback in progress (which would be most
of the time, on a busy filesystem), writeback_if_idle doesn''t do
anything, and it is happy just for the background writeback to
eventually get around to writing out for us.

 > For the _nr variant that btrfs uses, it''s worse for the
filesystems
> that don''t have a 1:1 bdi<->sb mapping.  It might not
actually write any
> of the pages from the SB that is out of space.
That''s true, but it might not write anything anyway (and after we
check whether writeout is in progress, the writeout thread could go
to sleep and not do anything anyway).

So it''s a pretty hacky interface anyway. If you want to do anything
deterministic, you obviously need real coupling between producer and
consumer. This should only be a performance tweak (or a workaround
hack in worst case).

 > > It makes no further guarantees, and anyway
> > the sb has to compete for normal writeback within this bdi.
> 
> > 
> > I think Christoph is right because filesystems should not really
> > know about how bdi writeback queueing works. But I don''t know
if it''s
> > worth doing anything more complex for this functionality?
> 
> I think we should make a writeback_inodes_sb_unlocked() that
doesn''t
> warn when the semaphore isn''t held.  That should be enough given
where
> btrfs and ext4 are calling it from.
It doesn''t solve the bugs -- calling and waiting for writeback is
still broken because completion requires i_mutex and it is called
from under i_mutex.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kara

2010-Nov-24 13:10 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Wed 24-11-10 12:03:43, Nick Piggin wrote:> > For the _nr variant that btrfs uses, it''s worse for the
filesystems
> > that don''t have a 1:1 bdi<->sb mapping.  It might not
actually write any
> > of the pages from the SB that is out of space.
> 
> That''s true, but it might not write anything anyway (and after we
> check whether writeout is in progress, the writeout thread could go
> to sleep and not do anything anyway).
> 
> So it''s a pretty hacky interface anyway. If you want to do
anything
> deterministic, you obviously need real coupling between producer and
> consumer. This should only be a performance tweak (or a workaround
> hack in worst case).  Yes, the current interface is a band aid for the problem and better
interface is welcome. But it''s not trivial to do better...
> > > It makes no further guarantees, and anyway
> > > the sb has to compete for normal writeback within this bdi.
> > 
> > > 
> > > I think Christoph is right because filesystems should not really
> > > know about how bdi writeback queueing works. But I don''t
know if it''s
> > > worth doing anything more complex for this functionality?
> > 
> > I think we should make a writeback_inodes_sb_unlocked() that
doesn''t
> > warn when the semaphore isn''t held.  That should be enough
given where
> > btrfs and ext4 are calling it from.
> 
> It doesn''t solve the bugs -- calling and waiting for writeback is
> still broken because completion requires i_mutex and it is called
> from under i_mutex.  Well, as I wrote in my previous email, only ext4 has the problem with
i_mutex and I personally view it as a bug. But ultimately it''s
Ted''s call
to decide.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2010-Nov-24 22:47 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Tue, 23 Nov 2010 07:34:07 -0500
Chris Mason <chris.mason@oracle.com> wrote:
> For btrfs there''s only one bdi per SB, but for most everyone else
a disk
> with a bunch of partitions is going to have multiple filesystems on the
> same bdi.
um, please explain why that wasn''t idiotic?  The BDI is a
representation of a backing device and it''s *supposed* to provide
visibility into what''s happening against other partitions on the same
device.  Creating a BDI per SB (it didn''t even occur to me to think
that a filesystem was even able to do this) breaks that.


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2010-Nov-24 22:51 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Wed, 24 Nov 2010 12:03:43 +1100
Nick Piggin <npiggin@kernel.dk> wrote:
> On Tue, Nov 23, 2010 at 01:58:24PM -0500, Chris Mason wrote:
> > > > My original btrfs patch just exported the bdi_ funcs so that
btrfs could
> > > > do the above internally.  But Christoph objected, and I
think he''s
> > > > right.  We should either give everyone a bdi or make sure
the writeback
> > > > func kicks only one filesystem.
> > > 
> > > Well it''s just kicking the writeback thread, and it will
writeback
> > > from that particular sb.
> > 
> > Hmmm?  It will writeback for all the SBs on that bdi.  In the current
> > form that ext4 uses, that gets pretty expensive if you have a bunch of
> > large partitions and you''re only running out of space on one
of them.
> 
> Right. But if the bdi has writeback in progress (which would be most
> of the time, on a busy filesystem), writeback_if_idle doesn''t do
> anything, and it is happy just for the background writeback to
> eventually get around to writing out for us.
That doesn''t work if you''re running btfs (apparently short for
"busticated filesystem") because the bdi-per-sb thing carefully hid
the
information which you''re looking for.


We still don''t have a fix for this bug yet, it appears, btw.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2010-Nov-25 03:52 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Tue, Nov 23, 2010 at 02:18:57PM +0100, Jan Kara
wrote:> On Tue 23-11-10 21:11:49, Nick Piggin wrote:
> > The issue of writeback_inodes_sb being synchronous so far as it has to
> > wait until the work has been dequeued is another subtlety. That is a
> > funny interface though, really. It has 3 callers, sync, quota, and
> > ubifs. From a quick look, quota and ubifs seem to think it is some
kind
> > of synchronous writeout API.
>   Yes, they expect it and it used to be the case before per-bdi flusher
> threads existed (because the function submitted IO on its own). Then it
> was changed to an async interface in per-bdi flusher thread patches and
> then back again to a synchronous one... Sad history...
> 
> > It also really sucks that it can get deadlocked behind an unrelated
item
> > in a workqueue. I think it should just get converted over to the
> > async-submission style as well.
>   Here I don''t quite agree. After my patches (currently in -mm
tree) all
> work items have reasonably well defined lifetime so no livelocks should
> occur.
It was actually a deadlock, and it was due to workqueue item for
writeback submission being held up behind item for completion, where
former is holding i_mutex and latter trying to acquire.

I haven''t studied the patches in -mm, do they solve this problem?
> After all writeback thread is doing its best to do as much IO as
> possible (and hopefully saturates the storage) so given all the IO work
> items we cannot do much better. Where I see a room for improvement is
> that work items usually try to achieve a common goal - for example when we
> get two items "write all dirty pages", we have mostly fulfilled
the second
> one after finishing the first one but we start from the beginning when
> processing the second request. But it seems non-trivial to do this request
> merging especially for different types of requests...
Well the request is still going via normal writeback submission. The
difference in my patch is that we just dynamically allocate it and
don''t
pass a pointer to sb, so we don''t have to wait.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2010-Nov-25 03:53 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Wed, Nov 24, 2010 at 02:10:28PM +0100, Jan Kara
wrote:> On Wed 24-11-10 12:03:43, Nick Piggin wrote:
> > > For the _nr variant that btrfs uses, it''s worse for the
filesystems
> > > that don''t have a 1:1 bdi<->sb mapping.  It might
not actually write any
> > > of the pages from the SB that is out of space.
> > 
> > That''s true, but it might not write anything anyway (and
after we
> > check whether writeout is in progress, the writeout thread could go
> > to sleep and not do anything anyway).
> > 
> > So it''s a pretty hacky interface anyway. If you want to do
anything
> > deterministic, you obviously need real coupling between producer and
> > consumer. This should only be a performance tweak (or a workaround
> > hack in worst case).
>   Yes, the current interface is a band aid for the problem and better
> interface is welcome. But it''s not trivial to do better...
> 
> > > > It makes no further guarantees, and anyway
> > > > the sb has to compete for normal writeback within this bdi.
> > > 
> > > > 
> > > > I think Christoph is right because filesystems should not
really
> > > > know about how bdi writeback queueing works. But I
don''t know if it''s
> > > > worth doing anything more complex for this functionality?
> > > 
> > > I think we should make a writeback_inodes_sb_unlocked() that
doesn''t
> > > warn when the semaphore isn''t held.  That should be
enough given where
> > > btrfs and ext4 are calling it from.
> > 
> > It doesn''t solve the bugs -- calling and waiting for
writeback is
> > still broken because completion requires i_mutex and it is called
> > from under i_mutex.
>   Well, as I wrote in my previous email, only ext4 has the problem with
> i_mutex and I personally view it as a bug. But ultimately it''s
Ted''s call
> to decide.
Well, for now, the easiest and simplest fix is my patch, I think. The
objection is that we may not write out anything for the specified sb,
but the current implementation provides no such guarantees at all
anyway, so I don''t think it''s a big issue.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2010-Nov-25 04:07 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Wed, Nov 24, 2010 at 02:51:57PM -0800, Andrew Morton
wrote:> On Wed, 24 Nov 2010 12:03:43 +1100
> Nick Piggin <npiggin@kernel.dk> wrote:
> 
> > On Tue, Nov 23, 2010 at 01:58:24PM -0500, Chris Mason wrote:
> > > > > My original btrfs patch just exported the bdi_ funcs so
that btrfs could
> > > > > do the above internally.  But Christoph objected, and I
think he''s
> > > > > right.  We should either give everyone a bdi or make
sure the writeback
> > > > > func kicks only one filesystem.
> > > > 
> > > > Well it''s just kicking the writeback thread, and it
will writeback
> > > > from that particular sb.
> > > 
> > > Hmmm?  It will writeback for all the SBs on that bdi.  In the
current
> > > form that ext4 uses, that gets pretty expensive if you have a
bunch of
> > > large partitions and you''re only running out of space on
one of them.
> > 
> > Right. But if the bdi has writeback in progress (which would be most
> > of the time, on a busy filesystem), writeback_if_idle doesn''t
do
> > anything, and it is happy just for the background writeback to
> > eventually get around to writing out for us.
> 
> That doesn''t work if you''re running btfs (apparently
short for
> "busticated filesystem") because the bdi-per-sb thing carefully
hid the
> information which you''re looking for.
> 
> 
> We still don''t have a fix for this bug yet, it appears, btw.
My last patch is a fix. It makes the writeback slightly less directed at
the sb (but there were no guarantees of that anyway). But this could be
improved with subsequent patches. We actually don''t need to refcount
the
sb in the work item submission, so long as we only compare sb pointers
and DTRT in the writeback thread if they match. So it can easily be
fixed, but for now both users (ext4 and btrfs) won''t care about that
detail.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Boaz Harrosh

2010-Nov-25 09:41 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On 11/25/2010 12:47 AM, Andrew Morton wrote:> On Tue, 23 Nov 2010 07:34:07 -0500
> Chris Mason <chris.mason@oracle.com> wrote:
> 
>> For btrfs there''s only one bdi per SB, but for most everyone
else a disk
>> with a bunch of partitions is going to have multiple filesystems on the
>> same bdi.
> 
> um, please explain why that wasn''t idiotic?  The BDI is a
> representation of a backing device and it''s *supposed* to provide
> visibility into what''s happening against other partitions on the
same
> device.  Creating a BDI per SB (it didn''t even occur to me to
think
> that a filesystem was even able to do this) breaks that.
> 
In btrfs an SB my span multiple partitions. How else can it be solved?

(I know... You didn''t like that from the begging ;-))

Boaz> --
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2010-Nov-25 20:30 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Thu, 25 Nov 2010 11:41:50 +0200 Boaz Harrosh <bharrosh@panasas.com>
wrote:
> On 11/25/2010 12:47 AM, Andrew Morton wrote:
> > On Tue, 23 Nov 2010 07:34:07 -0500
> > Chris Mason <chris.mason@oracle.com> wrote:
> > 
> >> For btrfs there''s only one bdi per SB, but for most
everyone else a disk
> >> with a bunch of partitions is going to have multiple filesystems
on the
> >> same bdi.
> > 
> > um, please explain why that wasn''t idiotic?  The BDI is a
> > representation of a backing device and it''s *supposed* to
provide
> > visibility into what''s happening against other partitions on
the same
> > device.  Creating a BDI per SB (it didn''t even occur to me to
think
> > that a filesystem was even able to do this) breaks that.
> > 
> 
> In btrfs an SB my span multiple partitions. How else can it be solved?
Associate a number of bdi''s with the superblock.  If necessary, convert
core kernel to operate on groups of BDI''s.  Which shouldn''t be
too hard
given that core kernel already does this, for MD.

ie: something which faithfully models what is actually going on, rather
than simply bending reality.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2010-Nov-29 22:26 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Thu, 25 Nov 2010 14:53:56 +1100
Nick Piggin <npiggin@kernel.dk> wrote:
> On Wed, Nov 24, 2010 at 02:10:28PM +0100, Jan Kara wrote:
> > On Wed 24-11-10 12:03:43, Nick Piggin wrote:
> > > > For the _nr variant that btrfs uses, it''s worse for
the filesystems
> > > > that don''t have a 1:1 bdi<->sb mapping.  It
might not actually write any
> > > > of the pages from the SB that is out of space.
> > > 
> > > That''s true, but it might not write anything anyway (and
after we
> > > check whether writeout is in progress, the writeout thread could
go
> > > to sleep and not do anything anyway).
> > > 
> > > So it''s a pretty hacky interface anyway. If you want to
do anything
> > > deterministic, you obviously need real coupling between producer
and
> > > consumer. This should only be a performance tweak (or a
workaround
> > > hack in worst case).
> >   Yes, the current interface is a band aid for the problem and better
> > interface is welcome. But it''s not trivial to do better...
> > 
> > > > > It makes no further guarantees, and anyway
> > > > > the sb has to compete for normal writeback within this
bdi.
> > > > 
> > > > > 
> > > > > I think Christoph is right because filesystems should
not really
> > > > > know about how bdi writeback queueing works. But I
don''t know if it''s
> > > > > worth doing anything more complex for this
functionality?
> > > > 
> > > > I think we should make a writeback_inodes_sb_unlocked() that
doesn''t
> > > > warn when the semaphore isn''t held.  That should be
enough given where
> > > > btrfs and ext4 are calling it from.
> > > 
> > > It doesn''t solve the bugs -- calling and waiting for
writeback is
> > > still broken because completion requires i_mutex and it is called
> > > from under i_mutex.
> >   Well, as I wrote in my previous email, only ext4 has the problem
with
> > i_mutex and I personally view it as a bug. But ultimately
it''s Ted''s call
> > to decide.
> 
> Well, for now, the easiest and simplest fix is my patch, I think. The
> objection is that we may not write out anything for the specified sb,
> but the current implementation provides no such guarantees at all
> anyway, so I don''t think it''s a big issue.
Well yes.  We take something which will fail occasionally and with your
patch replace it with something which will fail a bit more often.  Why
don''t we go all the way and do something which will fail *even more
often*.  Namely, just delete the damn function in the hope that the
resulting failures will provoke the ext4 crew into doing something sane
this time?

Guys, this delalloc thing *sucks*.  And here we are just sticking new
bandaids on top of the old bandaids.  And the btrfs approach isn''t
exactly a thing of glory, either.

So...  nope.  I won''t be applying Nick''s patch.  Please fix
this thing
properly - you have a whole month!
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2010-Nov-30 00:01 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Mon, Nov 29, 2010 at 02:26:03PM -0800, Andrew Morton
wrote:> 	On Thu, 25 Nov 2010 14:53:56 +1100
> Nick Piggin <npiggin@kernel.dk> wrote:
> > On Wed, Nov 24, 2010 at 02:10:28PM +0100, Jan Kara wrote:
> > Well, for now, the easiest and simplest fix is my patch, I think. The
> > objection is that we may not write out anything for the specified sb,
> > but the current implementation provides no such guarantees at all
> > anyway, so I don''t think it''s a big issue.
> 
> Well yes.  We take something which will fail occasionally and with your
> patch replace it with something which will fail a bit more often.  Why
> don''t we go all the way and do something which will fail *even
more
> often*.  Namely, just delete the damn function in the hope that the
> resulting failures will provoke the ext4 crew into doing something sane
> this time?
I just need it fixed because the deadlocks are constantly hanging my
tests and/or switching off lockdep.

 > Guys, this delalloc thing *sucks*.  And here we are just sticking new
> bandaids on top of the old bandaids.  And the btrfs approach isn''t
> exactly a thing of glory, either.
> 
> So...  nope.  I won''t be applying Nick''s patch.  Please
fix this thing
> properly - you have a whole month!
Testers have less. It would be better to fix it now and rip it out at
the start of the next merge window if you''re that way inclined :)

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Nov-30 00:50 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

Excerpts from Andrew Morton''s message of 2010-11-24 17:47:40
-0500:> On Tue, 23 Nov 2010 07:34:07 -0500
> Chris Mason <chris.mason@oracle.com> wrote:
> 
> > For btrfs there''s only one bdi per SB, but for most everyone
else a disk
> > with a bunch of partitions is going to have multiple filesystems on
the
> > same bdi.
> 
> um, please explain why that wasn''t idiotic?  The BDI is a
> representation of a backing device and it''s *supposed* to provide
> visibility into what''s happening against other partitions on the
same
> device.  Creating a BDI per SB (it didn''t even occur to me to
think
> that a filesystem was even able to do this) breaks that.
> 
We don''t really have visibility into the other partitions, we all just
pretend they aren''t there (this patch being the most recent example).
Yes, it does help prevent N flushers seeking around on the drive but it
does cause problems too.

How is the btrfs one-bdi-per-super different from device mapper''s one
bdi per logical volume?  We''re all idiots together I suppose.

As for having multiple bdis per FS, that was always a long term goal of
mine when Jens was setting up the new flushers.  I didn''t want to
confuse the initial implementation with it though.

-chris
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2010-Dec-16 03:12 UTC

head link

Re: [patch] fs: fix deadlocks in writeback_if_idle

On Tue, Nov 30, 2010 at 11:01:18AM +1100, Nick Piggin
wrote:> On Mon, Nov 29, 2010 at 02:26:03PM -0800, Andrew Morton wrote:
> > 	On Thu, 25 Nov 2010 14:53:56 +1100
> > Nick Piggin <npiggin@kernel.dk> wrote:
> > > On Wed, Nov 24, 2010 at 02:10:28PM +0100, Jan Kara wrote:
> > > Well, for now, the easiest and simplest fix is my patch, I think.
The
> > > objection is that we may not write out anything for the specified
sb,
> > > but the current implementation provides no such guarantees at all
> > > anyway, so I don''t think it''s a big issue.
> > 
> > Well yes.  We take something which will fail occasionally and with
your
> > patch replace it with something which will fail a bit more often.  Why
> > don''t we go all the way and do something which will fail
*even more
> > often*.  Namely, just delete the damn function in the hope that the
> > resulting failures will provoke the ext4 crew into doing something
sane
> > this time?
> 
> I just need it fixed because the deadlocks are constantly hanging my
> tests and/or switching off lockdep.
> 
>  
> > Guys, this delalloc thing *sucks*.  And here we are just sticking new
> > bandaids on top of the old bandaids.  And the btrfs approach
isn''t
> > exactly a thing of glory, either.
> > 
> > So...  nope.  I won''t be applying Nick''s patch. 
Please fix this thing
> > properly - you have a whole month!
> 
> Testers have less. It would be better to fix it now and rip it out at
> the start of the next merge window if you''re that way inclined :)
Is this going to be fixed in time, or shall we merge my patch for
2.6.37?

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Nov 2010 - [patch] fs: fix deadlocks in writeback_if_idle

[patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle

Re: [patch] fs: fix deadlocks in writeback_if_idle