thr3ads.net - Btrfs devel - [PATCH RFC] btrfs: Simplify locking [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Tejun Heo

2011-Mar-20 17:44 UTC

[PATCH RFC] btrfs: Simplify locking

Hello, guys.

I''ve been looking through btrfs code and found out that the locking
was quite interesting.  The distinction between blocking and
non-blocking locking is valid but is essentially attacing the same
problem that CONFIG_MUTEX_SPIN_ON_OWNER addresses in generic manner.

It seemed that CONFIG_MUTEX_SPIN_ON_OWNER should be able to do it
better as it actually knows whether the lock owner is running or not,
so I did a few "dbench 50" runs with the complex locking removed.

The test machine is a dual core machine with 1 gig of memory.  The
filesystem is on OCZ vertex 64 gig SSD.  I''m attaching the kernel
config.  Please note that CONFIG_MUTEX_SPIN_ON_OWNER is enabled only
when lock debugging is disabled, but it will be enabled on virtually
any production configuration.

The machine was idle during the dbench runs and CPU usages and context
switches are calculated from /proc/stat over the dbench runs.  The
throughput is as reported by dbench.

         user  system  sirq    cxtsw throughput
before  14426  129332   345  1674004    171.7
after   14274  129036   308  1183346    172.119

So, the extra code isn''t really helping anything.  It''s worse
in every
way.  Are there some obscure reasons the custom locking should be kept
around?

One thing to note is that the patch makes try_tree_lock and
try_spin_lock identical.  It might be benefical to apply owner spin
logic to try_spin_lock.  I''ll test that one too and see whether it
makes any difference.

Thanks.

NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/btrfs/Makefile    |    2 
 fs/btrfs/ctree.c     |   16 +--
 fs/btrfs/extent_io.c |    3 
 fs/btrfs/extent_io.h |   12 --
 fs/btrfs/locking.c   |  233 ---------------------------------------------------
 fs/btrfs/locking.h   |   43 +++++++--
 6 files changed, 48 insertions(+), 261 deletions(-)

Index: work/fs/btrfs/ctree.c
==================================================================---
work.orig/fs/btrfs/ctree.c
+++ work/fs/btrfs/ctree.c
@@ -1074,7 +1074,7 @@ static noinline int balance_level(struct
 
 	left = read_node_slot(root, parent, pslot - 1);
 	if (left) {
-		btrfs_tree_lock(left);
+		btrfs_tree_lock_nested(left, 1);
 		btrfs_set_lock_blocking(left);
 		wret = btrfs_cow_block(trans, root, left,
 				       parent, pslot - 1, &left);
@@ -1085,7 +1085,7 @@ static noinline int balance_level(struct
 	}
 	right = read_node_slot(root, parent, pslot + 1);
 	if (right) {
-		btrfs_tree_lock(right);
+		btrfs_tree_lock_nested(right, 2);
 		btrfs_set_lock_blocking(right);
 		wret = btrfs_cow_block(trans, root, right,
 				       parent, pslot + 1, &right);
@@ -1241,7 +1241,7 @@ static noinline int push_nodes_for_inser
 	if (left) {
 		u32 left_nr;
 
-		btrfs_tree_lock(left);
+		btrfs_tree_lock_nested(left, 1);
 		btrfs_set_lock_blocking(left);
 
 		left_nr = btrfs_header_nritems(left);
@@ -1291,7 +1291,7 @@ static noinline int push_nodes_for_inser
 	if (right) {
 		u32 right_nr;
 
-		btrfs_tree_lock(right);
+		btrfs_tree_lock_nested(right, 1);
 		btrfs_set_lock_blocking(right);
 
 		right_nr = btrfs_header_nritems(right);
@@ -2519,7 +2519,7 @@ static int push_leaf_right(struct btrfs_
 	if (right == NULL)
 		return 1;
 
-	btrfs_tree_lock(right);
+	btrfs_tree_lock_nested(right, 1);
 	btrfs_set_lock_blocking(right);
 
 	free_space = btrfs_leaf_free_space(root, right);
@@ -2772,7 +2772,7 @@ static int push_leaf_left(struct btrfs_t
 	if (left == NULL)
 		return 1;
 
-	btrfs_tree_lock(left);
+	btrfs_tree_lock_nested(left, 1);
 	btrfs_set_lock_blocking(left);
 
 	free_space = btrfs_leaf_free_space(root, left);
@@ -4420,7 +4420,7 @@ again:
 			ret = btrfs_try_spin_lock(next);
 			if (!ret) {
 				btrfs_set_path_blocking(path);
-				btrfs_tree_lock(next);
+				btrfs_tree_lock_nested(next, 1);
 				if (!force_blocking)
 					btrfs_clear_path_blocking(path, next);
 			}
@@ -4460,7 +4460,7 @@ again:
 			ret = btrfs_try_spin_lock(next);
 			if (!ret) {
 				btrfs_set_path_blocking(path);
-				btrfs_tree_lock(next);
+				btrfs_tree_lock_nested(next, 1);
 				if (!force_blocking)
 					btrfs_clear_path_blocking(path, next);
 			}
Index: work/fs/btrfs/extent_io.c
==================================================================---
work.orig/fs/btrfs/extent_io.c
+++ work/fs/btrfs/extent_io.c
@@ -3171,8 +3171,7 @@ static struct extent_buffer *__alloc_ext
 		return NULL;
 	eb->start = start;
 	eb->len = len;
-	spin_lock_init(&eb->lock);
-	init_waitqueue_head(&eb->lock_wq);
+	mutex_init(&eb->lock);
 	INIT_RCU_HEAD(&eb->rcu_head);
 
 #if LEAK_DEBUG
Index: work/fs/btrfs/extent_io.h
==================================================================---
work.orig/fs/btrfs/extent_io.h
+++ work/fs/btrfs/extent_io.h
@@ -2,6 +2,7 @@
 #define __EXTENTIO__
 
 #include <linux/rbtree.h>
+#include <linux/mutex.h>
 
 /* bits for the extent state */
 #define EXTENT_DIRTY 1
@@ -29,7 +30,6 @@
 
 /* these are bit numbers for test/set bit */
 #define EXTENT_BUFFER_UPTODATE 0
-#define EXTENT_BUFFER_BLOCKING 1
 #define EXTENT_BUFFER_DIRTY 2
 
 /* these are flags for extent_clear_unlock_delalloc */
@@ -129,14 +129,8 @@ struct extent_buffer {
 	struct list_head leak_list;
 	struct rcu_head rcu_head;
 
-	/* the spinlock is used to protect most operations */
-	spinlock_t lock;
-
-	/*
-	 * when we keep the lock held while blocking, waiters go onto
-	 * the wq
-	 */
-	wait_queue_head_t lock_wq;
+	/* used to protect most operations */
+	struct mutex lock;
 };
 
 static inline void extent_set_compress_type(unsigned long *bio_flags,
Index: work/fs/btrfs/locking.c
==================================================================---
work.orig/fs/btrfs/locking.c
+++ /dev/null
@@ -1,233 +0,0 @@
-/*
- * Copyright (C) 2008 Oracle.  All rights reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public
- * License v2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public
- * License along with this program; if not, write to the
- * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
- * Boston, MA 021110-1307, USA.
- */
-#include <linux/sched.h>
-#include <linux/pagemap.h>
-#include <linux/spinlock.h>
-#include <linux/page-flags.h>
-#include <asm/bug.h>
-#include "ctree.h"
-#include "extent_io.h"
-#include "locking.h"
-
-static inline void spin_nested(struct extent_buffer *eb)
-{
-	spin_lock(&eb->lock);
-}
-
-/*
- * Setting a lock to blocking will drop the spinlock and set the
- * flag that forces other procs who want the lock to wait.  After
- * this you can safely schedule with the lock held.
- */
-void btrfs_set_lock_blocking(struct extent_buffer *eb)
-{
-	if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags)) {
-		set_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags);
-		spin_unlock(&eb->lock);
-	}
-	/* exit with the spin lock released and the bit set */
-}
-
-/*
- * clearing the blocking flag will take the spinlock again.
- * After this you can''t safely schedule
- */
-void btrfs_clear_lock_blocking(struct extent_buffer *eb)
-{
-	if (test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags)) {
-		spin_nested(eb);
-		clear_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags);
-		smp_mb__after_clear_bit();
-	}
-	/* exit with the spin lock held */
-}
-
-/*
- * unfortunately, many of the places that currently set a lock to blocking
- * don''t end up blocking for very long, and often they don''t
block
- * at all.  For a dbench 50 run, if we don''t spin on the blocking bit
- * at all, the context switch rate can jump up to 400,000/sec or more.
- *
- * So, we''re still stuck with this crummy spin on the blocking bit,
- * at least until the most common causes of the short blocks
- * can be dealt with.
- */
-static int btrfs_spin_on_block(struct extent_buffer *eb)
-{
-	int i;
-
-	for (i = 0; i < 512; i++) {
-		if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-			return 1;
-		if (need_resched())
-			break;
-		cpu_relax();
-	}
-	return 0;
-}
-
-/*
- * This is somewhat different from trylock.  It will take the
- * spinlock but if it finds the lock is set to blocking, it will
- * return without the lock held.
- *
- * returns 1 if it was able to take the lock and zero otherwise
- *
- * After this call, scheduling is not safe without first calling
- * btrfs_set_lock_blocking()
- */
-int btrfs_try_spin_lock(struct extent_buffer *eb)
-{
-	int i;
-
-	if (btrfs_spin_on_block(eb)) {
-		spin_nested(eb);
-		if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-			return 1;
-		spin_unlock(&eb->lock);
-	}
-	/* spin for a bit on the BLOCKING flag */
-	for (i = 0; i < 2; i++) {
-		cpu_relax();
-		if (!btrfs_spin_on_block(eb))
-			break;
-
-		spin_nested(eb);
-		if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-			return 1;
-		spin_unlock(&eb->lock);
-	}
-	return 0;
-}
-
-/*
- * the autoremove wake function will return 0 if it tried to wake up
- * a process that was already awake, which means that process won''t
- * count as an exclusive wakeup.  The waitq code will continue waking
- * procs until it finds one that was actually sleeping.
- *
- * For btrfs, this isn''t quite what we want.  We want a single proc
- * to be notified that the lock is ready for taking.  If that proc
- * already happen to be awake, great, it will loop around and try for
- * the lock.
- *
- * So, btrfs_wake_function always returns 1, even when the proc that we
- * tried to wake up was already awake.
- */
-static int btrfs_wake_function(wait_queue_t *wait, unsigned mode,
-			       int sync, void *key)
-{
-	autoremove_wake_function(wait, mode, sync, key);
-	return 1;
-}
-
-/*
- * returns with the extent buffer spinlocked.
- *
- * This will spin and/or wait as required to take the lock, and then
- * return with the spinlock held.
- *
- * After this call, scheduling is not safe without first calling
- * btrfs_set_lock_blocking()
- */
-int btrfs_tree_lock(struct extent_buffer *eb)
-{
-	DEFINE_WAIT(wait);
-	wait.func = btrfs_wake_function;
-
-	if (!btrfs_spin_on_block(eb))
-		goto sleep;
-
-	while(1) {
-		spin_nested(eb);
-
-		/* nobody is blocking, exit with the spinlock held */
-		if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-			return 0;
-
-		/*
-		 * we have the spinlock, but the real owner is blocking.
-		 * wait for them
-		 */
-		spin_unlock(&eb->lock);
-
-		/*
-		 * spin for a bit, and if the blocking flag goes away,
-		 * loop around
-		 */
-		cpu_relax();
-		if (btrfs_spin_on_block(eb))
-			continue;
-sleep:
-		prepare_to_wait_exclusive(&eb->lock_wq, &wait,
-					  TASK_UNINTERRUPTIBLE);
-
-		if (test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-			schedule();
-
-		finish_wait(&eb->lock_wq, &wait);
-	}
-	return 0;
-}
-
-/*
- * Very quick trylock, this does not spin or schedule.  It returns
- * 1 with the spinlock held if it was able to take the lock, or it
- * returns zero if it was unable to take the lock.
- *
- * After this call, scheduling is not safe without first calling
- * btrfs_set_lock_blocking()
- */
-int btrfs_try_tree_lock(struct extent_buffer *eb)
-{
-	if (spin_trylock(&eb->lock)) {
-		if (test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags)) {
-			/*
-			 * we''ve got the spinlock, but the real owner is
-			 * blocking.  Drop the spinlock and return failure
-			 */
-			spin_unlock(&eb->lock);
-			return 0;
-		}
-		return 1;
-	}
-	/* someone else has the spinlock giveup */
-	return 0;
-}
-
-int btrfs_tree_unlock(struct extent_buffer *eb)
-{
-	/*
-	 * if we were a blocking owner, we don''t have the spinlock held
-	 * just clear the bit and look for waiters
-	 */
-	if (test_and_clear_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-		smp_mb__after_clear_bit();
-	else
-		spin_unlock(&eb->lock);
-
-	if (waitqueue_active(&eb->lock_wq))
-		wake_up(&eb->lock_wq);
-	return 0;
-}
-
-void btrfs_assert_tree_locked(struct extent_buffer *eb)
-{
-	if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-		assert_spin_locked(&eb->lock);
-}
Index: work/fs/btrfs/locking.h
==================================================================---
work.orig/fs/btrfs/locking.h
+++ work/fs/btrfs/locking.h
@@ -19,13 +19,40 @@
 #ifndef __BTRFS_LOCKING_
 #define __BTRFS_LOCKING_
 
-int btrfs_tree_lock(struct extent_buffer *eb);
-int btrfs_tree_unlock(struct extent_buffer *eb);
+#include <linux/bug.h>
 
-int btrfs_try_tree_lock(struct extent_buffer *eb);
-int btrfs_try_spin_lock(struct extent_buffer *eb);
+static inline bool btrfs_try_spin_lock(struct extent_buffer *eb)
+{
+	return mutex_trylock(&eb->lock);
+}
 
-void btrfs_set_lock_blocking(struct extent_buffer *eb);
-void btrfs_clear_lock_blocking(struct extent_buffer *eb);
-void btrfs_assert_tree_locked(struct extent_buffer *eb);
-#endif
+static inline void btrfs_tree_lock(struct extent_buffer *eb)
+{
+	mutex_lock(&eb->lock);
+}
+
+static inline void btrfs_tree_lock_nested(struct extent_buffer *eb,
+					  unsigned int subclass)
+{
+	mutex_lock_nested(&eb->lock, subclass);
+}
+
+static inline void btrfs_tree_unlock(struct extent_buffer *eb)
+{
+	mutex_unlock(&eb->lock);
+}
+
+static inline int btrfs_try_tree_lock(struct extent_buffer *eb)
+{
+	return mutex_trylock(&eb->lock);
+}
+
+static inline void btrfs_set_lock_blocking(struct extent_buffer *eb)	{ }
+static inline void btrfs_clear_lock_blocking(struct extent_buffer *eb)	{ }
+
+static inline void btrfs_assert_tree_locked(struct extent_buffer *eb)
+{
+	BUG_ON(!mutex_is_locked(&eb->lock));
+}
+
+#endif	/* __BTRFS_LOCKING_ */
Index: work/fs/btrfs/Makefile
==================================================================---
work.orig/fs/btrfs/Makefile
+++ work/fs/btrfs/Makefile
@@ -5,6 +5,6 @@ btrfs-y += super.o ctree.o extent-tree.o
 	   file-item.o inode-item.o inode-map.o disk-io.o \
 	   transaction.o inode.o file.o tree-defrag.o \
 	   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
-	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
+	   extent_io.o volumes.o async-thread.o ioctl.o orphan.o \
 	   export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo

2011-Mar-20 19:56 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

So, here''s the patch to implement and use mutex_try_spin(), which
applies the same owner spin logic to try locking.  The result looks
pretty good.

I re-ran all three.  DFL is the current custom locking.  SIMPLE is
with only the previous patch applied.  SPIN is both the previous and
this patches applied.

       USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
DFL    14484  129368    390  1669102     171.955
SIMPLE 14483  128902    318  1187031     171.512
SPIN   14311  129222    347  1198166     174.904

DFL/SIMPLE results are more or less consistent with the previous run.
SPIN seems to consume a bit more cpu than SIMPLE but shows discernably
better throughput.  I''m running SPIN again just in case but the result
seems pretty consistent.

Thanks.

NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/btrfs/locking.h    |    2 -
 include/linux/mutex.h |    1 
 kernel/mutex.c        |   58 ++++++++++++++++++++++++++++++++++++--------------
 3 files changed, 44 insertions(+), 17 deletions(-)

Index: work/fs/btrfs/locking.h
==================================================================---
work.orig/fs/btrfs/locking.h
+++ work/fs/btrfs/locking.h
@@ -23,7 +23,7 @@
 
 static inline bool btrfs_try_spin_lock(struct extent_buffer *eb)
 {
-	return mutex_trylock(&eb->lock);
+	return mutex_tryspin(&eb->lock);
 }
 
 static inline void btrfs_tree_lock(struct extent_buffer *eb)
Index: work/include/linux/mutex.h
==================================================================---
work.orig/include/linux/mutex.h
+++ work/include/linux/mutex.h
@@ -157,6 +157,7 @@ extern int __must_check mutex_lock_killa
  * Returns 1 if the mutex has been acquired successfully, and 0 on contention.
  */
 extern int mutex_trylock(struct mutex *lock);
+extern int mutex_tryspin(struct mutex *lock);
 extern void mutex_unlock(struct mutex *lock);
 extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
 
Index: work/kernel/mutex.c
==================================================================---
work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -126,20 +126,8 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
-/*
- * Lock a mutex (possibly interruptible), slowpath:
- */
-static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+static inline bool mutex_spin(struct mutex *lock)
 {
-	struct task_struct *task = current;
-	struct mutex_waiter waiter;
-	unsigned long flags;
-
-	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
-
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 	/*
 	 * Optimistic spinning.
@@ -158,7 +146,6 @@ __mutex_lock_common(struct mutex *lock,
 	 * We can''t do this for DEBUG_MUTEXES because that relies on
wait_lock
 	 * to serialize everything.
 	 */
-
 	for (;;) {
 		struct thread_info *owner;
 
@@ -181,7 +168,7 @@ __mutex_lock_common(struct mutex *lock,
 			lock_acquired(&lock->dep_map, ip);
 			mutex_set_owner(lock);
 			preempt_enable();
-			return 0;
+			return true;
 		}
 
 		/*
@@ -190,7 +177,7 @@ __mutex_lock_common(struct mutex *lock,
 		 * we''re an RT task that will live-lock because we won''t
let
 		 * the owner complete.
 		 */
-		if (!owner && (need_resched() || rt_task(task)))
+		if (!owner && (need_resched() || rt_task(current)))
 			break;
 
 		/*
@@ -202,6 +189,26 @@ __mutex_lock_common(struct mutex *lock,
 		cpu_relax();
 	}
 #endif
+	return false;
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static inline int __sched
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
+	       	unsigned long ip)
+{
+	struct task_struct *task = current;
+	struct mutex_waiter waiter;
+	unsigned long flags;
+
+	preempt_disable();
+	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+
+	if (mutex_spin(lock))
+		return 0;
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	debug_mutex_lock_common(lock, &waiter);
@@ -473,6 +480,25 @@ int __sched mutex_trylock(struct mutex *
 }
 EXPORT_SYMBOL(mutex_trylock);
 
+static inline int __mutex_tryspin_slowpath(atomic_t *lock_count)
+{
+	struct mutex *lock = container_of(lock_count, struct mutex, count);
+
+	return __mutex_trylock_slowpath(lock_count) || mutex_spin(lock);
+}
+
+int __sched mutex_tryspin(struct mutex *lock)
+{
+	int ret;
+
+	ret = __mutex_fastpath_trylock(&lock->count, __mutex_tryspin_slowpath);
+	if (ret)
+		mutex_set_owner(lock);
+
+	return ret;
+}
+EXPORT_SYMBOL(mutex_tryspin);
+
 /**
  * atomic_dec_and_mutex_lock - return holding mutex if we dec to 0
  * @cnt: the atomic which we are to dec
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo

2011-Mar-20 20:17 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

On Sun, Mar 20, 2011 at 08:56:52PM +0100, Tejun Heo
wrote:> So, here''s the patch to implement and use mutex_try_spin(), which
> applies the same owner spin logic to try locking.  The result looks
> pretty good.
> 
> I re-ran all three.  DFL is the current custom locking.  SIMPLE is
> with only the previous patch applied.  SPIN is both the previous and
> this patches applied.
> 
>        USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
> DFL    14484  129368    390  1669102     171.955
> SIMPLE 14483  128902    318  1187031     171.512
> SPIN   14311  129222    347  1198166     174.904
> 
> DFL/SIMPLE results are more or less consistent with the previous run.
> SPIN seems to consume a bit more cpu than SIMPLE but shows discernably
> better throughput.  I''m running SPIN again just in case but the
result
> seems pretty consistent.
Here''s another run.

  SPIN   14377  129092    335  1189817     172.724
  
It''s not as high as the last run, but given other runs I''ve
been
seeing, I think meaningful (ie. not measurement error) throughput
advantage exists.  That said, the difference definitely seems minor.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Mar-21 00:10 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

Excerpts from Tejun Heo''s message of 2011-03-20 13:44:33
-0400:> Hello, guys.
> 
> I''ve been looking through btrfs code and found out that the
locking
> was quite interesting.  The distinction between blocking and
> non-blocking locking is valid but is essentially attacing the same
> problem that CONFIG_MUTEX_SPIN_ON_OWNER addresses in generic manner.
> 
> It seemed that CONFIG_MUTEX_SPIN_ON_OWNER should be able to do it
> better as it actually knows whether the lock owner is running or not,
> so I did a few "dbench 50" runs with the complex locking removed.
> 
> The test machine is a dual core machine with 1 gig of memory.  The
> filesystem is on OCZ vertex 64 gig SSD.  I''m attaching the kernel
> config.  Please note that CONFIG_MUTEX_SPIN_ON_OWNER is enabled only
> when lock debugging is disabled, but it will be enabled on virtually
> any production configuration.
> 
> The machine was idle during the dbench runs and CPU usages and context
> switches are calculated from /proc/stat over the dbench runs.  The
> throughput is as reported by dbench.
> 
>          user  system  sirq    cxtsw throughput
> before  14426  129332   345  1674004    171.7
> after   14274  129036   308  1183346    172.119
> 
> So, the extra code isn''t really helping anything.  It''s
worse in every
> way.  Are there some obscure reasons the custom locking should be kept
> around?
Hi Tejun,

I went through a number of benchmarks with the explicit
blocking/spinning code and back then it was still significantly faster
than the adaptive spin.  But, it is definitely worth doing these again,
how many dbench procs did you use?

The biggest benefit to explicit spinning is that mutex_lock starts with
might_sleep(), so we skip the cond_resched().  Do you have voluntary
preempt on?

The long term goal was always to get rid of the blocking locks, I had a
lot of code to track the top blockers and we had gotten rid of about 90%
of them.

Thanks for looking at this

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo

2011-Mar-21 08:29 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

Hello, Chris.

On Sun, Mar 20, 2011 at 08:10:51PM -0400, Chris Mason
wrote:> I went through a number of benchmarks with the explicit
> blocking/spinning code and back then it was still significantly faster
> than the adaptive spin.  But, it is definitely worth doing these again,
> how many dbench procs did you use?
It was dbench 50.
> The biggest benefit to explicit spinning is that mutex_lock starts with
> might_sleep(), so we skip the cond_resched().  Do you have voluntary
> preempt on?
Ah, right, I of course forgot to actually attach the .config.  I had
CONFIG_PREEMPT, not CONFIG_PREEMPT_VOLUNTARY.  I''ll re-run with
VOLUNTARY and see how its behavior changes.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo

2011-Mar-21 16:59 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

Hello,

Here are the results with voluntary preemption.  I''ve moved to a
beefier machine for testing.  It''s dual Opteron 2347, so dual socket,
eight core.  The memory is limited to 1GiB to force IOs and the disk
is the same OCZ Vertex 60gig SSD.  /proc/stat is captured before and
after "dbench 50".

I ran the following four setups.

DFL	The current custom locking implementation.
SIMPLE	Simple mutex conversion.  The first patch in this thread.
SPIN	SIMPLE + mutex_tryspin().  The second patch in this thread.
SPIN2	SPIN + mutex_tryspin() in btrfs_tree_lock().  Patch below.

SPIN2 should alleviate the voluntary preemption by might_sleep() in
mutex_lock().

       USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
DFL    49427  458210   1433  7683488     642.947
SIMPLE 52836  471398   1427  3055384     705.369
SPIN   52267  473115   1467  3005603     705.369
SPIN2  52063  470453   1446  3092091     701.826

I''m running DFL again just in case but SIMPLE or SPIN seems to be a
much better choice.

Thanks.

NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/btrfs/locking.h |    2 ++
 1 file changed, 2 insertions(+)

Index: work/fs/btrfs/locking.h
==================================================================---
work.orig/fs/btrfs/locking.h
+++ work/fs/btrfs/locking.h
@@ -28,6 +28,8 @@ static inline bool btrfs_try_spin_lock(s
 
 static inline void btrfs_tree_lock(struct extent_buffer *eb)
 {
+	if (mutex_tryspin(&eb->lock))
+		return;
 	mutex_lock(&eb->lock);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo

2011-Mar-21 17:11 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

On Mon, Mar 21, 2011 at 05:59:55PM +0100, Tejun Heo
wrote:> I''m running DFL again just in case but SIMPLE or SPIN seems to be
a
> much better choice.
Got 644.176 MB/sec, so yeah the custom locking is definitely worse
than just using mutex.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Mar-21 17:24 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

Excerpts from Tejun Heo''s message of 2011-03-21 12:59:55
-0400:> Hello,
> 
> Here are the results with voluntary preemption.  I''ve moved to a
> beefier machine for testing.  It''s dual Opteron 2347, so dual
socket,
> eight core.  The memory is limited to 1GiB to force IOs and the disk
> is the same OCZ Vertex 60gig SSD.  /proc/stat is captured before and
> after "dbench 50".
> 
> I ran the following four setups.
> 
> DFL    The current custom locking implementation.
> SIMPLE    Simple mutex conversion.  The first patch in this thread.
> SPIN    SIMPLE + mutex_tryspin().  The second patch in this thread.
> SPIN2    SPIN + mutex_tryspin() in btrfs_tree_lock().  Patch below.
> 
> SPIN2 should alleviate the voluntary preemption by might_sleep() in
> mutex_lock().
> 
>        USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
> DFL    49427  458210   1433  7683488     642.947
> SIMPLE 52836  471398   1427  3055384     705.369
> SPIN   52267  473115   1467  3005603     705.369
> SPIN2  52063  470453   1446  3092091     701.826
> 
> I''m running DFL again just in case but SIMPLE or SPIN seems to be
a
> much better choice.
Very interesting.  Ok, I''ll definitely rerun my benchmarks as well.  I
used dbench extensively during the initial tuning, but you''re forcing
the memory low in order to force IO.

This case doesn''t really hammer on the locks, it hammers on the
transition from spinning to blocking.  We want also want to compare
dbench entirely in ram, which will hammer on the spinning portion.

-chris
> 
> Thanks.
> 
> NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  fs/btrfs/locking.h |    2 ++
>  1 file changed, 2 insertions(+)
> 
> Index: work/fs/btrfs/locking.h
> ==================================================================> ---
work.orig/fs/btrfs/locking.h
> +++ work/fs/btrfs/locking.h
> @@ -28,6 +28,8 @@ static inline bool btrfs_try_spin_lock(s
>  
>  static inline void btrfs_tree_lock(struct extent_buffer *eb)
>  {
> +    if (mutex_tryspin(&eb->lock))
> +        return;
>      mutex_lock(&eb->lock);
>  }
>  --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo

2011-Mar-21 18:11 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

Hello,

On Mon, Mar 21, 2011 at 01:24:37PM -0400, Chris Mason
wrote:> Very interesting.  Ok, I''ll definitely rerun my benchmarks as
well.  I
> used dbench extensively during the initial tuning, but you''re
forcing
> the memory low in order to force IO.
> 
> This case doesn''t really hammer on the locks, it hammers on the
> transition from spinning to blocking.  We want also want to compare
> dbench entirely in ram, which will hammer on the spinning portion.
Here''s re-run of DFL and SIMPLE with the memory restriction lifted.
Memory is 4GiB and disk remains mostly idle with all CPUs running
full.

       USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
DFL    59898  504517    377  6814245     782.295
SIMPLE 61090  493441    457  1631688     827.751
    
So, about the same picture.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Mar-22 23:13 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

Excerpts from Tejun Heo''s message of 2011-03-21 14:11:24
-0400:> Hello,
> 
> On Mon, Mar 21, 2011 at 01:24:37PM -0400, Chris Mason wrote:
> > Very interesting.  Ok, I''ll definitely rerun my benchmarks as
well.  I
> > used dbench extensively during the initial tuning, but you''re
forcing
> > the memory low in order to force IO.
> > 
> > This case doesn''t really hammer on the locks, it hammers on
the
> > transition from spinning to blocking.  We want also want to compare
> > dbench entirely in ram, which will hammer on the spinning portion.
> 
> Here''s re-run of DFL and SIMPLE with the memory restriction
lifted.
> Memory is 4GiB and disk remains mostly idle with all CPUs running
> full.
> 
>        USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
> DFL    59898  504517    377  6814245     782.295
> SIMPLE 61090  493441    457  1631688     827.751
>     
> So, about the same picture.
Ok, this impact of this is really interesting.  If we have very short
waits where there is no IO at all, this patch tends to lose.  I ran with
dbench 10 and got about 20% slower tput.

But, if we do any IO at all it wins by at least that much or more.  I
think we should take this patch and just work on getting rid of the
scheduling with the mutex held where possible.

Tejun, could you please send the mutex_tryspin stuff in?  If we can get
a sob for that I can send the whole thing.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tejun Heo

2011-Mar-23 10:46 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

Hello, Chris.

On Tue, Mar 22, 2011 at 07:13:09PM -0400, Chris Mason
wrote:> Ok, this impact of this is really interesting.  If we have very short
> waits where there is no IO at all, this patch tends to lose.  I ran with
> dbench 10 and got about 20% slower tput.
> 
> But, if we do any IO at all it wins by at least that much or more.  I
> think we should take this patch and just work on getting rid of the
> scheduling with the mutex held where possible.
I see.
> Tejun, could you please send the mutex_tryspin stuff in?  If we can get
> a sob for that I can send the whole thing.
I''m not sure whetehr mutex_tryspin() is justified at this point, and,
even if so, how to proceed with it.  Maybe we want to make
mutex_trylock() perform owner spin by default without introducing a
new API.

Given that the difference between SIMPLE and SPIN is small, I think it
would be best to simply use mutex_trylock() for now.  It''s not gonna
make much difference either way.

How do you want to proceed?  I can prep patches doing the following.

- Convert CONFIG_DEBUG_LOCK_ALLOC to CONFIG_LOCKDEP.

- Drop locking.c and make the lock function simple wrapper around
  mutex operations.  This makes blocking/unblocking noops.

- Remove all blocking/unblocking calls along with the API.

- Remove locking wrappers and use mutex API directly.

What do you think?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Mar-23 11:44 UTC

head link

Re: [PATCH RFC] btrfs: Simplify locking

Excerpts from Tejun Heo''s message of 2011-03-23 06:46:14
-0400:> Hello, Chris.
> 
> On Tue, Mar 22, 2011 at 07:13:09PM -0400, Chris Mason wrote:
> > Ok, this impact of this is really interesting.  If we have very short
> > waits where there is no IO at all, this patch tends to lose.  I ran
with
> > dbench 10 and got about 20% slower tput.
> > 
> > But, if we do any IO at all it wins by at least that much or more.  I
> > think we should take this patch and just work on getting rid of the
> > scheduling with the mutex held where possible.
> 
> I see.
> 
> > Tejun, could you please send the mutex_tryspin stuff in?  If we can
get
> > a sob for that I can send the whole thing.
> 
> I''m not sure whetehr mutex_tryspin() is justified at this point,
and,
> even if so, how to proceed with it.  Maybe we want to make
> mutex_trylock() perform owner spin by default without introducing a
> new API.
I''ll benchmark without it, but I think the cond_resched is going to
have
a pretty big impact.  I''m digging up the related benchmarks I used
during the initial adaptive spin work.
> 
> Given that the difference between SIMPLE and SPIN is small, I think it
> would be best to simply use mutex_trylock() for now.  It''s not
gonna
> make much difference either way.
mutex_trylock is a good start.
> 
> How do you want to proceed?  I can prep patches doing the following.
> 
> - Convert CONFIG_DEBUG_LOCK_ALLOC to CONFIG_LOCKDEP.
> 
> - Drop locking.c and make the lock function simple wrapper around
>   mutex operations.  This makes blocking/unblocking noops.
> 
> - Remove all blocking/unblocking calls along with the API.
I''d like to keep the blocking/unblocking calls for one release. 
I''d
like to finally finish off my patches that do concurrent reads.
> 
> - Remove locking wrappers and use mutex API directly.
I''d also like to keep the wrappers until the concurrent reader locking
is done.
> 
> What do you think?
Thanks for all the work.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Mar 2011 - [PATCH RFC] btrfs: Simplify locking

[PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking

Re: [PATCH RFC] btrfs: Simplify locking