Hello, guys.
I''ve been looking through btrfs code and found out that the locking
was quite interesting.  The distinction between blocking and
non-blocking locking is valid but is essentially attacing the same
problem that CONFIG_MUTEX_SPIN_ON_OWNER addresses in generic manner.
It seemed that CONFIG_MUTEX_SPIN_ON_OWNER should be able to do it
better as it actually knows whether the lock owner is running or not,
so I did a few "dbench 50" runs with the complex locking removed.
The test machine is a dual core machine with 1 gig of memory.  The
filesystem is on OCZ vertex 64 gig SSD.  I''m attaching the kernel
config.  Please note that CONFIG_MUTEX_SPIN_ON_OWNER is enabled only
when lock debugging is disabled, but it will be enabled on virtually
any production configuration.
The machine was idle during the dbench runs and CPU usages and context
switches are calculated from /proc/stat over the dbench runs.  The
throughput is as reported by dbench.
         user  system  sirq    cxtsw throughput
before  14426  129332   345  1674004    171.7
after   14274  129036   308  1183346    172.119
So, the extra code isn''t really helping anything.  It''s worse
in every
way.  Are there some obscure reasons the custom locking should be kept
around?
One thing to note is that the patch makes try_tree_lock and
try_spin_lock identical.  It might be benefical to apply owner spin
logic to try_spin_lock.  I''ll test that one too and see whether it
makes any difference.
Thanks.
NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/btrfs/Makefile    |    2 
 fs/btrfs/ctree.c     |   16 +--
 fs/btrfs/extent_io.c |    3 
 fs/btrfs/extent_io.h |   12 --
 fs/btrfs/locking.c   |  233 ---------------------------------------------------
 fs/btrfs/locking.h   |   43 +++++++--
 6 files changed, 48 insertions(+), 261 deletions(-)
Index: work/fs/btrfs/ctree.c
==================================================================---
work.orig/fs/btrfs/ctree.c
+++ work/fs/btrfs/ctree.c
@@ -1074,7 +1074,7 @@ static noinline int balance_level(struct
 
 	left = read_node_slot(root, parent, pslot - 1);
 	if (left) {
-		btrfs_tree_lock(left);
+		btrfs_tree_lock_nested(left, 1);
 		btrfs_set_lock_blocking(left);
 		wret = btrfs_cow_block(trans, root, left,
 				       parent, pslot - 1, &left);
@@ -1085,7 +1085,7 @@ static noinline int balance_level(struct
 	}
 	right = read_node_slot(root, parent, pslot + 1);
 	if (right) {
-		btrfs_tree_lock(right);
+		btrfs_tree_lock_nested(right, 2);
 		btrfs_set_lock_blocking(right);
 		wret = btrfs_cow_block(trans, root, right,
 				       parent, pslot + 1, &right);
@@ -1241,7 +1241,7 @@ static noinline int push_nodes_for_inser
 	if (left) {
 		u32 left_nr;
 
-		btrfs_tree_lock(left);
+		btrfs_tree_lock_nested(left, 1);
 		btrfs_set_lock_blocking(left);
 
 		left_nr = btrfs_header_nritems(left);
@@ -1291,7 +1291,7 @@ static noinline int push_nodes_for_inser
 	if (right) {
 		u32 right_nr;
 
-		btrfs_tree_lock(right);
+		btrfs_tree_lock_nested(right, 1);
 		btrfs_set_lock_blocking(right);
 
 		right_nr = btrfs_header_nritems(right);
@@ -2519,7 +2519,7 @@ static int push_leaf_right(struct btrfs_
 	if (right == NULL)
 		return 1;
 
-	btrfs_tree_lock(right);
+	btrfs_tree_lock_nested(right, 1);
 	btrfs_set_lock_blocking(right);
 
 	free_space = btrfs_leaf_free_space(root, right);
@@ -2772,7 +2772,7 @@ static int push_leaf_left(struct btrfs_t
 	if (left == NULL)
 		return 1;
 
-	btrfs_tree_lock(left);
+	btrfs_tree_lock_nested(left, 1);
 	btrfs_set_lock_blocking(left);
 
 	free_space = btrfs_leaf_free_space(root, left);
@@ -4420,7 +4420,7 @@ again:
 			ret = btrfs_try_spin_lock(next);
 			if (!ret) {
 				btrfs_set_path_blocking(path);
-				btrfs_tree_lock(next);
+				btrfs_tree_lock_nested(next, 1);
 				if (!force_blocking)
 					btrfs_clear_path_blocking(path, next);
 			}
@@ -4460,7 +4460,7 @@ again:
 			ret = btrfs_try_spin_lock(next);
 			if (!ret) {
 				btrfs_set_path_blocking(path);
-				btrfs_tree_lock(next);
+				btrfs_tree_lock_nested(next, 1);
 				if (!force_blocking)
 					btrfs_clear_path_blocking(path, next);
 			}
Index: work/fs/btrfs/extent_io.c
==================================================================---
work.orig/fs/btrfs/extent_io.c
+++ work/fs/btrfs/extent_io.c
@@ -3171,8 +3171,7 @@ static struct extent_buffer *__alloc_ext
 		return NULL;
 	eb->start = start;
 	eb->len = len;
-	spin_lock_init(&eb->lock);
-	init_waitqueue_head(&eb->lock_wq);
+	mutex_init(&eb->lock);
 	INIT_RCU_HEAD(&eb->rcu_head);
 
 #if LEAK_DEBUG
Index: work/fs/btrfs/extent_io.h
==================================================================---
work.orig/fs/btrfs/extent_io.h
+++ work/fs/btrfs/extent_io.h
@@ -2,6 +2,7 @@
 #define __EXTENTIO__
 
 #include <linux/rbtree.h>
+#include <linux/mutex.h>
 
 /* bits for the extent state */
 #define EXTENT_DIRTY 1
@@ -29,7 +30,6 @@
 
 /* these are bit numbers for test/set bit */
 #define EXTENT_BUFFER_UPTODATE 0
-#define EXTENT_BUFFER_BLOCKING 1
 #define EXTENT_BUFFER_DIRTY 2
 
 /* these are flags for extent_clear_unlock_delalloc */
@@ -129,14 +129,8 @@ struct extent_buffer {
 	struct list_head leak_list;
 	struct rcu_head rcu_head;
 
-	/* the spinlock is used to protect most operations */
-	spinlock_t lock;
-
-	/*
-	 * when we keep the lock held while blocking, waiters go onto
-	 * the wq
-	 */
-	wait_queue_head_t lock_wq;
+	/* used to protect most operations */
+	struct mutex lock;
 };
 
 static inline void extent_set_compress_type(unsigned long *bio_flags,
Index: work/fs/btrfs/locking.c
==================================================================---
work.orig/fs/btrfs/locking.c
+++ /dev/null
@@ -1,233 +0,0 @@
-/*
- * Copyright (C) 2008 Oracle.  All rights reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public
- * License v2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public
- * License along with this program; if not, write to the
- * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
- * Boston, MA 021110-1307, USA.
- */
-#include <linux/sched.h>
-#include <linux/pagemap.h>
-#include <linux/spinlock.h>
-#include <linux/page-flags.h>
-#include <asm/bug.h>
-#include "ctree.h"
-#include "extent_io.h"
-#include "locking.h"
-
-static inline void spin_nested(struct extent_buffer *eb)
-{
-	spin_lock(&eb->lock);
-}
-
-/*
- * Setting a lock to blocking will drop the spinlock and set the
- * flag that forces other procs who want the lock to wait.  After
- * this you can safely schedule with the lock held.
- */
-void btrfs_set_lock_blocking(struct extent_buffer *eb)
-{
-	if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags)) {
-		set_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags);
-		spin_unlock(&eb->lock);
-	}
-	/* exit with the spin lock released and the bit set */
-}
-
-/*
- * clearing the blocking flag will take the spinlock again.
- * After this you can''t safely schedule
- */
-void btrfs_clear_lock_blocking(struct extent_buffer *eb)
-{
-	if (test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags)) {
-		spin_nested(eb);
-		clear_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags);
-		smp_mb__after_clear_bit();
-	}
-	/* exit with the spin lock held */
-}
-
-/*
- * unfortunately, many of the places that currently set a lock to blocking
- * don''t end up blocking for very long, and often they don''t
block
- * at all.  For a dbench 50 run, if we don''t spin on the blocking bit
- * at all, the context switch rate can jump up to 400,000/sec or more.
- *
- * So, we''re still stuck with this crummy spin on the blocking bit,
- * at least until the most common causes of the short blocks
- * can be dealt with.
- */
-static int btrfs_spin_on_block(struct extent_buffer *eb)
-{
-	int i;
-
-	for (i = 0; i < 512; i++) {
-		if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-			return 1;
-		if (need_resched())
-			break;
-		cpu_relax();
-	}
-	return 0;
-}
-
-/*
- * This is somewhat different from trylock.  It will take the
- * spinlock but if it finds the lock is set to blocking, it will
- * return without the lock held.
- *
- * returns 1 if it was able to take the lock and zero otherwise
- *
- * After this call, scheduling is not safe without first calling
- * btrfs_set_lock_blocking()
- */
-int btrfs_try_spin_lock(struct extent_buffer *eb)
-{
-	int i;
-
-	if (btrfs_spin_on_block(eb)) {
-		spin_nested(eb);
-		if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-			return 1;
-		spin_unlock(&eb->lock);
-	}
-	/* spin for a bit on the BLOCKING flag */
-	for (i = 0; i < 2; i++) {
-		cpu_relax();
-		if (!btrfs_spin_on_block(eb))
-			break;
-
-		spin_nested(eb);
-		if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-			return 1;
-		spin_unlock(&eb->lock);
-	}
-	return 0;
-}
-
-/*
- * the autoremove wake function will return 0 if it tried to wake up
- * a process that was already awake, which means that process won''t
- * count as an exclusive wakeup.  The waitq code will continue waking
- * procs until it finds one that was actually sleeping.
- *
- * For btrfs, this isn''t quite what we want.  We want a single proc
- * to be notified that the lock is ready for taking.  If that proc
- * already happen to be awake, great, it will loop around and try for
- * the lock.
- *
- * So, btrfs_wake_function always returns 1, even when the proc that we
- * tried to wake up was already awake.
- */
-static int btrfs_wake_function(wait_queue_t *wait, unsigned mode,
-			       int sync, void *key)
-{
-	autoremove_wake_function(wait, mode, sync, key);
-	return 1;
-}
-
-/*
- * returns with the extent buffer spinlocked.
- *
- * This will spin and/or wait as required to take the lock, and then
- * return with the spinlock held.
- *
- * After this call, scheduling is not safe without first calling
- * btrfs_set_lock_blocking()
- */
-int btrfs_tree_lock(struct extent_buffer *eb)
-{
-	DEFINE_WAIT(wait);
-	wait.func = btrfs_wake_function;
-
-	if (!btrfs_spin_on_block(eb))
-		goto sleep;
-
-	while(1) {
-		spin_nested(eb);
-
-		/* nobody is blocking, exit with the spinlock held */
-		if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-			return 0;
-
-		/*
-		 * we have the spinlock, but the real owner is blocking.
-		 * wait for them
-		 */
-		spin_unlock(&eb->lock);
-
-		/*
-		 * spin for a bit, and if the blocking flag goes away,
-		 * loop around
-		 */
-		cpu_relax();
-		if (btrfs_spin_on_block(eb))
-			continue;
-sleep:
-		prepare_to_wait_exclusive(&eb->lock_wq, &wait,
-					  TASK_UNINTERRUPTIBLE);
-
-		if (test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-			schedule();
-
-		finish_wait(&eb->lock_wq, &wait);
-	}
-	return 0;
-}
-
-/*
- * Very quick trylock, this does not spin or schedule.  It returns
- * 1 with the spinlock held if it was able to take the lock, or it
- * returns zero if it was unable to take the lock.
- *
- * After this call, scheduling is not safe without first calling
- * btrfs_set_lock_blocking()
- */
-int btrfs_try_tree_lock(struct extent_buffer *eb)
-{
-	if (spin_trylock(&eb->lock)) {
-		if (test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags)) {
-			/*
-			 * we''ve got the spinlock, but the real owner is
-			 * blocking.  Drop the spinlock and return failure
-			 */
-			spin_unlock(&eb->lock);
-			return 0;
-		}
-		return 1;
-	}
-	/* someone else has the spinlock giveup */
-	return 0;
-}
-
-int btrfs_tree_unlock(struct extent_buffer *eb)
-{
-	/*
-	 * if we were a blocking owner, we don''t have the spinlock held
-	 * just clear the bit and look for waiters
-	 */
-	if (test_and_clear_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-		smp_mb__after_clear_bit();
-	else
-		spin_unlock(&eb->lock);
-
-	if (waitqueue_active(&eb->lock_wq))
-		wake_up(&eb->lock_wq);
-	return 0;
-}
-
-void btrfs_assert_tree_locked(struct extent_buffer *eb)
-{
-	if (!test_bit(EXTENT_BUFFER_BLOCKING, &eb->bflags))
-		assert_spin_locked(&eb->lock);
-}
Index: work/fs/btrfs/locking.h
==================================================================---
work.orig/fs/btrfs/locking.h
+++ work/fs/btrfs/locking.h
@@ -19,13 +19,40 @@
 #ifndef __BTRFS_LOCKING_
 #define __BTRFS_LOCKING_
 
-int btrfs_tree_lock(struct extent_buffer *eb);
-int btrfs_tree_unlock(struct extent_buffer *eb);
+#include <linux/bug.h>
 
-int btrfs_try_tree_lock(struct extent_buffer *eb);
-int btrfs_try_spin_lock(struct extent_buffer *eb);
+static inline bool btrfs_try_spin_lock(struct extent_buffer *eb)
+{
+	return mutex_trylock(&eb->lock);
+}
 
-void btrfs_set_lock_blocking(struct extent_buffer *eb);
-void btrfs_clear_lock_blocking(struct extent_buffer *eb);
-void btrfs_assert_tree_locked(struct extent_buffer *eb);
-#endif
+static inline void btrfs_tree_lock(struct extent_buffer *eb)
+{
+	mutex_lock(&eb->lock);
+}
+
+static inline void btrfs_tree_lock_nested(struct extent_buffer *eb,
+					  unsigned int subclass)
+{
+	mutex_lock_nested(&eb->lock, subclass);
+}
+
+static inline void btrfs_tree_unlock(struct extent_buffer *eb)
+{
+	mutex_unlock(&eb->lock);
+}
+
+static inline int btrfs_try_tree_lock(struct extent_buffer *eb)
+{
+	return mutex_trylock(&eb->lock);
+}
+
+static inline void btrfs_set_lock_blocking(struct extent_buffer *eb)	{ }
+static inline void btrfs_clear_lock_blocking(struct extent_buffer *eb)	{ }
+
+static inline void btrfs_assert_tree_locked(struct extent_buffer *eb)
+{
+	BUG_ON(!mutex_is_locked(&eb->lock));
+}
+
+#endif	/* __BTRFS_LOCKING_ */
Index: work/fs/btrfs/Makefile
==================================================================---
work.orig/fs/btrfs/Makefile
+++ work/fs/btrfs/Makefile
@@ -5,6 +5,6 @@ btrfs-y += super.o ctree.o extent-tree.o
 	   file-item.o inode-item.o inode-map.o disk-io.o \
 	   transaction.o inode.o file.o tree-defrag.o \
 	   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
-	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
+	   extent_io.o volumes.o async-thread.o ioctl.o orphan.o \
 	   export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
So, here''s the patch to implement and use mutex_try_spin(), which
applies the same owner spin logic to try locking.  The result looks
pretty good.
I re-ran all three.  DFL is the current custom locking.  SIMPLE is
with only the previous patch applied.  SPIN is both the previous and
this patches applied.
       USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
DFL    14484  129368    390  1669102     171.955
SIMPLE 14483  128902    318  1187031     171.512
SPIN   14311  129222    347  1198166     174.904
DFL/SIMPLE results are more or less consistent with the previous run.
SPIN seems to consume a bit more cpu than SIMPLE but shows discernably
better throughput.  I''m running SPIN again just in case but the result
seems pretty consistent.
Thanks.
NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/btrfs/locking.h    |    2 -
 include/linux/mutex.h |    1 
 kernel/mutex.c        |   58 ++++++++++++++++++++++++++++++++++++--------------
 3 files changed, 44 insertions(+), 17 deletions(-)
Index: work/fs/btrfs/locking.h
==================================================================---
work.orig/fs/btrfs/locking.h
+++ work/fs/btrfs/locking.h
@@ -23,7 +23,7 @@
 
 static inline bool btrfs_try_spin_lock(struct extent_buffer *eb)
 {
-	return mutex_trylock(&eb->lock);
+	return mutex_tryspin(&eb->lock);
 }
 
 static inline void btrfs_tree_lock(struct extent_buffer *eb)
Index: work/include/linux/mutex.h
==================================================================---
work.orig/include/linux/mutex.h
+++ work/include/linux/mutex.h
@@ -157,6 +157,7 @@ extern int __must_check mutex_lock_killa
  * Returns 1 if the mutex has been acquired successfully, and 0 on contention.
  */
 extern int mutex_trylock(struct mutex *lock);
+extern int mutex_tryspin(struct mutex *lock);
 extern void mutex_unlock(struct mutex *lock);
 extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock);
 
Index: work/kernel/mutex.c
==================================================================---
work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -126,20 +126,8 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
-/*
- * Lock a mutex (possibly interruptible), slowpath:
- */
-static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+static inline bool mutex_spin(struct mutex *lock)
 {
-	struct task_struct *task = current;
-	struct mutex_waiter waiter;
-	unsigned long flags;
-
-	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
-
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 	/*
 	 * Optimistic spinning.
@@ -158,7 +146,6 @@ __mutex_lock_common(struct mutex *lock,
 	 * We can''t do this for DEBUG_MUTEXES because that relies on
wait_lock
 	 * to serialize everything.
 	 */
-
 	for (;;) {
 		struct thread_info *owner;
 
@@ -181,7 +168,7 @@ __mutex_lock_common(struct mutex *lock,
 			lock_acquired(&lock->dep_map, ip);
 			mutex_set_owner(lock);
 			preempt_enable();
-			return 0;
+			return true;
 		}
 
 		/*
@@ -190,7 +177,7 @@ __mutex_lock_common(struct mutex *lock,
 		 * we''re an RT task that will live-lock because we won''t
let
 		 * the owner complete.
 		 */
-		if (!owner && (need_resched() || rt_task(task)))
+		if (!owner && (need_resched() || rt_task(current)))
 			break;
 
 		/*
@@ -202,6 +189,26 @@ __mutex_lock_common(struct mutex *lock,
 		cpu_relax();
 	}
 #endif
+	return false;
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static inline int __sched
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
+	       	unsigned long ip)
+{
+	struct task_struct *task = current;
+	struct mutex_waiter waiter;
+	unsigned long flags;
+
+	preempt_disable();
+	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+
+	if (mutex_spin(lock))
+		return 0;
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	debug_mutex_lock_common(lock, &waiter);
@@ -473,6 +480,25 @@ int __sched mutex_trylock(struct mutex *
 }
 EXPORT_SYMBOL(mutex_trylock);
 
+static inline int __mutex_tryspin_slowpath(atomic_t *lock_count)
+{
+	struct mutex *lock = container_of(lock_count, struct mutex, count);
+
+	return __mutex_trylock_slowpath(lock_count) || mutex_spin(lock);
+}
+
+int __sched mutex_tryspin(struct mutex *lock)
+{
+	int ret;
+
+	ret = __mutex_fastpath_trylock(&lock->count, __mutex_tryspin_slowpath);
+	if (ret)
+		mutex_set_owner(lock);
+
+	return ret;
+}
+EXPORT_SYMBOL(mutex_tryspin);
+
 /**
  * atomic_dec_and_mutex_lock - return holding mutex if we dec to 0
  * @cnt: the atomic which we are to dec
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
On Sun, Mar 20, 2011 at 08:56:52PM +0100, Tejun Heo wrote:> So, here''s the patch to implement and use mutex_try_spin(), which > applies the same owner spin logic to try locking. The result looks > pretty good. > > I re-ran all three. DFL is the current custom locking. SIMPLE is > with only the previous patch applied. SPIN is both the previous and > this patches applied. > > USER SYSTEM SIRQ CXTSW THROUGHPUT > DFL 14484 129368 390 1669102 171.955 > SIMPLE 14483 128902 318 1187031 171.512 > SPIN 14311 129222 347 1198166 174.904 > > DFL/SIMPLE results are more or less consistent with the previous run. > SPIN seems to consume a bit more cpu than SIMPLE but shows discernably > better throughput. I''m running SPIN again just in case but the result > seems pretty consistent.Here''s another run. SPIN 14377 129092 335 1189817 172.724 It''s not as high as the last run, but given other runs I''ve been seeing, I think meaningful (ie. not measurement error) throughput advantage exists. That said, the difference definitely seems minor. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Tejun Heo''s message of 2011-03-20 13:44:33 -0400:> Hello, guys. > > I''ve been looking through btrfs code and found out that the locking > was quite interesting. The distinction between blocking and > non-blocking locking is valid but is essentially attacing the same > problem that CONFIG_MUTEX_SPIN_ON_OWNER addresses in generic manner. > > It seemed that CONFIG_MUTEX_SPIN_ON_OWNER should be able to do it > better as it actually knows whether the lock owner is running or not, > so I did a few "dbench 50" runs with the complex locking removed. > > The test machine is a dual core machine with 1 gig of memory. The > filesystem is on OCZ vertex 64 gig SSD. I''m attaching the kernel > config. Please note that CONFIG_MUTEX_SPIN_ON_OWNER is enabled only > when lock debugging is disabled, but it will be enabled on virtually > any production configuration. > > The machine was idle during the dbench runs and CPU usages and context > switches are calculated from /proc/stat over the dbench runs. The > throughput is as reported by dbench. > > user system sirq cxtsw throughput > before 14426 129332 345 1674004 171.7 > after 14274 129036 308 1183346 172.119 > > So, the extra code isn''t really helping anything. It''s worse in every > way. Are there some obscure reasons the custom locking should be kept > around?Hi Tejun, I went through a number of benchmarks with the explicit blocking/spinning code and back then it was still significantly faster than the adaptive spin. But, it is definitely worth doing these again, how many dbench procs did you use? The biggest benefit to explicit spinning is that mutex_lock starts with might_sleep(), so we skip the cond_resched(). Do you have voluntary preempt on? The long term goal was always to get rid of the blocking locks, I had a lot of code to track the top blockers and we had gotten rid of about 90% of them. Thanks for looking at this -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Chris. On Sun, Mar 20, 2011 at 08:10:51PM -0400, Chris Mason wrote:> I went through a number of benchmarks with the explicit > blocking/spinning code and back then it was still significantly faster > than the adaptive spin. But, it is definitely worth doing these again, > how many dbench procs did you use?It was dbench 50.> The biggest benefit to explicit spinning is that mutex_lock starts with > might_sleep(), so we skip the cond_resched(). Do you have voluntary > preempt on?Ah, right, I of course forgot to actually attach the .config. I had CONFIG_PREEMPT, not CONFIG_PREEMPT_VOLUNTARY. I''ll re-run with VOLUNTARY and see how its behavior changes. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello,
Here are the results with voluntary preemption.  I''ve moved to a
beefier machine for testing.  It''s dual Opteron 2347, so dual socket,
eight core.  The memory is limited to 1GiB to force IOs and the disk
is the same OCZ Vertex 60gig SSD.  /proc/stat is captured before and
after "dbench 50".
I ran the following four setups.
DFL	The current custom locking implementation.
SIMPLE	Simple mutex conversion.  The first patch in this thread.
SPIN	SIMPLE + mutex_tryspin().  The second patch in this thread.
SPIN2	SPIN + mutex_tryspin() in btrfs_tree_lock().  Patch below.
SPIN2 should alleviate the voluntary preemption by might_sleep() in
mutex_lock().
       USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
DFL    49427  458210   1433  7683488     642.947
SIMPLE 52836  471398   1427  3055384     705.369
SPIN   52267  473115   1467  3005603     705.369
SPIN2  52063  470453   1446  3092091     701.826
I''m running DFL again just in case but SIMPLE or SPIN seems to be a
much better choice.
Thanks.
NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/btrfs/locking.h |    2 ++
 1 file changed, 2 insertions(+)
Index: work/fs/btrfs/locking.h
==================================================================---
work.orig/fs/btrfs/locking.h
+++ work/fs/btrfs/locking.h
@@ -28,6 +28,8 @@ static inline bool btrfs_try_spin_lock(s
 
 static inline void btrfs_tree_lock(struct extent_buffer *eb)
 {
+	if (mutex_tryspin(&eb->lock))
+		return;
 	mutex_lock(&eb->lock);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
On Mon, Mar 21, 2011 at 05:59:55PM +0100, Tejun Heo wrote:> I''m running DFL again just in case but SIMPLE or SPIN seems to be a > much better choice.Got 644.176 MB/sec, so yeah the custom locking is definitely worse than just using mutex. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Tejun Heo''s message of 2011-03-21 12:59:55 -0400:> Hello, > > Here are the results with voluntary preemption. I''ve moved to a > beefier machine for testing. It''s dual Opteron 2347, so dual socket, > eight core. The memory is limited to 1GiB to force IOs and the disk > is the same OCZ Vertex 60gig SSD. /proc/stat is captured before and > after "dbench 50". > > I ran the following four setups. > > DFL The current custom locking implementation. > SIMPLE Simple mutex conversion. The first patch in this thread. > SPIN SIMPLE + mutex_tryspin(). The second patch in this thread. > SPIN2 SPIN + mutex_tryspin() in btrfs_tree_lock(). Patch below. > > SPIN2 should alleviate the voluntary preemption by might_sleep() in > mutex_lock(). > > USER SYSTEM SIRQ CXTSW THROUGHPUT > DFL 49427 458210 1433 7683488 642.947 > SIMPLE 52836 471398 1427 3055384 705.369 > SPIN 52267 473115 1467 3005603 705.369 > SPIN2 52063 470453 1446 3092091 701.826 > > I''m running DFL again just in case but SIMPLE or SPIN seems to be a > much better choice.Very interesting. Ok, I''ll definitely rerun my benchmarks as well. I used dbench extensively during the initial tuning, but you''re forcing the memory low in order to force IO. This case doesn''t really hammer on the locks, it hammers on the transition from spinning to blocking. We want also want to compare dbench entirely in ram, which will hammer on the spinning portion. -chris> > Thanks. > > NOT-Signed-off-by: Tejun Heo <tj@kernel.org> > --- > fs/btrfs/locking.h | 2 ++ > 1 file changed, 2 insertions(+) > > Index: work/fs/btrfs/locking.h > ==================================================================> --- work.orig/fs/btrfs/locking.h > +++ work/fs/btrfs/locking.h > @@ -28,6 +28,8 @@ static inline bool btrfs_try_spin_lock(s > > static inline void btrfs_tree_lock(struct extent_buffer *eb) > { > + if (mutex_tryspin(&eb->lock)) > + return; > mutex_lock(&eb->lock); > } >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Mon, Mar 21, 2011 at 01:24:37PM -0400, Chris Mason wrote:> Very interesting. Ok, I''ll definitely rerun my benchmarks as well. I > used dbench extensively during the initial tuning, but you''re forcing > the memory low in order to force IO. > > This case doesn''t really hammer on the locks, it hammers on the > transition from spinning to blocking. We want also want to compare > dbench entirely in ram, which will hammer on the spinning portion.Here''s re-run of DFL and SIMPLE with the memory restriction lifted. Memory is 4GiB and disk remains mostly idle with all CPUs running full. USER SYSTEM SIRQ CXTSW THROUGHPUT DFL 59898 504517 377 6814245 782.295 SIMPLE 61090 493441 457 1631688 827.751 So, about the same picture. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Tejun Heo''s message of 2011-03-21 14:11:24 -0400:> Hello, > > On Mon, Mar 21, 2011 at 01:24:37PM -0400, Chris Mason wrote: > > Very interesting. Ok, I''ll definitely rerun my benchmarks as well. I > > used dbench extensively during the initial tuning, but you''re forcing > > the memory low in order to force IO. > > > > This case doesn''t really hammer on the locks, it hammers on the > > transition from spinning to blocking. We want also want to compare > > dbench entirely in ram, which will hammer on the spinning portion. > > Here''s re-run of DFL and SIMPLE with the memory restriction lifted. > Memory is 4GiB and disk remains mostly idle with all CPUs running > full. > > USER SYSTEM SIRQ CXTSW THROUGHPUT > DFL 59898 504517 377 6814245 782.295 > SIMPLE 61090 493441 457 1631688 827.751 > > So, about the same picture.Ok, this impact of this is really interesting. If we have very short waits where there is no IO at all, this patch tends to lose. I ran with dbench 10 and got about 20% slower tput. But, if we do any IO at all it wins by at least that much or more. I think we should take this patch and just work on getting rid of the scheduling with the mutex held where possible. Tejun, could you please send the mutex_tryspin stuff in? If we can get a sob for that I can send the whole thing. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Chris. On Tue, Mar 22, 2011 at 07:13:09PM -0400, Chris Mason wrote:> Ok, this impact of this is really interesting. If we have very short > waits where there is no IO at all, this patch tends to lose. I ran with > dbench 10 and got about 20% slower tput. > > But, if we do any IO at all it wins by at least that much or more. I > think we should take this patch and just work on getting rid of the > scheduling with the mutex held where possible.I see.> Tejun, could you please send the mutex_tryspin stuff in? If we can get > a sob for that I can send the whole thing.I''m not sure whetehr mutex_tryspin() is justified at this point, and, even if so, how to proceed with it. Maybe we want to make mutex_trylock() perform owner spin by default without introducing a new API. Given that the difference between SIMPLE and SPIN is small, I think it would be best to simply use mutex_trylock() for now. It''s not gonna make much difference either way. How do you want to proceed? I can prep patches doing the following. - Convert CONFIG_DEBUG_LOCK_ALLOC to CONFIG_LOCKDEP. - Drop locking.c and make the lock function simple wrapper around mutex operations. This makes blocking/unblocking noops. - Remove all blocking/unblocking calls along with the API. - Remove locking wrappers and use mutex API directly. What do you think? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Tejun Heo''s message of 2011-03-23 06:46:14 -0400:> Hello, Chris. > > On Tue, Mar 22, 2011 at 07:13:09PM -0400, Chris Mason wrote: > > Ok, this impact of this is really interesting. If we have very short > > waits where there is no IO at all, this patch tends to lose. I ran with > > dbench 10 and got about 20% slower tput. > > > > But, if we do any IO at all it wins by at least that much or more. I > > think we should take this patch and just work on getting rid of the > > scheduling with the mutex held where possible. > > I see. > > > Tejun, could you please send the mutex_tryspin stuff in? If we can get > > a sob for that I can send the whole thing. > > I''m not sure whetehr mutex_tryspin() is justified at this point, and, > even if so, how to proceed with it. Maybe we want to make > mutex_trylock() perform owner spin by default without introducing a > new API.I''ll benchmark without it, but I think the cond_resched is going to have a pretty big impact. I''m digging up the related benchmarks I used during the initial adaptive spin work.> > Given that the difference between SIMPLE and SPIN is small, I think it > would be best to simply use mutex_trylock() for now. It''s not gonna > make much difference either way.mutex_trylock is a good start.> > How do you want to proceed? I can prep patches doing the following. > > - Convert CONFIG_DEBUG_LOCK_ALLOC to CONFIG_LOCKDEP. > > - Drop locking.c and make the lock function simple wrapper around > mutex operations. This makes blocking/unblocking noops. > > - Remove all blocking/unblocking calls along with the API.I''d like to keep the blocking/unblocking calls for one release. I''d like to finally finish off my patches that do concurrent reads.> > - Remove locking wrappers and use mutex API directly.I''d also like to keep the wrappers until the concurrent reader locking is done.> > What do you think?Thanks for all the work. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html