Tejun Heo
2011-Mar-23  15:37 UTC
[RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
Hello, guys.
I''ve been playing with locking in btrfs which has developed custom
locking to avoid excessive context switches in its btree
implementation.
Generally, doing away with the custom implementation and just using
the mutex adaptive owner spinning seems better; however, there''s an
interesting distinction in the custom implemention of trylock.  It
distinguishes between simple trylock and tryspin, where the former
just tries once and then fail while the latter does some spinning
before giving up.
Currently, mutex_trylock() doesn''t use adaptive spinning.  It tries
just once.  I got curious whether using adaptive spinning on
mutex_trylock() would be beneficial and it seems so, at least for
btrfs anyway.
The following results are from "dbench 50" run on an opteron two
socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
During the run, disk stays mostly idle and all CPUs are fully occupied
and the difference in locking performance becomes quite visible.
IMPLE is with the locking simplification patch[1] applied.  i.e. it
basically just uses mutex.  SPIN is with the patch at the end of this
message applied on top - mutex_trylock() uses adaptive spinning.
        USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
 SIMPLE 61107  354977    217  8099529  845.100 MB/sec
 SPIN   63140  364888    214  6840527  879.077 MB/sec
I''ve been running various different configs from yesterday and the
adaptive spinning trylock consistently posts higher throughput.  The
amount of difference varies but it outperforms consistently.
In general, I think using adaptive spinning on trylock makes sense as
trylock failure usually leads to costly unlock-relock sequence.
Also, contrary to the comment, the adaptive spinning doesn''t seem to
check whether there are pending waiters or not.  Is this intended or
the test got lost somehow?
Thanks.
NOT-Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/mutex.c |   98 +++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 61 insertions(+), 37 deletions(-)
Index: work/kernel/mutex.c
==================================================================---
work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -126,39 +126,33 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
-/*
- * Lock a mutex (possibly interruptible), slowpath:
+/**
+ * mutex_spin - optimistic spinning on mutex
+ * @lock: mutex to spin on
+ *
+ * This function implements optimistic spin for acquisition of @lock when
+ * we find that there are no pending waiters and the lock owner is
+ * currently running on a (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation
doesn''t
+ * track the owner atomically in the lock field, we need to track it
+ * non-atomically.
+ *
+ * We can''t do this for DEBUG_MUTEXES because that relies on wait_lock
to
+ * serialize everything.
+ *
+ * CONTEXT:
+ * Preemption disabled.
+ *
+ * RETURNS:
+ * %true if @lock is acquired, %false otherwise.
  */
-static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+static inline bool mutex_spin(struct mutex *lock)
 {
-	struct task_struct *task = current;
-	struct mutex_waiter waiter;
-	unsigned long flags;
-
-	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
-
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	/*
-	 * Optimistic spinning.
-	 *
-	 * We try to spin for acquisition when we find that there are no
-	 * pending waiters and the lock owner is currently running on a
-	 * (different) CPU.
-	 *
-	 * The rationale is that if the lock owner is running, it is likely to
-	 * release the lock soon.
-	 *
-	 * Since this needs the lock owner, and this mutex implementation
-	 * doesn''t track the owner atomically in the lock field, we need to
-	 * track it non-atomically.
-	 *
-	 * We can''t do this for DEBUG_MUTEXES because that relies on
wait_lock
-	 * to serialize everything.
-	 */
-
 	for (;;) {
 		struct thread_info *owner;
 
@@ -177,12 +171,8 @@ __mutex_lock_common(struct mutex *lock,
 		if (owner && !mutex_spin_on_owner(lock, owner))
 			break;
 
-		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
-			lock_acquired(&lock->dep_map, ip);
-			mutex_set_owner(lock);
-			preempt_enable();
-			return 0;
-		}
+		if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
+			return true;
 
 		/*
 		 * When there''s no owner, we might have preempted between the
@@ -190,7 +180,7 @@ __mutex_lock_common(struct mutex *lock,
 		 * we''re an RT task that will live-lock because we won''t
let
 		 * the owner complete.
 		 */
-		if (!owner && (need_resched() || rt_task(task)))
+		if (!owner && (need_resched() || rt_task(current)))
 			break;
 
 		/*
@@ -202,6 +192,30 @@ __mutex_lock_common(struct mutex *lock,
 		cpu_relax();
 	}
 #endif
+	return false;
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static inline int __sched
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
+	       	unsigned long ip)
+{
+	struct task_struct *task = current;
+	struct mutex_waiter waiter;
+	unsigned long flags;
+
+	preempt_disable();
+	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+
+	if (mutex_spin(lock)) {
+		lock_acquired(&lock->dep_map, ip);
+		mutex_set_owner(lock);
+		preempt_enable();
+		return 0;
+	}
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	debug_mutex_lock_common(lock, &waiter);
@@ -430,6 +444,15 @@ static inline int __mutex_trylock_slowpa
 	unsigned long flags;
 	int prev;
 
+	preempt_disable();
+
+	if (mutex_spin(lock)) {
+		mutex_set_owner(lock);
+		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+		preempt_enable();
+		return 1;
+	}
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	prev = atomic_xchg(&lock->count, -1);
@@ -443,6 +466,7 @@ static inline int __mutex_trylock_slowpa
 		atomic_set(&lock->count, 0);
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
+	preempt_enable();
 
 	return prev == 1;
 }
Tejun Heo
2011-Mar-23  15:40 UTC
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
Oops, sorry, forgot the link to the simplification patch and attaching .config. On Wed, Mar 23, 2011 at 04:37:27PM +0100, Tejun Heo wrote:> SIMPLE is with the locking simplification patch[1] applied. i.e. it > basically just uses mutex. SPIN is with the patch at the end of this > message applied on top - mutex_trylock() uses adaptive spinning.[1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658 -- tejun
Linus Torvalds
2011-Mar-23  15:48 UTC
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo <tj@kernel.org> wrote:> > Currently, mutex_trylock() doesn''t use adaptive spinning. It tries > just once. I got curious whether using adaptive spinning on > mutex_trylock() would be beneficial and it seems so, at least for > btrfs anyway.Hmm. Seems reasonable to me. The patch looks clean, although part of that is just the mutex_spin() cleanup that is independent of actually using it in trylock. So no objections from me. Linus
Tejun Heo
2011-Mar-23  15:52 UTC
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, Mar 23, 2011 at 08:48:01AM -0700, Linus Torvalds wrote:> On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo <tj@kernel.org> wrote: > > > > Currently, mutex_trylock() doesn''t use adaptive spinning. It tries > > just once. I got curious whether using adaptive spinning on > > mutex_trylock() would be beneficial and it seems so, at least for > > btrfs anyway. > > Hmm. Seems reasonable to me. The patch looks clean, although part of > that is just the mutex_spin() cleanup that is independent of actually > using it in trylock.Oh, I have two split patches. Posted the combined one for comments.> So no objections from me.Awesome. Peter, what do you think? Are there some other tests which can be useful? Thanks. -- tejun
Andrey Kuzmin
2011-Mar-23  19:46 UTC
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, Mar 23, 2011 at 6:48 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:> On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo <tj@kernel.org> wrote: >> >> Currently, mutex_trylock() doesn''t use adaptive spinning. It tries >> just once. I got curious whether using adaptive spinning on >> mutex_trylock() would be beneficial and it seems so, at least for >> btrfs anyway. > > Hmm. Seems reasonable to me.TAS/spin with exponential back-off has been preferred locking approach in Postgres (and I believe other DBMSes) for years, at least since ''04 when I had last touched Postgres code. Even with ''false negative'' cost in user-space being much higher than in the kernel, it''s still just a question of scale (no wonder measurable improvement here is reported from dbench on SSD capable of few dozen thousand IOPS). Regards, Andrey> The patch looks clean, although part of that is just the mutex_spin() > cleanup that is independent of actually using it in trylock. > > So no objections from me. > > Linus > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Ingo Molnar
2011-Mar-24  08:18 UTC
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
* Tejun Heo <tj@kernel.org> wrote:> NOT-Signed-off-by: Tejun Heo <tj@kernel.org>s/NOT-// ? Ingo
Separate out mutex_spin() out of __mutex_lock_common().  The fat
comment is converted to docbook function description.
While at it, drop the part of comment which explains that adaptive
spinning considers whether there are pending waiters, which doesn''t
match the code.
This patch is to prepare for using adaptive spinning in
mutex_trylock() and doesn''t cause any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <20110323153727.GB12003@htj.dyndns.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
---
Here are split patches with SOB.  Ingo, it''s probably best to route
this through -tip, I suppose?
Thanks.
 kernel/mutex.c |   87 ++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 50 insertions(+), 37 deletions(-)
Index: work/kernel/mutex.c
==================================================================---
work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -126,39 +126,32 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
-/*
- * Lock a mutex (possibly interruptible), slowpath:
+/**
+ * mutex_spin - optimistic spinning on mutex
+ * @lock: mutex to spin on
+ *
+ * This function implements optimistic spin for acquisition of @lock when
+ * the lock owner is currently running on a (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation
doesn''t
+ * track the owner atomically in the lock field, we need to track it
+ * non-atomically.
+ *
+ * We can''t do this for DEBUG_MUTEXES because that relies on wait_lock
to
+ * serialize everything.
+ *
+ * CONTEXT:
+ * Preemption disabled.
+ *
+ * RETURNS:
+ * %true if @lock is acquired, %false otherwise.
  */
-static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-	       	unsigned long ip)
+static inline bool mutex_spin(struct mutex *lock)
 {
-	struct task_struct *task = current;
-	struct mutex_waiter waiter;
-	unsigned long flags;
-
-	preempt_disable();
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
-
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-	/*
-	 * Optimistic spinning.
-	 *
-	 * We try to spin for acquisition when we find that there are no
-	 * pending waiters and the lock owner is currently running on a
-	 * (different) CPU.
-	 *
-	 * The rationale is that if the lock owner is running, it is likely to
-	 * release the lock soon.
-	 *
-	 * Since this needs the lock owner, and this mutex implementation
-	 * doesn''t track the owner atomically in the lock field, we need to
-	 * track it non-atomically.
-	 *
-	 * We can''t do this for DEBUG_MUTEXES because that relies on
wait_lock
-	 * to serialize everything.
-	 */
-
 	for (;;) {
 		struct thread_info *owner;
 
@@ -177,12 +170,8 @@ __mutex_lock_common(struct mutex *lock,
 		if (owner && !mutex_spin_on_owner(lock, owner))
 			break;
 
-		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
-			lock_acquired(&lock->dep_map, ip);
-			mutex_set_owner(lock);
-			preempt_enable();
-			return 0;
-		}
+		if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
+			return true;
 
 		/*
 		 * When there''s no owner, we might have preempted between the
@@ -190,7 +179,7 @@ __mutex_lock_common(struct mutex *lock,
 		 * we''re an RT task that will live-lock because we won''t
let
 		 * the owner complete.
 		 */
-		if (!owner && (need_resched() || rt_task(task)))
+		if (!owner && (need_resched() || rt_task(current)))
 			break;
 
 		/*
@@ -202,6 +191,30 @@ __mutex_lock_common(struct mutex *lock,
 		arch_mutex_cpu_relax();
 	}
 #endif
+	return false;
+}
+
+/*
+ * Lock a mutex (possibly interruptible), slowpath:
+ */
+static inline int __sched
+__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
+	       	unsigned long ip)
+{
+	struct task_struct *task = current;
+	struct mutex_waiter waiter;
+	unsigned long flags;
+
+	preempt_disable();
+	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+
+	if (mutex_spin(lock)) {
+		lock_acquired(&lock->dep_map, ip);
+		mutex_set_owner(lock);
+		preempt_enable();
+		return 0;
+	}
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	debug_mutex_lock_common(lock, &waiter);
Tejun Heo
2011-Mar-24  09:41 UTC
[PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Adaptive owner spinning used to be applied only to mutex_lock().  This
patch applies it also to mutex_trylock().
btrfs has developed custom locking to avoid excessive context switches
in its btree implementation.  Generally, doing away with the custom
implementation and just using the mutex shows better behavior;
however, there''s an interesting distinction in the custom implemention
of trylock.  It distinguishes between simple trylock and tryspin,
where the former just tries once and then fail while the latter does
some spinning before giving up.
Currently, mutex_trylock() doesn''t use adaptive spinning.  It tries
just once.  I got curious whether using adaptive spinning on
mutex_trylock() would be beneficial and it seems so, for btrfs anyway.
The following results are from "dbench 50" run on an opteron two
socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
During the run, disk stays mostly idle and all CPUs are fully occupied
and the difference in locking performance becomes quite visible.
SIMPLE is with the locking simplification patch[1] applied.  i.e. it
basically just uses mutex.  SPIN is with this patch applied on top -
mutex_trylock() uses adaptive spinning.
        USER   SYSTEM   SIRQ    CXTSW  THROUGHPUT
 SIMPLE 61107  354977    217  8099529  845.100 MB/sec
 SPIN   63140  364888    214  6840527  879.077 MB/sec
On various runs, the adaptive spinning trylock consistently posts
higher throughput.  The amount of difference varies but it outperforms
consistently.
In general, using adaptive spinning on trylock makes sense as trylock
failure usually leads to costly unlock-relock sequence.
[1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <20110323153727.GB12003@htj.dyndns.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chris Mason <chris.mason@oracle.com>
---
 kernel/mutex.c |   10 ++++++++++
 1 file changed, 10 insertions(+)
Index: work/kernel/mutex.c
==================================================================---
work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -443,6 +443,15 @@ static inline int __mutex_trylock_slowpa
 	unsigned long flags;
 	int prev;
 
+	preempt_disable();
+
+	if (mutex_spin(lock)) {
+		mutex_set_owner(lock);
+		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+		preempt_enable();
+		return 1;
+	}
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	prev = atomic_xchg(&lock->count, -1);
@@ -456,6 +465,7 @@ static inline int __mutex_trylock_slowpa
 		atomic_set(&lock->count, 0);
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
+	preempt_enable();
 
 	return prev == 1;
 }
Ugh... Please drop the extra "Subject: " from subject before applying. Thanks. -- tejun
Steven Rostedt
2011-Mar-25  03:24 UTC
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
On Thu, Mar 24, 2011 at 09:18:16AM +0100, Ingo Molnar wrote:> > * Tejun Heo <tj@kernel.org> wrote: > > > NOT-Signed-off-by: Tejun Heo <tj@kernel.org> > > s/NOT-// ? >Perhaps because it is still in RFC context? -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2011-Mar-25  03:39 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:> Adaptive owner spinning used to be applied only to mutex_lock(). This > patch applies it also to mutex_trylock(). > > btrfs has developed custom locking to avoid excessive context switches > in its btree implementation. Generally, doing away with the custom > implementation and just using the mutex shows better behavior; > however, there''s an interesting distinction in the custom implemention > of trylock. It distinguishes between simple trylock and tryspin, > where the former just tries once and then fail while the latter does > some spinning before giving up. > > Currently, mutex_trylock() doesn''t use adaptive spinning. It tries > just once. I got curious whether using adaptive spinning on > mutex_trylock() would be beneficial and it seems so, for btrfs anyway. > > The following results are from "dbench 50" run on an opteron two > socket eight core machine with 4GiB of memory and an OCZ vertex SSD. > During the run, disk stays mostly idle and all CPUs are fully occupied > and the difference in locking performance becomes quite visible. > > SIMPLE is with the locking simplification patch[1] applied. i.e. it > basically just uses mutex. SPIN is with this patch applied on top - > mutex_trylock() uses adaptive spinning. > > USER SYSTEM SIRQ CXTSW THROUGHPUT > SIMPLE 61107 354977 217 8099529 845.100 MB/sec > SPIN 63140 364888 214 6840527 879.077 MB/sec > > On various runs, the adaptive spinning trylock consistently posts > higher throughput. The amount of difference varies but it outperforms > consistently. > > In general, using adaptive spinning on trylock makes sense as trylock > failure usually leads to costly unlock-relock sequence. > > [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658 > > Signed-off-by: Tejun Heo <tj@kernel.org>I''m curious about the effects that this has on those places that do: again: mutex_lock(A); if (mutex_trylock(B)) { mutex_unlock(A); goto again; Where the normal locking order is: B -> A If another location does: mutex_lock(B); [...] mutex_lock(A); But another process has A already, and is running, it may spin waiting for A as A''s owner is still running. But now, mutex_trylock(B) becomes a spinner too, and since the B''s owner is running (spinning on A) it will spin as well waiting for A''s owner to release it. Unfortunately, A''s owner is also spinning waiting for B to release it. If both A and B''s owners are real time tasks, then boom! deadlock. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2011-Mar-25  04:38 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt <rostedt@goodmis.org> wrote:> > But now, mutex_trylock(B) becomes a spinner too, and since the B''s owner > is running (spinning on A) it will spin as well waiting for A''s owner to > release it. Unfortunately, A''s owner is also spinning waiting for B to > release it. > > If both A and B''s owners are real time tasks, then boom! deadlock.Hmm. I think you''re right. And it looks pretty fundamental - I don''t see any reasonable approach to avoid it. I think the RT issue is a red herring too - afaik, you can get a deadlock with two perfectly normal processes too. Of course, for non-RT tasks, any other process will eventually disturb the situation and you''d get kicked out due to need_resched(), but even that might be avoided for a long time if there are other CPU''s - leading to tons of wasted CPU time. Linus
Tejun Heo
2011-Mar-25  06:53 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Hello, Steven, Linus. On Thu, Mar 24, 2011 at 09:38:58PM -0700, Linus Torvalds wrote:> On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > But now, mutex_trylock(B) becomes a spinner too, and since the B''s owner > > is running (spinning on A) it will spin as well waiting for A''s owner to > > release it. Unfortunately, A''s owner is also spinning waiting for B to > > release it. > > > > If both A and B''s owners are real time tasks, then boom! deadlock. > > Hmm. I think you''re right. And it looks pretty fundamental - I don''t > see any reasonable approach to avoid it.Hmmm... I have an idea. Will play with it a bit and post if it works out okay.> I think the RT issue is a red herring too - afaik, you can get a > deadlock with two perfectly normal processes too. Of course, for > non-RT tasks, any other process will eventually disturb the situation > and you''d get kicked out due to need_resched(), but even that might be > avoided for a long time if there are other CPU''s - leading to tons of > wasted CPU time.Yeap, need_resched() currently is the only thing which limits the duration of spinning when the owner continues to run. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo
2011-Mar-25  10:12 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Hello, On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote:> USER SYSTEM SIRQ CXTSW THROUGHPUT > SIMPLE 61107 354977 217 8099529 845.100 MB/sec > SPIN 63140 364888 214 6840527 879.077 MB/sec > > On various runs, the adaptive spinning trylock consistently posts > higher throughput. The amount of difference varies but it outperforms > consistently.I''ve been running more of these tests and am having doubts about the consistency. It seems that, even on a fresh filesystem, some random initial condition seems to have persistent effect on the whole run. I''ll run more tests and report back. Thanks. -- tejun
Ingo Molnar
2011-Mar-25  10:29 UTC
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
* Steven Rostedt <rostedt@goodmis.org> wrote:> On Thu, Mar 24, 2011 at 09:18:16AM +0100, Ingo Molnar wrote: > > > > * Tejun Heo <tj@kernel.org> wrote: > > > > > NOT-Signed-off-by: Tejun Heo <tj@kernel.org> > > > > s/NOT-// ? > > > > Perhaps because it is still in RFC context?Ok, i guess i was a bit too cryptic about it: the discussion was in support of the changes so i asked Tejun above to send a SOB version and did that in a geeky way. And paid my price for the shortcut: had two write two other emails about it already ;-) Thanks, Ingo
Ingo Molnar
2011-Mar-25  10:31 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
* Tejun Heo <tj@kernel.org> wrote:> Hello, > > On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote: > > USER SYSTEM SIRQ CXTSW THROUGHPUT > > SIMPLE 61107 354977 217 8099529 845.100 MB/sec > > SPIN 63140 364888 214 6840527 879.077 MB/sec > > > > On various runs, the adaptive spinning trylock consistently posts > > higher throughput. The amount of difference varies but it outperforms > > consistently. > > I''ve been running more of these tests and am having doubts about the > consistency. It seems that, even on a fresh filesystem, some random > initial condition seems to have persistent effect on the whole run. > I''ll run more tests and report back.Ok, and there''s the deadlock issue as well which Steve noticed. I''ll zap the patches from tip:core/urgent and lets do this via tip:core/locking with a .40 timeframe and plenty of time of testing. Thanks, Ingo
Andrey Kuzmin
2011-Mar-25  11:13 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, Mar 25, 2011 at 6:39 AM, Steven Rostedt <rostedt@goodmis.org> wrote:> On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote: >> Adaptive owner spinning used to be applied only to mutex_lock(). This >> patch applies it also to mutex_trylock(). >> >> btrfs has developed custom locking to avoid excessive context switches >> in its btree implementation. Generally, doing away with the custom >> implementation and just using the mutex shows better behavior; >> however, there''s an interesting distinction in the custom implemention >> of trylock. It distinguishes between simple trylock and tryspin, >> where the former just tries once and then fail while the latter does >> some spinning before giving up. >> >> Currently, mutex_trylock() doesn''t use adaptive spinning. It tries >> just once. I got curious whether using adaptive spinning on >> mutex_trylock() would be beneficial and it seems so, for btrfs anyway. >> >> The following results are from "dbench 50" run on an opteron two >> socket eight core machine with 4GiB of memory and an OCZ vertex SSD. >> During the run, disk stays mostly idle and all CPUs are fully occupied >> and the difference in locking performance becomes quite visible. >> >> SIMPLE is with the locking simplification patch[1] applied. i.e. it >> basically just uses mutex. SPIN is with this patch applied on top - >> mutex_trylock() uses adaptive spinning. >> >> USER SYSTEM SIRQ CXTSW THROUGHPUT >> SIMPLE 61107 354977 217 8099529 845.100 MB/sec >> SPIN 63140 364888 214 6840527 879.077 MB/sec >> >> On various runs, the adaptive spinning trylock consistently posts >> higher throughput. The amount of difference varies but it outperforms >> consistently. >> >> In general, using adaptive spinning on trylock makes sense as trylock >> failure usually leads to costly unlock-relock sequence. >> >> [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658 >> >> Signed-off-by: Tejun Heo <tj@kernel.org> > > I''m curious about the effects that this has on those places that do: > > again: > mutex_lock(A); > if (mutex_trylock(B)) { > mutex_unlock(A); > goto again; > > > Where the normal locking order is: > B -> A > > If another location does: > > mutex_lock(B); > [...] > mutex_lock(A); > > But another process has A already, and is running, it may spin waiting > for A as A''s owner is still running. > > But now, mutex_trylock(B) becomes a spinner too, and since the B''s owner > is running (spinning on A) it will spin as well waiting for A''s owner to > release it. Unfortunately, A''s owner is also spinning waiting for B to > release it. > > If both A and B''s owners are real time tasks, then boom! deadlock.Turning try_lock into indefinitely spinning one breaks its semantics, so deadlock is to be expected. But what''s wrong in this scenario if try_lock spins a bit before giving up? Regards, Andrey> > -- Steve > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Steven Rostedt
2011-Mar-25  13:10 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, 2011-03-25 at 07:53 +0100, Tejun Heo wrote:> Hello, Steven, Linus. > > On Thu, Mar 24, 2011 at 09:38:58PM -0700, Linus Torvalds wrote: > > On Thu, Mar 24, 2011 at 8:39 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > But now, mutex_trylock(B) becomes a spinner too, and since the B''s owner > > > is running (spinning on A) it will spin as well waiting for A''s owner to > > > release it. Unfortunately, A''s owner is also spinning waiting for B to > > > release it. > > > > > > If both A and B''s owners are real time tasks, then boom! deadlock. > > > > Hmm. I think you''re right. And it looks pretty fundamental - I don''t > > see any reasonable approach to avoid it. > > Hmmm... I have an idea. Will play with it a bit and post if it works > out okay.One solution is to have this be only done on explicit trylocks. Perhaps introduce a mutex_trylock_spin()? Then when the developer knows that this scenario does not exist, they can convert mutex_trylocks() into this spinning version.> > > I think the RT issue is a red herring too - afaik, you can get a > > deadlock with two perfectly normal processes too. Of course, for > > non-RT tasks, any other process will eventually disturb the situation > > and you''d get kicked out due to need_resched(), but even that might be > > avoided for a long time if there are other CPU''s - leading to tons of > > wasted CPU time. > > Yeap, need_resched() currently is the only thing which limits the > duration of spinning when the owner continues to run.Yeah, I was about to complain about the long latencies that this could cause, then I realized that RT tasks would in fact deadlock the system here, which I thought was a bigger problem, and focused on that issue. -- Steve
Steven Rostedt
2011-Mar-25  13:12 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote:> Turning try_lock into indefinitely spinning one breaks its semantics, > so deadlock is to be expected. But what''s wrong in this scenario if > try_lock spins a bit before giving up?Because that will cause this scenario to spin that "little longer" always, and introduce latencies that did not exist before. Either the solution does not break this scenario, or it should not go in. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Steven Rostedt
2011-Mar-25  13:29 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, 2011-03-25 at 09:10 -0400, Steven Rostedt wrote:> One solution is to have this be only done on explicit trylocks. Perhaps > introduce a mutex_trylock_spin()? Then when the developer knows that > this scenario does not exist, they can convert mutex_trylocks() into > this spinning version. >I''m not sure this is even worth it, as I''m looking at the btfs/extend-tree.c code, this is the main reason to use mutex_trylock(). I guess what you see in your benchmarks is that trylock contention happens mostly in the non-deadlock scenario. But I bet you have latencies when it does happen, but the benefit seems to out weigh it in the results. I wonder what happens if you run dbench as an RT task. -- Steve
Andrey Kuzmin
2011-Mar-25  13:50 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt <rostedt@goodmis.org> wrote:> On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote: >> Turning try_lock into indefinitely spinning one breaks its semantics, >> so deadlock is to be expected. But what''s wrong in this scenario if >> try_lock spins a bit before giving up? > > Because that will cause this scenario to spin that "little longer" > always, and introduce latencies that did not exist before. Either the > solution does not break this scenario, or it should not go in.Broken semantics and extra latency are two separate issues. If the former is fixed, the latter is easily handled by introducing new mutex_trylock_spin call that lets one either stick to existing behavior (try/fail) or choose a new one where latency penalty is justified by locking patterns. Regards, Andrey> > -- Steve > > >
Steven Rostedt
2011-Mar-25  14:05 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, 2011-03-25 at 16:50 +0300, Andrey Kuzmin wrote:> On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote: > >> Turning try_lock into indefinitely spinning one breaks its semantics, > >> so deadlock is to be expected. But what''s wrong in this scenario if > >> try_lock spins a bit before giving up? > > > > Because that will cause this scenario to spin that "little longer" > > always, and introduce latencies that did not exist before. Either the > > solution does not break this scenario, or it should not go in. > > Broken semantics and extra latency are two separate issues. If the > former is fixed, the latter is easily handled by introducing new > mutex_trylock_spin call that lets one either stick to existing > behavior (try/fail) or choose a new one where latency penalty is > justified by locking patterns. >For those wanting a more RT deterministic OS, I will argue against latency penalties. -- Steve
Ingo Molnar
2011-Mar-25  19:51 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
* Steven Rostedt <rostedt@goodmis.org> wrote:> On Fri, 2011-03-25 at 16:50 +0300, Andrey Kuzmin wrote: > > On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > > > On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote: > > >> Turning try_lock into indefinitely spinning one breaks its semantics, > > >> so deadlock is to be expected. But what''s wrong in this scenario if > > >> try_lock spins a bit before giving up? > > > > > > Because that will cause this scenario to spin that "little longer" > > > always, and introduce latencies that did not exist before. Either the > > > solution does not break this scenario, or it should not go in. > > > > Broken semantics and extra latency are two separate issues. If the > > former is fixed, the latter is easily handled by introducing new > > mutex_trylock_spin call that lets one either stick to existing > > behavior (try/fail) or choose a new one where latency penalty is > > justified by locking patterns. > > > > For those wanting a more RT deterministic OS, I will argue against > latency penalties.Later mails from Tejun suggest that the benchmark results are varied, and that it''s not a clear win after all. It''s possible that if useless spinning is introduced then that might explain such workload variations. I.e. it''s not really ''latencies'' but ''unnecessary overhead spent spinning'' - and also ''extra non-deterministic noise'' - none of which help consistent performance. Thanks, Ingo
Tejun Heo
2011-Mar-29  16:37 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Hello, guys.
I''ve been running dbench 50 for a few days now and the result is,
well, I don''t know how to call it.
The problem was that the original patch didn''t do anything because x86
fastpath code didn''t call into the generic slowpath at all.
  static inline int __mutex_fastpath_trylock(atomic_t *count,
					     int (*fail_fn)(atomic_t *))
  {
	  if (likely(atomic_cmpxchg(count, 1, 0) == 1))
		  return 1;
	  else
		  return 0;
  }												
So, I thought that I probably was doing unconscious data selection
while I was running the test before sending out the patches.  Maybe I
was seeing what I wanted to see, so I ran tests in larger scale more
methodologically.
I first started with ten consecutive runs and then doubled it with
intervening reboot and then basically ended up doing that twice for
four configuration (I didn''t do two runs of simple and refactor but
just averaged the two).
The hardware is mostly the same except that I switched to a hard drive
instead of SSD as hard drives tend to be slower but more consistent in
performance numbers.  On each run, the filesystem is recreated and the
system was rebooted after every ten runs.  The numbers are the
reported throughput in MiB/s at the end of each run.
 
https://spreadsheets.google.com/ccc?key=0AsbaQh2SFt66dDdxOGZWVVlIbEdIOWRQLURVVUNYSXc&hl=en
Here are the descriptions of the eight columns.
  simple	only with patch to make btrfs use mutex
  refactor	mutex_spin() factored out
  spin		mutex_spin() applied to the unused trylock slowpath
  spin-1	ditto
  spin-fixed	x86 trylock fastpath updated to use generic slowpath
  spin-fixed-1	ditto
  code-layout	refactor + dummy function added to mutex.c
  code-layout-1	ditto
After running the simple, refactor and spin ones, I was convinced that
there definitely was something which was causing the difference.  The
averages were apart by more than 1.5 sigma, but I couldn''t explain
what could have caused such difference.
The code-layout runs were my desparate attempts to find explanation on
what''s going on.  Addition of mutex_spin to the unused trylock generic
path makes gcc arrange functions differently.  Without it, trylock
functions end up inbetween lock and unlock funcitons; with it, they
are located at the end.  I commented out the unused trylock slowpath
function and added a dummy function at the end to make gcc generate
similar assembly layout.
At this point, the only conclusions I can draw are,
* Using adaptive spinning on mutex_trylock() doesn''t seem to buy
  anything according to btrfs dbench 50 runs.
and much more importantly,
* btrfs dbench 50 runs are probably not good for measuring subtle
  mutex performance differences.  Maybe it''s too macro and there are
  larger scale tendencies which skew the result unless the number of
  runs are vastly increased (but 40 runs are already over eight
  hours).
If anyone can provide an explanation on what''s going on, I''ll
be super
happy.  Otherwise, for now, I''ll just leave it alone.  :-(
Thanks.
-- 
tejun
Tejun Heo
2011-Mar-29  17:09 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Here''s the combined patch I was planning on testing but didn''t
get to
(yet).  It implements two things - hard limit on spin duration and
early break if the owner also is spinning on a mutex.
Thanks.
Index: work1/include/linux/sched.h
==================================================================---
work1.orig/include/linux/sched.h
+++ work1/include/linux/sched.h
@@ -359,7 +359,7 @@ extern signed long schedule_timeout_inte
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
 
 struct nsproxy;
 struct user_namespace;
Index: work1/kernel/sched.c
==================================================================---
work1.orig/kernel/sched.c
+++ work1/kernel/sched.c
@@ -536,6 +536,10 @@ struct rq {
 	struct hrtimer hrtick_timer;
 #endif
 
+#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
+	bool spinning_on_mutex;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	/* latency stats */
 	struct sched_info rq_sched_info;
@@ -4021,16 +4025,44 @@ EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
 /*
- * Look out! "owner" is an entirely speculative pointer
- * access and not reliable.
+ * Maximum mutex owner spin duration in nsecs.  Don''t spin more then
+ * DEF_TIMESLICE.
+ */
+#define MAX_MUTEX_SPIN_NS	(DEF_TIMESLICE * 1000000000LLU / HZ)
+
+/**
+ * mutex_spin_on_owner - optimistic adaptive spinning on locked mutex
+ * @lock: the mutex to spin on
+ * @owner: the current owner (speculative pointer)
+ *
+ * The caller is trying to acquire @lock held by @owner.  If @owner is
+ * currently running, it might get unlocked soon and spinning on it can
+ * save the overhead of sleeping and waking up.
+ *
+ * Note that @owner is completely speculative and may be completely
+ * invalid.  It should be accessed very carefully.
+ *
+ * Forward progress is guaranteed regardless of locking ordering by never
+ * spinning longer than MAX_MUTEX_SPIN_NS.  This is necessary because
+ * mutex_trylock(), which doesn''t have to follow the usual locking
+ * ordering, also uses this function.
+ *
+ * CONTEXT:
+ * Preemption disabled.
+ *
+ * RETURNS:
+ * %true if the lock was released and the caller should retry locking.
+ * %false if the caller better go sleeping.
  */
-int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
+bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
 {
+	unsigned long start;
 	unsigned int cpu;
 	struct rq *rq;
+	bool ret = true;
 
 	if (!sched_feat(OWNER_SPIN))
-		return 0;
+		return false;
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 	/*
@@ -4039,7 +4071,7 @@ int mutex_spin_on_owner(struct mutex *lo
 	 * the mutex owner just released it and exited.
 	 */
 	if (probe_kernel_address(&owner->cpu, cpu))
-		return 0;
+		return false;
 #else
 	cpu = owner->cpu;
 #endif
@@ -4049,15 +4081,17 @@ int mutex_spin_on_owner(struct mutex *lo
 	 * the cpu field may no longer be valid.
 	 */
 	if (cpu >= nr_cpumask_bits)
-		return 0;
+		return false;
 
 	/*
 	 * We need to validate that we can do a
 	 * get_cpu() and that we have the percpu area.
 	 */
 	if (!cpu_online(cpu))
-		return 0;
+		return false;
 
+	this_rq()->spinning_on_mutex = true;
+	start = local_clock();
 	rq = cpu_rq(cpu);
 
 	for (;;) {
@@ -4070,21 +4104,30 @@ int mutex_spin_on_owner(struct mutex *lo
 			 * we likely have heavy contention. Return 0 to quit
 			 * optimistic spinning and not contend further:
 			 */
-			if (lock->owner)
-				return 0;
+			ret = !lock->owner;
 			break;
 		}
 
 		/*
-		 * Is that owner really running on that cpu?
+		 * Quit spinning if any of the followings is true.
+		 *
+		 * - The owner isn''t running on that cpu.
+		 * - The owner also is spinning on a mutex.
+		 * - Someone else wants to use this cpu.
+		 * - We''ve been spinning for too long.
 		 */
-		if (task_thread_info(rq->curr) != owner || need_resched())
-			return 0;
+		if (task_thread_info(rq->curr) != owner ||
+		    rq->spinning_on_mutex || need_resched() ||
+		    local_clock() > start + MAX_MUTEX_SPIN_NS) {
+			ret = false;
+			break;
+		}
 
 		arch_mutex_cpu_relax();
 	}
 
-	return 1;
+	this_rq()->spinning_on_mutex = false;
+	return ret;
 }
 #endif
Peter Zijlstra
2011-Mar-29  17:37 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote:> Here''s the combined patch I was planning on testing but didn''t get to > (yet). It implements two things - hard limit on spin duration and > early break if the owner also is spinning on a mutex.This is going to give massive conflicts with https://lkml.org/lkml/2011/3/2/286 https://lkml.org/lkml/2011/3/2/282 which I was planning to stuff into .40> @@ -4021,16 +4025,44 @@ EXPORT_SYMBOL(schedule); > > #ifdef CONFIG_MUTEX_SPIN_ON_OWNER > /* > + * Maximum mutex owner spin duration in nsecs. Don''t spin more then > + * DEF_TIMESLICE. > + */ > +#define MAX_MUTEX_SPIN_NS (DEF_TIMESLICE * 1000000000LLU / HZ)DEF_TIMESLICE is SCHED_RR only, so its use here is dubious at best, also I bet we have something like NSEC_PER_SEC to avoid counting ''0''s.> + > +/** > + * mutex_spin_on_owner - optimistic adaptive spinning on locked mutex > + * @lock: the mutex to spin on > + * @owner: the current owner (speculative pointer) > + * > + * The caller is trying to acquire @lock held by @owner. If @owner is > + * currently running, it might get unlocked soon and spinning on it can > + * save the overhead of sleeping and waking up. > + * > + * Note that @owner is completely speculative and may be completely > + * invalid. It should be accessed very carefully. > + * > + * Forward progress is guaranteed regardless of locking ordering by never > + * spinning longer than MAX_MUTEX_SPIN_NS. This is necessary because > + * mutex_trylock(), which doesn''t have to follow the usual locking > + * ordering, also uses this function.While that puts a limit on things it''ll still waste time. I''d much rather pass an trylock argument to mutex_spin_on_owner() and then bail on owner also spinning.> + * CONTEXT: > + * Preemption disabled. > + * > + * RETURNS: > + * %true if the lock was released and the caller should retry locking. > + * %false if the caller better go sleeping. > */ > -int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) > +bool mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner) > {> @@ -4070,21 +4104,30 @@ int mutex_spin_on_owner(struct mutex *lo > * we likely have heavy contention. Return 0 to quit > * optimistic spinning and not contend further: > */ > + ret = !lock->owner; > break; > } > > /* > - * Is that owner really running on that cpu? > + * Quit spinning if any of the followings is true. > + * > + * - The owner isn''t running on that cpu. > + * - The owner also is spinning on a mutex. > + * - Someone else wants to use this cpu. > + * - We''ve been spinning for too long. > */ > + if (task_thread_info(rq->curr) != owner || > + rq->spinning_on_mutex || need_resched() || > + local_clock() > start + MAX_MUTEX_SPIN_NS) {While we did our best with making local_clock() cheap, I''m still fairly uncomfortable with putting it in such a tight loop.> + ret = false; > + break; > + } > > arch_mutex_cpu_relax(); > } > > + this_rq()->spinning_on_mutex = false; > + return ret; > } > #endif >
Tejun Heo
2011-Mar-30  08:17 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Hey, Peter. On Tue, Mar 29, 2011 at 07:37:33PM +0200, Peter Zijlstra wrote:> On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote: > > Here''s the combined patch I was planning on testing but didn''t get to > > (yet). It implements two things - hard limit on spin duration and > > early break if the owner also is spinning on a mutex. > > This is going to give massive conflicts with > > https://lkml.org/lkml/2011/3/2/286 > https://lkml.org/lkml/2011/3/2/282 > > which I was planning to stuff into .40I see. Adapting shouldn''t be hard. The patch is in proof-of-concept stage anyway.> > + * Forward progress is guaranteed regardless of locking ordering by never > > + * spinning longer than MAX_MUTEX_SPIN_NS. This is necessary because > > + * mutex_trylock(), which doesn''t have to follow the usual locking > > + * ordering, also uses this function. > > While that puts a limit on things it''ll still waste time. I''d much > rather pass an trylock argument to mutex_spin_on_owner() and then bail > on owner also spinning.Do we guarantee or enforce that the lock ownership can''t be transferred to a different task? If we do, the recursive spinning detection is enough to guarantee forward progress.> > + if (task_thread_info(rq->curr) != owner || > > + rq->spinning_on_mutex || need_resched() || > > + local_clock() > start + MAX_MUTEX_SPIN_NS) { > > While we did our best with making local_clock() cheap, I''m still fairly > uncomfortable with putting it in such a tight loop.That''s one thing I didn''t really understand. It seems the spinning code tried to be light on CPU cycle usage, but we''re wasting CPU cycles there anyway. If the spinning can become smarter using some CPU cycles, isn''t that a gain? Why is the spinning code optimizing for less CPU cycles? Also, it would be great if you can share what you''re using for locking performance measurements. My attempts with dbench didn''t end too well. :-( Thanks. -- tejun
Peter Zijlstra
2011-Mar-30  11:29 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, 2011-03-30 at 10:17 +0200, Tejun Heo wrote:> Hey, Peter. > > On Tue, Mar 29, 2011 at 07:37:33PM +0200, Peter Zijlstra wrote: > > On Tue, 2011-03-29 at 19:09 +0200, Tejun Heo wrote: > > > Here''s the combined patch I was planning on testing but didn''t get to > > > (yet). It implements two things - hard limit on spin duration and > > > early break if the owner also is spinning on a mutex. > > > > This is going to give massive conflicts with > > > > https://lkml.org/lkml/2011/3/2/286 > > https://lkml.org/lkml/2011/3/2/282 > > > > which I was planning to stuff into .40 > > I see. Adapting shouldn''t be hard. The patch is in proof-of-concept > stage anyway. > > > > + * Forward progress is guaranteed regardless of locking ordering by never > > > + * spinning longer than MAX_MUTEX_SPIN_NS. This is necessary because > > > + * mutex_trylock(), which doesn''t have to follow the usual locking > > > + * ordering, also uses this function. > > > > While that puts a limit on things it''ll still waste time. I''d much > > rather pass an trylock argument to mutex_spin_on_owner() and then bail > > on owner also spinning. > > Do we guarantee or enforce that the lock ownership can''t be > transferred to a different task? If we do, the recursive spinning > detection is enough to guarantee forward progress.The only way to switch owner is for the current owner to release and a new owner to acquire the lock. Also we already bail the spin loop when owner changes.> > > + if (task_thread_info(rq->curr) != owner || > > > + rq->spinning_on_mutex || need_resched() || > > > + local_clock() > start + MAX_MUTEX_SPIN_NS) { > > > > While we did our best with making local_clock() cheap, I''m still fairly > > uncomfortable with putting it in such a tight loop. > > That''s one thing I didn''t really understand. It seems the spinning > code tried to be light on CPU cycle usage, but we''re wasting CPU > cycles there anyway. If the spinning can become smarter using some > CPU cycles, isn''t that a gain? Why is the spinning code optimizing > for less CPU cycles?Loop exit latency mostly, the lighter the loop, the faster you''re through the less time you waste once the condition is true, but yeah, what you say is true too, its a balance game as always.> Also, it would be great if you can share what you''re using for locking > performance measurements. My attempts with dbench didn''t end too > well. :-(The thing I used for the initial implementation is mentioned in the changelog (0d66bf6d3): Testing with Ingo''s test-mutex application (http://lkml.org/lkml/2006/1/8/50) gave a 345% boost for VFS scalability on my testbox: # ./test-mutex-shm V 16 10 | grep "^avg ops" avg ops/sec: 296604 # ./test-mutex-shm V 16 10 | grep "^avg ops" avg ops/sec: 85870 I''ve no idea how heavy that is on trylock though, you''d have to look at that. Chris did some improvements using dbench in ac6e60ee4 but clearly that isn''t working for you.
Chris Mason
2011-Mar-30  11:46 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Excerpts from Tejun Heo''s message of 2011-03-29 12:37:02 -0400:> Hello, guys. > > I''ve been running dbench 50 for a few days now and the result is, > well, I don''t know how to call it. > > The problem was that the original patch didn''t do anything because x86 > fastpath code didn''t call into the generic slowpath at all. > > static inline int __mutex_fastpath_trylock(atomic_t *count, > int (*fail_fn)(atomic_t *)) > { > if (likely(atomic_cmpxchg(count, 1, 0) == 1)) > return 1; > else > return 0; > } > > So, I thought that I probably was doing unconscious data selection > while I was running the test before sending out the patches. Maybe I > was seeing what I wanted to see, so I ran tests in larger scale more > methodologically. > > I first started with ten consecutive runs and then doubled it with > intervening reboot and then basically ended up doing that twice for > four configuration (I didn''t do two runs of simple and refactor but > just averaged the two). > > The hardware is mostly the same except that I switched to a hard drive > instead of SSD as hard drives tend to be slower but more consistent in > performance numbers. On each run, the filesystem is recreated and the > system was rebooted after every ten runs. The numbers are the > reported throughput in MiB/s at the end of each run. > > https://spreadsheets.google.com/ccc?key=0AsbaQh2SFt66dDdxOGZWVVlIbEdIOWRQLURVVUNYSXc&hl=en > > Here are the descriptions of the eight columns. > > simple only with patch to make btrfs use mutex > refactor mutex_spin() factored out > spin mutex_spin() applied to the unused trylock slowpath > spin-1 ditto > spin-fixed x86 trylock fastpath updated to use generic slowpath > spin-fixed-1 ditto > code-layout refactor + dummy function added to mutex.c > code-layout-1 ditto > > After running the simple, refactor and spin ones, I was convinced that > there definitely was something which was causing the difference. The > averages were apart by more than 1.5 sigma, but I couldn''t explain > what could have caused such difference.I have another workload that is interesting for this, basically N (100 or so) procs all doing stats on a bunch of files in the same directory. This hammers very hard on the btree lookups. During the first set of mutex spinning patches, doing any kind of breakout on the owner spin (when the owner hasn''t changed) made it much slower. You might want to try again with the 2.6.39-rc1 btrfs (or pull my for-linus-unmerged tree into .38) because this gets rid of an unneeded spinlock for the root node. All your work convinced me to dig out my seqlock patches for read only tree block operations. I think we can take your current btrfs patches and combine the seqlock stuff for a huge benefit. In this case, the only thing we''re really missing is a way to mutex_lock without the cond_resched() The biggest trylock user in btrfs is only there so we can keep the spinning lock instead of the blocking lock. Since you''re getting rid of the whole spin vs block setup, we can switch directly to a lock. -chris
Peter Zijlstra
2011-Mar-30  11:52 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, 2011-03-30 at 07:46 -0400, Chris Mason wrote:> > In this case, the only thing we''re really missing is a way to mutex_lock > without the cond_resched()So you''re trying to explicitly avoid a voluntary preemption point? Seems like a bad idea, normally people add those :-)
Chris Mason
2011-Mar-30  11:59 UTC
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
Excerpts from Peter Zijlstra''s message of 2011-03-30 07:52:04 -0400:> On Wed, 2011-03-30 at 07:46 -0400, Chris Mason wrote: > > > > In this case, the only thing we''re really missing is a way to mutex_lock > > without the cond_resched() > > So you''re trying to explicitly avoid a voluntary preemption point? Seems > like a bad idea, normally people add those :-)Yeah, but the btrfs fast path (when we''re able to spin today) looks like this: spin_lock(parent) binary search, select slot pull block for that slot out of cache spin_lock(child) spin_unlock(parent) If we switch it all to mutexes: mutex_lock(parent) binary search, select slot pull block for that slot out of cache mutex_lock(child) mutex_unlock(parent) Most of the spinning vs blocking benefits in btrfs came from doing special things (like dropping the parent lock) when we know we''re going to block with the parent lock held. Surprise blocking in mutex_lock isn''t great. It would probably be enough to just move the cond_resched() after the spinning portion of the mutex_lock() -chris