thr3ads.net - search: "contended"

[PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks

2014 Mar 12

0

[PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks

A major problem with the queue spinlock patch is its performance at low contention level (2-4 contending tasks) where it is slower than the corresponding ticket spinlock code. The following table shows the execution time (in ms) of a micro-benchmark where 5M iterations of the lock/unlock cycles were run on a 10-core Westere-EX x86-64 CPU with 2 different types loads - standalone (lock and

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 26

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

A major problem with the queue spinlock patch is its performance at low contention level (2-4 contending tasks) where it is slower than the corresponding ticket spinlock code path. The following table shows the execution time (in ms) of a micro-benchmark where 5M iterations of the lock/unlock cycles were run on a 10-core Westere-EX CPU with 2 different types loads - standalone (lock and protected

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 27

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

A major problem with the queue spinlock patch is its performance at low contention level (2-4 contending tasks) where it is slower than the corresponding ticket spinlock code path. The following table shows the execution time (in ms) of a micro-benchmark where 5M iterations of the lock/unlock cycles were run on a 10-core Westere-EX CPU with 2 different types loads - standalone (lock and protected

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 28

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/28/2014 04:29 AM, Peter Zijlstra wrote: > On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote: >>>> + old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); >>>> + >>>> + if (old == 0) { >>>> + /* >>>> + * Got the lock, can clear the waiting bit now >>>> + */ >>>> +

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 28

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote: > >>+ old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); > >>+ > >>+ if (old == 0) { > >>+ /* > >>+ * Got the lock, can clear the waiting bit now > >>+ */ > >>+ smp_u8_store_release(&qlock->wait, 0); > > > >So we just did an

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 28

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote: > >>+ old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); > >>+ > >>+ if (old == 0) { > >>+ /* > >>+ * Got the lock, can clear the waiting bit now > >>+ */ > >>+ smp_u8_store_release(&qlock->wait, 0); > > > >So we just did an

[PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014 Jun 16

4

[PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

...nel/locking/mcs_spinlock.h > +++ linux-2.6/kernel/locking/mcs_spinlock.h > @@ -17,6 +17,7 @@ > struct mcs_spinlock { > struct mcs_spinlock *next; > int locked; /* 1 if lock acquired */ > + int count; This could use a comment. > }; > > #ifndef arch_mcs_spin_lock_contended > Index: linux-2.6/kernel/locking/qspinlock.c > =================================================================== > --- /dev/null > +++ linux-2.6/kernel/locking/qspinlock.c > @@ -0,0 +1,197 @@ > +/* > + * Queue spinlock > + * > + * This program is free software; you can...

[PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014 Jun 16

4

[PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

...nel/locking/mcs_spinlock.h > +++ linux-2.6/kernel/locking/mcs_spinlock.h > @@ -17,6 +17,7 @@ > struct mcs_spinlock { > struct mcs_spinlock *next; > int locked; /* 1 if lock acquired */ > + int count; This could use a comment. > }; > > #ifndef arch_mcs_spin_lock_contended > Index: linux-2.6/kernel/locking/qspinlock.c > =================================================================== > --- /dev/null > +++ linux-2.6/kernel/locking/qspinlock.c > @@ -0,0 +1,197 @@ > +/* > + * Queue spinlock > + * > + * This program is free software; you can...

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 27

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/26/2014 11:20 AM, Peter Zijlstra wrote: > You don't happen to have a proper state diagram for this thing do you? > > I suppose I'm going to have to make one; this is all getting a bit > unwieldy, and those xchg() + fixup things are hard to read. I don't have a state diagram on hand, but I will add more comments to describe the 4 possible cases and how to handle

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 28

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Feb 28, 2014 1:30 AM, "Peter Zijlstra" <peterz at infradead.org> wrote: > > At low contention the cmpxchg won't have to be retried (much) so using > it won't be a problem and you get to have arbitrary atomic ops. Peter, the difference between an atomic op and *no* atomic op is huge. And Waiman posted numbers for the optimization. Why do you argue with

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 02

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/26, Waiman Long wrote: > > @@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock) > int qlcode = atomic_read(lock->qlcode); > > if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode, > - qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode)) > + qlcode, code|_QSPINLOCK_LOCKED) == qlcode)) Hmm.

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Updated version, this includes numbers for my SNB desktop and Waiman's variant. Curiously Waiman's version seems consistently slower on 2 cross node CPUs. Whereas my version seems to have a problem on SNB with 2 CPUs. There's something weird with the ticket lock numbers; when I compile the code with: gcc (Debian 4.7.2-5) 4.7.2 I get the first set; when I compile with: gcc

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Peter, I was trying to implement the generic queue code exchange code using cmpxchg as suggested by you. However, when I gathered the performance data, the code performed worse than I expected at a higher contention level. Below were the execution time of the benchmark tool that I sent you: [xchg] [cmpxchg] # of tasks Ticket lock Queue lock Queue Lock

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote: > Peter, > > I was trying to implement the generic queue code exchange code using > cmpxchg as suggested by you. However, when I gathered the performance > data, the code performed worse than I expected at a higher contention > level. Below were the execution time of the benchmark tool that I sent > you: > >

[PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks

2014 Mar 13

0

[PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks

On Wed, Mar 12, 2014 at 03:08:24PM -0400, Waiman Long wrote: > On 03/12/2014 02:54 PM, Waiman Long wrote: > >+ /* > >+ * Set the lock bit& clear the waiting bit simultaneously > >+ * It is assumed that there is no lock stealing with this > >+ * quick path active. > >+ * > >+ * A direct memory store of _QSPINLOCK_LOCKED into the > >+ *

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 02

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/26, Waiman Long wrote: > > @@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock) > int qlcode = atomic_read(lock->qlcode); > > if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode, > - qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode)) > + qlcode, code|_QSPINLOCK_LOCKED) == qlcode)) Hmm.

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote: > Peter, > > I was trying to implement the generic queue code exchange code using > cmpxchg as suggested by you. However, when I gathered the performance > data, the code performed worse than I expected at a higher contention > level. Below were the execution time of the benchmark tool that I sent > you: > >

Creating a contended section of bandwidth with HTB and IMQ

2007 Feb 27

2

Creating a contended section of bandwidth with HTB and IMQ

Hi All, I''m trying to create a contended section of bandwidth using IMQ. I have the imq0 device up and running, with traffic passing through it. Firstly, I need to throttle the entire device imq0 to 2mbit/s. I would then like to add throttle rules for individual IP addresses, allowing them to pass up to 512kbit/s each, as long as imq0...

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 03

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

...360.083940 2 - nodes: 2 - nodes: 2: 1509.193824 2: 1209.090219 4: 48154.495998 4: 48547.242379 8: 137946.787244 8: 141381.498125 --- There a few curious facts I found (assuming my test code is sane). - Intel seems to be an order of magnitude faster on uncontended LOCKed ops compared to AMD - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower than the uncontended ticket xadd -- although both are plenty fast when compared to AMD. - In general, replacing cmpxchg loops with unconditional atomic ops doesn't seem to matter a w...

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 03

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

...360.083940 2 - nodes: 2 - nodes: 2: 1509.193824 2: 1209.090219 4: 48154.495998 4: 48547.242379 8: 137946.787244 8: 141381.498125 --- There a few curious facts I found (assuming my test code is sane). - Intel seems to be an order of magnitude faster on uncontended LOCKed ops compared to AMD - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower than the uncontended ticket xadd -- although both are plenty fast when compared to AMD. - In general, replacing cmpxchg loops with unconditional atomic ops doesn't seem to matter a w...

search for: contended