thr3ads.net - search: "contender"

[PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks

2014 Mar 12

0

[PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks

A major problem with the queue spinlock patch is its performance at low contention level (2-4 contending tasks) where it is slower than the corresponding ticket spinlock code. The following table shows the execution time (in ms) of a micro-benchmark where 5M iterations of the lock/unlock cycles were run on a 10-core Westere-EX x86-64 CPU with 2 different types loads - standalone (lock and

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 26

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

A major problem with the queue spinlock patch is its performance at low contention level (2-4 contending tasks) where it is slower than the corresponding ticket spinlock code path. The following table shows the execution time (in ms) of a micro-benchmark where 5M iterations of the lock/unlock cycles were run on a 10-core Westere-EX CPU with 2 different types loads - standalone (lock and protected

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 27

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

A major problem with the queue spinlock patch is its performance at low contention level (2-4 contending tasks) where it is slower than the corresponding ticket spinlock code path. The following table shows the execution time (in ms) of a micro-benchmark where 5M iterations of the lock/unlock cycles were run on a 10-core Westere-EX CPU with 2 different types loads - standalone (lock and protected

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 28

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/28/2014 04:29 AM, Peter Zijlstra wrote: > On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote: >>>> + old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); >>>> + >>>> + if (old == 0) { >>>> + /* >>>> + * Got the lock, can clear the waiting bit now >>>> + */ >>>> +

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 28

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote: > >>+ old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); > >>+ > >>+ if (old == 0) { > >>+ /* > >>+ * Got the lock, can clear the waiting bit now > >>+ */ > >>+ smp_u8_store_release(&qlock->wait, 0); > > > >So we just did an

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 28

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote: > >>+ old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); > >>+ > >>+ if (old == 0) { > >>+ /* > >>+ * Got the lock, can clear the waiting bit now > >>+ */ > >>+ smp_u8_store_release(&qlock->wait, 0); > > > >So we just did an

[PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014 Jun 16

4

[PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

On Sun, Jun 15, 2014 at 02:46:58PM +0200, Peter Zijlstra wrote: > From: Waiman Long <Waiman.Long at hp.com> > > This patch introduces a new generic queue spinlock implementation that > can serve as an alternative to the default ticket spinlock. Compared > with the ticket spinlock, this queue spinlock should be almost as fair > as the ticket spinlock. It has about the same

[PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014 Jun 16

4

[PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

On Sun, Jun 15, 2014 at 02:46:58PM +0200, Peter Zijlstra wrote: > From: Waiman Long <Waiman.Long at hp.com> > > This patch introduces a new generic queue spinlock implementation that > can serve as an alternative to the default ticket spinlock. Compared > with the ticket spinlock, this queue spinlock should be almost as fair > as the ticket spinlock. It has about the same

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 27

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/26/2014 11:20 AM, Peter Zijlstra wrote: > You don't happen to have a proper state diagram for this thing do you? > > I suppose I'm going to have to make one; this is all getting a bit > unwieldy, and those xchg() + fixup things are hard to read. I don't have a state diagram on hand, but I will add more comments to describe the 4 possible cases and how to handle

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 28

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Feb 28, 2014 1:30 AM, "Peter Zijlstra" <peterz at infradead.org> wrote: > > At low contention the cmpxchg won't have to be retried (much) so using > it won't be a problem and you get to have arbitrary atomic ops. Peter, the difference between an atomic op and *no* atomic op is huge. And Waiman posted numbers for the optimization. Why do you argue with

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 02

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/26, Waiman Long wrote: > > @@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock) > int qlcode = atomic_read(lock->qlcode); > > if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode, > - qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode)) > + qlcode, code|_QSPINLOCK_LOCKED) == qlcode)) Hmm.

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Updated version, this includes numbers for my SNB desktop and Waiman's variant. Curiously Waiman's version seems consistently slower on 2 cross node CPUs. Whereas my version seems to have a problem on SNB with 2 CPUs. There's something weird with the ticket lock numbers; when I compile the code with: gcc (Debian 4.7.2-5) 4.7.2 I get the first set; when I compile with: gcc

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Peter, I was trying to implement the generic queue code exchange code using cmpxchg as suggested by you. However, when I gathered the performance data, the code performed worse than I expected at a higher contention level. Below were the execution time of the benchmark tool that I sent you: [xchg] [cmpxchg] # of tasks Ticket lock Queue lock Queue Lock

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote: > Peter, > > I was trying to implement the generic queue code exchange code using > cmpxchg as suggested by you. However, when I gathered the performance > data, the code performed worse than I expected at a higher contention > level. Below were the execution time of the benchmark tool that I sent > you: > >

[PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks

2014 Mar 13

0

[PATCH v6 04/11] qspinlock: Optimized code path for 2 contending tasks

On Wed, Mar 12, 2014 at 03:08:24PM -0400, Waiman Long wrote: > On 03/12/2014 02:54 PM, Waiman Long wrote: > >+ /* > >+ * Set the lock bit& clear the waiting bit simultaneously > >+ * It is assumed that there is no lock stealing with this > >+ * quick path active. > >+ * > >+ * A direct memory store of _QSPINLOCK_LOCKED into the > >+ *

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 02

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/26, Waiman Long wrote: > > @@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock) > int qlcode = atomic_read(lock->qlcode); > > if (!(qlcode & _QSPINLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode, > - qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode)) > + qlcode, code|_QSPINLOCK_LOCKED) == qlcode)) Hmm.

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote: > Peter, > > I was trying to implement the generic queue code exchange code using > cmpxchg as suggested by you. However, when I gathered the performance > data, the code performed worse than I expected at a higher contention > level. Below were the execution time of the benchmark tool that I sent > you: > >

Creating a contended section of bandwidth with HTB and IMQ

2007 Feb 27

2

Creating a contended section of bandwidth with HTB and IMQ

Hi All, I''m trying to create a contended section of bandwidth using IMQ. I have the imq0 device up and running, with traffic passing through it. Firstly, I need to throttle the entire device imq0 to 2mbit/s. I would then like to add throttle rules for individual IP addresses, allowing them to pass up to 512kbit/s each, as long as imq0 has not reached its 2mbit/s. The configuration

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 03

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Hi, Here are some numbers for my version -- also attached is the test code. I found that booting big machines is tediously slow so I lifted the whole lot to userspace. I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node (2 socket) Intel Westmere-EP. AMD (ticket) AMD (qspinlock + pending + opt) Local:

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 03

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Hi, Here are some numbers for my version -- also attached is the test code. I found that booting big machines is tediously slow so I lifted the whole lot to userspace. I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node (2 socket) Intel Westmere-EP. AMD (ticket) AMD (qspinlock + pending + opt) Local:

search for: contender