thr3ads.net - similar to: "[PATCH V11 04/17] locking/qspinlock: Improve xchg

[PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

2014 Jun 15

0

[PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

From: Peter Zijlstra <peterz at infradead.org> When we allow for a max NR_CPUS < 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the

[PATCH v9 05/19] qspinlock: Optimize for smaller NR_CPUS

2014 Apr 17

0

[PATCH v9 05/19] qspinlock: Optimize for smaller NR_CPUS

When we allow for a max NR_CPUS < 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change.

[PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

2014 Jun 18

1

[PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

On Sun, Jun 15, 2014 at 02:47:02PM +0200, Peter Zijlstra wrote: > From: Peter Zijlstra <peterz at infradead.org> > > When we allow for a max NR_CPUS < 2^14 we can optimize the pending > wait-acquire and the xchg_tail() operations. > > By growing the pending bit to a byte, we reduce the tail to 16bit. > This means we can use xchg16 for the tail part and do away with

[PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

2014 Jun 18

1

[PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

On Sun, Jun 15, 2014 at 02:47:02PM +0200, Peter Zijlstra wrote: > From: Peter Zijlstra <peterz at infradead.org> > > When we allow for a max NR_CPUS < 2^14 we can optimize the pending > wait-acquire and the xchg_tail() operations. > > By growing the pending bit to a byte, we reduce the tail to 16bit. > This means we can use xchg16 for the tail part and do away with

[LLVMdev] [x86] Prefetch intrinsics and prefetchw

2015 Jul 30

0

[LLVMdev] [x86] Prefetch intrinsics and prefetchw

Hi, I am looking at how the PREFETCHW instruction is matched to the IR prefetch intrinsic (and __builtin_prefetch). Consider this C program: char foo[100]; int bar(void) { __builtin_prefetch(foo, 0, 0); __builtin_prefetch(foo, 0, 1); __builtin_prefetch(foo, 0, 2); __builtin_prefetch(foo, 0, 3); __builtin_prefetch(foo, 1, 0); __builtin_prefetch(foo, 1, 1);

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014 Jun 15

0

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

From: Waiman Long <Waiman.Long at hp.com> This patch extracts the logic for the exchange of new and previous tail code words into a new xchg_tail() function which can be optimized in a later patch. Signed-off-by: Waiman Long <Waiman.Long at hp.com> Signed-off-by: Peter Zijlstra <peterz at infradead.org> --- include/asm-generic/qspinlock_types.h | 2 +

[PATCH v9 04/19] qspinlock: Extract out the exchange of tail code word

2014 Apr 17

0

[PATCH v9 04/19] qspinlock: Extract out the exchange of tail code word

This patch extracts the logic for the exchange of new and previous tail code words into a new xchg_tail() function which can be optimized in a later patch. Signed-off-by: Waiman Long <Waiman.Long at hp.com> --- include/asm-generic/qspinlock_types.h | 2 + kernel/locking/qspinlock.c | 61 +++++++++++++++++++++------------ 2 files changed, 41 insertions(+), 22 deletions(-)

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014 Jun 18

0

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto: > On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: >> From: Waiman Long <Waiman.Long at hp.com> >> >> This patch extracts the logic for the exchange of new and previous tail >> code words into a new xchg_tail() function which can be optimized in a >> later patch. > > And also adds a

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014 Jun 17

3

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: > From: Waiman Long <Waiman.Long at hp.com> > > This patch extracts the logic for the exchange of new and previous tail > code words into a new xchg_tail() function which can be optimized in a > later patch. And also adds a third try on acquiring the lock. That I think should be a seperate patch. And instead of

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014 Jun 17

3

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: > From: Waiman Long <Waiman.Long at hp.com> > > This patch extracts the logic for the exchange of new and previous tail > code words into a new xchg_tail() function which can be optimized in a > later patch. And also adds a third try on acquiring the lock. That I think should be a seperate patch. And instead of

[PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

2014 Jul 07

0

[PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

On Mon, Jul 07, 2014 at 05:08:17PM +0200, Paolo Bonzini wrote: > Il 07/07/2014 16:35, Peter Zijlstra ha scritto: > >On Wed, Jun 18, 2014 at 01:39:52PM +0200, Paolo Bonzini wrote: > >>Il 15/06/2014 14:47, Peter Zijlstra ha scritto: > >>> > >>>- for (;;) { > >>>- new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; > >>>- >

[RFC 08/07] qspinlock: integrate pending bit into queue

2014 May 21

0

[RFC 08/07] qspinlock: integrate pending bit into queue

2014-05-21 18:49+0200, Radim Kr?m??: > 2014-05-19 16:17-0400, Waiman Long: > > As for now, I will focus on just having one pending bit. > > I'll throw some ideas at it, One of the ideas follows; it seems sound, but I haven't benchmarked it thoroughly. (Wasted a lot of time by writing/playing with various tools and loads.) Dbench on ext4 ramdisk, hackbench and ebizzy

[PATCH] finish processor.h integration

2007 Dec 18

3

[PATCH] finish processor.h integration

What's left in processor_32.h and processor_64.h cannot be cleanly integrated. However, it's just a couple of definitions. They are moved to processor.h around ifdefs, and the original files are deleted. Note that there's much less headers included in the final version. Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com> --- include/asm-x86/processor.h | 140

[PATCH] finish processor.h integration

2007 Dec 18

3

[PATCH] finish processor.h integration

What's left in processor_32.h and processor_64.h cannot be cleanly integrated. However, it's just a couple of definitions. They are moved to processor.h around ifdefs, and the original files are deleted. Note that there's much less headers included in the final version. Signed-off-by: Glauber de Oliveira Costa <gcosta@redhat.com> --- include/asm-x86/processor.h | 140

[PATCH v9 06/19] qspinlock: prolong the stay in the pending bit path

2014 Apr 17

2

[PATCH v9 06/19] qspinlock: prolong the stay in the pending bit path

On Thu, Apr 17, 2014 at 11:03:58AM -0400, Waiman Long wrote: > There is a problem in the current trylock_pending() function. When the > lock is free, but the pending bit holder hasn't grabbed the lock & > cleared the pending bit yet, the trylock_pending() function will fail. I remember seeing some of this.. > It can be seen that the queue spinlock is slower than the ticket

[PATCH v9 06/19] qspinlock: prolong the stay in the pending bit path

2014 Apr 17

2

[PATCH v9 06/19] qspinlock: prolong the stay in the pending bit path

On Thu, Apr 17, 2014 at 11:03:58AM -0400, Waiman Long wrote: > There is a problem in the current trylock_pending() function. When the > lock is free, but the pending bit holder hasn't grabbed the lock & > cleared the pending bit yet, the trylock_pending() function will fail. I remember seeing some of this.. > It can be seen that the queue spinlock is slower than the ticket

[PATCH v9 06/19] qspinlock: prolong the stay in the pending bit path

2014 Apr 18

0

[PATCH v9 06/19] qspinlock: prolong the stay in the pending bit path

On 04/17/2014 12:36 PM, Peter Zijlstra wrote: > On Thu, Apr 17, 2014 at 11:03:58AM -0400, Waiman Long wrote: >> There is a problem in the current trylock_pending() function. When the >> lock is free, but the pending bit holder hasn't grabbed the lock& >> cleared the pending bit yet, the trylock_pending() function will fail. > I remember seeing some of this.. >

[PATCH 07/11] qspinlock: Use a simple write to grab the lock, if applicable

2014 Jun 15

0

[PATCH 07/11] qspinlock: Use a simple write to grab the lock, if applicable

From: Waiman Long <Waiman.Long at hp.com> Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the queue code word. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that

[PATCH v9 07/19] qspinlock: Use a simple write to grab the lock, if applicable

2014 Apr 17

0

[PATCH v9 07/19] qspinlock: Use a simple write to grab the lock, if applicable

Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the queue code word. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The

[PATCH v10 07/19] qspinlock: Use a simple write to grab the lock, if applicable

2014 May 07

0

[PATCH v10 07/19] qspinlock: Use a simple write to grab the lock, if applicable

Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the queue code word. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The

similar to: [PATCH V11 04/17] locking/qspinlock: Improve xchg_tail for number of cpus >= 16k