thr3ads.net - Linux Virtualization - [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks [Feb 2014]

If this information is useful, please help other people find it:
Share via:

Peter Zijlstra

2014-Feb-28 09:29 UTC

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long
wrote:> >>+	old = xchg(&qlock->lock_wait,
_QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
> >>+
> >>+	if (old == 0) {
> >>+		/*
> >>+		 * Got the lock, can clear the waiting bit now
> >>+		 */
> >>+		smp_u8_store_release(&qlock->wait, 0);
> >
> >So we just did an atomic op, and now you're trying to optimize this
> >write. Why do you need a whole byte for that?
> >
> >Surely a cmpxchg loop with the right atomic op can't be _that_ much
> >slower? Its far more readable and likely avoids that steal fail below
as
> >well.
> 
> At low contention level, atomic operations that requires a lock prefix are
> the major contributor to the total execution times. I saw estimate online
> that the time to execute a lock prefix instruction can easily be 50X longer
> than a regular instruction that can be pipelined. That is why I try to do
it
> with as few lock prefix instructions as possible. If I have to do an atomic
> cmpxchg, it probably won't be faster than the regular qspinlock
slowpath.
At low contention the cmpxchg won't have to be retried (much) so using
it won't be a problem and you get to have arbitrary atomic ops.
> Given that speed at low contention level which is the common case is
> important to get this patch accepted, I have to do what I can to make it
run
> as far as possible for this 2 contending task case.
What I'm saying is that you can do the whole thing with a single
cmpxchg. No extra ops needed. And at that point you don't need a whole
byte, you can use a single bit.

that removes the whole NR_CPUS dependent logic.
> >>+		/*
> >>+		 * Someone has steal the lock, so wait again
> >>+		 */
> >>+		goto try_again;
> >That's just a fail.. steals should not ever be allowed. It's a
fair lock
> >after all.
> 
> The code is unfair, but this unfairness help it to run faster than ticket
> spinlock in this particular case. And the regular qspinlock slowpath is
> fair. A little bit of unfairness in this particular case helps its speed.
*groan*, no, unfairness not cool. ticket lock is absolutely fair; we
should preserve this.

BTW; can you share your benchmark thingy?

Linus Torvalds

2014-Feb-28 16:25 UTC

head link

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Feb 28, 2014 1:30 AM, "Peter Zijlstra" <peterz at
infradead.org> wrote:>
> At low contention the cmpxchg won't have to be retried (much) so using
> it won't be a problem and you get to have arbitrary atomic ops.
Peter, the difference between an atomic op and *no* atomic op is huge.

And Waiman posted numbers for the optimization. Why do you argue with
handwaving and against numbers?

       Linus
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20140228/b13b88b0/attachment.html>

Waiman Long

2014-Feb-28 16:38 UTC

head link

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 02/28/2014 04:29 AM, Peter Zijlstra wrote:> On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote:
>>>> +	old = xchg(&qlock->lock_wait,
_QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
>>>> +
>>>> +	if (old == 0) {
>>>> +		/*
>>>> +		 * Got the lock, can clear the waiting bit now
>>>> +		 */
>>>> +		smp_u8_store_release(&qlock->wait, 0);
>>> So we just did an atomic op, and now you're trying to optimize
this
>>> write. Why do you need a whole byte for that?
>>>
>>> Surely a cmpxchg loop with the right atomic op can't be _that_
much
>>> slower? Its far more readable and likely avoids that steal fail
below as
>>> well.
>> At low contention level, atomic operations that requires a lock prefix
are
>> the major contributor to the total execution times. I saw estimate
online
>> that the time to execute a lock prefix instruction can easily be 50X
longer
>> than a regular instruction that can be pipelined. That is why I try to
do it
>> with as few lock prefix instructions as possible. If I have to do an
atomic
>> cmpxchg, it probably won't be faster than the regular qspinlock
slowpath.
> At low contention the cmpxchg won't have to be retried (much) so using
> it won't be a problem and you get to have arbitrary atomic ops.
>
>> Given that speed at low contention level which is the common case is
>> important to get this patch accepted, I have to do what I can to make
it run
>> as far as possible for this 2 contending task case.
> What I'm saying is that you can do the whole thing with a single
> cmpxchg. No extra ops needed. And at that point you don't need a whole
> byte, you can use a single bit.
>
> that removes the whole NR_CPUS dependent logic.
After modifying it to do a deterministic cmpxchg, the test run time of 2 
contending tasks jumps up from 600ms (best case) to about 1700ms which 
was worse than the original qspinlock's 1300-1500ms. It is the 
opportunistic nature of the xchg() code that can potentially combine 
multiple steps in the deterministic atomic sequence which can saves 
time. Without that, I would rather prefer going back to the basic 
qspinlock queuing sequence for 2 contending tasks.

Please take a look at the performance data in my patch 3 to see if the 
slowdown at 2 and 3 contending tasks are acceptable or not.

The reason why I need a whole byte for the lock bit is because of the 
simple unlock code of assigning 0 to the lock byte by the lock holder. 
Utilizing other bits in the low byte for other purpose will complicate 
the unlock path and slow down the no-contention case.
>>>> +		/*
>>>> +		 * Someone has steal the lock, so wait again
>>>> +		 */
>>>> +		goto try_again;
>>> That's just a fail.. steals should not ever be allowed.
It's a fair lock
>>> after all.
>> The code is unfair, but this unfairness help it to run faster than
ticket
>> spinlock in this particular case. And the regular qspinlock slowpath is
>> fair. A little bit of unfairness in this particular case helps its
speed.
> *groan*, no, unfairness not cool. ticket lock is absolutely fair; we
> should preserve this.
We can preserve that by removing patch 3.
> BTW; can you share your benchmark thingy?
I have attached the test program that I used to generate the timing data 
for patch 3.

-Longman


-------------- next part --------------
A non-text attachment was scrubbed...
Name: locktest.tar.gz
Type: application/x-gzip
Size: 5244 bytes
Desc: not available
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20140228/a8dc1c7a/attachment.gz>

Peter Zijlstra

2014-Feb-28 17:37 UTC

head link

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Fri, Feb 28, 2014 at 08:25:24AM -0800, Linus Torvalds
wrote:> On Feb 28, 2014 1:30 AM, "Peter Zijlstra" <peterz at
infradead.org> wrote:
> >
> > At low contention the cmpxchg won't have to be retried (much) so
using
> > it won't be a problem and you get to have arbitrary atomic ops.
> 
> Peter, the difference between an atomic op and *no* atomic op is huge.
I know, I'm just asking what the difference is between the xchg() - and
atomic op, and an cmpxchg(), also an atomic op.

The xchg() makes the entire thing somewhat difficult. Needing to fixup
all kinds of states if we guessed wrong about what was in the variables.
> And Waiman posted numbers for the optimization. Why do you argue with
> handwaving and against numbers?
I've asked for his benchmark..

Peter Zijlstra

2014-Feb-28 17:56 UTC

head link

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

> After modifying it to do a deterministic cmpxchg, the test run time of 2
> contending tasks jumps up from 600ms (best case) to about 1700ms which was
> worse than the original qspinlock's 1300-1500ms. It is the
opportunistic
> nature of the xchg() code that can potentially combine multiple steps in
the
> deterministic atomic sequence which can saves time. Without that, I would
> rather prefer going back to the basic qspinlock queuing sequence for 2
> contending tasks.
> 
> Please take a look at the performance data in my patch 3 to see if the
> slowdown at 2 and 3 contending tasks are acceptable or not.
Right; so I've gone back to a simple version (~200 lines) that's fairly
easy to comprehend (to me, since I wrote it). And will now try to see if
I can find the same state transitions in your code.

I find your code somewhat hard to follow; mostly due to those xchg() +
fixup thingies. But I'll get there.
> The reason why I need a whole byte for the lock bit is because of the
simple
> unlock code of assigning 0 to the lock byte by the lock holder. Utilizing
> other bits in the low byte for other purpose will complicate the unlock
path
> and slow down the no-contention case.
Yeah, I get why you need a whole byte for the lock part, I was asking if
we really need another whole byte for the pending part.

So in my version I think I see an optimization case where this is indeed
useful and I can trade an atomic op for a write barrier, which should be
a big win.

It just wasn't at all clear (to me) from your code.

(I also think the optimization isn't x86 specific)
> >>The code is unfair, but this unfairness help it to run faster than
ticket
> >>spinlock in this particular case. And the regular qspinlock
slowpath is
> >>fair. A little bit of unfairness in this particular case helps its
speed.
> >*groan*, no, unfairness not cool. ticket lock is absolutely fair; we
> >should preserve this.
> 
> We can preserve that by removing patch 3.
I've got a version that does the pending thing and still is entirely
fair.

I don't think the concept of the extra spinner is incompatible with
fairness.
> >BTW; can you share your benchmark thingy?
> 
> I have attached the test program that I used to generate the timing data
for
> patch 3.
Thanks, I'll have a play.

Peter Zijlstra

2014-Mar-03 17:43 UTC

head link

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Hi,

Here are some numbers for my version -- also attached is the test code.
I found that booting big machines is tediously slow so I lifted the
whole lot to userspace.

I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().

The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node
(2 socket) Intel Westmere-EP.

AMD (ticket)		AMD (qspinlock + pending + opt)

Local:                  Local:

1:    324.425530        1:    324.102142
2:  17141.324050        2:    620.185930
3:  52212.232343        3:  25242.574661
4:  93136.458314        4:  47982.037866
6: 167967.455965        6:  95345.011864
8: 245402.534869        8: 142412.451438

2 - nodes:              2 - nodes:

2:  12763.640956        2:   1879.460823
4:  94423.027123        4:  48278.719130
6: 167903.698361        6:  96747.767310
8: 257243.508294        8: 144672.846317

4 - nodes:              4 - nodes:

 4:  82408.853603        4:  49820.323075
 8: 260492.952355        8: 143538.264724
16: 630099.031148       16: 337796.553795



Intel (ticket)		Intel (qspinlock + pending + opt)

Local:                  Local:

1:    19.002249         1:    29.002844
2:  5093.275530         2:  1282.209519
3: 22300.859761         3: 22127.477388
4: 44929.922325         4: 44493.881832
6: 86338.755247         6: 86360.083940

2 - nodes:              2 - nodes:

2:   1509.193824        2:   1209.090219
4:  48154.495998        4:  48547.242379
8: 137946.787244        8: 141381.498125

---

There a few curious facts I found (assuming my test code is sane).

 - Intel seems to be an order of magnitude faster on uncontended LOCKed
   ops compared to AMD

 - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower
   than the uncontended ticket xadd -- although both are plenty fast
   when compared to AMD.

 - In general, replacing cmpxchg loops with unconditional atomic ops
   doesn't seem to matter a whole lot when the thing is contended.

Below is the (rather messy) qspinlock slow path code (the only thing
that really differs between our versions.

I'll try and slot your version in tomorrow.

---


/*
 * Exactly fills one cacheline on 64bit.
 */
static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);

static inline u32 encode_tail(int cpu, int idx)
{
	u32 code;

        code  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
	code |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */

	return code;
}

static inline struct mcs_spinlock *decode_tail(u32 code)
{
	int cpu = (code >> _Q_TAIL_CPU_OFFSET) - 1;
	int idx = (code >> _Q_TAIL_IDX_OFFSET) & _Q_TAIL_IDX_MASK;

	return per_cpu_ptr(&mcs_nodes[idx], cpu);
}

#define _QSPINLOCK_PENDING	(1U << _Q_PENDING_OFFSET)
#define _QSPINLOCK_MASK		(_QSPINLOCK_LOCKED | _QSPINLOCK_PENDING)

// PENDING - enables the pending bit logic
// OPT     - removes one atomic op at the cost of making pending a byte
// OPT2    - replaces some cmpxchg loops with unconditional atomic ops
//
// PENDING looks to be a win, even with 2 atomic ops on Intel, and a loss on AMD
// OPT is a full win
// OPT2 somehow doesn't seem to make much difference !?
//

/**
 * queue_spin_lock_slowpath - acquire the queue spinlock
 * @lock: Pointer to queue spinlock structure
 *
 *              fast      :    slow                                  :    unlock
 *                        :                                          :
 * uncontended  (0,0,0) --:--> (0,0,1) ------------------------------:-->
(*,*,0)
 *                        :       | ^--------.------.             /  :
 *                        :       v           \      \            |  :
 * pending                :    (0,1,1) +--> (0,1,0)   \           |  :
 *                        :       | ^--'              |           |  :
 *                        :       v                   |           |  :
 * uncontended            :    (n,x,y) +--> (n,0,0) --'           |  :
 *   queue                :       | ^--'                          |  :
 *                        :       v                               |  :
 * contended              :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
 *   queue                :                                          :
 *
 */
void queue_spin_lock_slowpath(struct qspinlock *lock)
{
	struct mcs_spinlock *prev, *next, *node;
	u32 val, new, old, code;
	int idx;

#if PENDING
	/*
	 * trylock || pending
	 *
	 * 0,0,0 -> 0,0,1 ; trylock
	 * 0,0,1 -> 0,1,1 ; pending
	 */
	val = atomic_read(&lock->val);
#if !OPT2
	for (;;) {
		/*
		 * If we observe any contention; queue.
		 */
		if (val & ~_Q_LOCKED_MASK)
			goto queue;

		new = _QSPINLOCK_LOCKED;
		if (val == new)
			new |= _QSPINLOCK_PENDING;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * we won the trylock
	 */
	if (new == _QSPINLOCK_LOCKED)
		return;

#else
	/*
	 * we can ignore the (unlikely) trylock case and have a fall-through on
	 * the wait below.
	 */
	if (val & ~_Q_LOCKED_MASK)
		goto queue;

	if (xchg(&(((u8 *)lock)[1]), 1))
		goto queue;

// could not observe a significant difference
// between the one (xchg) and the other (bts) unconditional
// LOCKed op
//
//	if (atomic_test_and_set_bit(_Q_PENDING_OFFSET, &lock->val))
//		goto queue;
#endif

	/*
	 * we're pending, wait for the owner to go away.
	 */
	while ((val = atomic_read(&lock->val)) & _QSPINLOCK_LOCKED)
		cpu_relax();

	/*
	 * take ownership and clear the pending bit.
	 */
#if !OPT
	for (;;) {
		new = (val & ~_QSPINLOCK_PENDING) | _QSPINLOCK_LOCKED;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}
#else
	((u8 *)lock)[0] = 1; /* locked */
	smp_wmb();
	((u8 *)lock)[1] = 0; /* pending */

// there is a big difference between an atomic and
// no atomic op.
//
//	smp_mb__before_atomic_inc();
//	atomic_clear_bit(_Q_PENDING_OFFSET, &lock->val);
#endif

	return;

queue:
#endif
	node = this_cpu_ptr(&mcs_nodes[0]);
	idx = node->count++;
	code = encode_tail(smp_processor_id(), idx);

	node += idx;
	node->locked = 0;
	node->next = NULL;

	/*
	 * we already touched the queueing cacheline; don't bother with pending
	 * stuff.
	 *
	 * trylock || xchg(lock, node)
	 *
	 * 0,0,0 -> 0,0,1 ; trylock
	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
	 */
	val = atomic_read(&lock->val);
#if !OPT2
	for (;;) {
		new = _QSPINLOCK_LOCKED;
		if (val)
			new = code | (val & _QSPINLOCK_MASK);

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * we won the trylock; forget about queueing.
	 */
	if (new == _QSPINLOCK_LOCKED)
		goto release;
#else
	/*
	 * Like with the pending case; we can ignore the unlikely trylock case
	 * and have a fall-through on the wait.
	 */
	old = xchg(&((u16 *)lock)[1], code >> 16) << 16;
#endif

	/*
	 * if there was a previous node; link it and wait.
	 */
	if (old & ~_QSPINLOCK_MASK) {
		prev = decode_tail(old);
		ACCESS_ONCE(prev->next) = node;

		arch_mcs_spin_lock_contended(&node->locked);
	}

	/*
	 * we're at the head of the waitqueue, wait for the owner & pending to
	 * go away.
	 *
	 * *,x,y -> *,0,0
	 */
	while ((val = atomic_read(&lock->val)) & _QSPINLOCK_MASK)
		cpu_relax();

	/*
	 * claim the lock:
	 *
	 * n,0,0 -> 0,0,1 : lock, uncontended
	 * *,0,0 -> *,0,1 : lock, contended
	 */
	for (;;) {
		new = _QSPINLOCK_LOCKED;
		if (val != code)
			new |= val;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * contended path; wait for next, release.
	 */
	if (new != _QSPINLOCK_LOCKED) {
		while (!(next = ACCESS_ONCE(node->next)))
			arch_mutex_cpu_relax();

		arch_mcs_spin_unlock_contended(&next->locked);
	}

release:
	/*
	 * release the node
	 */
	this_cpu_ptr(&mcs_nodes[0])->count--;
//	this_cpu_dec(mcs_nodes[0].count);
}
EXPORT_SYMBOL(queue_spin_lock_slowpath);
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spinlocks.tar.bz2
Type: application/octet-stream
Size: 10164 bytes
Desc: not available
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20140303/03dc8269/attachment.obj>

Maybe Matching Threads

Search for more maybe matching threads

Linux Virtualization - Feb 2014 - [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Maybe Matching Threads