thr3ads.net - Linux Virtualization - [PATCH 04/11] qspinlock: Extract out the exchange of tail code word [Jun 2014]

If this information is useful, please help other people find it:
Share via:

Konrad Rzeszutek Wilk

2014-Jun-18 13:50 UTC

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini
wrote:> Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto:
> >On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote:
> >>From: Waiman Long <Waiman.Long at hp.com>
> >>
> >>This patch extracts the logic for the exchange of new and previous
tail
> >>code words into a new xchg_tail() function which can be optimized
in a
> >>later patch.
> >
> >And also adds a third try on acquiring the lock. That I think should
> >be a seperate patch.
> 
> It doesn't really add a new try, the old code is:
> 
> 
> -	for (;;) {
> -		new = _Q_LOCKED_VAL;
> -		if (val)
> -			new = tail | (val & _Q_LOCKED_PENDING_MASK);
> -
> -		old = atomic_cmpxchg(&lock->val, val, new);
> -		if (old == val)
> -			break;
> -
> -		val = old;
> -	}
> 
>  	/*
> -	 * we won the trylock; forget about queueing.
>  	 */
> -	if (new == _Q_LOCKED_VAL)
> -		goto release;
> 
> The trylock happens if the "if (val)" hits the else branch.
> 
> What the patch does is change it from attempting two transition with a
> single cmpxchg:
> 
> -	 * 0,0,0 -> 0,0,1 ; trylock
> -	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
> 
> to first doing the trylock, then the xchg.  If the trylock passes and the
> xchg returns prev=0,0,0, the next step of the algorithm goes to the
> locked/uncontended state
> 
> +	/*
> +	 * claim the lock:
> +	 *
> +	 * n,0 -> 0,1 : lock, uncontended
> 
> Similar to your suggestion of patch 3, it's expected that the xchg will
> *not* return prev=0,0,0 after a failed trylock.
I do like your explanation. I hope that Peter will put it in the
description as it explains the change quite well.
> 
> However, I *do* agree with you that it's simpler to just squash this
patch
> into 01/11.
Uh, did I say that? Oh I said why don't make it right the first time!

I meant in terms of seperating the slowpath (aka the bytelock on the pending
bit) from the queue (MCS code). Or renaming the function to be called
'complex' instead of 'slowpath' as it is getting quite hairy.

The #1 patch is nice by itself - as it lays out the foundation of the
MCS-similar code - and if Ingo decides he does not want this pending
byte-lock bit business - it can be easily reverted or dropped.

In terms of squashing this in #1 - I would advocate against that.

Thanks!

Waiman Long

2014-Jun-18 15:46 UTC

head link

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

On 06/18/2014 09:50 AM, Konrad Rzeszutek Wilk wrote:> On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote:
>> Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto:
>>> On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote:
>>>> From: Waiman Long<Waiman.Long at hp.com>
>>>>
>>>> This patch extracts the logic for the exchange of new and
previous tail
>>>> code words into a new xchg_tail() function which can be
optimized in a
>>>> later patch.
>>> And also adds a third try on acquiring the lock. That I think
should
>>> be a seperate patch.
>> It doesn't really add a new try, the old code is:
>>
>>
>> -	for (;;) {
>> -		new = _Q_LOCKED_VAL;
>> -		if (val)
>> -			new = tail | (val&  _Q_LOCKED_PENDING_MASK);
>> -
>> -		old = atomic_cmpxchg(&lock->val, val, new);
>> -		if (old == val)
>> -			break;
>> -
>> -		val = old;
>> -	}
>>
>>   	/*
>> -	 * we won the trylock; forget about queueing.
>>   	 */
>> -	if (new == _Q_LOCKED_VAL)
>> -		goto release;
>>
>> The trylock happens if the "if (val)" hits the else branch.
>>
>> What the patch does is change it from attempting two transition with a
>> single cmpxchg:
>>
>> -	 * 0,0,0 ->  0,0,1 ; trylock
>> -	 * p,y,x ->  n,y,x ; prev = xchg(lock, node)
>>
>> to first doing the trylock, then the xchg.  If the trylock passes and
the
>> xchg returns prev=0,0,0, the next step of the algorithm goes to the
>> locked/uncontended state
>>
>> +	/*
>> +	 * claim the lock:
>> +	 *
>> +	 * n,0 ->  0,1 : lock, uncontended
>>
>> Similar to your suggestion of patch 3, it's expected that the xchg
will
>> *not* return prev=0,0,0 after a failed trylock.
> I do like your explanation. I hope that Peter will put it in the
> description as it explains the change quite well.
>
>> However, I *do* agree with you that it's simpler to just squash
this patch
>> into 01/11.
> Uh, did I say that? Oh I said why don't make it right the first time!
>
> I meant in terms of seperating the slowpath (aka the bytelock on the
pending
> bit) from the queue (MCS code). Or renaming the function to be called
> 'complex' instead of 'slowpath' as it is getting quite
hairy.
>
> The #1 patch is nice by itself - as it lays out the foundation of the
> MCS-similar code - and if Ingo decides he does not want this pending
> byte-lock bit business - it can be easily reverted or dropped.
The pending bit code is needed for performance parity with ticket 
spinlock for light load. My own measurement indicates that the queuing 
overhead will cause the queue spinlock to be slower than ticket spinlock 
with 2-4 contending tasks. The pending bit solves the performance 
problem with 2 contending tasks, leave only the 3-4 tasks cases being a 
bit slower than the ticket spinlock which should be more than 
compensated by its superior performance with heavy contention and 
slightly better performance with no contention.

-Longman

Paolo Bonzini

2014-Jun-18 15:49 UTC

head link

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

Il 18/06/2014 17:46, Waiman Long ha scritto:>>
>>
>> The #1 patch is nice by itself - as it lays out the foundation of the
>> MCS-similar code - and if Ingo decides he does not want this pending
>> byte-lock bit business - it can be easily reverted or dropped.
>
> The pending bit code is needed for performance parity with ticket
> spinlock for light load. My own measurement indicates that the queuing
> overhead will cause the queue spinlock to be slower than ticket spinlock
> with 2-4 contending tasks. The pending bit solves the performance
> problem with 2 contending tasks, leave only the 3-4 tasks cases being a
> bit slower than the ticket spinlock which should be more than
> compensated by its superior performance with heavy contention and
> slightly better performance with no contention.
Note that this patch is not related to the pending bit, only to the 
trylock bit which is already in patch 1.  It serializes two 
previously-parallel checks for transitions.  This is why I thought it 
could already belong in patch 1.

Paolo

Konrad Rzeszutek Wilk

2014-Jun-18 16:02 UTC

head link

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

> >>However, I *do* agree with you that it's simpler to just squash
this patch
> >>into 01/11.
> >Uh, did I say that? Oh I said why don't make it right the first
time!
> >
> >I meant in terms of seperating the slowpath (aka the bytelock on the
pending
> >bit) from the queue (MCS code). Or renaming the function to be called
> >'complex' instead of 'slowpath' as it is getting quite
hairy.
> >
> >The #1 patch is nice by itself - as it lays out the foundation of the
> >MCS-similar code - and if Ingo decides he does not want this pending
> >byte-lock bit business - it can be easily reverted or dropped.
> 
> The pending bit code is needed for performance parity with ticket spinlock
> for light load. My own measurement indicates that the queuing overhead will
> cause the queue spinlock to be slower than ticket spinlock with 2-4
> contending tasks. The pending bit solves the performance problem with 2
Aha!
> contending tasks, leave only the 3-4 tasks cases being a bit slower than
the
> ticket spinlock which should be more than compensated by its superior
> performance with heavy contention and slightly better performance with no
> contention.
That should be mentioned in the commit description as the rationale for
the patch "qspinlock: Add pending bit" and also in the code.

Thank you!

Apparently Analagous Threads

Search for more apparently analagous threads

Linux Virtualization - Jun 2014 - [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

Apparently Analagous Threads