thr3ads.net - Linux Virtualization - [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Peter Zijlstra

2014-Mar-03 17:43 UTC

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Hi,

Here are some numbers for my version -- also attached is the test code.
I found that booting big machines is tediously slow so I lifted the
whole lot to userspace.

I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().

The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node
(2 socket) Intel Westmere-EP.

AMD (ticket)		AMD (qspinlock + pending + opt)

Local:                  Local:

1:    324.425530        1:    324.102142
2:  17141.324050        2:    620.185930
3:  52212.232343        3:  25242.574661
4:  93136.458314        4:  47982.037866
6: 167967.455965        6:  95345.011864
8: 245402.534869        8: 142412.451438

2 - nodes:              2 - nodes:

2:  12763.640956        2:   1879.460823
4:  94423.027123        4:  48278.719130
6: 167903.698361        6:  96747.767310
8: 257243.508294        8: 144672.846317

4 - nodes:              4 - nodes:

 4:  82408.853603        4:  49820.323075
 8: 260492.952355        8: 143538.264724
16: 630099.031148       16: 337796.553795



Intel (ticket)		Intel (qspinlock + pending + opt)

Local:                  Local:

1:    19.002249         1:    29.002844
2:  5093.275530         2:  1282.209519
3: 22300.859761         3: 22127.477388
4: 44929.922325         4: 44493.881832
6: 86338.755247         6: 86360.083940

2 - nodes:              2 - nodes:

2:   1509.193824        2:   1209.090219
4:  48154.495998        4:  48547.242379
8: 137946.787244        8: 141381.498125

---

There a few curious facts I found (assuming my test code is sane).

 - Intel seems to be an order of magnitude faster on uncontended LOCKed
   ops compared to AMD

 - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower
   than the uncontended ticket xadd -- although both are plenty fast
   when compared to AMD.

 - In general, replacing cmpxchg loops with unconditional atomic ops
   doesn't seem to matter a whole lot when the thing is contended.

Below is the (rather messy) qspinlock slow path code (the only thing
that really differs between our versions.

I'll try and slot your version in tomorrow.

---


/*
 * Exactly fills one cacheline on 64bit.
 */
static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);

static inline u32 encode_tail(int cpu, int idx)
{
	u32 code;

        code  = (cpu + 1) << _Q_TAIL_CPU_OFFSET;
	code |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */

	return code;
}

static inline struct mcs_spinlock *decode_tail(u32 code)
{
	int cpu = (code >> _Q_TAIL_CPU_OFFSET) - 1;
	int idx = (code >> _Q_TAIL_IDX_OFFSET) & _Q_TAIL_IDX_MASK;

	return per_cpu_ptr(&mcs_nodes[idx], cpu);
}

#define _QSPINLOCK_PENDING	(1U << _Q_PENDING_OFFSET)
#define _QSPINLOCK_MASK		(_QSPINLOCK_LOCKED | _QSPINLOCK_PENDING)

// PENDING - enables the pending bit logic
// OPT     - removes one atomic op at the cost of making pending a byte
// OPT2    - replaces some cmpxchg loops with unconditional atomic ops
//
// PENDING looks to be a win, even with 2 atomic ops on Intel, and a loss on AMD
// OPT is a full win
// OPT2 somehow doesn't seem to make much difference !?
//

/**
 * queue_spin_lock_slowpath - acquire the queue spinlock
 * @lock: Pointer to queue spinlock structure
 *
 *              fast      :    slow                                  :    unlock
 *                        :                                          :
 * uncontended  (0,0,0) --:--> (0,0,1) ------------------------------:-->
(*,*,0)
 *                        :       | ^--------.------.             /  :
 *                        :       v           \      \            |  :
 * pending                :    (0,1,1) +--> (0,1,0)   \           |  :
 *                        :       | ^--'              |           |  :
 *                        :       v                   |           |  :
 * uncontended            :    (n,x,y) +--> (n,0,0) --'           |  :
 *   queue                :       | ^--'                          |  :
 *                        :       v                               |  :
 * contended              :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
 *   queue                :                                          :
 *
 */
void queue_spin_lock_slowpath(struct qspinlock *lock)
{
	struct mcs_spinlock *prev, *next, *node;
	u32 val, new, old, code;
	int idx;

#if PENDING
	/*
	 * trylock || pending
	 *
	 * 0,0,0 -> 0,0,1 ; trylock
	 * 0,0,1 -> 0,1,1 ; pending
	 */
	val = atomic_read(&lock->val);
#if !OPT2
	for (;;) {
		/*
		 * If we observe any contention; queue.
		 */
		if (val & ~_Q_LOCKED_MASK)
			goto queue;

		new = _QSPINLOCK_LOCKED;
		if (val == new)
			new |= _QSPINLOCK_PENDING;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * we won the trylock
	 */
	if (new == _QSPINLOCK_LOCKED)
		return;

#else
	/*
	 * we can ignore the (unlikely) trylock case and have a fall-through on
	 * the wait below.
	 */
	if (val & ~_Q_LOCKED_MASK)
		goto queue;

	if (xchg(&(((u8 *)lock)[1]), 1))
		goto queue;

// could not observe a significant difference
// between the one (xchg) and the other (bts) unconditional
// LOCKed op
//
//	if (atomic_test_and_set_bit(_Q_PENDING_OFFSET, &lock->val))
//		goto queue;
#endif

	/*
	 * we're pending, wait for the owner to go away.
	 */
	while ((val = atomic_read(&lock->val)) & _QSPINLOCK_LOCKED)
		cpu_relax();

	/*
	 * take ownership and clear the pending bit.
	 */
#if !OPT
	for (;;) {
		new = (val & ~_QSPINLOCK_PENDING) | _QSPINLOCK_LOCKED;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}
#else
	((u8 *)lock)[0] = 1; /* locked */
	smp_wmb();
	((u8 *)lock)[1] = 0; /* pending */

// there is a big difference between an atomic and
// no atomic op.
//
//	smp_mb__before_atomic_inc();
//	atomic_clear_bit(_Q_PENDING_OFFSET, &lock->val);
#endif

	return;

queue:
#endif
	node = this_cpu_ptr(&mcs_nodes[0]);
	idx = node->count++;
	code = encode_tail(smp_processor_id(), idx);

	node += idx;
	node->locked = 0;
	node->next = NULL;

	/*
	 * we already touched the queueing cacheline; don't bother with pending
	 * stuff.
	 *
	 * trylock || xchg(lock, node)
	 *
	 * 0,0,0 -> 0,0,1 ; trylock
	 * p,y,x -> n,y,x ; prev = xchg(lock, node)
	 */
	val = atomic_read(&lock->val);
#if !OPT2
	for (;;) {
		new = _QSPINLOCK_LOCKED;
		if (val)
			new = code | (val & _QSPINLOCK_MASK);

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * we won the trylock; forget about queueing.
	 */
	if (new == _QSPINLOCK_LOCKED)
		goto release;
#else
	/*
	 * Like with the pending case; we can ignore the unlikely trylock case
	 * and have a fall-through on the wait.
	 */
	old = xchg(&((u16 *)lock)[1], code >> 16) << 16;
#endif

	/*
	 * if there was a previous node; link it and wait.
	 */
	if (old & ~_QSPINLOCK_MASK) {
		prev = decode_tail(old);
		ACCESS_ONCE(prev->next) = node;

		arch_mcs_spin_lock_contended(&node->locked);
	}

	/*
	 * we're at the head of the waitqueue, wait for the owner & pending to
	 * go away.
	 *
	 * *,x,y -> *,0,0
	 */
	while ((val = atomic_read(&lock->val)) & _QSPINLOCK_MASK)
		cpu_relax();

	/*
	 * claim the lock:
	 *
	 * n,0,0 -> 0,0,1 : lock, uncontended
	 * *,0,0 -> *,0,1 : lock, contended
	 */
	for (;;) {
		new = _QSPINLOCK_LOCKED;
		if (val != code)
			new |= val;

		old = atomic_cmpxchg(&lock->val, val, new);
		if (old == val)
			break;

		val = old;
	}

	/*
	 * contended path; wait for next, release.
	 */
	if (new != _QSPINLOCK_LOCKED) {
		while (!(next = ACCESS_ONCE(node->next)))
			arch_mutex_cpu_relax();

		arch_mcs_spin_unlock_contended(&next->locked);
	}

release:
	/*
	 * release the node
	 */
	this_cpu_ptr(&mcs_nodes[0])->count--;
//	this_cpu_dec(mcs_nodes[0].count);
}
EXPORT_SYMBOL(queue_spin_lock_slowpath);
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spinlocks.tar.bz2
Type: application/octet-stream
Size: 10164 bytes
Desc: not available
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20140303/03dc8269/attachment.obj>

Waiman Long

2014-Mar-04 15:27 UTC

head link

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On 03/03/2014 12:43 PM, Peter Zijlstra wrote:> Hi,
>
> Here are some numbers for my version -- also attached is the test code.
> I found that booting big machines is tediously slow so I lifted the
> whole lot to userspace.
>
> I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().
>
> The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node
> (2 socket) Intel Westmere-EP.
>
> AMD (ticket)		AMD (qspinlock + pending + opt)
>
> Local:                  Local:
>
> 1:    324.425530        1:    324.102142
> 2:  17141.324050        2:    620.185930
> 3:  52212.232343        3:  25242.574661
> 4:  93136.458314        4:  47982.037866
> 6: 167967.455965        6:  95345.011864
> 8: 245402.534869        8: 142412.451438
>
> 2 - nodes:              2 - nodes:
>
> 2:  12763.640956        2:   1879.460823
> 4:  94423.027123        4:  48278.719130
> 6: 167903.698361        6:  96747.767310
> 8: 257243.508294        8: 144672.846317
>
> 4 - nodes:              4 - nodes:
>
>   4:  82408.853603        4:  49820.323075
>   8: 260492.952355        8: 143538.264724
> 16: 630099.031148       16: 337796.553795
>
>
>
> Intel (ticket)		Intel (qspinlock + pending + opt)
>
> Local:                  Local:
>
> 1:    19.002249         1:    29.002844
> 2:  5093.275530         2:  1282.209519
> 3: 22300.859761         3: 22127.477388
> 4: 44929.922325         4: 44493.881832
> 6: 86338.755247         6: 86360.083940
>
> 2 - nodes:              2 - nodes:
>
> 2:   1509.193824        2:   1209.090219
> 4:  48154.495998        4:  48547.242379
> 8: 137946.787244        8: 141381.498125
>
> ---
>
> There a few curious facts I found (assuming my test code is sane).
>
>   - Intel seems to be an order of magnitude faster on uncontended LOCKed
>     ops compared to AMD
>
>   - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower
>     than the uncontended ticket xadd -- although both are plenty fast
>     when compared to AMD.
>
>   - In general, replacing cmpxchg loops with unconditional atomic ops
>     doesn't seem to matter a whole lot when the thing is contended.
>
> Below is the (rather messy) qspinlock slow path code (the only thing
> that really differs between our versions.
>
> I'll try and slot your version in tomorrow.
>
> ---
>
It is curious to see that the qspinlock code offers a big benefit on AMD 
machines, but no so much on Intel. Anyway, I am working on a revised 
version of the patch that includes some of your comments. I will also 
try to see if I can get an AMD machine to run test on.

-Longman

Peter Zijlstra

2014-Mar-04 16:58 UTC

head link

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Updated version, this includes numbers for my SNB desktop and Waiman's
variant.

Curiously Waiman's version seems consistently slower on 2 cross node
CPUs. Whereas my version seems to have a problem on SNB with 2 CPUs.

There's something weird with the ticket lock numbers; when I compile
the code with:

  gcc (Debian 4.7.2-5) 4.7.2

I get the first set; when I compile with:

  gcc (Ubuntu/Linaro 4.7.3-2ubuntu4) 4.7.3

I get the second set; afaict the other locks don't seem to have this
problem, but I only just noticed.

---

I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().

The machines used are a 4 node (2 socket) AMD Interlagos, a 2 node
(2 socket) Intel Westmere-EP and my i7-2600K (SNB) desktop.


(ticket)		(qspinlock + all)	(waiman)


AMD Interlagos

Local:

 1:    324.425530        1:    324.102142        1:    323.857834
 2:  17141.324050        2:    620.185930        2:    618.737681
 3:  52212.232343        3:  25242.574661        3:  24888.154161
 4:  93136.458314        4:  47982.037866        4:  48227.610976
 6: 167967.455965        6:  95345.011864        6:  94372.276116
 8: 245402.534869        8: 142412.451438        8: 140516.525791

 1: 324.071515
 2: 981.778516
 3: 24414.144262
 4: 50868.376667
 6: 99536.890504
 8: 151616.395779

2 - nodes:

 2:  12763.640956        2:   1879.460823        2:   2023.594014
 4:  94423.027123        4:  48278.719130        4:  48621.724929
 6: 167903.698361        6:  96747.767310        6:  95815.242461
 8: 257243.508294        8: 144672.846317        8: 143282.222038

 2:   1875.637053
 4:  50082.796058
 6: 107780.361523
 8: 163166.728218

4 - nodes:

 4:  82408.853603        4:  49820.323075	 4:  50566.036473
 8: 260492.952355        8: 143538.264724        8: 143485.584624
16: 630099.031148       16: 337796.553795       16: 333419.421338

 4: 55409.896671
 8: 167340.905593
16: 443195.057052



Intel WSM-EP

Local:

1:    19.002249         1:    29.002844		1:    28.979917
2:  5093.275530         2:  1282.209519         2:  1236.624785
3: 22300.859761         3: 22127.477388         3: 22336.241120
4: 44929.922325         4: 44493.881832         4: 44832.450598
6: 86338.755247         6: 86360.083940         6: 85808.603491

1: 20.009974
2: 1206.419074
3: 22071.535000
4: 44606.831373
6: 86498.760774

2 - nodes:

2:   1527.466159	2:   1227.051993	2:   1434.204666
4:  46004.232179        4:  46450.787234        4:  46999.356429
6:  89226.472057        6:  90124.984324        6:  90110.423115
8: 137225.472406        8: 137909.184358        8: 137988.290401


Intel SNB

Local:

1:    15.276759		1:    25.336807		1:    25.353041
2:   714.621152         2:   843.240641         2:   711.281211
3: 11339.104267         3: 11751.159167         3: 11684.286334
4: 22648.387376         4: 23454.798068         4: 22903.498910



































-------------- next part --------------
A non-text attachment was scrubbed...
Name: spinlocks.tar.bz2
Type: application/octet-stream
Size: 14659 bytes
Desc: not available
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20140304/82db38f9/attachment-0001.obj>

Waiman Long

2014-Mar-04 17:48 UTC

head link

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Peter,

I was trying to implement the generic queue code exchange code using
cmpxchg as suggested by you. However, when I gathered the performance
data, the code performed worse than I expected at a higher contention
level. Below were the execution time of the benchmark tool that I sent
you:

                 [xchg]        [cmpxchg]
   # of tasks    Ticket lock     Queue lock      Queue Lock
   ----------    -----------     -----------     ----------
        1          135            135              135
        2          732           1315            1102
        3         1827           2372            2681
        4         2689           2934             5392
        5         3736           3658             7696
        6         4942           4434            9876
        7         6304           5176           11901
        8         7736           5955           14551

Below is the code that I used:

static inline u32 queue_code_xchg(struct qspinlock *lock, u32 *ocode, 
u32 ncode)
{
         while (true) {
                 u32 qlcode = atomic_read(&lock->qlcode);

                 if (qlcode == 0) {
                         /*
                          * Try to get the lock
                          */
                         if (atomic_cmpxchg(&lock->qlcode, 0,
                                            _QSPINLOCK_LOCKED) == 0)
                                 return 1;
                 } else if (qlcode & _QSPINLOCK_LOCKED) {
                         *ocode = atomic_cmpxchg(&lock->qlcode, qlcode,
                                                 ncode | _QSPINLOCK_LOCKED);
                         if (*ocode == qlcode) {
                                 /* Clear lock bit before return */
                                 *ocode &= ~_QSPINLOCK_LOCKED;
                                 return 0;
                         }
                 }
                 /*
                  * Wait if atomic_cmpxchg() fails or lock is 
temporarily free.
                  */
                 arch_mutex_cpu_relax();
         }
}

My cmpxchg code is not optimal, and I can probably tune the code to
make it perform better. Given the trend that I was seeing, however,
I think I will keep the current xchg code, but I will package it in
an inline function.

-Longman

Peter Zijlstra

2014-Mar-04 18:09 UTC

head link

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Tue, Mar 04, 2014 at 05:58:00PM +0100, Peter Zijlstra
wrote:>  2:  17141.324050        2:    620.185930        2:    618.737681
So I forgot that AMD has compute units that share L2:

root at interlagos:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 1 ; $LOCK 0 2) |
awk '/^total/ { print $2 }'
982.938839
1325.702905
root at interlagos:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0
1 ; $LOCK 0 2) | awk '/^total/ { print $2 }'
630.684313
999.119087
root at interlagos:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 1 ; $LOCK 0 2) |
awk '/^total/ { print $2 }'
620.562791
1644.700639


Doing the same for Intel SMT, which shares L1:


root at westmere:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 12 ; $LOCK 0 1) |
awk '/^total/ { print $2 }'
45.765302
1292.721827
root at westmere:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 12
; $LOCK 0 1) | awk '/^total/ { print $2 }'
54.536890
1260.467527
root at westmere:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 12 ; $LOCK 0 1) |
awk '/^total/ { print $2 }'
65.944794
1230.522895

Peter Zijlstra

2014-Mar-04 22:40 UTC

head link

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long
wrote:> Peter,
> 
> I was trying to implement the generic queue code exchange code using
> cmpxchg as suggested by you. However, when I gathered the performance
> data, the code performed worse than I expected at a higher contention
> level. Below were the execution time of the benchmark tool that I sent
> you:
> 
>                 [xchg]        [cmpxchg]
>   # of tasks    Ticket lock     Queue lock      Queue Lock
>   ----------    -----------     -----------     ----------
>        1          135            135              135
>        2          732           1315            1102
>        3         1827           2372            2681
>        4         2689           2934             5392
>        5         3736           3658             7696
>        6         4942           4434            9876
>        7         6304           5176           11901
>        8         7736           5955           14551
> 
I'm just not seeing that; with test-4 modified to take the AMD compute
units into account:

root at interlagos:~/spinlocks# LOCK=./qspinlock-pending-opt ./test-4.sh ;
LOCK=./qspinlock-pending-opt2 ./test-4.sh
 4: 50783.509653
 8: 146295.875715
16: 332942.964709
 4: 51033.341441
 8: 146320.656285
16: 332586.355194

And the difference between opt and opt2 is that opt2 replaces 2 cmpxchg
loops with unconditional ops (xchg8 and xchg16).

And I'd think that 4 CPUs x 4 Nodes would be heavy contention.

I'll have another poke tomorrow; including verifying asm tomorrow, need
to go sleep now.

Seemingly Similar Threads

Search for more maybe matching threads

Linux Virtualization - Mar 2014 - [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Seemingly Similar Threads