Peter Zijlstra
2014-Mar-03 17:43 UTC
[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
Hi, Here are some numbers for my version -- also attached is the test code. I found that booting big machines is tediously slow so I lifted the whole lot to userspace. I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node (2 socket) Intel Westmere-EP. AMD (ticket) AMD (qspinlock + pending + opt) Local: Local: 1: 324.425530 1: 324.102142 2: 17141.324050 2: 620.185930 3: 52212.232343 3: 25242.574661 4: 93136.458314 4: 47982.037866 6: 167967.455965 6: 95345.011864 8: 245402.534869 8: 142412.451438 2 - nodes: 2 - nodes: 2: 12763.640956 2: 1879.460823 4: 94423.027123 4: 48278.719130 6: 167903.698361 6: 96747.767310 8: 257243.508294 8: 144672.846317 4 - nodes: 4 - nodes: 4: 82408.853603 4: 49820.323075 8: 260492.952355 8: 143538.264724 16: 630099.031148 16: 337796.553795 Intel (ticket) Intel (qspinlock + pending + opt) Local: Local: 1: 19.002249 1: 29.002844 2: 5093.275530 2: 1282.209519 3: 22300.859761 3: 22127.477388 4: 44929.922325 4: 44493.881832 6: 86338.755247 6: 86360.083940 2 - nodes: 2 - nodes: 2: 1509.193824 2: 1209.090219 4: 48154.495998 4: 48547.242379 8: 137946.787244 8: 141381.498125 --- There a few curious facts I found (assuming my test code is sane). - Intel seems to be an order of magnitude faster on uncontended LOCKed ops compared to AMD - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower than the uncontended ticket xadd -- although both are plenty fast when compared to AMD. - In general, replacing cmpxchg loops with unconditional atomic ops doesn't seem to matter a whole lot when the thing is contended. Below is the (rather messy) qspinlock slow path code (the only thing that really differs between our versions. I'll try and slot your version in tomorrow. --- /* * Exactly fills one cacheline on 64bit. */ static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]); static inline u32 encode_tail(int cpu, int idx) { u32 code; code = (cpu + 1) << _Q_TAIL_CPU_OFFSET; code |= idx << _Q_TAIL_IDX_OFFSET; /* assume < 4 */ return code; } static inline struct mcs_spinlock *decode_tail(u32 code) { int cpu = (code >> _Q_TAIL_CPU_OFFSET) - 1; int idx = (code >> _Q_TAIL_IDX_OFFSET) & _Q_TAIL_IDX_MASK; return per_cpu_ptr(&mcs_nodes[idx], cpu); } #define _QSPINLOCK_PENDING (1U << _Q_PENDING_OFFSET) #define _QSPINLOCK_MASK (_QSPINLOCK_LOCKED | _QSPINLOCK_PENDING) // PENDING - enables the pending bit logic // OPT - removes one atomic op at the cost of making pending a byte // OPT2 - replaces some cmpxchg loops with unconditional atomic ops // // PENDING looks to be a win, even with 2 atomic ops on Intel, and a loss on AMD // OPT is a full win // OPT2 somehow doesn't seem to make much difference !? // /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * * fast : slow : unlock * : : * uncontended (0,0,0) --:--> (0,0,1) ------------------------------:--> (*,*,0) * : | ^--------.------. / : * : v \ \ | : * pending : (0,1,1) +--> (0,1,0) \ | : * : | ^--' | | : * : v | | : * uncontended : (n,x,y) +--> (n,0,0) --' | : * queue : | ^--' | : * : v | : * contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' : * queue : : * */ void queue_spin_lock_slowpath(struct qspinlock *lock) { struct mcs_spinlock *prev, *next, *node; u32 val, new, old, code; int idx; #if PENDING /* * trylock || pending * * 0,0,0 -> 0,0,1 ; trylock * 0,0,1 -> 0,1,1 ; pending */ val = atomic_read(&lock->val); #if !OPT2 for (;;) { /* * If we observe any contention; queue. */ if (val & ~_Q_LOCKED_MASK) goto queue; new = _QSPINLOCK_LOCKED; if (val == new) new |= _QSPINLOCK_PENDING; old = atomic_cmpxchg(&lock->val, val, new); if (old == val) break; val = old; } /* * we won the trylock */ if (new == _QSPINLOCK_LOCKED) return; #else /* * we can ignore the (unlikely) trylock case and have a fall-through on * the wait below. */ if (val & ~_Q_LOCKED_MASK) goto queue; if (xchg(&(((u8 *)lock)[1]), 1)) goto queue; // could not observe a significant difference // between the one (xchg) and the other (bts) unconditional // LOCKed op // // if (atomic_test_and_set_bit(_Q_PENDING_OFFSET, &lock->val)) // goto queue; #endif /* * we're pending, wait for the owner to go away. */ while ((val = atomic_read(&lock->val)) & _QSPINLOCK_LOCKED) cpu_relax(); /* * take ownership and clear the pending bit. */ #if !OPT for (;;) { new = (val & ~_QSPINLOCK_PENDING) | _QSPINLOCK_LOCKED; old = atomic_cmpxchg(&lock->val, val, new); if (old == val) break; val = old; } #else ((u8 *)lock)[0] = 1; /* locked */ smp_wmb(); ((u8 *)lock)[1] = 0; /* pending */ // there is a big difference between an atomic and // no atomic op. // // smp_mb__before_atomic_inc(); // atomic_clear_bit(_Q_PENDING_OFFSET, &lock->val); #endif return; queue: #endif node = this_cpu_ptr(&mcs_nodes[0]); idx = node->count++; code = encode_tail(smp_processor_id(), idx); node += idx; node->locked = 0; node->next = NULL; /* * we already touched the queueing cacheline; don't bother with pending * stuff. * * trylock || xchg(lock, node) * * 0,0,0 -> 0,0,1 ; trylock * p,y,x -> n,y,x ; prev = xchg(lock, node) */ val = atomic_read(&lock->val); #if !OPT2 for (;;) { new = _QSPINLOCK_LOCKED; if (val) new = code | (val & _QSPINLOCK_MASK); old = atomic_cmpxchg(&lock->val, val, new); if (old == val) break; val = old; } /* * we won the trylock; forget about queueing. */ if (new == _QSPINLOCK_LOCKED) goto release; #else /* * Like with the pending case; we can ignore the unlikely trylock case * and have a fall-through on the wait. */ old = xchg(&((u16 *)lock)[1], code >> 16) << 16; #endif /* * if there was a previous node; link it and wait. */ if (old & ~_QSPINLOCK_MASK) { prev = decode_tail(old); ACCESS_ONCE(prev->next) = node; arch_mcs_spin_lock_contended(&node->locked); } /* * we're at the head of the waitqueue, wait for the owner & pending to * go away. * * *,x,y -> *,0,0 */ while ((val = atomic_read(&lock->val)) & _QSPINLOCK_MASK) cpu_relax(); /* * claim the lock: * * n,0,0 -> 0,0,1 : lock, uncontended * *,0,0 -> *,0,1 : lock, contended */ for (;;) { new = _QSPINLOCK_LOCKED; if (val != code) new |= val; old = atomic_cmpxchg(&lock->val, val, new); if (old == val) break; val = old; } /* * contended path; wait for next, release. */ if (new != _QSPINLOCK_LOCKED) { while (!(next = ACCESS_ONCE(node->next))) arch_mutex_cpu_relax(); arch_mcs_spin_unlock_contended(&next->locked); } release: /* * release the node */ this_cpu_ptr(&mcs_nodes[0])->count--; // this_cpu_dec(mcs_nodes[0].count); } EXPORT_SYMBOL(queue_spin_lock_slowpath); -------------- next part -------------- A non-text attachment was scrubbed... Name: spinlocks.tar.bz2 Type: application/octet-stream Size: 10164 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20140303/03dc8269/attachment.obj>
Waiman Long
2014-Mar-04 15:27 UTC
[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On 03/03/2014 12:43 PM, Peter Zijlstra wrote:> Hi, > > Here are some numbers for my version -- also attached is the test code. > I found that booting big machines is tediously slow so I lifted the > whole lot to userspace. > > I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). > > The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node > (2 socket) Intel Westmere-EP. > > AMD (ticket) AMD (qspinlock + pending + opt) > > Local: Local: > > 1: 324.425530 1: 324.102142 > 2: 17141.324050 2: 620.185930 > 3: 52212.232343 3: 25242.574661 > 4: 93136.458314 4: 47982.037866 > 6: 167967.455965 6: 95345.011864 > 8: 245402.534869 8: 142412.451438 > > 2 - nodes: 2 - nodes: > > 2: 12763.640956 2: 1879.460823 > 4: 94423.027123 4: 48278.719130 > 6: 167903.698361 6: 96747.767310 > 8: 257243.508294 8: 144672.846317 > > 4 - nodes: 4 - nodes: > > 4: 82408.853603 4: 49820.323075 > 8: 260492.952355 8: 143538.264724 > 16: 630099.031148 16: 337796.553795 > > > > Intel (ticket) Intel (qspinlock + pending + opt) > > Local: Local: > > 1: 19.002249 1: 29.002844 > 2: 5093.275530 2: 1282.209519 > 3: 22300.859761 3: 22127.477388 > 4: 44929.922325 4: 44493.881832 > 6: 86338.755247 6: 86360.083940 > > 2 - nodes: 2 - nodes: > > 2: 1509.193824 2: 1209.090219 > 4: 48154.495998 4: 48547.242379 > 8: 137946.787244 8: 141381.498125 > > --- > > There a few curious facts I found (assuming my test code is sane). > > - Intel seems to be an order of magnitude faster on uncontended LOCKed > ops compared to AMD > > - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower > than the uncontended ticket xadd -- although both are plenty fast > when compared to AMD. > > - In general, replacing cmpxchg loops with unconditional atomic ops > doesn't seem to matter a whole lot when the thing is contended. > > Below is the (rather messy) qspinlock slow path code (the only thing > that really differs between our versions. > > I'll try and slot your version in tomorrow. > > --- >It is curious to see that the qspinlock code offers a big benefit on AMD machines, but no so much on Intel. Anyway, I am working on a revised version of the patch that includes some of your comments. I will also try to see if I can get an AMD machine to run test on. -Longman
Peter Zijlstra
2014-Mar-04 16:58 UTC
[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
Updated version, this includes numbers for my SNB desktop and Waiman's variant. Curiously Waiman's version seems consistently slower on 2 cross node CPUs. Whereas my version seems to have a problem on SNB with 2 CPUs. There's something weird with the ticket lock numbers; when I compile the code with: gcc (Debian 4.7.2-5) 4.7.2 I get the first set; when I compile with: gcc (Ubuntu/Linaro 4.7.3-2ubuntu4) 4.7.3 I get the second set; afaict the other locks don't seem to have this problem, but I only just noticed. --- I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, a 2 node (2 socket) Intel Westmere-EP and my i7-2600K (SNB) desktop. (ticket) (qspinlock + all) (waiman) AMD Interlagos Local: 1: 324.425530 1: 324.102142 1: 323.857834 2: 17141.324050 2: 620.185930 2: 618.737681 3: 52212.232343 3: 25242.574661 3: 24888.154161 4: 93136.458314 4: 47982.037866 4: 48227.610976 6: 167967.455965 6: 95345.011864 6: 94372.276116 8: 245402.534869 8: 142412.451438 8: 140516.525791 1: 324.071515 2: 981.778516 3: 24414.144262 4: 50868.376667 6: 99536.890504 8: 151616.395779 2 - nodes: 2: 12763.640956 2: 1879.460823 2: 2023.594014 4: 94423.027123 4: 48278.719130 4: 48621.724929 6: 167903.698361 6: 96747.767310 6: 95815.242461 8: 257243.508294 8: 144672.846317 8: 143282.222038 2: 1875.637053 4: 50082.796058 6: 107780.361523 8: 163166.728218 4 - nodes: 4: 82408.853603 4: 49820.323075 4: 50566.036473 8: 260492.952355 8: 143538.264724 8: 143485.584624 16: 630099.031148 16: 337796.553795 16: 333419.421338 4: 55409.896671 8: 167340.905593 16: 443195.057052 Intel WSM-EP Local: 1: 19.002249 1: 29.002844 1: 28.979917 2: 5093.275530 2: 1282.209519 2: 1236.624785 3: 22300.859761 3: 22127.477388 3: 22336.241120 4: 44929.922325 4: 44493.881832 4: 44832.450598 6: 86338.755247 6: 86360.083940 6: 85808.603491 1: 20.009974 2: 1206.419074 3: 22071.535000 4: 44606.831373 6: 86498.760774 2 - nodes: 2: 1527.466159 2: 1227.051993 2: 1434.204666 4: 46004.232179 4: 46450.787234 4: 46999.356429 6: 89226.472057 6: 90124.984324 6: 90110.423115 8: 137225.472406 8: 137909.184358 8: 137988.290401 Intel SNB Local: 1: 15.276759 1: 25.336807 1: 25.353041 2: 714.621152 2: 843.240641 2: 711.281211 3: 11339.104267 3: 11751.159167 3: 11684.286334 4: 22648.387376 4: 23454.798068 4: 22903.498910 -------------- next part -------------- A non-text attachment was scrubbed... Name: spinlocks.tar.bz2 Type: application/octet-stream Size: 14659 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20140304/82db38f9/attachment-0001.obj>
Waiman Long
2014-Mar-04 17:48 UTC
[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
Peter, I was trying to implement the generic queue code exchange code using cmpxchg as suggested by you. However, when I gathered the performance data, the code performed worse than I expected at a higher contention level. Below were the execution time of the benchmark tool that I sent you: [xchg] [cmpxchg] # of tasks Ticket lock Queue lock Queue Lock ---------- ----------- ----------- ---------- 1 135 135 135 2 732 1315 1102 3 1827 2372 2681 4 2689 2934 5392 5 3736 3658 7696 6 4942 4434 9876 7 6304 5176 11901 8 7736 5955 14551 Below is the code that I used: static inline u32 queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode) { while (true) { u32 qlcode = atomic_read(&lock->qlcode); if (qlcode == 0) { /* * Try to get the lock */ if (atomic_cmpxchg(&lock->qlcode, 0, _QSPINLOCK_LOCKED) == 0) return 1; } else if (qlcode & _QSPINLOCK_LOCKED) { *ocode = atomic_cmpxchg(&lock->qlcode, qlcode, ncode | _QSPINLOCK_LOCKED); if (*ocode == qlcode) { /* Clear lock bit before return */ *ocode &= ~_QSPINLOCK_LOCKED; return 0; } } /* * Wait if atomic_cmpxchg() fails or lock is temporarily free. */ arch_mutex_cpu_relax(); } } My cmpxchg code is not optimal, and I can probably tune the code to make it perform better. Given the trend that I was seeing, however, I think I will keep the current xchg code, but I will package it in an inline function. -Longman
Peter Zijlstra
2014-Mar-04 18:09 UTC
[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On Tue, Mar 04, 2014 at 05:58:00PM +0100, Peter Zijlstra wrote:> 2: 17141.324050 2: 620.185930 2: 618.737681So I forgot that AMD has compute units that share L2: root at interlagos:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }' 982.938839 1325.702905 root at interlagos:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }' 630.684313 999.119087 root at interlagos:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }' 620.562791 1644.700639 Doing the same for Intel SMT, which shares L1: root at westmere:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }' 45.765302 1292.721827 root at westmere:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }' 54.536890 1260.467527 root at westmere:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }' 65.944794 1230.522895
Peter Zijlstra
2014-Mar-04 22:40 UTC
[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote:> Peter, > > I was trying to implement the generic queue code exchange code using > cmpxchg as suggested by you. However, when I gathered the performance > data, the code performed worse than I expected at a higher contention > level. Below were the execution time of the benchmark tool that I sent > you: > > [xchg] [cmpxchg] > # of tasks Ticket lock Queue lock Queue Lock > ---------- ----------- ----------- ---------- > 1 135 135 135 > 2 732 1315 1102 > 3 1827 2372 2681 > 4 2689 2934 5392 > 5 3736 3658 7696 > 6 4942 4434 9876 > 7 6304 5176 11901 > 8 7736 5955 14551 >I'm just not seeing that; with test-4 modified to take the AMD compute units into account: root at interlagos:~/spinlocks# LOCK=./qspinlock-pending-opt ./test-4.sh ; LOCK=./qspinlock-pending-opt2 ./test-4.sh 4: 50783.509653 8: 146295.875715 16: 332942.964709 4: 51033.341441 8: 146320.656285 16: 332586.355194 And the difference between opt and opt2 is that opt2 replaces 2 cmpxchg loops with unconditional ops (xchg8 and xchg16). And I'd think that 4 CPUs x 4 Nodes would be heavy contention. I'll have another poke tomorrow; including verifying asm tomorrow, need to go sleep now.