Nicholas Piggin
2020-Jul-07 05:57 UTC
[PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
Excerpts from Waiman Long's message of July 7, 2020 4:39 am:> On 7/6/20 12:35 AM, Nicholas Piggin wrote: >> v3 is updated to use __pv_queued_spin_unlock, noticed by Waiman (thank you). >> >> Thanks, >> Nick >> >> Nicholas Piggin (6): >> powerpc/powernv: must include hvcall.h to get PAPR defines >> powerpc/pseries: move some PAPR paravirt functions to their own file >> powerpc: move spinlock implementation to simple_spinlock >> powerpc/64s: implement queued spinlocks and rwlocks >> powerpc/pseries: implement paravirt qspinlocks for SPLPAR >> powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the >> lock hint >> >> arch/powerpc/Kconfig | 13 + >> arch/powerpc/include/asm/Kbuild | 2 + >> arch/powerpc/include/asm/atomic.h | 28 ++ >> arch/powerpc/include/asm/paravirt.h | 89 +++++ >> arch/powerpc/include/asm/qspinlock.h | 91 ++++++ >> arch/powerpc/include/asm/qspinlock_paravirt.h | 7 + >> arch/powerpc/include/asm/simple_spinlock.h | 292 +++++++++++++++++ >> .../include/asm/simple_spinlock_types.h | 21 ++ >> arch/powerpc/include/asm/spinlock.h | 308 +----------------- >> arch/powerpc/include/asm/spinlock_types.h | 17 +- >> arch/powerpc/lib/Makefile | 3 + >> arch/powerpc/lib/locks.c | 12 +- >> arch/powerpc/platforms/powernv/pci-ioda-tce.c | 1 + >> arch/powerpc/platforms/pseries/Kconfig | 5 + >> arch/powerpc/platforms/pseries/setup.c | 6 +- >> include/asm-generic/qspinlock.h | 4 + >> 16 files changed, 577 insertions(+), 322 deletions(-) >> create mode 100644 arch/powerpc/include/asm/paravirt.h >> create mode 100644 arch/powerpc/include/asm/qspinlock.h >> create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h >> create mode 100644 arch/powerpc/include/asm/simple_spinlock.h >> create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h >> > This patch looks OK to me.Thanks for reviewing and testing.> I had run some microbenchmark on powerpc system with or w/o the patch. > > On a 2-socket 160-thread SMT4 POWER9 system (not virtualized): > > 5.8.0-rc4 > ========> > Running locktest with spinlock [runtime = 10s, load = 1] > Threads = 160, Min/Mean/Max = 77,665/90,153/106,895 > Threads = 160, Total Rate = 1,441,759 op/s; Percpu Rate = 9,011 op/s > > Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] > Threads = 160, Min/Mean/Max = 47,879/53,807/63,689 > Threads = 160, Total Rate = 860,192 op/s; Percpu Rate = 5,376 op/s > > Running locktest with spinlock [runtime = 10s, load = 1] > Threads = 80, Min/Mean/Max = 242,907/319,514/463,161 > Threads = 80, Total Rate = 2,555 kop/s; Percpu Rate = 32 kop/s > > Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] > Threads = 80, Min/Mean/Max = 146,161/187,474/259,270 > Threads = 80, Total Rate = 1,498 kop/s; Percpu Rate = 19 kop/s > > Running locktest with spinlock [runtime = 10s, load = 1] > Threads = 40, Min/Mean/Max = 646,639/1,000,817/1,455,205 > Threads = 40, Total Rate = 4,001 kop/s; Percpu Rate = 100 kop/s > > Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] > Threads = 40, Min/Mean/Max = 402,165/597,132/814,555 > Threads = 40, Total Rate = 2,388 kop/s; Percpu Rate = 60 kop/s > > 5.8.0-rc4-qlock+ > ===============> > Running locktest with spinlock [runtime = 10s, load = 1] > Threads = 160, Min/Mean/Max = 123,835/124,580/124,587 > Threads = 160, Total Rate = 1,992 kop/s; Percpu Rate = 12 kop/s > > Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] > Threads = 160, Min/Mean/Max = 254,210/264,714/276,784 > Threads = 160, Total Rate = 4,231 kop/s; Percpu Rate = 26 kop/s > > Running locktest with spinlock [runtime = 10s, load = 1] > Threads = 80, Min/Mean/Max = 599,715/603,397/603,450 > Threads = 80, Total Rate = 4,825 kop/s; Percpu Rate = 60 kop/s > > Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] > Threads = 80, Min/Mean/Max = 492,687/525,224/567,456 > Threads = 80, Total Rate = 4,199 kop/s; Percpu Rate = 52 kop/s > > Running locktest with spinlock [runtime = 10s, load = 1] > Threads = 40, Min/Mean/Max = 1,325,623/1,325,628/1,325,636 > Threads = 40, Total Rate = 5,299 kop/s; Percpu Rate = 132 kop/s > > Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] > Threads = 40, Min/Mean/Max = 1,249,731/1,292,977/1,342,815 > Threads = 40, Total Rate = 5,168 kop/s; Percpu Rate = 129 kop/s > > On systems on large number of cpus, qspinlock lock is faster and more fair. > > With some tuning, we may be able to squeeze out more performance.Yes, powerpc could certainly get more performance out of the slow paths, and then there are a few parameters to tune. We don't have a good alternate patching for function calls yet, but that would be something to do for native vs pv. And then there seem to be one or two tunable parameters we could experiment with. The paravirt locks may need a bit more tuning. Some simple testing under KVM shows we might be a bit slower in some cases. Whether this is fairness or something else I'm not sure. The current simple pv spinlock code can do a directed yield to the lock holder CPU, whereas the pv qspl here just does a general yield. I think we might actually be able to change that to also support directed yield. Though I'm not sure if this is actually the cause of the slowdown yet. Thanks, Nick
On 7/7/20 1:57 AM, Nicholas Piggin wrote:> Yes, powerpc could certainly get more performance out of the slow > paths, and then there are a few parameters to tune. > > We don't have a good alternate patching for function calls yet, but > that would be something to do for native vs pv. > > And then there seem to be one or two tunable parameters we could > experiment with. > > The paravirt locks may need a bit more tuning. Some simple testing > under KVM shows we might be a bit slower in some cases. Whether this > is fairness or something else I'm not sure. The current simple pv > spinlock code can do a directed yield to the lock holder CPU, whereas > the pv qspl here just does a general yield. I think we might actually > be able to change that to also support directed yield. Though I'm > not sure if this is actually the cause of the slowdown yet.Regarding the paravirt lock, I have taken a further look into the current PPC spinlock code. There is an equivalent of pv_wait() but no pv_kick(). Maybe PPC doesn't really need that. Attached are two additional qspinlock patches that adds a CONFIG_PARAVIRT_QSPINLOCKS_LITE option to not require pv_kick(). There is also a fixup patch to be applied after your patchset. I don't have access to a PPC LPAR with shared processor at the moment, so I can't test the performance of the paravirt code. Would you mind adding my patches and do some performance test on your end to see if it gives better result? Thanks, Longman -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-locking-pvqspinlock-Code-relocation-and-extraction.patch Type: text/x-patch Size: 11939 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20200707/a022922c/attachment-0003.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-locking-pvqspinlock-Introduce-CONFIG_PARAVIRT_QSPINL.patch Type: text/x-patch Size: 6167 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20200707/a022922c/attachment-0004.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0009-powerpc-pseries-Fixup.patch Type: text/x-patch Size: 3087 bytes Desc: not available URL: <http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20200707/a022922c/attachment-0005.bin>
Nicholas Piggin
2020-Jul-08 05:10 UTC
[PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
Excerpts from Waiman Long's message of July 8, 2020 1:33 pm:> On 7/7/20 1:57 AM, Nicholas Piggin wrote: >> Yes, powerpc could certainly get more performance out of the slow >> paths, and then there are a few parameters to tune. >> >> We don't have a good alternate patching for function calls yet, but >> that would be something to do for native vs pv. >> >> And then there seem to be one or two tunable parameters we could >> experiment with. >> >> The paravirt locks may need a bit more tuning. Some simple testing >> under KVM shows we might be a bit slower in some cases. Whether this >> is fairness or something else I'm not sure. The current simple pv >> spinlock code can do a directed yield to the lock holder CPU, whereas >> the pv qspl here just does a general yield. I think we might actually >> be able to change that to also support directed yield. Though I'm >> not sure if this is actually the cause of the slowdown yet. > > Regarding the paravirt lock, I have taken a further look into the > current PPC spinlock code. There is an equivalent of pv_wait() but no > pv_kick(). Maybe PPC doesn't really need that.So powerpc has two types of wait, either undirected "all processors" or directed to a specific processor which has been preempted by the hypervisor. The simple spinlock code does a directed wait, because it knows the CPU which is holding the lock. In this case, there is a sequence that is used to ensure we don't wait if the condition has become true, and the target CPU does not need to kick the waiter it will happen automatically (see splpar_spin_yield). This is preferable because we only wait as needed and don't require the kick operation. The pv spinlock code I did uses the undirected wait, because we don't know the CPU number which we are waiting on. This is undesirable because it's higher overhead and the wait is not so accurate. I think perhaps we could change things so we wait on the correct CPU when queued, which might be good enough (we could also put the lock owner CPU in the spinlock word, if we add another format).> Attached are two > additional qspinlock patches that adds a CONFIG_PARAVIRT_QSPINLOCKS_LITE > option to not require pv_kick(). There is also a fixup patch to be > applied after your patchset. > > I don't have access to a PPC LPAR with shared processor at the moment, > so I can't test the performance of the paravirt code. Would you mind > adding my patches and do some performance test on your end to see if it > gives better result?Great, I'll do some tests. Any suggestions for what to try? Thanks, Nick
On Tue, Jul 07, 2020 at 11:33:45PM -0400, Waiman Long wrote:> From 5d7941a498935fb225b2c7a3108cbf590114c3db Mon Sep 17 00:00:00 2001 > From: Waiman Long <longman at redhat.com> > Date: Tue, 7 Jul 2020 22:29:16 -0400 > Subject: [PATCH 2/9] locking/pvqspinlock: Introduce > CONFIG_PARAVIRT_QSPINLOCKS_LITE > > Add a new PARAVIRT_QSPINLOCKS_LITE config option that allows > architectures to use the PV qspinlock code without the need to use or > implement a pv_kick() function, thus eliminating the atomic unlock > overhead. The non-atomic queued_spin_unlock() can be used instead. > The pv_wait() function will still be needed, but it can be a dummy > function. > > With that option set, the hybrid PV queued/unfair locking code should > still be able to make it performant enough in a paravirtualizedHow is this supposed to work? If there is no kick, you have no control over who wakes up and fairness goes out the window entirely. You don't even begin to explain...
On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:> Yes, powerpc could certainly get more performance out of the slow > paths, and then there are a few parameters to tune.Can you clarify? The slow path is already in use on ARM64 which is weak, so I doubt there's superfluous serialization present. And Will spend a fair amount of time on making that thing guarantee forward progressm, so there just isn't too much room to play.> We don't have a good alternate patching for function calls yet, but > that would be something to do for native vs pv.Going by your jump_label implementation, support for static_call should be fairly straight forward too, no? https://lkml.kernel.org/r/20200624153024.794671356 at infradead.org
On 7/8/20 4:41 AM, Peter Zijlstra wrote:> On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote: >> Yes, powerpc could certainly get more performance out of the slow >> paths, and then there are a few parameters to tune. > Can you clarify? The slow path is already in use on ARM64 which is weak, > so I doubt there's superfluous serialization present. And Will spend a > fair amount of time on making that thing guarantee forward progressm, so > there just isn't too much room to play. > >> We don't have a good alternate patching for function calls yet, but >> that would be something to do for native vs pv. > Going by your jump_label implementation, support for static_call should > be fairly straight forward too, no? > > https://lkml.kernel.org/r/20200624153024.794671356 at infradead.org >Speaking of static_call, I am also looking forward to it. Do you have an idea when that will be merged? Cheers, Longman
Nicholas Piggin
2020-Jul-21 11:08 UTC
[PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
Excerpts from Peter Zijlstra's message of July 8, 2020 6:41 pm:> On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote: >> Yes, powerpc could certainly get more performance out of the slow >> paths, and then there are a few parameters to tune. >Sorry for the delay, got bogged down and distracted by other things :(> Can you clarify? The slow path is already in use on ARM64 which is weak, > so I doubt there's superfluous serialization present. And Will spend a > fair amount of time on making that thing guarantee forward progressm, so > there just isn't too much room to play.Sure, the way the pending not-queued slowpath (which I guess is the medium-path) is implemented is just poorly structured for LL/SC. It has one more atomic than necessary (queued_fetch_set_pending_acquire), and a lot of branches in suboptimal order. Attached patch (completely untested just compiled and looked at asm so far) is a way we can fix this on powerpc I think. It's actually very little generic code change which is good, duplicated medium-path logic unfortunately but that's no worse than something like x86 really.>> We don't have a good alternate patching for function calls yet, but >> that would be something to do for native vs pv. > > Going by your jump_label implementation, support for static_call should > be fairly straight forward too, no? > > https://lkml.kernel.org/r/20200624153024.794671356 at infradead.orgNice, yeah it should be. I've wanted this for ages! powerpc is kind of annoying to implement that with limited call range, Hmm, not sure if we'd need a new linker feature to support it. We'd provide call site patch space for indirect branches for those out of range of direct call, so that should work fine. The trick would be patching in the TOC lookup for the function... should be doable somehow. Thanks, Nick --- diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index b752d34517b3..26d8766a1106 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -31,16 +31,57 @@ static inline void queued_spin_unlock(struct qspinlock *lock) #else extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void queued_spin_lock_slowpath_queue(struct qspinlock *lock); #endif static __always_inline void queued_spin_lock(struct qspinlock *lock) { - u32 val = 0; - - if (likely(atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL))) + atomic_t *a = &lock->val; + u32 val; + +again: + asm volatile( +"1:\t" PPC_LWARX(%0,0,%1,1) " # queued_spin_lock \n" + : "=&r" (val) + : "r" (&a->counter) + : "memory"); + + if (likely(val == 0)) { + asm_volatile_goto( + " stwcx. %0,0,%1 \n" + " bne- %l[again] \n" + "\t" PPC_ACQUIRE_BARRIER " \n" + : + : "r"(_Q_LOCKED_VAL), "r" (&a->counter) + : "cr0", "memory" + : again ); return; - - queued_spin_lock_slowpath(lock, val); + } + + if (likely(val == _Q_LOCKED_VAL)) { + asm_volatile_goto( + " stwcx. %0,0,%1 \n" + " bne- %l[again] \n" + : + : "r"(_Q_LOCKED_VAL | _Q_PENDING_VAL), "r" (&a->counter) + : "cr0", "memory" + : again ); + + atomic_cond_read_acquire(a, !(VAL & _Q_LOCKED_MASK)); +// clear_pending_set_locked(lock); + WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL); +// lockevent_inc(lock_pending); + return; + } + + if (val == _Q_PENDING_VAL) { + int cnt = _Q_PENDING_LOOPS; + val = atomic_cond_read_relaxed(a, + (VAL != _Q_PENDING_VAL) || !cnt--); + if (!(val & ~_Q_LOCKED_MASK)) + goto again; + } + queued_spin_lock_slowpath_queue(lock); } #define queued_spin_lock queued_spin_lock diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index b9515fcc9b29..ebcc6f5d99d5 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -287,10 +287,14 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock, #ifdef CONFIG_PARAVIRT_SPINLOCKS #define queued_spin_lock_slowpath native_queued_spin_lock_slowpath +#define queued_spin_lock_slowpath_queue native_queued_spin_lock_slowpath_queue #endif #endif /* _GEN_PV_LOCK_SLOWPATH */ +void queued_spin_lock_slowpath_queue(struct qspinlock *lock); +static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock); + /** * queued_spin_lock_slowpath - acquire the queued spinlock * @lock: Pointer to queued spinlock structure @@ -314,12 +318,6 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock, */ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) { - struct mcs_spinlock *prev, *next, *node; - u32 old, tail; - int idx; - - BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)); - if (pv_enabled()) goto pv_queue; @@ -397,6 +395,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) queue: lockevent_inc(lock_slowpath); pv_queue: + __queued_spin_lock_slowpath_queue(lock); +} +EXPORT_SYMBOL(queued_spin_lock_slowpath); + +void queued_spin_lock_slowpath_queue(struct qspinlock *lock) +{ + lockevent_inc(lock_slowpath); + __queued_spin_lock_slowpath_queue(lock); +} +EXPORT_SYMBOL(queued_spin_lock_slowpath_queue); + +static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock) +{ + struct mcs_spinlock *prev, *next, *node; + u32 old, tail; + u32 val; + int idx; + + BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)); + node = this_cpu_ptr(&qnodes[0].mcs); idx = node->count++; tail = encode_tail(smp_processor_id(), idx); @@ -559,7 +577,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) */ __this_cpu_dec(qnodes[0].mcs.count); } -EXPORT_SYMBOL(queued_spin_lock_slowpath); /* * Generate the paravirt code for queued_spin_unlock_slowpath().
Possibly Parallel Threads
- [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
- [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
- [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
- [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
- [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks