On Fri, Jun 03, 2016 at 02:33:47PM +1000, Benjamin Herrenschmidt wrote:> ?- For the above, can you show (or describe) where the qspinlock > ? ?improves things compared to our current locks.So currently PPC has a fairly straight forward test-and-set spinlock IIRC. You have this because LPAR/virt muck and lock holder preemption issues etc.. qspinlock is 1) a fair lock (like ticket locks) and 2) provides out-of-word spinning, reducing cacheline pressure. Esp. on multi-socket x86 we saw the out-of-word spinning being a big win over our ticket locks. And fairness, brought to us by the ticket locks a long time ago, eliminated starvation issues we had, where a spinner local to the holder would 'always' win from a spinner further away. So under heavy enough local contention, the spinners on 'remote' CPUs would 'never' get to own the lock. pv-qspinlock tries to preserve the fairness while allowing limited lock stealing and explicitly managing which vcpus to wake.> While there's > ? ?theory and to some extent practice on x86, it would be nice to > ? ?validate the effects on POWER.Right; so that will have to be from benchmarks which I cannot help you with ;-)
Benjamin Herrenschmidt
2016-Jun-06 21:41 UTC
[PATCH v5 1/6] qspinlock: powerpc support qspinlock
On Mon, 2016-06-06 at 17:59 +0200, Peter Zijlstra wrote:> On Fri, Jun 03, 2016 at 02:33:47PM +1000, Benjamin Herrenschmidt wrote: > > > > ?- For the above, can you show (or describe) where the qspinlock > > ? ?improves things compared to our current locks. > So currently PPC has a fairly straight forward test-and-set spinlock > IIRC. You have this because LPAR/virt muck and lock holder preemption > issues etc.. > qspinlock is 1) a fair lock (like ticket locks) and 2) provides > out-of-word spinning, reducing cacheline pressure.Thanks Peter. I think I understand the theory, but I'd like see it translate into real numbers.> Esp. on multi-socket x86 we saw the out-of-word spinning being a big win > over our ticket locks. > > And fairness, brought to us by the ticket locks a long time ago, > eliminated starvation issues we had, where a spinner local to the holder > would 'always' win from a spinner further away. So under heavy enough > local contention, the spinners on 'remote' CPUs would 'never' get to own > the lock.I think our HW has tweaks to avoid that from happening with the simple locks in the underlying ll/sc implementation. In any case, what I'm asking is actual tests to verify it works as expected for us.> pv-qspinlock tries to preserve the fairness while allowing limited lock > stealing and explicitly managing which vcpus to wake.Right.> > > > While there's > > ? ?theory and to some extent practice on x86, it would be nice to > > ? ?validate the effects on POWER. > Right; so that will have to be from benchmarks which I cannot help you > with ;-)Precisely :-) This is what I was asking for ;-) Cheers, Ben.
On 2016?06?07? 05:41, Benjamin Herrenschmidt wrote:> On Mon, 2016-06-06 at 17:59 +0200, Peter Zijlstra wrote: >> On Fri, Jun 03, 2016 at 02:33:47PM +1000, Benjamin Herrenschmidt wrote: >>> >>> - For the above, can you show (or describe) where the qspinlock >>> improves things compared to our current locks. >> So currently PPC has a fairly straight forward test-and-set spinlock >> IIRC. You have this because LPAR/virt muck and lock holder preemption >> issues etc.. >> qspinlock is 1) a fair lock (like ticket locks) and 2) provides >> out-of-word spinning, reducing cacheline pressure. > > Thanks Peter. I think I understand the theory, but I'd like see it > translate into real numbers. > >> Esp. on multi-socket x86 we saw the out-of-word spinning being a big win >> over our ticket locks. >> >> And fairness, brought to us by the ticket locks a long time ago, >> eliminated starvation issues we had, where a spinner local to the holder >> would 'always' win from a spinner further away. So under heavy enough >> local contention, the spinners on 'remote' CPUs would 'never' get to own >> the lock. > > I think our HW has tweaks to avoid that from happening with the simple > locks in the underlying ll/sc implementation. In any case, what I'm > asking is actual tests to verify it works as expected for us. >IF HW has such tweaks then there mush be performance drop when total cpu's number grows up. And I got such clues one simple benchmark test: it tests how many spin_lock/spin_unlock pairs can be done within 15 seconds on all cpus. say, while(!done) { spin_lock() this_cpu_inc(loops) spin_unlock() } I do the test on two machines, one is using powerKVM, and the other is using pHyp. the result below shows what the sum of loops is in the end, with xxxxK form. cpu count | pv-qspinlock | test-set spinlock| ---------------------------------------------------- 8 (powerKVM) | 62830K | 67340K | ------------------------------------------------ 8 (pHyp) | 49800K | 59330K | ------------------------------------------------ 32 (pHyp) | 87580K | 20990K | ------------------------------------------------- while cpu count grows up, the lock/unlock pairs ops of test-set spinlock drops very much. this is because the cache bouncing in different physical cpus. So to verify how both spinlock impact the data-cache, another simple benchmark test. code looks like: struct _x { spinlock_t lk; unsigned long x; } x; while(!this_cpu_read(stop)) { int i = 0xff spin_lock(x.lk) this_cpu_inc(loops) while(i--) READ_ONCE(x.x); spin_unlock(x.lk) } the result below shows what the sum of loops is in the end, with xxxxK form. cpu count | pv-qspinlock | test-set spinlock| ------------------------------------------------ 8 (pHyp) | 13240K | 9780K | ------------------------------------------------ 32 (pHyp) | 25790K | 9700K | ------------------------------------------------ obviously pv-qspinlock is more cache-friendly, and has better performance than test-set spinlock. More test is going on, I will send out new patch set with the result. HOPE *within* this week. unixbench really takes a long time. thanks xinhui>> pv-qspinlock tries to preserve the fairness while allowing limited lock >> stealing and explicitly managing which vcpus to wake. > > Right. >>> >>> While there's >>> theory and to some extent practice on x86, it would be nice to >>> validate the effects on POWER. >> Right; so that will have to be from benchmarks which I cannot help you >> with ;-) > > Precisely :-) This is what I was asking for ;-) > > Cheers, > Ben. >