Raghavendra K T
2014-Jun-28 09:17 UTC
[RFC PATCH v2] Implement Batched (group) ticket lock
In virtualized environment there are mainly three problems related to spinlocks that affects performance. 1. LHP (lock holder preemption) 2. Lock Waiter Preemption (LWP) 3. Starvation/fairness Though Ticketlocks solve fairness problem it worsens LWP, LHP problems. Though pv-ticketlocks tried to address these problems we can further improve at the cost of relaxed fairness. The following patch tries to achieve that by grouping (batched) ticketlocks. Here we form a batch of eligible lock holders and we serve the eligible (to hold the lock) batches in FIFO, but the lock-waiters within eligible batch are served in an unfair manner. This increases probability of any eligible lock-holder being in running state (to an average of (batch_size/2)-1) and also it provides needed bounded starvation since any lock requester can not acquire more than batch_size times repeatedly during contention. On the negetive side we would need an extra cmpxchg. The patch has the batch size of 8. (As we know increasing batch size means we are closer to unfair locks and batch size of 1 = ticketlock). V2 implementation has been optimized for baremetal : (1) Making !CONFIG_PARAVIRT_SPINLOCK case equivalent to older ticketlock. (2) Removing extra overhead of cmpxchg for CONFIG_PARAVIRT_SPINLOCK case by replacing with add_smp call. Result: Test system: 32cpu 2node machine w/ 64GB each (16 pcpu machine +ht). Guests: 8GB 16vcpu guests (average of 8 iterations) A: batched ticket with batch_size = 4 B: batched ticket with batch_size = 8 C: qspinlock V11 of Waiman D: qspinlock V11 of Waiman with VIRT_UNFAIR=y % Improvements with kvm guests A B C D ebizzy_0.5x -0.20 1.071 -0.07 1.39 ebizzy_1.0x 0.83 8.89 7.89 6.08 ebizzy_1.5x 24.11 38.39 5.42 42.56 ebizzy_2.0x 47.77 78.49 11.94 108.14 Observations: With increased batch_size batch ticketlock has closer performanace to unfair locks and very promising for kvm guests that expects high performance of unfair locks with bounded starvation property of fair locks. Baremetal: No significant performnce difference even for CONFIG_PARAVIRT_SPINLOCK enabled on baremetal Signed-off-by: Raghavendra K T <raghavendra.kt at linux.vnet.ibm.com> --- arch/x86/include/asm/spinlock.h | 71 +++++++++++++++++++++++++++++------ arch/x86/include/asm/spinlock_types.h | 46 ++++++++++++++++++++--- arch/x86/kernel/kvm.c | 6 ++- arch/x86/xen/spinlock.c | 6 ++- 4 files changed, 107 insertions(+), 22 deletions(-) Note: I tried different implementation for CONFIG_PRAVIRT enabled performance without luck for some time that added delay for V2 :(. (could not test with PeterZ's latest qspinlock which I reported) Changes since V1: - Make sure no extra cmpxchg overhead for !CONFIG_PARAVIRT (it works like previous ticketlock now since batch size is set to 1). - Change TICKET_LOCK_INC_SHIFT back to 0 in !CONFIG_PARAVIRT case (Rik). - Add build check to make sure TICKET_BATCH is power of 2 (Rik). - Replace extra cmpxchg with add_smp for CONFIG_PARAVIRT enabled case in host (Waiman, Linus.. had concernes). - Correct TICKET_LOCK_BATCH_MASK to suit all TAIL_INC (expression given by Waiman). - Add comment on LOCK bit (Waiman). - Add Xen support (Completely untested). TODO: (But none of the below imply shortcoming of current patch) - we can further add dynamically changing batch_size (to 1 and MAX size) implementation (inspiration and hint by Paul McKenney) as necessary. - I have found that increasing batch size gives excellent improvements for overcommitted guests. As Rik pointed we may want to further increase batch_size but we need its (side) effect on bigger machine for undercommit cases. - Should we enable it for baremetal too? (As Rik pointed, exploiting NUMA may give more performance) - Should we have config option for batch_size? I would like to thank Rik, Waiman for their review and valuable feedback for V1. Please provide your suggestion and comments. diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 54f1c80..cfb78e6 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -8,6 +8,7 @@ #include <linux/compiler.h> #include <asm/paravirt.h> #include <asm/bitops.h> +#include <linux/bug.h> /* * Your basic SMP spinlocks, allowing only a single CPU anywhere @@ -49,7 +50,42 @@ static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) set_bit(0, (volatile unsigned long *)&lock->tickets.tail); } +static int __ticket_lock_get_batch_mask(void) +{ + if (static_key_false(¶virt_ticketlocks_enabled)) + return TICKET_BATCH_MASK; + else + return TICKET_BATCH_MASK_NATIVE; +} + +static void __ticket_lock_batch_spin(arch_spinlock_t *lock, __ticket_t ticket) +{ + if (static_key_false(¶virt_ticketlocks_enabled)) { + register struct __raw_tickets inc, new; + + inc.head = ACCESS_ONCE(lock->tickets.head); + barrier(); + for (;;) { + if (!(inc.head & TICKET_LOCK_LOCK_INC)) { + new.head = inc.head | TICKET_LOCK_LOCK_INC; + if (cmpxchg(&lock->tickets.head, inc.head, + new.head) == inc.head) + break; + } + cpu_relax(); + inc.head = ACCESS_ONCE(lock->tickets.head); + } + } else { + add_smp(&lock->tickets.head, TICKET_LOCK_UNLOCK_INC); + } +} + #else /* !CONFIG_PARAVIRT_SPINLOCKS */ +static inline void __ticket_lock_batch_spin(arch_spinlock_t *lock, + __ticket_t ticket) +{ +} + static __always_inline void __ticket_lock_spinning(arch_spinlock_t *lock, __ticket_t ticket) { @@ -59,6 +95,10 @@ static inline void __ticket_unlock_kick(arch_spinlock_t *lock, { } +static inline int __ticket_lock_get_batch_mask(void) +{ + return TICKET_BATCH_MASK_NATIVE; +} #endif /* CONFIG_PARAVIRT_SPINLOCKS */ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) @@ -81,24 +121,28 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) */ static __always_inline void arch_spin_lock(arch_spinlock_t *lock) { - register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC }; + register struct __raw_tickets inc = { .tail = TICKET_LOCK_TAIL_INC }; + check_sane_mask(); inc = xadd(&lock->tickets, inc); - if (likely(inc.head == inc.tail)) - goto out; inc.tail &= ~TICKET_SLOWPATH_FLAG; for (;;) { unsigned count = SPIN_THRESHOLD; + unsigned batchmask = __ticket_lock_get_batch_mask(); do { - if (ACCESS_ONCE(lock->tickets.head) == inc.tail) - goto out; + inc.head = ACCESS_ONCE(lock->tickets.head); + if ((inc.head & batchmask) == (inc.tail & batchmask)) + goto spin; cpu_relax(); } while (--count); __ticket_lock_spinning(lock, inc.tail); } -out: barrier(); /* make sure nothing creeps before the lock is taken */ +spin: + __ticket_lock_batch_spin(lock, inc.tail); + + barrier(); /* make sure nothing creeps before the lock is taken */ } static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) @@ -109,10 +153,12 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) if (old.tickets.head != (old.tickets.tail & ~TICKET_SLOWPATH_FLAG)) return 0; - new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT); + new.head_tail = old.head_tail + (TICKET_LOCK_TAIL_INC << TICKET_SHIFT); + new.head_tail |= TICKET_LOCK_LOCK_INC; /* cmpxchg is a full barrier, so nothing can move before it */ - return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail; + return cmpxchg(&lock->head_tail, old.head_tail, + new.head_tail) == old.head_tail; } static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock, @@ -123,7 +169,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock, BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS); /* Perform the unlock on the "before" copy */ - old.tickets.head += TICKET_LOCK_INC; + old.tickets.head += TICKET_LOCK_UNLOCK_INC; /* Clear the slowpath flag */ new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT); @@ -150,14 +196,15 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) arch_spinlock_t prev; prev = *lock; - add_smp(&lock->tickets.head, TICKET_LOCK_INC); + add_smp(&lock->tickets.head, TICKET_LOCK_UNLOCK_INC); /* add_smp() is a full mb() */ if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); } else - __add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX); + __add(&lock->tickets.head, TICKET_LOCK_UNLOCK_INC, + UNLOCK_LOCK_PREFIX); } static inline int arch_spin_is_locked(arch_spinlock_t *lock) @@ -171,7 +218,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock) { struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets); - return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_INC; + return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_UNLOCK_INC; } #define arch_spin_is_contended arch_spin_is_contended diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 73c4c00..339ddf8 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -3,15 +3,16 @@ #include <linux/types.h> + #ifdef CONFIG_PARAVIRT_SPINLOCKS -#define __TICKET_LOCK_INC 2 -#define TICKET_SLOWPATH_FLAG ((__ticket_t)1) +#define TICKET_LOCK_INC_SHIFT 1 +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT) #else -#define __TICKET_LOCK_INC 1 -#define TICKET_SLOWPATH_FLAG ((__ticket_t)0) +#define TICKET_LOCK_INC_SHIFT 0 +#define __TICKET_LOCK_TAIL_INC (1<<TICKET_LOCK_INC_SHIFT) #endif -#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC)) +#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_TAIL_INC)) typedef u8 __ticket_t; typedef u16 __ticketpair_t; #else @@ -19,7 +20,40 @@ typedef u16 __ticket_t; typedef u32 __ticketpair_t; #endif -#define TICKET_LOCK_INC ((__ticket_t)__TICKET_LOCK_INC) +#ifdef CONFIG_PARAVIRT_SPINLOCKS +/* + * Batched ticketlock: 2n bits in lock is divided as follows: + * --------------------------------------------------------------- + * | tail | head | + * --------------------------------------------------------------- + * | TAIL (n-1) | SLOWPATH_FLAG (1bit) | HEAD (n-1) | LOCK (1bit)| + * --------------------------------------------------------------- + * + * batch_size determines maximum number of cpus contending for lock at any time. + * (n-1)-log(batch_size) bits of head and tail determine eligibility for + * contending for lock. + * SLOWPATH_FLAG in tail indicates contention that is used for spin optimization + * in guests. + * LOCK bit in head determines whether lock is available. + */ +#define TICKET_SLOWPATH_FLAG ((__ticket_t)1) +#define TICKET_LOCK_LOCK_INC ((__ticket_t)1) +#define TICKET_LOCK_UNLOCK_INC ((__ticket_t)1) +#define TICKET_BATCH_MASK_NATIVE (~(TICKET_BATCH_NATIVE<<TICKET_LOCK_INC_SHIFT)+1) +#else +#define TICKET_SLOWPATH_FLAG ((__ticket_t)0) +#define TICKET_LOCK_LOCK_INC ((__ticket_t)0) +#define TICKET_LOCK_UNLOCK_INC ((__ticket_t)1) +#define TICKET_BATCH_MASK_NATIVE (~(TICKET_BATCH_NATIVE<<TICKET_LOCK_INC_SHIFT)+1) +#endif + +#define TICKET_LOCK_TAIL_INC ((__ticket_t)__TICKET_LOCK_TAIL_INC) +#define TICKET_BATCH 8 /* 8 waiters can contend simultaneously */ +#define TICKET_BATCH_NATIVE 1 +#define TICKET_BATCH_MASK (~(TICKET_BATCH<<TICKET_LOCK_INC_SHIFT) + 1) + +/* Ensure batch_size is always power of two */ +#define check_sane_mask() BUILD_BUG_ON((TICKET_BATCH & (TICKET_BATCH-1)) > 1) #define TICKET_SHIFT (sizeof(__ticket_t) * 8) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 3dd8e2c..28e4492 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -757,7 +757,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want) * check again make sure it didn't become free while * we weren't looking. */ - if (ACCESS_ONCE(lock->tickets.head) == want) { + if ((ACCESS_ONCE(lock->tickets.head) & TICKET_BATCH_MASK) =+ (want & TICKET_BATCH_MASK)) { add_stats(TAKEN_SLOW_PICKUP, 1); goto out; } @@ -789,7 +790,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) for_each_cpu(cpu, &waiting_cpus) { const struct kvm_lock_waiting *w = &per_cpu(klock_waiting, cpu); if (ACCESS_ONCE(w->lock) == lock && - ACCESS_ONCE(w->want) == ticket) { + (ACCESS_ONCE(w->want) & TICKET_BATCH_MASK) =+ (ticket & TICKET_BATCH_MASK)) { add_stats(RELEASED_SLOW_KICKED, 1); kvm_kick_cpu(cpu); break; diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 0ba5f3b..8d83510 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -163,7 +163,8 @@ __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want) * check again make sure it didn't become free while * we weren't looking */ - if (ACCESS_ONCE(lock->tickets.head) == want) { + if ((ACCESS_ONCE(lock->tickets.head) & TICKET_BATCH_MASK) =+ (want & TICKET_BATCH_MASK)) { add_stats(TAKEN_SLOW_PICKUP, 1); goto out; } @@ -205,7 +206,8 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next) /* Make sure we read lock before want */ if (ACCESS_ONCE(w->lock) == lock && - ACCESS_ONCE(w->want) == next) { + ((ACCESS_ONCE(w->want) & TICKET_BATCH_MASK) =+ (next & TICKET_BATCH_MASK))) { add_stats(RELEASED_SLOW_KICKED, 1); xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); break; -- 1.7.11.7
On Sat, Jun 28, 2014 at 02:47:04PM +0530, Raghavendra K T wrote:> In virtualized environment there are mainly three problems > related to spinlocks that affects performance. > 1. LHP (lock holder preemption) > 2. Lock Waiter Preemption (LWP) > 3. Starvation/fairness > > Though Ticketlocks solve fairness problem it worsens LWP, LHP problems. Though > pv-ticketlocks tried to address these problems we can further improve at the > cost of relaxed fairness. The following patch tries to achieve that by grouping > (batched) ticketlocks.And here I stop reading and ignore this patch, right? Why should I look at this?
Raghavendra K T
2014-Jul-01 09:44 UTC
[RFC PATCH v2] Implement Batched (group) ticket lock
On 07/01/2014 01:35 PM, Peter Zijlstra wrote:> On Sat, Jun 28, 2014 at 02:47:04PM +0530, Raghavendra K T wrote: >> In virtualized environment there are mainly three problems >> related to spinlocks that affects performance. >> 1. LHP (lock holder preemption) >> 2. Lock Waiter Preemption (LWP) >> 3. Starvation/fairness >> >> Though Ticketlocks solve fairness problem it worsens LWP, LHP problems. Though >> pv-ticketlocks tried to address these problems we can further improve at the >> cost of relaxed fairness. The following patch tries to achieve that by grouping >> (batched) ticketlocks. > > And here I stop reading and ignore this patch, right? > > Why should I look at this? >For baremetal we continue to have 'fully fair ticketlock' with this patch series. Good thing about the patch is we do have upper bound on starvation and at the same time we solve Lock Waiter Preemption horror of strict serialization of ticketlocks in guest and combine definite advantages of unfair locks for guest.