Waiman Long
2014-Mar-19 20:13 UTC
[PATCH v7 00/11] qspinlock: a 4-byte queue spinlock with PV support
v6->v7: - Remove an atomic operation from the 2-task contending code - Shorten the names of some macros - Make the queue waiter to attempt to steal lock when unfair lock is enabled. - Remove lock holder kick from the PV code and fix a race condition - Run the unfair lock & PV code on overcommitted KVM guests to collect performance data. v5->v6: - Change the optimized 2-task contending code to make it fairer at the expense of a bit of performance. - Add a patch to support unfair queue spinlock for Xen. - Modify the PV qspinlock code to follow what was done in the PV ticketlock. - Add performance data for the unfair lock as well as the PV support code. v4->v5: - Move the optimized 2-task contending code to the generic file to enable more architectures to use it without code duplication. - Address some of the style-related comments by PeterZ. - Allow the use of unfair queue spinlock in a real para-virtualized execution environment. - Add para-virtualization support to the qspinlock code by ensuring that the lock holder and queue head stay alive as much as possible. v3->v4: - Remove debugging code and fix a configuration error - Simplify the qspinlock structure and streamline the code to make it perform a bit better - Add an x86 version of asm/qspinlock.h for holding x86 specific optimization. - Add an optimized x86 code path for 2 contending tasks to improve low contention performance. v2->v3: - Simplify the code by using numerous mode only without an unfair option. - Use the latest smp_load_acquire()/smp_store_release() barriers. - Move the queue spinlock code to kernel/locking. - Make the use of queue spinlock the default for x86-64 without user configuration. - Additional performance tuning. v1->v2: - Add some more comments to document what the code does. - Add a numerous CPU mode to support >= 16K CPUs - Add a configuration option to allow lock stealing which can further improve performance in many cases. - Enable wakeup of queue head CPU at unlock time for non-numerous CPU mode. This patch set has 3 different sections: 1) Patches 1-4: Introduces a queue-based spinlock implementation that can replace the default ticket spinlock without increasing the size of the spinlock data structure. As a result, critical kernel data structures that embed spinlock won't increase in size and break data alignments. 2) Patches 5-7: Enables the use of unfair queue spinlock in a para-virtualized execution environment. This can resolve some of the locking related performance issues due to the fact that the next CPU to get the lock may have been scheduled out for a period of time. 3) Patches 8-11: Enable qspinlock para-virtualization support by halting the waiting CPUs after spinning for a certain amount of time. The unlock code will detect the a sleeping waiter and wake it up. This is essentially the same logic as the PV ticketlock code. Patches 1-8 are fully tested and ready for production. Patches 9-10 are also tested on KVM guests. Patch 11, however, has not been tested on XEN guest yet. Anyone who would like to test this out is welcome to do so. The queue spinlock has slightly better performance than the ticket spinlock in uncontended case. Its performance can be much better with moderate to heavy contention. This patch has the potential of improving the performance of all the workloads that have moderate to heavy spinlock contention. The queue spinlock is especially suitable for NUMA machines with at least 2 sockets, though noticeable performance benefit probably won't show up in machines with less than 4 sockets. The purpose of this patch set is not to solve any particular spinlock contention problems. Those need to be solved by refactoring the code to make more efficient use of the lock or finer granularity ones. The main purpose is to make the lock contention problems more tolerable until someone can spend the time and effort to fix them. Waiman Long (11): qspinlock: A generic 4-byte queue spinlock implementation qspinlock, x86: Enable x86-64 to use queue spinlock qspinlock: More optimized code for smaller NR_CPUS qspinlock: Optimized code path for 2 contending tasks pvqspinlock, x86: Allow unfair spinlock in a PV guest pvqspinlock, x86: Allow unfair queue spinlock in a KVM guest pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest pvqspinlock, x86: Rename paravirt_ticketlocks_enabled pvqspinlock, x86: Add qspinlock para-virtualization support pvqspinlock, x86: Enable qspinlock PV support for KVM pvqspinlock, x86: Enable qspinlock PV support for XEN arch/x86/Kconfig | 12 + arch/x86/include/asm/paravirt.h | 12 +- arch/x86/include/asm/paravirt_types.h | 5 + arch/x86/include/asm/pvqspinlock.h | 249 +++++++++ arch/x86/include/asm/qspinlock.h | 161 ++++++ arch/x86/include/asm/spinlock.h | 9 +- arch/x86/include/asm/spinlock_types.h | 4 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/kvm.c | 101 ++++- arch/x86/kernel/paravirt-spinlocks.c | 16 +- arch/x86/xen/setup.c | 19 + arch/x86/xen/spinlock.c | 92 +++- include/asm-generic/qspinlock.h | 122 +++++ include/asm-generic/qspinlock_types.h | 62 +++ kernel/Kconfig.locks | 7 + kernel/locking/Makefile | 1 + kernel/locking/qspinlock.c | 938 +++++++++++++++++++++++++++++++++ 17 files changed, 1799 insertions(+), 12 deletions(-) create mode 100644 arch/x86/include/asm/pvqspinlock.h create mode 100644 arch/x86/include/asm/qspinlock.h create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 kernel/locking/qspinlock.c
Waiman Long
2014-Mar-19 20:13 UTC
[PATCH v7 01/11] qspinlock: A generic 4-byte queue spinlock implementation
This patch introduces a new generic queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. The idea behind this spinlock implementation is the fact that spinlocks are acquired with preemption disabled. In other words, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Of course, interrupt handler can try to acquire one spinlock while the interrupted user process is in the process of getting another spinlock. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 16-bit size. Together with the 1-byte lock bit, this queue spinlock implementation will only need 4 bytes to hold all the information that it needs. The current queue node address encoding of the 4-byte word is as follows: Bits 0-7 : the locked byte Bits 8-9 : queue node index in the per-cpu array (4 entries) Bits 10-31: cpu number + 1 (max cpus = 4M -1) For single-thread performance (no contention), a 256K lock/unlock loop was run on a 2.4Ghz Westmere x86-64 CPU. The following table shows the average time (in ns) for a single lock/unlock sequence (including the looping and timing overhead): Lock Type Time (ns) --------- --------- Ticket spinlock 14.1 Queue spinlock 8.8 So the queue spinlock is much faster than the ticket spinlock, even though the overhead of locking and unlocking should be pretty small when there is no contention. The performance advantage is mainly due to the fact that ticket spinlock does a read-modify-write (add) instruction in unlock whereas queue spinlock only does a simple write in unlock which can be much faster in a pipelined CPU. The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere x86-64 CPUs with XFS filesystem on a ramdisk and HT off to evaluate the performance impact of this patch on a 3.13 kernel. +------------+----------+-----------------+---------+ | Kernel | 3.13 JPM | 3.13 with | %Change | | | | qspinlock patch | | +------------+----------+-----------------+---------+ | 10-100 users | +------------+----------+-----------------+---------+ |custom | 357459 | 363109 | +1.58% | |dbase | 496847 | 498801 | +0.39% | |disk | 2925312 | 2771387 | -5.26% | |five_sec | 166612 | 169215 | +1.56% | |fserver | 382129 | 383279 | +0.30% | |high_systime| 16356 | 16380 | +0.15% | |short | 4521978 | 4257363 | -5.85% | +------------+----------+-----------------+---------+ | 200-1000 users | +------------+----------+-----------------+---------+ |custom | 449070 | 447711 | -0.30% | |dbase | 845029 | 853362 | +0.99% | |disk | 2725249 | 4892907 | +79.54% | |five_sec | 169410 | 170638 | +0.72% | |fserver | 489662 | 491828 | +0.44% | |high_systime| 142823 | 143790 | +0.68% | |short | 7435288 | 9016171 | +21.26% | +------------+----------+-----------------+---------+ | 1100-2000 users | +------------+----------+-----------------+---------+ |custom | 432470 | 432570 | +0.02% | |dbase | 889289 | 890026 | +0.08% | |disk | 2565138 | 5008732 | +95.26% | |five_sec | 169141 | 170034 | +0.53% | |fserver | 498569 | 500701 | +0.43% | |high_systime| 229913 | 245866 | +6.94% | |short | 8496794 | 8281918 | -2.53% | +------------+----------+-----------------+---------+ The workload with the most gain was the disk workload. Without the patch, the perf profile at 1500 users looked like: 26.19% reaim [kernel.kallsyms] [k] _raw_spin_lock |--47.28%-- evict |--46.87%-- inode_sb_list_add |--1.24%-- xlog_cil_insert_items |--0.68%-- __remove_inode_hash |--0.67%-- inode_wait_for_writeback --3.26%-- [...] 22.96% swapper [kernel.kallsyms] [k] cpu_idle_loop 5.56% reaim [kernel.kallsyms] [k] mutex_spin_on_owner 4.87% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load 2.04% reaim [kernel.kallsyms] [k] mspin_lock 1.30% reaim [kernel.kallsyms] [k] memcpy 1.08% reaim [unknown] [.] 0x0000003c52009447 There was pretty high spinlock contention on the inode_sb_list_lock and maybe the inode's i_lock. With the patch, the perf profile at 1500 users became: 26.82% swapper [kernel.kallsyms] [k] cpu_idle_loop 4.66% reaim [kernel.kallsyms] [k] mutex_spin_on_owner 3.97% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load 2.40% reaim [kernel.kallsyms] [k] queue_spin_lock_slowpath |--88.31%-- _raw_spin_lock | |--36.02%-- inode_sb_list_add | |--35.09%-- evict | |--16.89%-- xlog_cil_insert_items | |--6.30%-- try_to_wake_up | |--2.20%-- _xfs_buf_find | |--0.75%-- __remove_inode_hash | |--0.72%-- __mutex_lock_slowpath | |--0.53%-- load_balance |--6.02%-- _raw_spin_lock_irqsave | |--74.75%-- down_trylock | |--9.69%-- rcu_check_quiescent_state | |--7.47%-- down | |--3.57%-- up | |--1.67%-- rwsem_wake | |--1.00%-- remove_wait_queue | |--0.56%-- pagevec_lru_move_fn |--5.39%-- _raw_spin_lock_irq | |--82.05%-- rwsem_down_read_failed | |--10.48%-- rwsem_down_write_failed | |--4.24%-- __down | |--2.74%-- __schedule --0.28%-- [...] 2.20% reaim [kernel.kallsyms] [k] memcpy 1.84% reaim [unknown] [.] 0x000000000041517b 1.77% reaim [kernel.kallsyms] [k] _raw_spin_lock |--21.08%-- xlog_cil_insert_items |--10.14%-- xfs_icsb_modify_counters |--7.20%-- xfs_iget_cache_hit |--6.56%-- inode_sb_list_add |--5.49%-- _xfs_buf_find |--5.25%-- evict |--5.03%-- __remove_inode_hash |--4.64%-- __mutex_lock_slowpath |--3.78%-- selinux_inode_free_security |--2.95%-- xfs_inode_is_filestream |--2.35%-- try_to_wake_up |--2.07%-- xfs_inode_set_reclaim_tag |--1.52%-- list_lru_add |--1.16%-- xfs_inode_clear_eofblocks_tag : 1.30% reaim [kernel.kallsyms] [k] effective_load 1.27% reaim [kernel.kallsyms] [k] mspin_lock 1.10% reaim [kernel.kallsyms] [k] security_compute_sid On the ext4 filesystem, the disk workload improved from 416281 JPM to 899101 JPM (+116%) with the patch. In this case, the contended spinlock is the mb_cache_spinlock. Signed-off-by: Waiman Long <Waiman.Long at hp.com> Acked-by: Rik van Riel <riel at redhat.com> --- include/asm-generic/qspinlock.h | 122 +++++++++++ include/asm-generic/qspinlock_types.h | 49 +++++ kernel/Kconfig.locks | 7 + kernel/locking/Makefile | 1 + kernel/locking/qspinlock.c | 375 +++++++++++++++++++++++++++++++++ 5 files changed, 554 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 kernel/locking/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 0000000..8525931 --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,122 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long <waiman.long at hp.com> + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include <asm-generic/qspinlock_types.h> + +/* + * External function declarations + */ +extern void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval); + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return atomic_read(&lock->qlcode) & _QLOCK_LOCK_MASK; +} + +/** + * queue_spin_value_unlocked - is the spinlock structure unlocked? + * @lock: queue spinlock structure + * Return: 1 if it is unlocked, 0 otherwise + */ +static __always_inline int queue_spin_value_unlocked(struct qspinlock lock) +{ + return !(atomic_read(&lock.qlcode) & _QLOCK_LOCK_MASK); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return atomic_read(&lock->qlcode) & ~_QLOCK_LOCK_MASK; +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!atomic_read(&lock->qlcode) && + (atomic_cmpxchg(&lock->qlcode, 0, _QLOCK_LOCKED) == 0)) + return 1; + return 0; +} + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + int qsval; + + /* + * To reduce memory access to only once for the cold cache case, + * a direct cmpxchg() is performed in the fastpath to optimize the + * uncontended case. The contended performance, however, may suffer + * a bit because of that. + */ + qsval = atomic_cmpxchg(&lock->qlcode, 0, _QLOCK_LOCKED); + if (likely(qsval == 0)) + return; + queue_spin_lock_slowpath(lock, qsval); +} + +#ifndef queue_spin_unlock +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_unlock(struct qspinlock *lock) +{ + /* + * Use an atomic subtraction to clear the lock bit. + */ + smp_mb__before_atomic_dec(); + atomic_sub(_QLOCK_LOCKED, &lock->qlcode); +} +#endif + +/* + * Initializier + */ +#define __ARCH_SPIN_LOCK_UNLOCKED { ATOMIC_INIT(0) } + +/* + * Remapping spinlock architecture specific functions to the corresponding + * queue spinlock functions. + */ +#define arch_spin_is_locked(l) queue_spin_is_locked(l) +#define arch_spin_is_contended(l) queue_spin_is_contended(l) +#define arch_spin_value_unlocked(l) queue_spin_value_unlocked(l) +#define arch_spin_lock(l) queue_spin_lock(l) +#define arch_spin_trylock(l) queue_spin_trylock(l) +#define arch_spin_unlock(l) queue_spin_unlock(l) +#define arch_spin_lock_flags(l, f) queue_spin_lock(l) + +#endif /* __ASM_GENERIC_QSPINLOCK_H */ diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h new file mode 100644 index 0000000..fbfe898 --- /dev/null +++ b/include/asm-generic/qspinlock_types.h @@ -0,0 +1,49 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long <waiman.long at hp.com> + */ +#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H +#define __ASM_GENERIC_QSPINLOCK_TYPES_H + +/* + * Including atomic.h with PARAVIRT on will cause compilation errors because + * of recursive header file incluson via paravirt_types.h. A workaround is + * to include paravirt_types.h here in this case. + */ +#ifdef CONFIG_PARAVIRT +# include <asm/paravirt_types.h> +#else +# include <linux/types.h> +# include <linux/atomic.h> +#endif + +/* + * The queue spinlock data structure - a 32-bit word + * + * The bits assignment are: + * Bit 0 : Set if locked + * Bits 1-7 : Not used + * Bits 8-31: Queue code + */ +typedef struct qspinlock { + atomic_t qlcode; /* Lock + queue code */ +} arch_spinlock_t; + +#define _QCODE_OFFSET 8 +#define _QLOCK_LOCKED 1U +#define _QLOCK_LOCK_MASK 0xff + +#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index d2b32ac..f185584 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -223,3 +223,10 @@ endif config MUTEX_SPIN_ON_OWNER def_bool y depends on SMP && !DEBUG_MUTEXES + +config ARCH_USE_QUEUE_SPINLOCK + bool + +config QUEUE_SPINLOCK + def_bool y if ARCH_USE_QUEUE_SPINLOCK + depends on SMP && !PARAVIRT_SPINLOCKS diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile index baab8e5..e3b3293 100644 --- a/kernel/locking/Makefile +++ b/kernel/locking/Makefile @@ -15,6 +15,7 @@ obj-$(CONFIG_LOCKDEP) += lockdep_proc.o endif obj-$(CONFIG_SMP) += spinlock.o obj-$(CONFIG_PROVE_LOCKING) += spinlock.o +obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o obj-$(CONFIG_RT_MUTEXES) += rtmutex.o obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c new file mode 100644 index 0000000..a3b9ed3 --- /dev/null +++ b/kernel/locking/qspinlock.c @@ -0,0 +1,375 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long <waiman.long at hp.com> + */ +#include <linux/smp.h> +#include <linux/bug.h> +#include <linux/cpumask.h> +#include <linux/percpu.h> +#include <linux/hardirq.h> +#include <linux/mutex.h> +#include <linux/spinlock.h> + +/* + * The basic principle of a queue-based spinlock can best be understood + * by studying a classic queue-based spinlock implementation called the + * MCS lock. The paper below provides a good description for this kind + * of lock. + * + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf + * + * This queue spinlock implementation is based on the MCS lock with twists + * to make it fit the following constraints: + * 1. A max spinlock size of 4 bytes + * 2. Good fastpath performance + * 3. No change in the locking APIs + * + * The queue spinlock fastpath is as simple as it can get, all the heavy + * lifting is done in the lock slowpath. The main idea behind this queue + * spinlock implementation is to keep the spinlock size at 4 bytes while + * at the same time implement a queue structure to queue up the waiting + * lock spinners. + * + * Since preemption is disabled before getting the lock, a given CPU will + * only need to use one queue node structure in a non-interrupt context. + * A percpu queue node structure will be allocated for this purpose and the + * cpu number will be put into the queue spinlock structure to indicate the + * tail of the queue. + * + * To handle spinlock acquisition at interrupt context (softirq or hardirq), + * the queue node structure is actually an array for supporting nested spin + * locking operations in interrupt handlers. If all the entries in the + * array are used up, a warning message will be printed (as that shouldn't + * happen in normal circumstances) and the lock spinner will fall back to + * busy spinning instead of waiting in a queue. + */ + +/* + * The 24-bit queue node code is divided into the following 2 fields: + * Bits 0-1 : queue node index (4 nodes) + * Bits 2-23: CPU number + 1 (4M - 1 CPUs) + * + * A queue node code of 0 indicates that no one is waiting for the lock. + * As the value 0 cannot be used as a valid CPU number. We need to add + * 1 to it before putting it into the queue code. + */ +#define MAX_QNODES 4 +#ifndef _QCODE_VAL_OFFSET +#define _QCODE_VAL_OFFSET _QCODE_OFFSET +#endif + +/* + * Function exit status + */ +enum exitval { + NORMAL_EXIT = 0, + NOTIFY_NEXT , /* Notify the next waiting node CPU */ + RELEASE_NODE /* Release current node directly */ +}; + +/* + * The queue node structure + * + * This structure is essentially the same as the mcs_spinlock structure + * in mcs_spinlock.h file. It is retained for future extension where new + * fields may be added. + */ +struct qnode { + u32 qhead; /* Queue head flag */ + struct qnode *next; /* Next queue node addr */ +}; + +struct qnode_set { + struct qnode nodes[MAX_QNODES]; + int node_idx; /* Current node to use */ +}; + +/* + * Per-CPU queue node structures + */ +static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 }; + +/** + *_queue_spin_setlock - try to acquire the lock by setting the lock bit + * @lock: Pointer to queue spinlock structure + * Return: 1 if lock bit set successfully, 0 if failed + */ +static __always_inline int queue_spin_setlock(struct qspinlock *lock) +{ + int qlcode = atomic_read(&lock->qlcode); + + if (!(qlcode & _QLOCK_LOCKED) && (atomic_cmpxchg(&lock->qlcode, + qlcode, qlcode|_QLOCK_LOCKED) == qlcode)) + return 1; + return 0; +} + +/* + ************************************************************************ + * Inline functions used by the queue_spin_lock_slowpath() function * + * that may get superseded by a more optimized version. * + ************************************************************************ + */ + +#ifndef queue_get_lock_qcode +/** + * queue_get_lock_qcode - get the lock & qcode values + * @lock : Pointer to queue spinlock structure + * @qcode : Pointer to the returned qcode value + * Return : != 0 if lock is not available, = 0 if lock is free + */ +static inline int +queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode) +{ + int qlcode = atomic_read(&lock->qlcode); + + *qcode = qlcode; + return qlcode & _QLOCK_LOCK_MASK; +} +#endif /* queue_get_lock_qcode */ + +#ifndef queue_spin_trylock_and_clr_qcode +/** + * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously + * @lock : Pointer to queue spinlock structure + * @qcode: The supposedly current qcode value + * Return: true if successful, false otherwise + */ +static inline int +queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode) +{ + return atomic_cmpxchg(&lock->qlcode, qcode, _QLOCK_LOCKED) == qcode; +} +#endif /* queue_spin_trylock_and_clr_qcode */ + +#ifndef queue_encode_qcode +/** + * queue_encode_qcode - Encode the CPU number & node index into a qnode code + * @cpu_nr: CPU number + * @qn_idx: Queue node index + * Return : A qnode code that can be saved into the qspinlock structure + */ +static inline u32 queue_encode_qcode(u32 cpu_nr, u8 qn_idx) +{ + return ((cpu_nr + 1) << (_QCODE_VAL_OFFSET + 2)) | + (qn_idx << _QCODE_VAL_OFFSET); +} +#endif /* queue_encode_qcode */ + +#ifndef queue_code_xchg +/** + * queue_code_xchg - exchange a queue code value + * @lock : Pointer to queue spinlock structure + * @ocode: Old queue code in the lock [OUT] + * @ncode: New queue code to be exchanged + * Return: An enum exitval value + */ +static inline enum exitval +queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode) +{ + ncode |= _QLOCK_LOCKED; /* Set lock bit */ + + /* + * Exchange current copy of the queue node code + */ + *ocode = atomic_xchg(&lock->qlcode, ncode); + + if (likely(*ocode & _QLOCK_LOCKED)) { + *ocode &= ~_QLOCK_LOCKED; /* Clear the lock bit */ + return NORMAL_EXIT; + } + /* + * It is possible that we may accidentally steal the lock during + * the unlock-lock transition. If this is the case, we need to either + * release it if not the head of the queue or get the lock and be + * done with it. + */ + if (*ocode == 0) { + u32 qcode; + + /* + * Got the lock since it is at the head of the queue + * Now try to atomically clear the queue code. + */ + qcode = atomic_cmpxchg(&lock->qlcode, ncode, _QLOCK_LOCKED); + /* + * The cmpxchg fails only if one or more tasks are added to + * the queue. In this case, NOTIFY_NEXT is returned instead + * of RELEASE_NODE. + */ + return (qcode != ncode) ? NOTIFY_NEXT : RELEASE_NODE; + } + /* + * Accidentally steal the lock, release the lock and + * let the queue head get it. + */ + queue_spin_unlock(lock); + return NORMAL_EXIT; +} +#endif /* queue_code_xchg */ + +/* + ************************************************************************ + * Other inline functions needed by the queue_spin_lock_slowpath() * + * function. * + ************************************************************************ + */ + +/** + * xlate_qcode - translate the queue code into the queue node address + * @qcode: Queue code to be translated + * Return: The corresponding queue node address + */ +static inline struct qnode *xlate_qcode(u32 qcode) +{ + u32 cpu_nr = (qcode >> (_QCODE_VAL_OFFSET + 2)) - 1; + u8 qn_idx = (qcode >> _QCODE_VAL_OFFSET) & 3; + + return per_cpu_ptr(&qnset.nodes[qn_idx], cpu_nr); +} + +/** + * get_qnode - Get a queue node address + * @qn_idx: Pointer to queue node index [out] + * Return : queue node address & queue node index in qn_idx, or NULL if + * no free queue node available. + */ +static inline struct qnode *get_qnode(unsigned int *qn_idx) +{ + struct qnode_set *qset = this_cpu_ptr(&qnset); + int i; + + if (unlikely(qset->node_idx >= MAX_QNODES)) + return NULL; + i = qset->node_idx++; + *qn_idx = i; + return &qset->nodes[i]; +} + +/** + * put_qnode - Return a queue node to the pool + */ +static inline void put_qnode(void) +{ + this_cpu_dec(qnset.node_idx); +} + +/** + * queue_spin_lock_slowpath - acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * @qsval: Current value of the queue spinlock 32-bit word + */ +void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) +{ + unsigned int cpu_nr, qn_idx; + struct qnode *node, *next; + u32 prev_qcode, my_qcode; + enum exitval exitval; + + /* + * Get the queue node + */ + cpu_nr = smp_processor_id(); + node = get_qnode(&qn_idx); + + /* + * It should never happen that all the queue nodes are being used. + */ + BUG_ON(!node); + + /* + * Set up the new cpu code to be exchanged + */ + my_qcode = queue_encode_qcode(cpu_nr, qn_idx); + + /* + * Initialize the queue node + */ + node->qhead = false; + node->next = NULL; + + /* + * The lock may be available at this point, try again if no task was + * waiting in the queue. + */ + if (!(qsval >> _QCODE_OFFSET) && queue_spin_trylock(lock)) + goto release_node; + + /* + * Exchange current copy of the queue node code + */ + exitval = queue_code_xchg(lock, &prev_qcode, my_qcode); + if (unlikely(exitval == NOTIFY_NEXT)) + goto notify_next; + else if (unlikely(exitval == RELEASE_NODE)) + goto release_node; + + if (prev_qcode) { + /* + * Not at the queue head, get the address of the previous node + * and set up the "next" fields of the that node. + */ + struct qnode *prev = xlate_qcode(prev_qcode); + + ACCESS_ONCE(prev->next) = node; + /* + * Wait until the queue head flag is on + */ + while (!smp_load_acquire(&node->qhead)) + arch_mutex_cpu_relax(); + } + + /* + * At the head of the wait queue now + */ + while (true) { + u32 qcode; + + if (queue_get_lock_qcode(lock, &qcode)) + ; /* Lock not available yet */ + else if (qcode != my_qcode) { + /* + * Just get the lock with other spinners waiting + * in the queue. + */ + if (queue_spin_setlock(lock)) + goto notify_next; + } else { + /* + * Get the lock & clear the queue code simultaneously + */ + if (queue_spin_trylock_and_clr_qcode(lock, qcode)) + /* No need to notify the next one */ + goto release_node; + } + arch_mutex_cpu_relax(); + } + +notify_next: + /* + * Wait, if needed, until the next one in queue set up the next field + */ + while (!(next = ACCESS_ONCE(node->next))) + arch_mutex_cpu_relax(); + /* + * The next one in queue is now at the head + */ + smp_store_release(&next->qhead, true); + +release_node: + put_qnode(); +} +EXPORT_SYMBOL(queue_spin_lock_slowpath); -- 1.7.1
Waiman Long
2014-Mar-19 20:14 UTC
[PATCH v7 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock
This patch makes the necessary changes at the x86 architecture specific layer to enable the use of queue spinlock for x86-64. As x86-32 machines are typically not multi-socket. The benefit of queue spinlock may not be apparent. So queue spinlock is not enabled. Currently, there is some incompatibilities between the para-virtualized spinlock code (which hard-codes the use of ticket spinlock) and the queue spinlock. Therefore, the use of queue spinlock is disabled when the para-virtualized spinlock is enabled. The arch/x86/include/asm/qspinlock.h header file includes some x86 specific optimization which will make the queue spinlock code perform better than the generic implementation. Signed-off-by: Waiman Long <Waiman.Long at hp.com> Acked-by: Rik van Riel <riel at redhat.com> --- arch/x86/Kconfig | 1 + arch/x86/include/asm/qspinlock.h | 41 +++++++++++++++++++++++++++++++++ arch/x86/include/asm/spinlock.h | 5 ++++ arch/x86/include/asm/spinlock_types.h | 4 +++ 4 files changed, 51 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/qspinlock.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 0af5250..de573f9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -17,6 +17,7 @@ config X86_64 depends on 64BIT select X86_DEV_DMA_OPS select ARCH_USE_CMPXCHG_LOCKREF + select ARCH_USE_QUEUE_SPINLOCK ### Arch settings config X86 diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h new file mode 100644 index 0000000..44cefee --- /dev/null +++ b/arch/x86/include/asm/qspinlock.h @@ -0,0 +1,41 @@ +#ifndef _ASM_X86_QSPINLOCK_H +#define _ASM_X86_QSPINLOCK_H + +#include <asm-generic/qspinlock_types.h> + +#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE) + +#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS + +/* + * x86-64 specific queue spinlock union structure + */ +union arch_qspinlock { + struct qspinlock slock; + u8 lock; /* Lock bit */ +}; + +#define queue_spin_unlock queue_spin_unlock +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + * + * No special memory barrier other than a compiler one is needed for the + * x86 architecture. A compiler barrier is added at the end to make sure + * that the clearing the lock bit is done ASAP without artificial delay + * due to compiler optimization. + */ +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + + barrier(); + ACCESS_ONCE(qlock->lock) = 0; + barrier(); +} + +#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */ + +#include <asm-generic/qspinlock.h> + +#endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 0f62f54..958d20f 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -42,6 +42,10 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_QUEUE_SPINLOCK +#include <asm/qspinlock.h> +#else + #ifdef CONFIG_PARAVIRT_SPINLOCKS static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) @@ -180,6 +184,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, { arch_spin_lock(lock); } +#endif /* CONFIG_QUEUE_SPINLOCK */ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 4f1bea1..7960268 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t; #define TICKET_SHIFT (sizeof(__ticket_t) * 8) +#ifdef CONFIG_QUEUE_SPINLOCK +#include <asm-generic/qspinlock_types.h> +#else typedef struct arch_spinlock { union { __ticketpair_t head_tail; @@ -33,6 +36,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif /* CONFIG_QUEUE_SPINLOCK */ #include <asm/rwlock.h> -- 1.7.1
Waiman Long
2014-Mar-19 20:14 UTC
[PATCH v7 03/11] qspinlock: More optimized code for smaller NR_CPUS
For architectures that support atomic operations on smaller 8 or 16 bits data types. It is possible to simplify the code and produce slightly better optimized code at the expense of smaller number of supported CPUs. The qspinlock code can support up to a maximum of 4M-1 CPUs. With less than 16K CPUs, it is possible to squeeze the queue code into a 2-byte short word which can be accessed directly as a 16-bit short data type. This enables the simplification of the queue code exchange portion of the slowpath code. This patch introduces a new macro _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS which can now be defined in an architecture specific qspinlock.h header file to indicate its support for smaller atomic operation data types. This macro triggers the replacement of some of the generic functions by more optimized versions. Signed-off-by: Waiman Long <Waiman.Long at hp.com> --- arch/x86/include/asm/qspinlock.h | 14 ++++- include/asm-generic/qspinlock_types.h | 19 +++++- kernel/locking/qspinlock.c | 100 +++++++++++++++++++++++++++++++++ 3 files changed, 129 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index 44cefee..acbe155 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -8,11 +8,23 @@ #define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS /* + * As the qcode will be accessed as a 16-bit word, no offset is needed + */ +#define _QCODE_VAL_OFFSET 0 + +/* * x86-64 specific queue spinlock union structure + * Besides the slock and lock fields, the other fields are only + * valid with less than 16K CPUs. */ union arch_qspinlock { struct qspinlock slock; - u8 lock; /* Lock bit */ + struct { + u8 lock; /* Lock bit */ + u8 reserved; + u16 qcode; /* Queue code */ + }; + u32 qlcode; /* Complete lock word */ }; #define queue_spin_unlock queue_spin_unlock diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index fbfe898..5df2f53 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -33,17 +33,30 @@ /* * The queue spinlock data structure - a 32-bit word * - * The bits assignment are: + * For NR_CPUS >= 16K, the bits assignment are: * Bit 0 : Set if locked * Bits 1-7 : Not used * Bits 8-31: Queue code + * + * For NR_CPUS < 16K, the bits assignment are: + * Bit 0 : Set if locked + * Bits 1-7 : Not used + * Bits 8-15: Reserved for architecture specific optimization + * Bits 16-31: Queue code */ typedef struct qspinlock { atomic_t qlcode; /* Lock + queue code */ } arch_spinlock_t; -#define _QCODE_OFFSET 8 +#if CONFIG_NR_CPUS >= (1 << 14) +# define _Q_MANY_CPUS +# define _QCODE_OFFSET 8 +# define _QLOCK_LOCK_MASK 0xff +#else +# define _QCODE_OFFSET 16 +# define _QLOCK_LOCK_MASK 0xffff +#endif + #define _QLOCK_LOCKED 1U -#define _QLOCK_LOCK_MASK 0xff #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index a3b9ed3..b093a97 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -62,6 +62,10 @@ * Bits 0-1 : queue node index (4 nodes) * Bits 2-23: CPU number + 1 (4M - 1 CPUs) * + * The 16-bit queue node code is divided into the following 2 fields: + * Bits 0-1 : queue node index (4 nodes) + * Bits 2-15: CPU number + 1 (16K - 1 CPUs) + * * A queue node code of 0 indicates that no one is waiting for the lock. * As the value 0 cannot be used as a valid CPU number. We need to add * 1 to it before putting it into the queue code. @@ -102,6 +106,101 @@ struct qnode_set { */ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 }; +/* + ************************************************************************ + * The following optimized codes are for architectures that support: * + * 1) Atomic byte and short data write * + * 2) Byte and short data exchange and compare-exchange instructions * + * * + * For those architectures, their asm/qspinlock.h header file should * + * define the followings in order to use the optimized codes. * + * 1) The _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS macro * + * 2) A "union arch_qspinlock" structure that include the individual * + * fields of the qspinlock structure, including: * + * o slock - the qspinlock structure * + * o lock - the lock byte * + * o qcode - the queue node code * + * o qlcode - the 32-bit qspinlock word * + * * + ************************************************************************ + */ +#ifdef _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS +#ifndef _Q_MANY_CPUS +/* + * With less than 16K CPUs, the following optimizations are possible with + * architectures that allows atomic 8/16 bit operations: + * 1) The 16-bit queue code can be accessed or modified directly as a + * 16-bit short value without disturbing the first 2 bytes. + */ +#define queue_encode_qcode(cpu, idx) (((cpu) + 1) << 2 | (idx)) + +#define queue_code_xchg queue_code_xchg +/** + * queue_code_xchg - exchange a queue code value + * @lock : Pointer to queue spinlock structure + * @ocode: Old queue code in the lock [OUT] + * @ncode: New queue code to be exchanged + * Return: NORMAL_EXIT is always returned + */ +static inline enum exitval +queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + + *ocode = xchg(&qlock->qcode, (u16)ncode); + return NORMAL_EXIT; +} + +#define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode +/** + * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously + * @lock : Pointer to queue spinlock structure + * @qcode: The supposedly current qcode value + * Return: true if successful, false otherwise + */ +static inline int +queue_spin_trylock_and_clr_qcode(struct qspinlock *lock, u32 qcode) +{ + qcode <<= _QCODE_OFFSET; + return atomic_cmpxchg(&lock->qlcode, qcode, _QLOCK_LOCKED) == qcode; +} + +#define queue_get_lock_qcode queue_get_lock_qcode +/** + * queue_get_lock_qcode - get the lock & qcode values + * @lock : Pointer to queue spinlock structure + * @qcode : Pointer to the returned qcode value + * Return : != 0 if lock is not available + * = 0 if lock is free + * + * It is considered locked when either the lock bit or the wait bit is set. + */ +static inline int +queue_get_lock_qcode(struct qspinlock *lock, u32 *qcode) +{ + u32 qlcode = (u32)atomic_read(&lock->qlcode); + + *qcode = qlcode >> _QCODE_OFFSET; + return qlcode & _QLOCK_LOCK_MASK; +} +#endif /* _Q_MANY_CPUS */ + +/** + * queue_spin_setlock - try to acquire the lock by setting the lock bit + * @lock: Pointer to queue spinlock structure + * Return: 1 if lock bit set successfully, 0 if failed + */ +static __always_inline int queue_spin_setlock(struct qspinlock *lock) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + + return cmpxchg(&qlock->lock, 0, _QLOCK_LOCKED) == 0; +} +#else /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */ +/* + * Generic functions for architectures that do not support atomic + * byte or short data types. + */ /** *_queue_spin_setlock - try to acquire the lock by setting the lock bit * @lock: Pointer to queue spinlock structure @@ -116,6 +215,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock) return 1; return 0; } +#endif /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */ /* ************************************************************************ -- 1.7.1
Waiman Long
2014-Mar-19 20:14 UTC
[PATCH v7 04/11] qspinlock: Optimized code path for 2 contending tasks
A major problem with the queue spinlock patch is its performance at low contention level (2-4 contending tasks) where it is slower than the corresponding ticket spinlock code. The following table shows the execution time (in ms) of a micro-benchmark where 5M iterations of the lock/unlock cycles were run on a 10-core Westere-EX x86-64 CPU with 2 different types loads - standalone (lock and protected data in different cachelines) and embedded (lock and protected data in the same cacheline). [Standalone/Embedded] # of tasks Ticket lock Queue lock %Change ---------- ----------- ---------- ------- 1 135/111 135/102 0%/-8% 2 1045/950 1943/2022 +86%/+113% 3 1827/1783 2372/2428 +30%/+36% 4 2689/2725 2934/2934 +9%/+8% 5 3736/3748 3658/3652 -2%/-3% 6 4942/4984 4434/4428 -10%/-11% 7 6304/6319 5176/5163 -18%/-18% 8 7736/7629 5955/5944 -23%/-22% It can be seen that the performance degradation is particular bad with 2 and 3 contending tasks. To reduce that performance deficit at low contention level, a special specific optimized code path for 2 contending tasks was added. This special code path can only be activated with less than 16K of configured CPUs because it uses a byte in the 32-bit lock word to hold a waiting bit for the 2nd contending tasks instead of queuing the waiting task in the queue. With the change, the performance data became: [Standalone/Embedded] # of tasks Ticket lock Queue lock %Change ---------- ----------- ---------- ------- 2 1045/950 1070/1061 +2%/+12% In a multi-socketed server, the optimized code path also seems to produce a pretty good performance improvement in cross-node contention traffic at low contention level. The table below show the performance with 1 contending task per node: [Standalone] # of nodes Ticket lock Queue lock %Change ---------- ----------- ---------- ------- 1 135 135 0% 2 4452 1483 -67% 3 10767 13432 +25% 4 20835 10796 -48% Except some drop in performance at the 3 contending tasks level, the queue spinlock performs much better than the ticket spinlock at 2 and 4 contending tasks level. Signed-off-by: Waiman Long <Waiman.Long at hp.com> --- arch/x86/include/asm/qspinlock.h | 3 +- kernel/locking/qspinlock.c | 149 ++++++++++++++++++++++++++++++++++++-- 2 files changed, 146 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index acbe155..7f3129c 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -21,9 +21,10 @@ union arch_qspinlock { struct qspinlock slock; struct { u8 lock; /* Lock bit */ - u8 reserved; + u8 wait; /* Waiting bit */ u16 qcode; /* Queue code */ }; + u16 lock_wait; /* Lock and wait bits */ u32 qlcode; /* Complete lock word */ }; diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index b093a97..d2da0c0 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -121,6 +121,8 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 }; * o lock - the lock byte * * o qcode - the queue node code * * o qlcode - the 32-bit qspinlock word * + * o wait - the waiting byte * + * o lock_wait - the combined lock and waiting bytes * * * ************************************************************************ */ @@ -131,8 +133,109 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 }; * architectures that allows atomic 8/16 bit operations: * 1) The 16-bit queue code can be accessed or modified directly as a * 16-bit short value without disturbing the first 2 bytes. + * 2) The 2nd byte of the 32-bit lock word can be used as a pending bit + * for waiting lock acquirer so that it won't need to go through the + * MCS style locking queuing which has a higher overhead. */ + +/* + * Masks for the lock and wait bits + */ +#define _QLOCK_WAIT_SHIFT 8 /* Waiting bit position */ +#define _QLOCK_WAITING (1 << _QLOCK_WAIT_SHIFT) +#define _QLOCK_LW_MASK (_QLOCK_WAITING | _QLOCK_LOCKED) + #define queue_encode_qcode(cpu, idx) (((cpu) + 1) << 2 | (idx)) +#define queue_get_qcode(lock) (atomic_read(&(lock)->qlcode) >> _QCODE_OFFSET) + +#define queue_spin_trylock_quick queue_spin_trylock_quick +/** + * queue_spin_trylock_quick - quick spinning on the queue spinlock + * @lock : Pointer to queue spinlock structure + * @qsval: Old queue spinlock value + * Return: 1 if lock acquired, 0 if failed + * + * This is an optimized contention path for 2 contending tasks. It + * should only be entered if no task is waiting in the queue. + * + * The lock and wait bits can be in one of following 4 states: + * + * State lock wait + * ----- --------- + * [0] 0 0 Lock is free and no one is waiting + * [1] 1 0 Lock is not available, but no one is waiting + * [2] 0 1 Lock is free, but a waiter is present + * [3] 1 1 Lock is not available and a waiter is present + * + * A task entering the quick path will set the wait bit and be in either + * either states 2 or 3. The allowable transitions are: + * + * [3] => [2] => [1] => [0] + * ^ | + * +-------------+ + */ +static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + + /* + * Fall into the quick spinning code path only if no task is waiting + * in the queue. + */ + while (likely(!(qsval >> _QCODE_OFFSET))) { + if ((qsval & _QLOCK_LW_MASK) == _QLOCK_LW_MASK) { + /* + * Both the lock and wait bits are set, wait a while + * to see if that changes. It not, quit the quick path. + */ + arch_mutex_cpu_relax(); + cpu_relax(); + qsval = atomic_read(&lock->qlcode); + if ((qsval >> _QCODE_OFFSET) || + ((qsval & _QLOCK_LW_MASK) == _QLOCK_LW_MASK)) + return 0; + } + + /* + * Try to set the corresponding waiting bit + */ + if (xchg(&qlock->wait, _QLOCK_WAITING >> 8)) { + /* + * Wait bit was set already, try again after some delay + * as the waiter will probably get the lock & clear + * the wait bit. + * + * There are 2 cpu_relax() calls to make sure that + * the wait is longer than that of the + * smp_load_acquire() loop below. + */ + arch_mutex_cpu_relax(); + cpu_relax(); + qsval = atomic_read(&lock->qlcode); + continue; + } + + /* + * Now wait until the lock bit is cleared + */ + while (smp_load_acquire(&qlock->qlcode) & _QLOCK_LOCKED) + arch_mutex_cpu_relax(); + + /* + * Set the lock bit & clear the waiting bit simultaneously + * No lock stealing is allowed when this quick path is active. + */ + ACCESS_ONCE(qlock->lock_wait) = _QLOCK_LOCKED; + return 1; + } + return 0; +} + +/* + * With the qspinlock quickpath logic activated, disable the trylock logic + * in the slowpath as it will be redundant. + */ +#define queue_spin_trylock(lock) (0) #define queue_code_xchg queue_code_xchg /** @@ -140,14 +243,40 @@ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 }; * @lock : Pointer to queue spinlock structure * @ocode: Old queue code in the lock [OUT] * @ncode: New queue code to be exchanged - * Return: NORMAL_EXIT is always returned + * @qsval: Old queue spinlock value + * Return: Either NORMAL_EXIT or RELEASE_NODE */ static inline enum exitval -queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode) +queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode, int qsval) { union arch_qspinlock *qlock = (union arch_qspinlock *)lock; *ocode = xchg(&qlock->qcode, (u16)ncode); + if ((*ocode == 0) && ((qsval & _QLOCK_LW_MASK) != _QLOCK_LW_MASK)) { + /* + * When no one is waiting in the queue before, try to fall + * back into the optimized 2-task contending code path. + */ + u32 qlcode = ACCESS_ONCE(qlock->qlcode); + + if ((qlcode != ((ncode << _QCODE_OFFSET)|_QLOCK_LOCKED)) || + (cmpxchg(&qlock->qlcode, qlcode, + _QLOCK_LOCKED|_QLOCK_WAITING) != qlcode)) + return NORMAL_EXIT; + /* + * Successfully fall back to the optimized code path. + * Now wait until the lock byte is cleared + */ + while (smp_load_acquire(&qlock->qlcode) & _QLOCK_LOCKED) + arch_mutex_cpu_relax(); + + /* + * Set the lock bit & clear the waiting bit simultaneously + * No lock stealing is allowed when this quick path is active. + */ + ACCESS_ONCE(qlock->lock_wait) = _QLOCK_LOCKED; + return RELEASE_NODE; + } return NORMAL_EXIT; } @@ -194,7 +323,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock) { union arch_qspinlock *qlock = (union arch_qspinlock *)lock; - return cmpxchg(&qlock->lock, 0, _QLOCK_LOCKED) == 0; + return cmpxchg(&qlock->lock_wait, 0, _QLOCK_LOCKED) == 0; } #else /* _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS */ /* @@ -223,6 +352,10 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock) * that may get superseded by a more optimized version. * ************************************************************************ */ +#ifndef queue_spin_trylock_quick +static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval) +{ return 0; } +#endif #ifndef queue_get_lock_qcode /** @@ -275,10 +408,11 @@ static inline u32 queue_encode_qcode(u32 cpu_nr, u8 qn_idx) * @lock : Pointer to queue spinlock structure * @ocode: Old queue code in the lock [OUT] * @ncode: New queue code to be exchanged + * @qsval: Old queue spinlock value (not used) * Return: An enum exitval value */ static inline enum exitval -queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode) +queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode, int qsval) { ncode |= _QLOCK_LOCKED; /* Set lock bit */ @@ -380,6 +514,11 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) enum exitval exitval; /* + * Try the quick spinning code path + */ + if (queue_spin_trylock_quick(lock, qsval)) + return; + /* * Get the queue node */ cpu_nr = smp_processor_id(); @@ -411,7 +550,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) /* * Exchange current copy of the queue node code */ - exitval = queue_code_xchg(lock, &prev_qcode, my_qcode); + exitval = queue_code_xchg(lock, &prev_qcode, my_qcode, qsval); if (unlikely(exitval == NOTIFY_NEXT)) goto notify_next; else if (unlikely(exitval == RELEASE_NODE)) -- 1.7.1
Waiman Long
2014-Mar-19 20:14 UTC
[PATCH v7 05/11] pvqspinlock, x86: Allow unfair spinlock in a PV guest
Locking is always an issue in a virtualized environment as the virtual CPU that is waiting on a lock may get scheduled out and hence block any progress in lock acquisition even when the lock has been freed. One solution to this problem is to allow unfair lock in a para-virtualized environment. In this case, a new lock acquirer can come and steal the lock if the next-in-line CPU to get the lock is scheduled out. Other CPUs queued behind the scheduled-out CPU may also attempt to steal the lock, though at a lower frequency (1/16) to give the queue head a higher chance of getting the lock. An alternative is to use a simple unfair byte lock which can, in some cases, provide a bit better performance than the queue unfair lock mentioned above. However, the chance of lock starvation problem is also likely to be higher. Unfair lock in a native environment is generally not a good idea as there is a possibility of lock starvation for a heavily contended lock. This patch add a new configuration option for the x86 architecture to enable the use of unfair queue spinlock (PARAVIRT_UNFAIR_LOCKS) in a para-virtualized guest. A jump label (paravirt_unfairlocks_enabled) is used to switch between a fair and an unfair version of the spinlock code. This jump label will only be enabled in a PV guest. Enabling this configuration feature causes a slight decrease the performance of an uncontended lock-unlock operation by about 1-2% mainly due to the use of a static key. However, uncontended lock-unlock operation are really just a tiny percentage of a real workload. So there should no noticeable change in application performance. With the unfair locking activated on bare metal 4-socket Westmere-EX box, the execution times (in ms) of a spinlock micro-benchmark were as follows: # of Ticket Fair Unfair Unfair tasks lock queue lock queue lock byte lock ------ ------- ---------- ---------- --------- 1 135 135 137 137 2 1045 1120 462 462 3 1827 2345 794 963 4 2689 2934 1323 1706 5 3736 3658 1902 2127 6 4942 4434 2733 2980 7 6304 5176 3416 3491 8 7736 5955 4342 3955 Executing one task per node, the performance data were: # of Ticket Fair Unfair Unfair nodes lock queue lock queue lock byte lock ------ ------- ---------- ---------- --------- 1 135 135 137 137 2 4452 1736 715 710 3 10767 13432 1508 1468 4 20835 10796 2697 2582 The performance of a simple unfair byte lock is pretty close to the unfair queue lock. Of course there are pretty big variation in the execution times of each individual task. For the 4 nodes case above, the standard deviation was 785ms. In general, the shorter the critical section, the better the performance benefit of an unfair lock. For large critical section, however, there may not be much benefit. Signed-off-by: Waiman Long <Waiman.Long at hp.com> --- arch/x86/Kconfig | 11 ++ arch/x86/include/asm/qspinlock.h | 72 +++++++++++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/paravirt-spinlocks.c | 7 + kernel/locking/qspinlock.c | 227 +++++++++++++++++++++++++++++++++- 5 files changed, 312 insertions(+), 6 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index de573f9..010abc4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -629,6 +629,17 @@ config PARAVIRT_SPINLOCKS If you are unsure how to answer this question, answer Y. +config PARAVIRT_UNFAIR_LOCKS + bool "Enable unfair locks in a para-virtualized guest" + depends on PARAVIRT && SMP && QUEUE_SPINLOCK + depends on !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE + ---help--- + This changes the kernel to use unfair locks in a + para-virtualized guest. This will help performance in most + cases. However, there is a possibility of lock starvation + on a heavily contended lock especially in a large guest + with many virtual CPUs. + source "arch/x86/xen/Kconfig" config KVM_GUEST diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index 7f3129c..3fdc0e2 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -51,4 +51,76 @@ static inline void queue_spin_unlock(struct qspinlock *lock) #include <asm-generic/qspinlock.h> +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS +/** + * queue_spin_lock_unfair - acquire a queue spinlock unfairly + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + + if (likely(cmpxchg(&qlock->lock, 0, _QLOCK_LOCKED) == 0)) + return; + /* + * Since the lock is now unfair, we should not activate the 2-task + * quick spinning code path which disallows lock stealing. + */ + queue_spin_lock_slowpath(lock, -1); +} + +/** + * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + + if (!qlock->lock && (cmpxchg(&qlock->lock, 0, _QLOCK_LOCKED) == 0)) + return 1; + return 0; +} + +/* + * Redefine arch_spin_lock and arch_spin_trylock as inline functions that will + * jump to the unfair versions if the static key paravirt_unfairlocks_enabled + * is true. + */ +#undef arch_spin_lock +#undef arch_spin_trylock +#undef arch_spin_lock_flags + +extern struct static_key paravirt_unfairlocks_enabled; + +/** + * arch_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static inline void arch_spin_lock(struct qspinlock *lock) +{ + if (static_key_false(¶virt_unfairlocks_enabled)) + queue_spin_lock_unfair(lock); + else + queue_spin_lock(lock); +} + +/** + * arch_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static inline int arch_spin_trylock(struct qspinlock *lock) +{ + if (static_key_false(¶virt_unfairlocks_enabled)) + return queue_spin_trylock_unfair(lock); + else + return queue_spin_trylock(lock); +} + +#define arch_spin_lock_flags(l, f) arch_spin_lock(l) + +#endif /* CONFIG_PARAVIRT_UNFAIR_LOCKS */ + #endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index cb648c8..1107a20 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -88,6 +88,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o obj-$(CONFIG_KVM_GUEST) += kvm.o kvmclock.o obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o +obj-$(CONFIG_PARAVIRT_UNFAIR_LOCKS)+= paravirt-spinlocks.o obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index bbb6c73..a50032a 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -8,6 +8,7 @@ #include <asm/paravirt.h> +#ifdef CONFIG_PARAVIRT_SPINLOCKS struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP .lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop), @@ -18,3 +19,9 @@ EXPORT_SYMBOL(pv_lock_ops); struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE; EXPORT_SYMBOL(paravirt_ticketlocks_enabled); +#endif + +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS +struct static_key paravirt_unfairlocks_enabled = STATIC_KEY_INIT_FALSE; +EXPORT_SYMBOL(paravirt_unfairlocks_enabled); +#endif diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index d2da0c0..309b0d6 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -86,13 +86,13 @@ enum exitval { /* * The queue node structure - * - * This structure is essentially the same as the mcs_spinlock structure - * in mcs_spinlock.h file. It is retained for future extension where new - * fields may be added. */ struct qnode { u32 qhead; /* Queue head flag */ +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS + u32 prev_qcode; /* Queue code of previous node */ + struct qnode *qprev; /* Previous queue node addr */ +#endif struct qnode *next; /* Next queue node addr */ }; @@ -280,6 +280,22 @@ queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode, int qsval) return NORMAL_EXIT; } +#define cmpxchg_queue_code cmpxchg_queue_code +/** + * cmpxchg_queue_code - compare and exchange a queue code value in the lock + * @lock : Pointer to queue spinlock structure + * @ocode: Old queue code value + * @ncode: New queue code value + * Return: true if compare and exchange successful, false otherwise + */ +static inline int +cmpxchg_queue_code(struct qspinlock *lock, u32 ocode, u32 ncode) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + + return cmpxchg(&qlock->qcode, (u16)ocode, (u16)ncode) == (u16)ocode; +} + #define queue_spin_trylock_and_clr_qcode queue_spin_trylock_and_clr_qcode /** * queue_spin_trylock_and_clr_qcode - Try to lock & clear qcode simultaneously @@ -455,6 +471,184 @@ queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode, int qsval) } #endif /* queue_code_xchg */ +#ifndef cmpxchg_queue_code +/** + * cmpxchg_queue_code - compare and exchange a queue code value in the lock + * @lock : Pointer to queue spinlock structure + * @ocode: Old queue code value + * @ncode: New queue code value + * Return: true if compare and exchange successful, false otherwise + */ +static inline int +cmpxchg_queue_code(struct qspinlock *lock, u32 ocode, u32 ncode) +{ + u32 lockval = atomic_read(lock->qlcode) & _QLOCK_LOCK_MASK; + + ocode |= lockval; + ncode |= lockval; + return atomic_cmpxchg(&lock->qlcode, ocode, ncode) == ocode; +} +#endif /* cmpxchg_queue_code */ + +/* + ************************************************************************ + * Inline functions for supporting unfair queue lock * + ************************************************************************ + */ +/* + * Unfair lock support in a paravirtualized guest + * + * With unfair lock enabled, lock stealing is allowed in the following + * places: + * 1) In the spin_lock and spin_trylock functiopns + * 2) When spinning in the waiter queue before becoming the queue head + * + * A lock acquirer has only one chance of stealing the lock in the spin_lock + * and spin_trylock function. If the attempt fails for spin_lock, the task + * will be queued in the wait queue. In there, the task will still attempt + * to steal the lock periodically at a lesser frequency to give the queue + * head a higher chance of getting the lock. This is to ensure that there + * will be some forward progress to avoid as much of the lock starvation + * problem as possible as well as reducing the load on the lock cacheline. + */ +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS +#define DEF_LOOP_VAR(c) int c = 0 +#define INC_LOOP_VAR(c) (c)++ +#define LOOP_VAR(c) c +#define LSTEAL_FREQ (1 << 4) +#define LSTEAL_MASK (LSTEAL_FREQ - 1) + +/** + * unfair_init_vars - initialize unfair relevant fields in queue node structure + * @node: Current queue node address + */ +static void unfair_init_vars(struct qnode *node) +{ + node->qprev = NULL; + node->prev_qcode = 0; +} + +/** + * unfair_set_vars - set unfair related fields in the queue node structure + * @node : Current queue node address + * @prev : Previous queue node address + * @prev_qcode: Previous qcode value + */ +static void +unfair_set_vars(struct qnode *node, struct qnode *prev, u32 prev_qcode) +{ + node->qprev = prev; + node->prev_qcode = prev_qcode; + /* Make sure the new fields are visible to others */ + smp_wmb(); +} + +/** + * unfair_check_qcode - check the current to see if it is the last one + * and need to be cleared. + * @lock : Pointer to queue spinlock structure + * @my_qcode: My queue code value to be checked again + * Return : Either RELEASE_NODE or NOTIFY_NEXT + */ +static enum exitval unfair_check_qcode(struct qspinlock *lock, u32 my_qcode) +{ + u32 qcode; + + if (!static_key_false(¶virt_unfairlocks_enabled)) + return NOTIFY_NEXT; + + (void)queue_get_lock_qcode(lock, &qcode); + /* + * Try to clear the current queue code if it match my_qcode + */ + if ((qcode == my_qcode) && cmpxchg_queue_code(lock, my_qcode, 0)) + return RELEASE_NODE; + return NOTIFY_NEXT; +} + +/** + * unfair_get_lock - try to steal the lock periodically + * @lock : Pointer to queue spinlock structure + * @node : Current queue node address + * @my_qcode: My queue code value + * @count : Loop count + * Return : NORMAL_EXIT, RELEASE_NODE or NOTIFY_NEXT + */ +static enum exitval unfair_get_lock(struct qspinlock *lock, struct qnode *node, + u32 my_qcode, int count) +{ + u32 qcode, pqcode; + int qhead; + struct qnode *next; + + if (!static_key_false(¶virt_unfairlocks_enabled)) + return NORMAL_EXIT; + + if (((count & LSTEAL_MASK) != LSTEAL_MASK) || + !queue_spin_trylock_unfair(lock)) + return NORMAL_EXIT; + + /* + * Have stolen the lock, need to remove itself from the wait queue + * 1) If it is at the end of the queue, change the qcode in the lock + * word to the one before it. + * 2) Change the next pointer in the previous queue node to point + * to the next one in queue or NULL if it is at the end of queue. + * 3) If a next node is present, copy the prev_qcode and qprev values + * to the next node. + */ + (void)queue_get_lock_qcode(lock, &qcode); + qhead = ACCESS_ONCE(node->qhead); + pqcode = qhead ? 0 : node->prev_qcode; + + if ((qcode == my_qcode) && cmpxchg_queue_code(lock, my_qcode, pqcode)) { + /* + * Successfully change the qcode back to the previous one. + * Now need to clear the next pointer in the previous node + * only if it contains my queue node address. Whether the + * cmpxchg() call below fails or succeeds doesn't really + * matter. + */ + (void)cmpxchg(&node->qprev->next, node, NULL); + return RELEASE_NODE; + } + + /* Wait until the next pointer is set */ + while (!(next = ACCESS_ONCE(node->next))) + arch_mutex_cpu_relax(); + + if (!qhead) { + /* + * Change the node data only if it is not the queue head + */ + ACCESS_ONCE(node->qprev->next) = next; + ACCESS_ONCE(next->qprev) = node->qprev; + ACCESS_ONCE(next->prev_qcode) = node->prev_qcode; + + /* + * Make sure all the new node information are visible + * before proceeding. + */ + smp_wmb(); + return RELEASE_NODE; + } + return NOTIFY_NEXT; +} + +#else /* CONFIG_PARAVIRT_UNFAIR_LOCKS */ +#define DEF_LOOP_VAR(c) +#define INC_LOOP_VAR(c) +#define LOOP_VAR(c) 0 + +static void unfair_init_vars(struct qnode *node) {} +static void unfair_set_vars(struct qnode *node, struct qnode *prev, + u32 prev_qcode) {} +static enum exitval unfair_get_lock(struct qspinlock *lock, struct qnode *node, + u32 my_qcode, int count) { return NORMAL_EXIT; } +static enum exitval unfair_check_qcode(struct qspinlock *lock, u32 my_qcode) + { return NORMAL_EXIT; } +#endif /* CONFIG_PARAVIRT_UNFAIR_LOCKS */ + /* ************************************************************************ * Other inline functions needed by the queue_spin_lock_slowpath() * @@ -539,6 +733,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) */ node->qhead = false; node->next = NULL; + unfair_init_vars(node); /* * The lock may be available at this point, try again if no task was @@ -562,13 +757,23 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) * and set up the "next" fields of the that node. */ struct qnode *prev = xlate_qcode(prev_qcode); + DEF_LOOP_VAR(cnt); + unfair_set_vars(node, prev, prev_qcode); ACCESS_ONCE(prev->next) = node; /* * Wait until the queue head flag is on */ - while (!smp_load_acquire(&node->qhead)) + while (!smp_load_acquire(&node->qhead)) { + INC_LOOP_VAR(cnt); arch_mutex_cpu_relax(); + exitval = unfair_get_lock(lock, node, my_qcode, + LOOP_VAR(cnt)); + if (unlikely(exitval == RELEASE_NODE)) + goto release_node; + else if (unlikely(exitval == NOTIFY_NEXT)) + goto notify_next; + } } /* @@ -584,8 +789,18 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) * Just get the lock with other spinners waiting * in the queue. */ - if (queue_spin_setlock(lock)) + if (queue_spin_setlock(lock)) { + /* + * The unfair lock stealing code may have the + * next one get the lock first and thus don't + * need to be notify. + * Need to recheck the qcode value in the lock + */ + if (unlikely(unfair_check_qcode(lock, my_qcode) + == RELEASE_NODE)) + goto release_node; goto notify_next; + } } else { /* * Get the lock & clear the queue code simultaneously -- 1.7.1
Waiman Long
2014-Mar-19 20:14 UTC
[PATCH v7 06/11] pvqspinlock, x86: Allow unfair queue spinlock in a KVM guest
This patch adds a KVM init function to activate the unfair queue spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config option is selected. Signed-off-by: Waiman Long <Waiman.Long at hp.com> --- arch/x86/kernel/kvm.c | 17 +++++++++++++++++ 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 713f1b3..a489140 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void) early_initcall(kvm_spinlock_init_jump); #endif /* CONFIG_PARAVIRT_SPINLOCKS */ + +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS +/* + * Enable unfair lock if running in a real para-virtualized environment + */ +static __init int kvm_unfair_locks_init_jump(void) +{ + if (!kvm_para_available()) + return 0; + + static_key_slow_inc(¶virt_unfairlocks_enabled); + printk(KERN_INFO "KVM setup unfair spinlock\n"); + + return 0; +} +early_initcall(kvm_unfair_locks_init_jump); +#endif -- 1.7.1
Waiman Long
2014-Mar-19 20:14 UTC
[PATCH v7 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest
This patch adds a XEN init function to activate the unfair queue spinlock in a XEN guest when the PARAVIRT_UNFAIR_LOCKS kernel config option is selected. Signed-off-by: Waiman Long <Waiman.Long at hp.com> --- arch/x86/xen/setup.c | 19 +++++++++++++++++++ 1 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c index 0982233..66bb6f5 100644 --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -625,3 +625,22 @@ void __init xen_arch_setup(void) numa_off = 1; #endif } + +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS +/* + * Enable unfair lock if running in a Xen guest + */ +static __init int xen_unfair_locks_init_jump(void) +{ + /* + * Disable unfair lock if not running in a PV domain + */ + if (!xen_pv_domain()) + return 0; + + static_key_slow_inc(¶virt_unfairlocks_enabled); + + return 0; +} +early_initcall(xen_unfair_locks_init_jump); +#endif -- 1.7.1
Waiman Long
2014-Mar-19 20:14 UTC
[PATCH v7 08/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
This patch renames the paravirt_ticketlocks_enabled static key to a more generic paravirt_spinlocks_enabled name. Signed-off-by: Waiman Long <Waiman.Long at hp.com> --- arch/x86/include/asm/spinlock.h | 4 ++-- arch/x86/kernel/kvm.c | 2 +- arch/x86/kernel/paravirt-spinlocks.c | 4 ++-- arch/x86/xen/spinlock.c | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 958d20f..428d0d1 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -39,7 +39,7 @@ /* How long a lock should spin before we consider blocking */ #define SPIN_THRESHOLD (1 << 15) -extern struct static_key paravirt_ticketlocks_enabled; +extern struct static_key paravirt_spinlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); #ifdef CONFIG_QUEUE_SPINLOCK @@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock, static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { if (TICKET_SLOWPATH_FLAG && - static_key_false(¶virt_ticketlocks_enabled)) { + static_key_false(¶virt_spinlocks_enabled)) { arch_spinlock_t prev; prev = *lock; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index a489140..f318e78 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -818,7 +818,7 @@ static __init int kvm_spinlock_init_jump(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return 0; - static_key_slow_inc(¶virt_ticketlocks_enabled); + static_key_slow_inc(¶virt_spinlocks_enabled); printk(KERN_INFO "KVM setup paravirtual spinlock\n"); return 0; diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index a50032a..8c67cbe 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -17,8 +17,8 @@ struct pv_lock_ops pv_lock_ops = { }; EXPORT_SYMBOL(pv_lock_ops); -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE; -EXPORT_SYMBOL(paravirt_ticketlocks_enabled); +struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE; +EXPORT_SYMBOL(paravirt_spinlocks_enabled); #endif #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 581521c..06f4a64 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -290,7 +290,7 @@ static __init int xen_init_spinlocks_jump(void) if (!xen_pvspin) return 0; - static_key_slow_inc(¶virt_ticketlocks_enabled); + static_key_slow_inc(¶virt_spinlocks_enabled); return 0; } early_initcall(xen_init_spinlocks_jump); -- 1.7.1
Waiman Long
2014-Mar-19 20:14 UTC
[PATCH v7 09/11] pvqspinlock, x86: Add qspinlock para-virtualization support
This patch adds para-virtualization support to the queue spinlock in the same way as was done in the PV ticket lock code. In essence, the lock waiters will spin for a specified number of times (QSPIN_THRESHOLD = 2^14) and then halted itself. The queue head waiter, unlike the other waiter, will spins 2*QSPIN_THRESHOLD times before halting itself. Before being halted, the queue head waiter will set a flag (_QLOCK_LOCKED_SLOWPATH) in the lock byte to indicate that the unlock slowpath has to be invoked. In the unlock slowpath, the current lock holder will find the queue head by following the previous node pointer links stored in the queue node structure until it finds one that has the qhead flag turned on. It then attempt to kick the CPU of the queue head. After the queue head acquired the lock, it will also check the status of the next node and set _QLOCK_LOCKED_SLOWPATH if it has been halted. Enabling the PV code does have a performance impact on spinlock acquisitions and releases. The following table shows the execution time (in ms) of a spinlock micro-benchmark that does lock/unlock operations 5M times for each task versus the number of contending tasks on a Westmere-EX system. # of Ticket lock Queue lock tasks PV off/PV on/%Change PV off/PV on/%Change ------ -------------------- --------------------- 1 135/ 179/+33% 137/ 169/+23% 2 1045/ 1103/ +6% 1120/ 1574/+41% 3 1827/ 2683/+47% 2313/ 2664/+15% 4 2689/ 4191/+56% 2914/ 3208/+10% 5 3736/ 5830/+56% 3715/ 3776/ +2% 6 4942/ 7609/+54% 4504/ 4536/ +1% 7 6304/ 9570/+52% 5292/ 5294/ 0% 8 7736/11323/+46% 6037/ 6028/ 0% It can be seen that the ticket lock PV code has a fairly big decrease in performance when there are 3 or more contending tasks. The queue spinlock PV code, on the other hand, only has a relatively minor drop in performance for 3 or more contending tasks. At 5 or more contending tasks, there is practically no difference in performance. When coupled with unfair lock, the queue spinlock can be much faster than the PV ticket lock. Signed-off-by: Waiman Long <Waiman.Long at hp.com> --- arch/x86/include/asm/paravirt.h | 12 ++- arch/x86/include/asm/paravirt_types.h | 5 + arch/x86/include/asm/pvqspinlock.h | 249 +++++++++++++++++++++++++++++++++ arch/x86/include/asm/qspinlock.h | 35 +++++ arch/x86/kernel/paravirt-spinlocks.c | 5 + kernel/locking/qspinlock.c | 121 +++++++++++++++- 6 files changed, 420 insertions(+), 7 deletions(-) create mode 100644 arch/x86/include/asm/pvqspinlock.h diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index cd6e161..ab32020 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -711,7 +711,17 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, } #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUE_SPINLOCK +static __always_inline void __queue_kick_cpu(int cpu) +{ + PVOP_VCALL1(pv_lock_ops.kick_cpu, cpu); +} +static __always_inline void __queue_hibernate(void) +{ + PVOP_VCALL0(pv_lock_ops.hibernate); +} +#else static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -723,7 +733,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, { PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } - +#endif #endif #ifdef CONFIG_X86_32 diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 7549b8b..2879b32 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -334,8 +334,13 @@ typedef u16 __ticket_t; #endif struct pv_lock_ops { +#ifdef CONFIG_QUEUE_SPINLOCK + void (*kick_cpu)(int cpu); + void (*hibernate)(void); +#else struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket); +#endif }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h new file mode 100644 index 0000000..67dd163 --- /dev/null +++ b/arch/x86/include/asm/pvqspinlock.h @@ -0,0 +1,249 @@ +#ifndef _ASM_X86_PVQSPINLOCK_H +#define _ASM_X86_PVQSPINLOCK_H + +/* + * Queue Spinlock Para-Virtualization (PV) Support + * + * +------+ +-----+ next +----+ + * | Lock | |Queue|----------->|Next| + * |Holder|<-----------|Head |<-----------|Node| + * +------+ prev_qcode +-----+ prev_qcode +----+ + * + * The PV support code for queue spinlock is roughly the same as that + * of the ticket spinlock. Each CPU waiting for the lock will spin until it + * reaches a threshold. When that happens, it will put itself to halt so + * that the hypervisor can reuse the CPU cycles in some other guests. + * + * A major difference between the two versions of PV support is the fact + * that the queue head will spin twice as long as the other nodes before it + * puts itself to halt. + * + * There are 2 places where race can happen: + * 1) Halting of the queue head CPU (in pv_head_spin_check) and the CPU + * kicking by the lock holder (in pv_kick_node). + * 2) Halting of the queue node CPU (in pv_queue_spin_check) and the + * the status check by the previous queue head (in pv_next_node_check). + * See the comments on those functions to see how the races are being + * addressed. + */ + +/* + * Spin threshold for queue spinlock + * This is half of the ticket lock's SPIN_THRESHOLD. The queue head will + * be halted after 2*QSPIN_THRESHOLD whereas the other nodes will be + * halted after QSPIN_THRESHOLD. + */ +#define QSPIN_THRESHOLD (1U<<14) + +/* + * CPU state flags + */ +#define PV_CPU_ACTIVE 1 /* This CPU is active */ +#define PV_CPU_KICKED 2 /* This CPU is being kicked */ +#define PV_CPU_HALTED -1 /* This CPU is halted */ + +/* + * Additional fields to be added to the qnode structure + */ +#if CONFIG_NR_CPUS >= (1 << 16) +#define _cpuid_t u32 +#else +#define _cpuid_t u16 +#endif + +struct qnode; + +struct pv_qvars { + s8 cpustate; /* CPU status flag */ + s8 qhead; /* Becoming queue head */ + _cpuid_t mycpu; /* CPU number of this node */ + struct qnode *prev; /* Pointer to previous node */ +}; + +/* + * Macro to be used by the unfair lock code to access the previous node pointer + * in the pv structure. + */ +#define qprev pv.prev + +/** + * pv_init_vars - initialize fields in struct pv_qvars + * @pv : pointer to struct pv_qvars + * @cpu: current CPU number + */ +static __always_inline void pv_init_vars(struct pv_qvars *pv, int cpu) +{ + pv->cpustate = PV_CPU_ACTIVE; + pv->prev = NULL; + pv->qhead = 0; + pv->mycpu = cpu; +} + +/** + * pv_head_spin_check - perform para-virtualization checks for queue head + * @pv : pointer to struct pv_qvars + * @count : loop count + * @qcode : queue code of the supposed lock holder + * @lock : pointer to the qspinlock structure + * + * The following checks will be done: + * 2) Halt itself if lock is still not available after 2*QSPIN_THRESHOLD + */ +static __always_inline void pv_head_spin_check(struct pv_qvars *pv, int *count, + u32 qcode, struct qspinlock *lock) +{ + if (!static_key_false(¶virt_spinlocks_enabled)) + return; + + if (unlikely(*count >= 2*QSPIN_THRESHOLD)) { + u8 lockval; + + /* + * Set the lock byte to _QLOCK_LOCKED_SLOWPATH before + * trying to hibernate itself. It is possible that the + * lock byte had been set to _QLOCK_LOCKED_SLOWPATH + * already (spurious wakeup of queue head after a halt). + * In this case, just proceeds to sleeping. + * + * queue head lock holder + * ---------- ----------- + * cpustate = PV_CPU_HALTED + * [1] cmpxchg(_QLOCK_LOCKED [2] cmpxchg(_QLOCK_LOCKED => 0) + * => _QLOCK_LOCKED_SLOWPATH) if (cmpxchg fails && + * if (cmpxchg succeeds) cpustate == PV_CPU_HALTED) + * halt() kick() + * + * Sequence: + * 1,2 - slowpath flag set, queue head halted & lock holder + * will call slowpath + * 2,1 - queue head cmpxchg fails, halt is aborted + * + * If the queue head CPU is woken up by a spurious interrupt + * at the same time as the lock holder check the cpustate, + * it is possible that the lock holder will try to kick + * the queue head CPU which isn't halted. + */ + ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED; + lockval = cmpxchg(&((union arch_qspinlock *)lock)->lock, + _QLOCK_LOCKED, _QLOCK_LOCKED_SLOWPATH); + if (lockval == 0) { + /* + * Can exit now as the lock is free + */ + ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE; + *count = 0; + return; + } + __queue_hibernate(); + ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE; + *count = 0; /* Reset count */ + } +} + +/** + * pv_queue_spin_check - perform para-virtualization checks for queue member + * @pv : pointer to struct pv_qvars + * @count: loop count + */ +static __always_inline void pv_queue_spin_check(struct pv_qvars *pv, int *count) +{ + if (!static_key_false(¶virt_spinlocks_enabled)) + return; + /* + * Attempt to halt oneself after QSPIN_THRESHOLD spins + */ + if (unlikely(*count >= QSPIN_THRESHOLD)) { + /* + * Time to hibernate itself + */ + ACCESS_ONCE(pv->cpustate) = PV_CPU_HALTED; + /* + * In order to avoid the racing between pv_next_node_check() + * and pv_queue_spin_check(), 2 variables handshake is used + * to make sure that pv_next_node_check() won't miss setting + * the _QLOCK_LOCKED_SLOWPATH when the CPU is about to be + * halted. + * + * pv_next_node_check pv_queue_spin_check + * ------------------ ------------------- + * [1] qhead = true [3] cpustate = PV_CPU_HALTED + * barrier() barrier() + * [2] if (cpustate [4] if (qhead) + * == PV_CPU_HALTED) + * + * Sequence: + * *,1,*,4,* - halt is aborted as the qhead flag is set, + * _QLOCK_LOCKED_SLOWPATH may or may not be set + * 3,4,1,2 - the CPU is halt and _QLOCK_LOCKED_SLOWPATH is set + */ + barrier(); + if (!ACCESS_ONCE(pv->qhead)) + __queue_hibernate(); + else + pv->qhead = false; + ACCESS_ONCE(pv->cpustate) = PV_CPU_ACTIVE; + *count = 0; /* Reset count */ + } +} + +/** + * pv_next_node_check - set _QLOCK_LOCKED_SLOWPATH flag if the next node + * is halted + * @pv : pointer to struct pv_qvars + * @count: loop count + * + * The current CPU should have gotten the lock before calling this function. + */ +static __always_inline void +pv_next_node_check(struct pv_qvars *pv, struct qspinlock *lock) +{ + if (!static_key_false(¶virt_spinlocks_enabled)) + return; + pv->qhead = true; + /* + * Make sure qhead flag is visible before checking the cpustate flag + */ + barrier(); + if (ACCESS_ONCE(pv->cpustate) == PV_CPU_HALTED) + ACCESS_ONCE(((union arch_qspinlock *)lock)->lock) + = _QLOCK_LOCKED_SLOWPATH; +} + +/** + * pv_set_prev - set previous queue node pointer + * @pv : pointer to struct pv_qvars to be set + * @prev: pointer to the previous node + */ +static __always_inline void pv_set_prev(struct pv_qvars *pv, struct qnode *prev) +{ + ACCESS_ONCE(pv->prev) = prev; +} + +/* + * The following inlined functions are being used by the + * queue_spin_unlock_slowpath() function. + */ + +/** + * pv_get_prev - get previous queue node pointer + * @pv : pointer to struct pv_qvars to be set + * Return: the previous queue node pointer + */ +static __always_inline struct qnode *pv_get_prev(struct pv_qvars *pv) +{ + return ACCESS_ONCE(pv->prev); +} + +/** + * pv_kick_node - kick up the CPU of the given node + * @pv : pointer to struct pv_qvars of the node to be kicked + */ +static __always_inline void pv_kick_node(struct pv_qvars *pv) +{ + if (pv->cpustate != PV_CPU_HALTED) + return; + ACCESS_ONCE(pv->cpustate) = PV_CPU_KICKED; + __queue_kick_cpu(pv->mycpu); +} + +#endif /* _ASM_X86_PVQSPINLOCK_H */ diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index 3fdc0e2..d1685d1 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -38,7 +38,11 @@ union arch_qspinlock { * that the clearing the lock bit is done ASAP without artificial delay * due to compiler optimization. */ +#ifdef CONFIG_PARAVIRT_SPINLOCKS +static __always_inline void __queue_spin_unlock(struct qspinlock *lock) +#else static inline void queue_spin_unlock(struct qspinlock *lock) +#endif { union arch_qspinlock *qlock = (union arch_qspinlock *)lock; @@ -47,6 +51,37 @@ static inline void queue_spin_unlock(struct qspinlock *lock) barrier(); } +#ifdef CONFIG_PARAVIRT_SPINLOCKS +/* + * The lock byte can have a value of _QLOCK_LOCKED_SLOWPATH to indicate + * that it needs to go through the slowpath to do the unlocking. + */ +#define _QLOCK_LOCKED_SLOWPATH 3 /* Set both bits 0 & 1 */ + +extern void queue_spin_unlock_slowpath(struct qspinlock *lock); + +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + + barrier(); + if (static_key_false(¶virt_spinlocks_enabled)) { + /* + * Need to atomically clear the lock byte to avoid racing with + * queue head waiter trying to set _QLOCK_LOCKED_SLOWPATH. + */ + if (likely(cmpxchg(&qlock->lock, _QLOCK_LOCKED, 0) + == _QLOCK_LOCKED)) + return; + else + queue_spin_unlock_slowpath(lock); + + } else { + __queue_spin_unlock(lock); + } +} +#endif + #endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */ #include <asm-generic/qspinlock.h> diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 8c67cbe..d98547f 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -11,9 +11,14 @@ #ifdef CONFIG_PARAVIRT_SPINLOCKS struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP +#ifdef CONFIG_QUEUE_SPINLOCK + .kick_cpu = paravirt_nop, + .hibernate = paravirt_nop, +#else .lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop), .unlock_kick = paravirt_nop, #endif +#endif }; EXPORT_SYMBOL(pv_lock_ops); diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 309b0d6..820217e 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -58,6 +58,29 @@ */ /* + * Para-virtualized queue spinlock support + */ +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#include <asm/pvqspinlock.h> +#else + +#define PV_SET_VAR(type, var, val) +#define PV_VAR(var) 0 + +struct qnode; +struct pv_qvars {}; +static inline void pv_init_vars(struct pv_qvars *pv, int cpu_nr) {} +static inline void pv_head_spin_check(struct pv_qvars *pv, int *count, + u32 qcode, struct qspinlock *lock) {} +static inline void pv_queue_spin_check(struct pv_qvars *pv, int *count) {} +static inline void pv_next_node_check(struct pv_qvars *pv, void *lock) {} +static inline void pv_kick_node(struct pv_qvars *pv) {} +static inline void pv_set_prev(struct pv_qvars *pv, struct qnode *prev) {} +static inline struct qnode *pv_get_prev(struct pv_qvars *pv) +{ return NULL; } +#endif + +/* * The 24-bit queue node code is divided into the following 2 fields: * Bits 0-1 : queue node index (4 nodes) * Bits 2-23: CPU number + 1 (4M - 1 CPUs) @@ -86,13 +109,19 @@ enum exitval { /* * The queue node structure + * + * If CONFIG_PARAVIRT_SPINLOCKS, the previous node pointer in the pv structure + * will be used by the unfair lock code. */ struct qnode { u32 qhead; /* Queue head flag */ #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS u32 prev_qcode; /* Queue code of previous node */ +#ifndef CONFIG_PARAVIRT_SPINLOCKS struct qnode *qprev; /* Previous queue node addr */ #endif +#endif + struct pv_qvars pv; /* Para-virtualization */ struct qnode *next; /* Next queue node addr */ }; @@ -102,6 +131,20 @@ struct qnode_set { }; /* + * Allow spinning loop count only if either PV spinlock or unfair lock is + * configured. + */ +#if defined(CONFIG_PARAVIRT_UNFAIR_LOCKS) || defined(CONFIG_PARAVIRT_SPINLOCKS) +#define DEF_LOOP_VAR(c) int c = 0 +#define INC_LOOP_VAR(c) (c)++ +#define LOOP_VAR(c) c +#else +#define DEF_LOOP_VAR(c) +#define INC_LOOP_VAR(c) +#define LOOP_VAR(c) 0 +#endif + +/* * Per-CPU queue node structures */ static DEFINE_PER_CPU_ALIGNED(struct qnode_set, qnset) = { { { 0 } }, 0 }; @@ -373,6 +416,10 @@ static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval) { return 0; } #endif +#ifndef queue_get_qcode +#define queue_get_qcode(lock) (atomic_read(&(lock)->qlcode) & ~_QLOCK_LOCKED) +#endif + #ifndef queue_get_lock_qcode /** * queue_get_lock_qcode - get the lock & qcode values @@ -512,9 +559,6 @@ cmpxchg_queue_code(struct qspinlock *lock, u32 ocode, u32 ncode) * problem as possible as well as reducing the load on the lock cacheline. */ #ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS -#define DEF_LOOP_VAR(c) int c = 0 -#define INC_LOOP_VAR(c) (c)++ -#define LOOP_VAR(c) c #define LSTEAL_FREQ (1 << 4) #define LSTEAL_MASK (LSTEAL_FREQ - 1) @@ -636,9 +680,6 @@ static enum exitval unfair_get_lock(struct qspinlock *lock, struct qnode *node, } #else /* CONFIG_PARAVIRT_UNFAIR_LOCKS */ -#define DEF_LOOP_VAR(c) -#define INC_LOOP_VAR(c) -#define LOOP_VAR(c) 0 static void unfair_init_vars(struct qnode *node) {} static void unfair_set_vars(struct qnode *node, struct qnode *prev, @@ -706,6 +747,17 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) struct qnode *node, *next; u32 prev_qcode, my_qcode; enum exitval exitval; + DEF_LOOP_VAR(hcnt); + +#ifdef CONFIG_PARAVIRT_SPINLOCKS + /* + * Disable the quick spinning code path if PV spinlock is enabled to + * make sure that all the spinning CPUs can be halted when the lock + * holder is somehow scheduled out. + */ + if (static_key_false(¶virt_spinlocks_enabled)) + qsval = -1; +#endif /* * Try the quick spinning code path @@ -734,6 +786,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) node->qhead = false; node->next = NULL; unfair_init_vars(node); + pv_init_vars(&node->pv, cpu_nr); /* * The lock may be available at this point, try again if no task was @@ -760,7 +813,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) DEF_LOOP_VAR(cnt); unfair_set_vars(node, prev, prev_qcode); + pv_set_prev(&node->pv, prev); ACCESS_ONCE(prev->next) = node; + /* * Wait until the queue head flag is on */ @@ -773,7 +828,10 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) goto release_node; else if (unlikely(exitval == NOTIFY_NEXT)) goto notify_next; + pv_queue_spin_check(&node->pv, LOOP_VAR(&cnt)); } + } else { + ACCESS_ONCE(node->qhead) = true; } /* @@ -782,6 +840,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) while (true) { u32 qcode; + INC_LOOP_VAR(hcnt); if (queue_get_lock_qcode(lock, &qcode)) ; /* Lock not available yet */ else if (qcode != my_qcode) { @@ -810,6 +869,12 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) goto release_node; } arch_mutex_cpu_relax(); + + /* + * Perform para-virtualization checks + */ + pv_head_spin_check(&node->pv, LOOP_VAR(&hcnt), prev_qcode, + lock); } notify_next: @@ -821,9 +886,53 @@ notify_next: /* * The next one in queue is now at the head */ + pv_next_node_check(&next->pv, lock); smp_store_release(&next->qhead, true); release_node: put_qnode(); } EXPORT_SYMBOL(queue_spin_lock_slowpath); + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +/** + * queue_spin_unlock_slowpath - kick up the CPU of the queue head + * @lock : Pointer to queue spinlock structure + * + * The lock is released after finding the queue head to avoid racing + * condition between the queue head and the lock holder. + */ +void queue_spin_unlock_slowpath(struct qspinlock *lock) +{ + struct qnode *node, *prev; + u32 qcode = (u32)queue_get_qcode(lock); + + /* + * Get the queue tail node + */ + node = xlate_qcode(qcode); + + /* + * Locate the queue head node by following the prev pointer from + * tail to head. + * It is assumed that the PV guests won't have that many CPUs so + * that it won't take a long time to follow the pointers. + */ + while (!ACCESS_ONCE(node->qhead)) { + prev = pv_get_prev(&node->pv); + if (prev) + node = prev; + else + /* + * Delay a bit to allow the prev pointer to be set up + */ + arch_mutex_cpu_relax(); + } + /* + * Found the queue head, now release the lock before waking it up + */ + __queue_spin_unlock(lock); + pv_kick_node(&node->pv); +} +EXPORT_SYMBOL(queue_spin_unlock_slowpath); +#endif /* CONFIG_PARAVIRT_SPINLOCKS */ -- 1.7.1
Waiman Long
2014-Mar-19 20:14 UTC
[PATCH v7 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
This patch adds the necessary KVM specific code to allow KVM to support the sleeping and CPU kicking operations needed by the queue spinlock PV code. Two KVM guests of 20 CPU cores (2 nodes) were created for performance testing. With only one KVM guest powered on (no overcommit), the disk workload of the AIM7 benchmark was run on both ext4 and xfs RAM disks at 3000 users on a 3.14-rc6 based kernel. The JPM (jobs/minute) data of the test run were: kernel XFS FS %change ext4 FS %change ------ ------ ------- ------- ------- PV ticketlock (baseline) 2409639 - 1289398 - qspinlock 2425876 +0.7% 1291248 +0.1% PV qspinlock 2425876 +0.7% 1299639 +0.8% unfair qspinlock 2435724 +1.1% 1441153 +11.8% unfair + PV qspinlock 2479339 +2.9% 1459854 +13.2% The XFS test had moderate spinlock contention of 1.6% whereas the ext4 test had heavy spinlock contention of 15.4% as reported by perf. A more interesting performance comparison is when there is overcommit. With both PV guests (all 20 CPUs equally shared with both guests - 200% overcommit) turned on and the "idle=poll" kernel option to simulate a busy guest. The results of running the same AIM7 workloads are shown in the table below. XFS Test: kernel JPM Real Time Sys Time Usr Time ----- --- --------- -------- -------- PV ticketlock 597015 29.99 424.44 25.85 qspinlock 117493 153.20 1006.06 59.60 PV qspinlock 616438 29.20 397.57 23.31 unfair qspinlock 642398 28.02 396.42 25.42 unfair + PV qspinlock 633803 28.40 400.45 26.16 ext4 Test: kernel JPM Real Time Sys Time Usr Time ----- --- --------- -------- -------- PV ticketlock 120984 148.78 2378.98 29.27 qspinlock 54995 327.30 5023.58 54.73 PV qspinlock 124215 144.91 2282.22 28.89 unfair qspinlock 467411 38.51 481.80 25.80 unfair + PV qspinlock 471080 38.21 482.40 25.09 The kernel build test (make -j 20) results are as follows: kernel Real Time Sys Time Usr Time ----- --------- -------- -------- PV ticketlock 20m6.158s 41m7.167s 283m3.790s qspinlock 26m41.294s 74m55.585s 346m31.981s PV qspinlock 20m17.429s 41m19.434s 281m21.238s unfair qspinlock 19m58.315s 40m18.011s 279m27.177s unfair + PV qspinlock 20m0.030s 40m35.011s 278m50.522s With no overcommit, the PV code doesn't really do anything. The unfair lock does provide some performance advantage depending on the workload. In an overcommited PV guest, however, there can be a big performance benefit with PV and unfair lock. In term of spinlock contention, the ordering of the 3 workloads are: kernel build < AIM7 disk xfs < AIM7 disk ext4 With light spinlock contention, the PV ticketlock can be a bit faster than PV qspinlock. On moderate to high spinlock contention, PV qspinlock performs better. The unfair lock has the best performance in all cases, especially with heavy spinlock contention. Signed-off-by: Waiman Long <Waiman.Long at hp.com> --- arch/x86/kernel/kvm.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++ kernel/Kconfig.locks | 2 +- 2 files changed, 83 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index f318e78..c28bc1b 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -568,6 +568,7 @@ static void kvm_kick_cpu(int cpu) kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid); } +#ifndef CONFIG_QUEUE_SPINLOCK enum kvm_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -795,6 +796,82 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) } } } +#else /* !CONFIG_QUEUE_SPINLOCK */ + +#ifdef CONFIG_KVM_DEBUG_FS +static struct dentry *d_spin_debug; +static struct dentry *d_kvm_debug; +static u32 kick_stats; /* CPU kick count */ +static u32 hibernate_stats; /* Hibernation count */ + +static int __init kvm_spinlock_debugfs(void) +{ + d_kvm_debug = debugfs_create_dir("kvm-guest", NULL); + if (!d_kvm_debug) { + printk(KERN_WARNING + "Could not create 'kvm' debugfs directory\n"); + return -ENOMEM; + } + d_spin_debug = debugfs_create_dir("spinlocks", d_kvm_debug); + + debugfs_create_u32("kick_stats", 0644, d_spin_debug, &kick_stats); + debugfs_create_u32("hibernate_stats", + 0644, d_spin_debug, &hibernate_stats); + return 0; +} + +static inline void inc_kick_stats(void) +{ + add_smp(&kick_stats, 1); +} + +static inline void inc_hib_stats(void) +{ + add_smp(&hibernate_stats, 1); +} + +fs_initcall(kvm_spinlock_debugfs); + +#else /* CONFIG_KVM_DEBUG_FS */ +static inline void inc_kick_stats(void) +{ +} + +static inline void inc_hib_stats(void) +{ + +} +#endif /* CONFIG_KVM_DEBUG_FS */ + +static void kvm_kick_cpu_type(int cpu) +{ + kvm_kick_cpu(cpu); + inc_kick_stats(); +} + +/* + * Halt the current CPU & release it back to the host + */ +static void kvm_hibernate(void) +{ + unsigned long flags; + + if (in_nmi()) + return; + + inc_hib_stats(); + /* + * Make sure an interrupt handler can't upset things in a + * partially setup state. + */ + local_irq_save(flags); + if (arch_irqs_disabled_flags(flags)) + halt(); + else + safe_halt(); + local_irq_restore(flags); +} +#endif /* !CONFIG_QUEUE_SPINLOCK */ /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. @@ -807,8 +884,13 @@ void __init kvm_spinlock_init(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return; +#ifdef CONFIG_QUEUE_SPINLOCK + pv_lock_ops.kick_cpu = kvm_kick_cpu_type; + pv_lock_ops.hibernate = kvm_hibernate; +#else pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning); pv_lock_ops.unlock_kick = kvm_unlock_kick; +#endif } static __init int kvm_spinlock_init_jump(void) diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index f185584..a70fdeb 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK config QUEUE_SPINLOCK def_bool y if ARCH_USE_QUEUE_SPINLOCK - depends on SMP && !PARAVIRT_SPINLOCKS + depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN) -- 1.7.1
Waiman Long
2014-Mar-19 20:14 UTC
[PATCH RFC v7 11/11] pvqspinlock, x86: Enable qspinlock PV support for XEN
This patch adds the necessary KVM specific code to allow XEN to support the sleeping and CPU kicking operations needed by the queue spinlock PV code. Signed-off-by: Waiman Long <Waiman.Long at hp.com> --- arch/x86/xen/spinlock.c | 90 ++++++++++++++++++++++++++++++++++++++++++++--- kernel/Kconfig.locks | 2 +- 2 files changed, 86 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 06f4a64..78891ee 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -17,6 +17,12 @@ #include "xen-ops.h" #include "debugfs.h" +static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; +static DEFINE_PER_CPU(char *, irq_name); +static bool xen_pvspin = true; + +#ifndef CONFIG_QUEUE_SPINLOCK + enum xen_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -100,12 +106,9 @@ struct xen_lock_waiting { __ticket_t want; }; -static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; -static DEFINE_PER_CPU(char *, irq_name); static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting); static cpumask_t waiting_cpus; -static bool xen_pvspin = true; __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want) { int irq = __this_cpu_read(lock_kicker_irq); @@ -213,6 +216,74 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next) } } +#else /* CONFIG_QUEUE_SPINLOCK */ + +#ifdef CONFIG_XEN_DEBUG_FS +static u32 kick_stats; /* CPU kick count */ +static u32 hibernate_stats; /* Hibernation count */ + +static inline void inc_kick_stats(void) +{ + add_smp(&kick_stats, 1); +} + +static inline void inc_hib_stats(void) +{ + add_smp(&hibernate_stats, 1); +} +#else /* CONFIG_XEN_DEBUG_FS */ +static inline void inc_kick_stats(void) +{ +} + +static inline void inc_hib_stats(void) +{ + +} +#endif /* CONFIG_XEN_DEBUG_FS */ + +static void xen_kick_cpu_type(int cpu) +{ + xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); + inc_kick_stats(); +} + +/* + * Halt the current CPU & release it back to the host + */ +static void xen_hibernate(void) +{ + int irq = __this_cpu_read(lock_kicker_irq); + unsigned long flags; + + /* If kicker interrupts not initialized yet, just spin */ + if (irq == -1) + return; + + /* + * Make sure an interrupt handler can't upset things in a + * partially setup state. + */ + local_irq_save(flags); + + inc_hib_stats(); + /* clear pending */ + xen_clear_irq_pending(irq); + + /* Allow interrupts while blocked */ + local_irq_restore(flags); + + /* + * If an interrupt happens here, it will leave the wakeup irq + * pending, which will cause xen_poll_irq() to return + * immediately. + */ + + /* Block until irq becomes pending (or perhaps a spurious wakeup) */ + xen_poll_irq(irq); +} +#endif /* CONFIG_QUEUE_SPINLOCK */ + static irqreturn_t dummy_handler(int irq, void *dev_id) { BUG(); @@ -258,7 +329,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -275,8 +345,13 @@ void __init xen_init_spinlocks(void) return; } +#ifdef CONFIG_QUEUE_SPINLOCK + pv_lock_ops.kick_cpu = xen_kick_cpu_type; + pv_lock_ops.hibernate = xen_hibernate; +#else pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning); pv_lock_ops.unlock_kick = xen_unlock_kick; +#endif } /* @@ -318,6 +393,7 @@ static int __init xen_spinlock_debugfs(void) d_spin_debug = debugfs_create_dir("spinlocks", d_xen); +#ifndef CONFIG_QUEUE_SPINLOCK debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats); debugfs_create_u32("taken_slow", 0444, d_spin_debug, @@ -337,7 +413,11 @@ static int __init xen_spinlock_debugfs(void) debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug, spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1); - +#else /* CONFIG_QUEUE_SPINLOCK */ + debugfs_create_u32("kick_stats", 0644, d_spin_debug, &kick_stats); + debugfs_create_u32("hibernate_stats", + 0644, d_spin_debug, &hibernate_stats); +#endif /* CONFIG_QUEUE_SPINLOCK */ return 0; } fs_initcall(xen_spinlock_debugfs); diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index a70fdeb..451e392 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK config QUEUE_SPINLOCK def_bool y if ARCH_USE_QUEUE_SPINLOCK - depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN) + depends on SMP -- 1.7.1
Konrad Rzeszutek Wilk
2014-Mar-19 20:24 UTC
[PATCH v7 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock
On Wed, Mar 19, 2014 at 04:14:00PM -0400, Waiman Long wrote:> This patch makes the necessary changes at the x86 architecture > specific layer to enable the use of queue spinlock for x86-64. As > x86-32 machines are typically not multi-socket. The benefit of queue > spinlock may not be apparent. So queue spinlock is not enabled. > > Currently, there is some incompatibilities between the para-virtualized > spinlock code (which hard-codes the use of ticket spinlock) and the > queue spinlock. Therefore, the use of queue spinlock is disabled when > the para-virtualized spinlock is enabled.And how does this patch do that? I think that comment is obsolete?> > The arch/x86/include/asm/qspinlock.h header file includes some x86 > specific optimization which will make the queue spinlock code perform > better than the generic implementation. > > Signed-off-by: Waiman Long <Waiman.Long at hp.com> > Acked-by: Rik van Riel <riel at redhat.com> > --- > arch/x86/Kconfig | 1 + > arch/x86/include/asm/qspinlock.h | 41 +++++++++++++++++++++++++++++++++ > arch/x86/include/asm/spinlock.h | 5 ++++ > arch/x86/include/asm/spinlock_types.h | 4 +++ > 4 files changed, 51 insertions(+), 0 deletions(-) > create mode 100644 arch/x86/include/asm/qspinlock.h > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 0af5250..de573f9 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -17,6 +17,7 @@ config X86_64 > depends on 64BIT > select X86_DEV_DMA_OPS > select ARCH_USE_CMPXCHG_LOCKREF > + select ARCH_USE_QUEUE_SPINLOCK > > ### Arch settings > config X86 > diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h > new file mode 100644 > index 0000000..44cefee > --- /dev/null > +++ b/arch/x86/include/asm/qspinlock.h > @@ -0,0 +1,41 @@ > +#ifndef _ASM_X86_QSPINLOCK_H > +#define _ASM_X86_QSPINLOCK_H > + > +#include <asm-generic/qspinlock_types.h> > + > +#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE) > + > +#define _ARCH_SUPPORTS_ATOMIC_8_16_BITS_OPS > + > +/* > + * x86-64 specific queue spinlock union structure > + */ > +union arch_qspinlock { > + struct qspinlock slock; > + u8 lock; /* Lock bit */ > +}; > + > +#define queue_spin_unlock queue_spin_unlock > +/** > + * queue_spin_unlock - release a queue spinlock > + * @lock : Pointer to queue spinlock structure > + * > + * No special memory barrier other than a compiler one is needed for the > + * x86 architecture. A compiler barrier is added at the end to make sure > + * that the clearing the lock bit is done ASAP without artificial delay > + * due to compiler optimization. > + */ > +static inline void queue_spin_unlock(struct qspinlock *lock) > +{ > + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; > + > + barrier(); > + ACCESS_ONCE(qlock->lock) = 0; > + barrier(); > +} > + > +#endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */ > + > +#include <asm-generic/qspinlock.h> > + > +#endif /* _ASM_X86_QSPINLOCK_H */ > diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h > index 0f62f54..958d20f 100644 > --- a/arch/x86/include/asm/spinlock.h > +++ b/arch/x86/include/asm/spinlock.h > @@ -42,6 +42,10 @@ > extern struct static_key paravirt_ticketlocks_enabled; > static __always_inline bool static_key_false(struct static_key *key); > > +#ifdef CONFIG_QUEUE_SPINLOCK > +#include <asm/qspinlock.h> > +#else > + > #ifdef CONFIG_PARAVIRT_SPINLOCKS > > static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) > @@ -180,6 +184,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, > { > arch_spin_lock(lock); > } > +#endif /* CONFIG_QUEUE_SPINLOCK */ > > static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) > { > diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h > index 4f1bea1..7960268 100644 > --- a/arch/x86/include/asm/spinlock_types.h > +++ b/arch/x86/include/asm/spinlock_types.h > @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t; > > #define TICKET_SHIFT (sizeof(__ticket_t) * 8) > > +#ifdef CONFIG_QUEUE_SPINLOCK > +#include <asm-generic/qspinlock_types.h> > +#else > typedef struct arch_spinlock { > union { > __ticketpair_t head_tail; > @@ -33,6 +36,7 @@ typedef struct arch_spinlock { > } arch_spinlock_t; > > #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } > +#endif /* CONFIG_QUEUE_SPINLOCK */ > > #include <asm/rwlock.h> > > -- > 1.7.1 >
Konrad Rzeszutek Wilk
2014-Mar-19 20:28 UTC
[PATCH v7 07/11] pvqspinlock, x86: Allow unfair queue spinlock in a XEN guest
On Wed, Mar 19, 2014 at 04:14:05PM -0400, Waiman Long wrote:> This patch adds a XEN init function to activate the unfair queue > spinlock in a XEN guest when the PARAVIRT_UNFAIR_LOCKS kernel config > option is selected. > > Signed-off-by: Waiman Long <Waiman.Long at hp.com> > --- > arch/x86/xen/setup.c | 19 +++++++++++++++++++ > 1 files changed, 19 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c > index 0982233..66bb6f5 100644 > --- a/arch/x86/xen/setup.c > +++ b/arch/x86/xen/setup.c > @@ -625,3 +625,22 @@ void __init xen_arch_setup(void) > numa_off = 1; > #endif > } > + > +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS > +/* > + * Enable unfair lock if running in a Xen guest > + */ > +static __init int xen_unfair_locks_init_jump(void) > +{ > + /* > + * Disable unfair lock if not running in a PV domain > + */ > + if (!xen_pv_domain()) > + return 0;I would just make this 'xen_domain'. Not sure why you need to have it only for PV while the PVHVM guests can also use it? Would it also make sense to use the same printk statement that the KVM has?> + > + static_key_slow_inc(¶virt_unfairlocks_enabled); > + > + return 0; > +} > +early_initcall(xen_unfair_locks_init_jump); > +#endif > -- > 1.7.1 >
Paolo Bonzini
2014-Mar-20 22:01 UTC
[PATCH v7 06/11] pvqspinlock, x86: Allow unfair queue spinlock in a KVM guest
Il 19/03/2014 21:14, Waiman Long ha scritto:> This patch adds a KVM init function to activate the unfair queue > spinlock in a KVM guest when the PARAVIRT_UNFAIR_LOCKS kernel config > option is selected. > > Signed-off-by: Waiman Long <Waiman.Long at hp.com> > --- > arch/x86/kernel/kvm.c | 17 +++++++++++++++++ > 1 files changed, 17 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index 713f1b3..a489140 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -826,3 +826,20 @@ static __init int kvm_spinlock_init_jump(void) > early_initcall(kvm_spinlock_init_jump); > > #endif /* CONFIG_PARAVIRT_SPINLOCKS */ > + > +#ifdef CONFIG_PARAVIRT_UNFAIR_LOCKS > +/* > + * Enable unfair lock if running in a real para-virtualized environment > + */ > +static __init int kvm_unfair_locks_init_jump(void) > +{ > + if (!kvm_para_available()) > + return 0; > + > + static_key_slow_inc(¶virt_unfairlocks_enabled); > + printk(KERN_INFO "KVM setup unfair spinlock\n"); > + > + return 0; > +} > +early_initcall(kvm_unfair_locks_init_jump); > +#endif >No! Please do what I asked you to do. You are not handling Hyper-V or VMWare. Just use X86_FEATURE_HYPERVISOR and it will cover all hypervisors that actually follow Intel's guidelines. Paolo
Paolo Bonzini
2014-Mar-20 22:07 UTC
[PATCH v7 10/11] pvqspinlock, x86: Enable qspinlock PV support for KVM
Il 19/03/2014 21:14, Waiman Long ha scritto:> This patch adds the necessary KVM specific code to allow KVM to support > the sleeping and CPU kicking operations needed by the queue spinlock PV > code.The remaining problem of this patch is that you cannot get the generically optimal configuration: qspinlock on baremetal, PV qspinlock on KVM/Xen, unfair qspinlock on VMware/Hyper-V. You really need to disable unfair locks in kvm_spinlock_init. Paolo> The XFS test had moderate spinlock contention of 1.6% whereas the > ext4 test had heavy spinlock contention of 15.4% as reported by perf. > > A more interesting performance comparison is when there is > overcommit. With both PV guests (all 20 CPUs equally shared with > both guests - 200% overcommit) turned on and the "idle=poll" kernel > option to simulate a busy guest. The results of running the same AIM7 > workloads are shown in the table below. > > XFS Test: > kernel JPM Real Time Sys Time Usr Time > ----- --- --------- -------- -------- > PV ticketlock 597015 29.99 424.44 25.85 > qspinlock 117493 153.20 1006.06 59.60 > PV qspinlock 616438 29.20 397.57 23.31 > unfair qspinlock 642398 28.02 396.42 25.42 > unfair + PV qspinlock 633803 28.40 400.45 26.16 > > ext4 Test: > kernel JPM Real Time Sys Time Usr Time > ----- --- --------- -------- -------- > PV ticketlock 120984 148.78 2378.98 29.27 > qspinlock 54995 327.30 5023.58 54.73 > PV qspinlock 124215 144.91 2282.22 28.89 > unfair qspinlock 467411 38.51 481.80 25.80 > unfair + PV qspinlock 471080 38.21 482.40 25.09 > > The kernel build test (make -j 20) results are as follows: > > kernel Real Time Sys Time Usr Time > ----- --------- -------- -------- > PV ticketlock 20m6.158s 41m7.167s 283m3.790s > qspinlock 26m41.294s 74m55.585s 346m31.981s > PV qspinlock 20m17.429s 41m19.434s 281m21.238s > unfair qspinlock 19m58.315s 40m18.011s 279m27.177s > unfair + PV qspinlock 20m0.030s 40m35.011s 278m50.522s > > With no overcommit, the PV code doesn't really do anything. The unfair > lock does provide some performance advantage depending on the workload. > > In an overcommited PV guest, however, there can be a big performance > benefit with PV and unfair lock. In term of spinlock contention, > the ordering of the 3 workloads are: > > kernel build < AIM7 disk xfs < AIM7 disk ext4 > > With light spinlock contention, the PV ticketlock can be a bit > faster than PV qspinlock. On moderate to high spinlock contention, > PV qspinlock performs better. The unfair lock has the best performance > in all cases, especially with heavy spinlock contention. > > Signed-off-by: Waiman Long <Waiman.Long at hp.com> > --- > arch/x86/kernel/kvm.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++ > kernel/Kconfig.locks | 2 +- > 2 files changed, 83 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index f318e78..c28bc1b 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -568,6 +568,7 @@ static void kvm_kick_cpu(int cpu) > kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid); > } > > +#ifndef CONFIG_QUEUE_SPINLOCK > enum kvm_contention_stat { > TAKEN_SLOW, > TAKEN_SLOW_PICKUP, > @@ -795,6 +796,82 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) > } > } > } > +#else /* !CONFIG_QUEUE_SPINLOCK */ > + > +#ifdef CONFIG_KVM_DEBUG_FS > +static struct dentry *d_spin_debug; > +static struct dentry *d_kvm_debug; > +static u32 kick_stats; /* CPU kick count */ > +static u32 hibernate_stats; /* Hibernation count */ > + > +static int __init kvm_spinlock_debugfs(void) > +{ > + d_kvm_debug = debugfs_create_dir("kvm-guest", NULL); > + if (!d_kvm_debug) { > + printk(KERN_WARNING > + "Could not create 'kvm' debugfs directory\n"); > + return -ENOMEM; > + } > + d_spin_debug = debugfs_create_dir("spinlocks", d_kvm_debug); > + > + debugfs_create_u32("kick_stats", 0644, d_spin_debug, &kick_stats); > + debugfs_create_u32("hibernate_stats", > + 0644, d_spin_debug, &hibernate_stats); > + return 0; > +} > + > +static inline void inc_kick_stats(void) > +{ > + add_smp(&kick_stats, 1); > +} > + > +static inline void inc_hib_stats(void) > +{ > + add_smp(&hibernate_stats, 1); > +} > + > +fs_initcall(kvm_spinlock_debugfs); > + > +#else /* CONFIG_KVM_DEBUG_FS */ > +static inline void inc_kick_stats(void) > +{ > +} > + > +static inline void inc_hib_stats(void) > +{ > + > +} > +#endif /* CONFIG_KVM_DEBUG_FS */ > + > +static void kvm_kick_cpu_type(int cpu) > +{ > + kvm_kick_cpu(cpu); > + inc_kick_stats(); > +} > + > +/* > + * Halt the current CPU & release it back to the host > + */ > +static void kvm_hibernate(void) > +{ > + unsigned long flags; > + > + if (in_nmi()) > + return; > + > + inc_hib_stats(); > + /* > + * Make sure an interrupt handler can't upset things in a > + * partially setup state. > + */ > + local_irq_save(flags); > + if (arch_irqs_disabled_flags(flags)) > + halt(); > + else > + safe_halt(); > + local_irq_restore(flags); > +} > +#endif /* !CONFIG_QUEUE_SPINLOCK */ > > /* > * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. > @@ -807,8 +884,13 @@ void __init kvm_spinlock_init(void) > if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) > return; > > +#ifdef CONFIG_QUEUE_SPINLOCK > + pv_lock_ops.kick_cpu = kvm_kick_cpu_type; > + pv_lock_ops.hibernate = kvm_hibernate; > +#else > pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning); > pv_lock_ops.unlock_kick = kvm_unlock_kick; > +#endif > } > > static __init int kvm_spinlock_init_jump(void) > diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks > index f185584..a70fdeb 100644 > --- a/kernel/Kconfig.locks > +++ b/kernel/Kconfig.locks > @@ -229,4 +229,4 @@ config ARCH_USE_QUEUE_SPINLOCK > > config QUEUE_SPINLOCK > def_bool y if ARCH_USE_QUEUE_SPINLOCK > - depends on SMP && !PARAVIRT_SPINLOCKS > + depends on SMP && (!PARAVIRT_SPINLOCKS || !XEN) >
Possibly Parallel Threads
- [PATCH v7 00/11] qspinlock: a 4-byte queue spinlock with PV support
- [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support
- [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support
- [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support
- [PATCH v8 00/10] qspinlock: a 4-byte queue spinlock with PV support