On 03/25/2015 03:47 PM, Konrad Rzeszutek Wilk wrote:> On Mon, Mar 16, 2015 at 02:16:13PM +0100, Peter Zijlstra wrote:
>> Hi Waiman,
>>
>> As promised; here is the paravirt stuff I did during the trip to BOS
last week.
>>
>> All the !paravirt patches are more or less the same as before (the only
real
>> change is the copyright lines in the first patch).
>>
>> The paravirt stuff is 'simple' and KVM only -- the Xen code was
a little more
>> convoluted and I've no real way to test that but it should be
stright fwd to
>> make work.
>>
>> I ran this using the virtme tool (thanks Andy) on my laptop with a 4x
>> overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually
has) and
>> it both booted and survived a hackbench run (perf bench sched messaging
-g 20
>> -l 5000).
>>
>> So while the paravirt code isn't the most optimal code ever
conceived it does work.
>>
>> Also, the paravirt patching includes replacing the call with "movb
$0, %arg1"
>> for the native case, which should greatly reduce the cost of having
>> CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware.
> Ah nice. That could be spun out as a seperate patch to optimize the
existing
> ticket locks I presume.
The goal is to replace ticket spinlock by queue spinlock. We may not
want to support 2 different spinlock implementations in the kernel.
>
> Now with the old pv ticketlock code an vCPU would only go to sleep once and
> be woken up when it was its turn. With this new code it is woken up twice
> (and twice it goes to sleep). With an overcommit scenario this would imply
> that we will have at least twice as many VMEXIT as with the previous code.
I did it differently in my PV portion of the qspinlock patch. Instead of
just waking up the CPU, the new lock holder will check if the new queue
head has been halted. If so, it will set the slowpath flag for the
halted queue head in the lock so as to wake it up at unlock time. This
should eliminate your concern of dong twice as many VMEXIT in an
overcommitted scenario.
BTW, I did some qspinlock vs. ticketspinlock benchmarks using AIM7
high_systime workload on a 4-socket IvyBridge-EX system (60 cores, 120
threads) with some interesting results.
In term of the performance benefit of this patch, I ran the
high_systime workload (which does a lot of fork() and exit())
at various load levels (500, 1000, 1500 and 2000 users) on a
4-socket IvyBridge-EX bare-metal system (60 cores, 120 threads)
with intel_pstate driver and performance scaling governor. The JPM
(jobs/minutes) and execution time results were as follows:
Kernel JPM Execution Time
------ --- --------------
At 500 users:
3.19 118857.14 26.25s
3.19-qspinlock 134889.75 23.13s
% change +13.5% -11.9%
At 1000 users:
3.19 204255.32 30.55s
3.19-qspinlock 239631.34 26.04s
% change +17.3% -14.8%
At 1500 users:
3.19 177272.73 52.80s
3.19-qspinlock 326132.40 28.70s
% change +84.0% -45.6%
At 2000 users:
3.19 196690.31 63.45s
3.19-qspinlock 341730.56 36.52s
% change +73.7% -42.4%
It turns out that this workload was causing quite a lot of spinlock
contention in the vanilla 3.19 kernel. The performance advantage of
this patch increases with heavier loads.
With the powersave governor, the JPM data were as follows:
Users 3.19 3.19-qspinlock % Change
----- ---- -------------- --------
500 112635.38 132596.69 +17.7%
1000 171240.40 240369.80 +40.4%
1500 130507.53 324436.74 +148.6%
2000 175972.93 341637.01 +94.1%
With the qspinlock patch, there wasn't too much difference in
performance between the 2 scaling governors. Without this patch,
the powersave governor was much slower than the performance governor.
By disabling the intel_pstate driver and use acpi_cpufreq instead,
the benchmark performance (JPM) at 1000 users level for the performance
and ondemand governors were:
Governor 3.19 3.19-qspinlock % Change
-------- ---- -------------- --------
performance 124949.94 219950.65 +76.0%
ondemand 4838.90 206690.96 +4171%
The performance was just horrible when there was significant spinlock
contention with the ondemand governor. There was also significant
run-to-run variation. A second run of the same benchmark gave a result
of 22115 JPMs. With the qspinlock patch, however, the performance was
much more stable under different cpufreq drivers and governors. That
is not the case with the default ticket spinlock implementation.
The %CPU times spent on spinlock contention (from perf) with the
performance governor and the intel_pstate driver were:
Kernel Function 3.19 kernel 3.19-qspinlock kernel
--------------- ----------- ---------------------
At 500 users:
_raw_spin_lock* 28.23% 2.25%
queue_spin_lock_slowpath N/A 4.05%
At 1000 users:
_raw_spin_lock* 23.21% 2.25%
queue_spin_lock_slowpath N/A 4.42%
At 1500 users:
_raw_spin_lock* 29.07% 2.24%
queue_spin_lock_slowpath N/A 4.49%
At 2000 users:
_raw_spin_lock* 29.15% 2.26%
queue_spin_lock_slowpath N/A 4.82%
The top spinlock related entries in the perf profile for the 3.19
kernel at 1000 users were:
7.40% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--58.96%-- rwsem_wake
|--20.02%-- release_pages
|--15.88%-- pagevec_lru_move_fn
|--1.53%-- get_page_from_freelist
|--0.78%-- __wake_up
|--0.55%-- try_to_wake_up
--2.28%-- [...]
3.13% reaim [kernel.kallsyms] [k] _raw_spin_lock
|--37.55%-- free_one_page
|--17.47%-- __cache_free_alien
|--4.95%-- __rcu_process_callbacks
|--2.93%-- __pte_alloc
|--2.68%-- __drain_alien_cache
|--2.56%-- ext4_do_update_inode
|--2.54%-- try_to_wake_up
|--2.46%-- pgd_free
|--2.32%-- cache_alloc_refill
|--2.32%-- pgd_alloc
|--2.32%-- free_pcppages_bulk
|--1.88%-- do_wp_page
|--1.77%-- handle_pte_fault
|--1.58%-- do_anonymous_page
|--1.56%-- rmqueue_bulk.clone.0
|--1.35%-- copy_pte_range
|--1.25%-- zap_pte_range
|--1.13%-- cache_flusharray
|--0.88%-- __pmd_alloc
|--0.70%-- wake_up_new_task
|--0.66%-- __pud_alloc
|--0.59%-- ext4_discard_preallocations
--6.53%-- [...]
With the qspinlock patch, the perf profile at 1000 users was:
3.25% reaim [kernel.kallsyms] [k] queue_spin_lock_slowpath
|--62.00%-- _raw_spin_lock_irqsave
| |--77.49%-- rwsem_wake
| |--11.99%-- release_pages
| |--4.34%-- pagevec_lru_move_fn
| |--1.93%-- get_page_from_freelist
| |--1.90%-- prepare_to_wait_exclusive
| |--1.29%-- __wake_up
| |--0.74%-- finish_wait
|--11.63%-- _raw_spin_lock
| |--31.11%-- try_to_wake_up
| |--7.77%-- free_pcppages_bulk
| |--7.12%-- __drain_alien_cache
| |--6.17%-- rmqueue_bulk.clone.0
| |--4.17%-- __rcu_process_callbacks
| |--2.22%-- cache_alloc_refill
| |--2.15%-- wake_up_new_task
| |--1.80%-- ext4_do_update_inode
| |--1.52%-- cache_flusharray
| |--0.89%-- __mutex_unlock_slowpath
| |--0.64%-- ttwu_queue
|--11.19%-- _raw_spin_lock_irq
| |--98.95%-- rwsem_down_write_failed
| |--0.93%-- __schedule
|--7.91%-- queue_read_lock_slowpath
| _raw_read_lock
| |--96.79%-- do_wait
| |--2.44%-- do_prlimit
| chrdev_open
| do_dentry_open
| vfs_open
| do_last
| path_openat
| do_filp_open
| do_sys_open
| sys_open
| system_call
| __GI___libc_open
|--7.05%-- queue_write_lock_slowpath
| _raw_write_lock_irq
| |--35.36%-- release_task
| |--32.76%-- copy_process
| do_exit
| do_group_exit
| sys_exit_group
| system_call
--0.22%-- [...]
This demonstrates the benefit of this patch for those applications
that run on multi-socket machines and can cause significant spinlock
contentions in the kernel.