On 2020/1/15 22:14, Marc Zyngier wrote:> On 2020-01-13 12:12, Will Deacon wrote:
>> [+PeterZ]
>>
>> On Thu, Dec 26, 2019 at 09:58:27PM +0800, Zengruan Ye wrote:
>>> This patch set aims to support the vcpu_is_preempted()
functionality
>>> under KVM/arm64, which allowing the guest to obtain the VCPU is
>>> currently running or not. This will enhance lock performance on
>>> overcommitted hosts (more runnable VCPUs than physical CPUs in the
>>> system) as doing busy waits for preempted VCPUs will hurt system
>>> performance far worse than early yielding.
>>>
>>> We have observed some performace improvements in uninx benchmark
tests.
>>>
>>> unix benchmark result:
>>> ? host:? kernel 5.5.0-rc1, HiSilicon Kunpeng920, 8 CPUs
>>> ? guest: kernel 5.5.0-rc1, 16 VCPUs
>>>
>>> ?????????????? test-case??????????????? |??? after-patch??? |??
before-patch
>>>
----------------------------------------+-------------------+------------------
>>> ?Dhrystone 2 using register variables?? | 334600751.0 lps?? |
335319028.3 lps
>>> ?Double-Precision Whetstone???????????? |???? 32856.1 MWIPS |????
32849.6 MWIPS
>>> ?Execl Throughput?????????????????????? |????? 3662.1 lps?? |?????
2718.0 lps
>>> ?File Copy 1024 bufsize 2000 maxblocks? |??? 432906.4 KBps? |???
158011.8 KBps
>>> ?File Copy 256 bufsize 500 maxblocks??? |??? 116023.0 KBps? |????
37664.0 KBps
>>> ?File Copy 4096 bufsize 8000 maxblocks? |?? 1432769.8 KBps? |???
441108.8 KBps
>>> ?Pipe Throughput??????????????????????? |?? 6405029.6 lps?? |??
6021457.6 lps
>>> ?Pipe-based Context Switching?????????? |??? 185872.7 lps?? |???
184255.3 lps
>>> ?Process Creation?????????????????????? |????? 4025.7 lps?? |?????
3706.6 lps
>>> ?Shell Scripts (1 concurrent)?????????? |????? 6745.6 lpm?? |?????
6436.1 lpm
>>> ?Shell Scripts (8 concurrent)?????????? |?????? 998.7 lpm?? |??????
931.1 lpm
>>> ?System Call Overhead?????????????????? |?? 3913363.1 lps?? |??
3883287.8 lps
>>>
----------------------------------------+-------------------+------------------
>>> ?System Benchmarks Index Score????????? |????? 1835.1?????? |?????
1327.6
>>
>> Interesting, thanks for the numbers.
>>
>> So it looks like there is a decent improvement to be had from targetted
vCPU
>> wakeup, but I really dislike the explicit PV interface and it's
already been
>> shown to interact badly with the WFE-based polling in
smp_cond_load_*().
>>
>> Rather than expose a divergent interface, I would instead like to
explore an
>> improvement to smp_cond_load_*() and see how that performs before we
commit
>> to something more intrusive. Marc and I looked at this very briefly in
the
>> past, and the basic idea is to register all of the WFE sites with the
>> hypervisor, indicating which register contains the address being spun
on
>> and which register contains the "bad" value. That way, you
don't bother
>> rescheduling a vCPU if the value at the address is still bad, because
you
>> know it will exit immediately.
>>
>> Of course, the devil is in the details because when I say
"address", that's
>> a guest virtual address, so you need to play some tricks in the
hypervisor
>> so that you have a separate mapping for the lockword (it's enough
to keep
>> track of the physical address).
>>
>> Our hacks are here but we basically ran out of time to work on them
beyond
>> an unoptimised and hacky prototype:
>>
>>
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy
>>
>> Marc -- how would you prefer to handle this?
>
> Let me try and rebase this thing to a modern kernel (I doubt it applies
without
> conflicts to mainline). We can then have discussion about its merit on the
list
> once I post it. It'd be good to have a pointer to the benchamrks that
have been
> used here.
Hi Marc, Will,
My apologies for the slow reply. Just checking what is the latest on this
PV cond yield prototype?
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy
Recently, I re-doed the unixbench test comparison between vCPU preempted check
and PV cond yield. The results are as follows:
unix benchmark result:
? host:? kernel 5.10.0-rc6, HiSilicon Kunpeng920, 8 CPUs
? guest: kernel 5.10.0-rc6, 16 VCPUs
?????????????????????????????????????? | 5.10.0-rc6 | pv_cond_yield |
vcpu_is_preempted
?System Benchmarks Index Values??????? |??? INDEX?? |????? INDEX??? |????? INDEX
---------------------------------------+------------+---------------+-------------------
?Dhrystone 2 using register variables? |? 29164.0?? |??? 29156.9??? |??? 29207.2
?Double-Precision Whetstone??????????? |?? 6807.6?? |???? 6789.2??? |???? 6912.1
?Execl Throughput????????????????????? |??? 856.7?? |???? 1195.6??? |????? 863.1
?File Copy 1024 bufsize 2000 maxblocks |??? 189.9?? |????? 923.5??? |???? 1094.2
?File Copy 256 bufsize 500 maxblocks?? |??? 121.9?? |????? 578.4??? |????? 588.7
?File Copy 4096 bufsize 8000 maxblocks |??? 419.9?? |???? 1992.0??? |???? 2733.7
?Pipe Throughput?????????????????????? |?? 6727.2?? |???? 6670.2??? |???? 6743.2
?Pipe-based Context Switching????????? |??? 486.9?? |????? 547.0??? |????? 471.9
?Process Creation????????????????????? |??? 353.4?? |????? 345.1??? |????? 338.5
?Shell Scripts (1 concurrent)????????? |?? 3187.2?? |???? 1432.2??? |???? 2798.7
?Shell Scripts (8 concurrent)????????? |?? 3410.5?? |???? 1360.1??? |???? 2672.9
?System Call Overhead????????????????? |?? 2967.0?? |???? 3273.9??? |???? 3497.9
---------------------------------------+------------+---------------+-------------------
?System Benchmarks Index Score???????? |?? 1410.0?? |???? 1885.8??? |???? 2128.5
Thanks,
Zengruan
>
> Thanks,
>
> ??????? M.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.linuxfoundation.org/pipermail/virtualization/attachments/20201216/6a147894/attachment-0001.html>