thr3ads.net - Linux Virtualization - [PATCH 0/6] x86: reduce paravirtualized spinlock overhead [May 2015]

If this information is useful, please help other people find it:
Share via:

Juergen Gross

2015-May-04 05:55 UTC

[PATCH 0/6] x86: reduce paravirtualized spinlock overhead

On 04/30/2015 06:39 PM, Jeremy Fitzhardinge wrote:> On 04/30/2015 03:53 AM, Juergen Gross wrote:
>> Paravirtualized spinlocks produce some overhead even if the kernel is
>> running on bare metal. The main reason are the more complex locking
>> and unlocking functions. Especially unlocking is no longer just one
>> instruction but so complex that it is no longer inlined.
>>
>> This patch series addresses this issue by adding two more pvops
>> functions to reduce the size of the inlined spinlock functions. When
>> running on bare metal unlocking is again basically one instruction.
>
> Out of curiosity, is there a measurable difference?
I did a small measurement of the pure locking functions on bare metal
without and with my patches.

spin_lock() for the first time (lock and code not in cache) dropped from
about 600 to 500 cycles.

spin_unlock() for first time dropped from 145 to 87 cycles.

spin_lock() in a loop dropped from 48 to 45 cycles.

spin_unlock() in the same loop dropped from 24 to 22 cycles.


Juergen

Jeremy Fitzhardinge

2015-May-05 17:21 UTC

head link

[PATCH 0/6] x86: reduce paravirtualized spinlock overhead

On 05/03/2015 10:55 PM, Juergen Gross wrote:> I did a small measurement of the pure locking functions on bare metal
> without and with my patches.
>
> spin_lock() for the first time (lock and code not in cache) dropped from
> about 600 to 500 cycles.
>
> spin_unlock() for first time dropped from 145 to 87 cycles.
>
> spin_lock() in a loop dropped from 48 to 45 cycles.
>
> spin_unlock() in the same loop dropped from 24 to 22 cycles.
Did you isolate icache hot/cold from dcache hot/cold? It seems to me the
main difference will be whether the branch predictor is warmed up rather
than if the lock itself is in dcache, but its much more likely that the
lock code is icache if the code is lock intensive, making the cold case
moot. But that's pure speculation.

Could you see any differences in workloads beyond microbenchmarks?

Not that its my call at all, but I think we'd need to see some concrete
improvements in real workloads before adding the complexity of more pvops.

    J

Juergen Gross

2015-May-06 11:55 UTC

head link

[PATCH 0/6] x86: reduce paravirtualized spinlock overhead

On 05/05/2015 07:21 PM, Jeremy Fitzhardinge wrote:> On 05/03/2015 10:55 PM, Juergen Gross wrote:
>> I did a small measurement of the pure locking functions on bare metal
>> without and with my patches.
>>
>> spin_lock() for the first time (lock and code not in cache) dropped
from
>> about 600 to 500 cycles.
>>
>> spin_unlock() for first time dropped from 145 to 87 cycles.
>>
>> spin_lock() in a loop dropped from 48 to 45 cycles.
>>
>> spin_unlock() in the same loop dropped from 24 to 22 cycles.
>
> Did you isolate icache hot/cold from dcache hot/cold? It seems to me the
> main difference will be whether the branch predictor is warmed up rather
> than if the lock itself is in dcache, but its much more likely that the
> lock code is icache if the code is lock intensive, making the cold case
> moot. But that's pure speculation.
>
> Could you see any differences in workloads beyond microbenchmarks?
>
> Not that its my call at all, but I think we'd need to see some concrete
> improvements in real workloads before adding the complexity of more pvops.
I did another test on a larger machine:

25 kernel builds (time make -j 32) on a 32 core machine. Before each
build "make clean" was called, the first result after boot was omitted
to avoid disk cache warmup effects.

System time without my patches: 861.5664 +/- 3.3665 s
                with my patches: 852.2269 +/- 3.6629 s


Juergen

Apparently Analagous Threads

Search for more possibly parallel threads

Linux Virtualization - May 2015 - [PATCH 0/6] x86: reduce paravirtualized spinlock overhead

[PATCH 0/6] x86: reduce paravirtualized spinlock overhead

[PATCH 0/6] x86: reduce paravirtualized spinlock overhead

[PATCH 0/6] x86: reduce paravirtualized spinlock overhead

Apparently Analagous Threads