thr3ads.net - Xen devel - [Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG

If this information is useful, please help other people find it:
Share via:

Jeremy Fitzhardinge

2009-Jan-20 20:45 UTC

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Ingo Molnar wrote:> * Ingo Molnar <mingo@elte.hu> wrote:
>
>   
>>> Times I believe are in nanoseconds for lmbench, anyway lower is 
>>> better.
>>>
>>> non pv   AVG=464.22 STD=5.56
>>> paravirt AVG=502.87 STD=7.36
>>>
>>> Nearly 10% performance drop here, which is quite a bit... hopefully
>>> people are testing the speed of their PV implementations against 
>>> non-PV bare metal :)
>>>       
>> Ouch, that looks unacceptably expensive. All the major distros turn 
>> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express
>> promise to have no measurable runtime overhead.
>>     
>
> Here are some more precise stats done via hw counters on a perfcounters 
> kernel using ''timec'', running a modified version of the
''mmap performance
> stress-test'' app i made years ago.
>
> The MM benchmark app can be downloaded from:
>
>    http://redhat.com/~mingo/misc/mmap-perf.c
>
> timec.c can be picked up from:
>
>    http://redhat.com/~mingo/perfcounters/timec.c
>
> mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and touches 
> the mapped area as well with a certain chance. The patterns are 
> pseudo-random and the random seed is initialized to the same value so 
> repeated runs produce the exact same mmap sequence.
>
> I ran the test with a single thread and bound to a single core:
>
>   # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1
>
> [ I ran it as root - so that kernel-space hardware-counter statistics are 
>   included as well. ]
>
> The results are quite surprisingly candid about the true costs of 
> paravirt_ops on the native kernel''s overhead (CONFIG_PARAVIRT=y):
>
> -----------------------------------------------
> | Performance counter stats for ''./mmap-perf'' |
> -----------------------------------------------
> |                |
> |  x86-defconfig |   PARAVIRT=y         
> |------------------------------------------------------------------
> |
> |    1311.554526 |  1360.624932  task clock ticks (msecs)    +3.74%
> |                |
> |              1 |            1  CPU migrations
> |             91 |           79  context switches
> |          55945 |        55943  pagefaults
> |    ............................................
> |     3781392474 |   3918777174  CPU cycles                  +3.63%
> |     1957153827 |   2161280486  instructions               +10.43%
>   
!!
> |       50234816 |     51303520  cache references            +2.12%
> |        5428258 |      5583728  cache misses                +2.86%
>   
Is this I or D, or combined?
> |                |
> |    1314.782469 |  1363.694447  time elapsed (msecs)        +3.72%
> |                |
> -----------------------------------
>
> The most surprising element is that in the paravirt_ops case we run 204 
> million more instructions - out of the ~2000 million instructions total. 
>
> That''s an increase of over 10%!
>   
Yow!  That''s pretty awful.  We knew that static instruction count was 
up, but wouldn''t have thought that it would hit the dynamic instruction
count so much...

I think there are some immediate tweaks we can make to the code 
generated for each call site,  which will help to an extent.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ingo Molnar

2009-Jan-20 20:56 UTC

head link

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> Ingo Molnar wrote:
>> * Ingo Molnar <mingo@elte.hu> wrote:
>>
>>   
>>>> Times I believe are in nanoseconds for lmbench, anyway lower is
>>>> better.
>>>>
>>>> non pv   AVG=464.22 STD=5.56
>>>> paravirt AVG=502.87 STD=7.36
>>>>
>>>> Nearly 10% performance drop here, which is quite a bit...
hopefully
>>>> people are testing the speed of their PV implementations
against
>>>> non-PV bare metal :)
>>>>       
>>> Ouch, that looks unacceptably expensive. All the major distros turn
>>> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the 
>>> express promise to have no measurable runtime overhead.
>>>     
>>
>> Here are some more precise stats done via hw counters on a perfcounters
>> kernel using ''timec'', running a modified version of
the ''mmap
>> performance stress-test'' app i made years ago.
>>
>> The MM benchmark app can be downloaded from:
>>
>>    http://redhat.com/~mingo/misc/mmap-perf.c
>>
>> timec.c can be picked up from:
>>
>>    http://redhat.com/~mingo/perfcounters/timec.c
>>
>> mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and 
>> touches the mapped area as well with a certain chance. The patterns are
>> pseudo-random and the random seed is initialized to the same value so  
>> repeated runs produce the exact same mmap sequence.
>>
>> I ran the test with a single thread and bound to a single core:
>>
>>   # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1
>>
>> [ I ran it as root - so that kernel-space hardware-counter statistics 
>> are   included as well. ]
>>
>> The results are quite surprisingly candid about the true costs of  
>> paravirt_ops on the native kernel''s overhead
(CONFIG_PARAVIRT=y):
>>
>> -----------------------------------------------
>> | Performance counter stats for ''./mmap-perf'' |
>> -----------------------------------------------
>> |                |
>> |  x86-defconfig |   PARAVIRT=y          
>> |------------------------------------------------------------------
>> |
>> |    1311.554526 |  1360.624932  task clock ticks (msecs)    +3.74%
>> |                |
>> |              1 |            1  CPU migrations
>> |             91 |           79  context switches
>> |          55945 |        55943  pagefaults
>> |    ............................................
>> |     3781392474 |   3918777174  CPU cycles                  +3.63%
>> |     1957153827 |   2161280486  instructions               +10.43%
>>   
>
> !!
>
>> |       50234816 |     51303520  cache references            +2.12%
>> |        5428258 |      5583728  cache misses                +2.86%
>>   
>
> Is this I or D, or combined?
That''s last-level-cache references+misses (L2 cache):

 Bit Position Event Name                UMask Event Select
 CPUID.AH.EBX
 3            LLC Reference             4FH   2EH
 4            LLC Misses                41H   2EH

	Ingo

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Jan-21 22:23 UTC

head link

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Nick Piggin wrote:> Oh, _llc_ references/misses? Ouch.
>
> You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark
> is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is
> coming from? Instruction fetches?
>   
I assume so.  There should be no extra data accesses with 
CONFIG_PARAVIRT (hm, there''s probably some extra stack/spill traffic, 
but I surely hope that''s not falling out of cache).

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Jan-22 22:44 UTC

head link

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Zachary Amsden wrote:> These fragments, from native_pgd_val, certainly don''t help:
>
> c0120f60:       55                      push   %ebp
> c0120f61:       89 e5                   mov    %esp,%ebp
> c0120f63:       5d                      pop    %ebp
> c0120f64:       c3                      ret
> c0120f65:       8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
> c0120f69:       8d bc 27 00 00 00 00    lea    0x0(%edi,%eiz,1),%edi
>   
Yes, that''s a rather awful noop; compiling without frame pointers 
reduces this to a single "ret".
> That is really disgusting.  We absolutely should be patching away the
> function calls here in the native case.. not sure we do that today.
>   
I did have some patches to do that at one point.  If you set pgd_val = 
paravirt_nop, then the patching machinery will completely nop out the 
call site.  The problem is that it depends on the calling convention 
using the same regs for the first arg and return - true for 32-bit, but 
not 64.  We could fix that with identity functions which the patcher 
recognizes and can replace with either pure nops or inline appropriate 
register moves.

Also, I just posted patches to get rid of all pvops calls when fetching 
or setting flags in a pte, which I hope will help.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Jan-23 00:08 UTC

head link

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

H. Peter Anvin wrote:> Right now a number of the call sites contain a huge push/pop sequence 
> followed by an indirect call.  We can patch in the native code to 
> avoid the branch overhead, but the register constraints and icache 
> footprint is unchanged.
That''s true for the pvops hooks emitted in the .S files, but not so
true
for ones in C code (well, there are no explicit push/pops, but the 
presence of the call may cause the compiler to generate them).

The .S hooks can definitely be cleaned up, but I don''t think
that''s
germane to Nick''s observations that the mm code is showing slowdowns.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Jan-23 00:14 UTC

head link

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Zachary Amsden wrote:> What about removing the identity functions entirely.  They are useless,
> really.  All that is needed is a patch site filled with nops for Xen to
> overwrite, just stuffing the value into the proper registers.  For
> 64-bit, it can be a simple mov to satisfy the constraints.
>   
I think it comes to the same thing really.  Both end up generating a 
series of nops with values entering and leaving in well-defined 
registers.  The x86-64 calling convention is a bit awkward because the 
first arg is in rdi and the ret is rax, so it can''t quite be pure nops,
or we use a non-standard calling-convention with appropriate thunks to 
call into C code.  I think a mov is a better performance-complexity 
tradeoff.
>> Also, I just posted patches to get rid of all pvops calls when fetching
>> or setting flags in a pte, which I hope will help.
>>     
>
> Sounds like it will help.
>   
...but apparently not.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ingo Molnar

2009-Jan-27 07:59 UTC

head link

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

* Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>>> Also, I just posted patches to get rid of all pvops calls when 
>>> fetching or setting flags in a pte, which I hope will help.
>>
>> Sounds like it will help.
>
> ...but apparently not.
ping?

This is a very serious paravirt_ops slowdown affecting the native
kernel''s
performance to the tune of 5-10% in certain workloads.

It''s been about 2 years ago that paravirt_ops went upstream, when you
told
us that something like this would never happen, that paravirt_ops is 
designed so flexibly that it will never hinder the native kernel - and if 
it does it will be easy to fix it. Now is the time to fulfill that 
promise.

	Ingo

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Jan-27 08:24 UTC

head link

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Ingo Molnar wrote:> This is a very serious paravirt_ops slowdown affecting the native
kernel''s
> performance to the tune of 5-10% in certain workloads.
>
> It''s been about 2 years ago that paravirt_ops went upstream, when
you told
> us that something like this would never happen, that paravirt_ops is 
> designed so flexibly that it will never hinder the native kernel - and if 
> it does it will be easy to fix it. Now is the time to fulfill that 
> promise.
>   
Yep, working on it.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-Jan-27 10:17 UTC

head link

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

Ingo Molnar wrote:> ping?
>
> This is a very serious paravirt_ops slowdown affecting the native
kernel''s
> performance to the tune of 5-10% in certain workloads.
>
> It''s been about 2 years ago that paravirt_ops went upstream, when
you told
> us that something like this would never happen, that paravirt_ops is 
> designed so flexibly that it will never hinder the native kernel - and if 
> it does it will be easy to fix it. Now is the time to fulfill that 
> promise.
I couldn''t exactly reproduce your results, but I guess they''re
similar
in shape.  Comparing 2.6.29-rc2-nopv with -pvops, I saw this ratio (pass 
1-5).  Interestingly I''m seeing identical instruction counts for pvops 
vs non-pvops, and a lower cycle count.  The cache references are way up 
and the miss rate is up a bit, which I guess is the source of the slowdown.

With the attached patch, I get a clear improvement; it replaces the 
do-nothing pte_val/make_pte functions with inlined movs to move the 
argument to return, overpatching the 6-byte indirect call (on i386 it 
would just be all nopped out).  CPU cycles and cache misses are way 
down, and the tick count is down from ~5% worse to ~2%.  But the cache 
reference rate is even higher, which really doesn''t make sense to me. 
But the patch is a clear improvement, and its hard to see how it could 
make anything worse (its always going to replace an indirect call with 
simple inlined code).

(Full numbers in spreadsheet.)

I have a couple of other patches to reduce the register pressure of the 
pvops calls, but I''m trying to work out how to make sure its not all to
complex and/or fragile.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jan 2009 - Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT

[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT