Jeremy Fitzhardinge
2009-Jan-20 20:45 UTC
[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
Ingo Molnar wrote:> * Ingo Molnar <mingo@elte.hu> wrote: > > >>> Times I believe are in nanoseconds for lmbench, anyway lower is >>> better. >>> >>> non pv AVG=464.22 STD=5.56 >>> paravirt AVG=502.87 STD=7.36 >>> >>> Nearly 10% performance drop here, which is quite a bit... hopefully >>> people are testing the speed of their PV implementations against >>> non-PV bare metal :) >>> >> Ouch, that looks unacceptably expensive. All the major distros turn >> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the express >> promise to have no measurable runtime overhead. >> > > Here are some more precise stats done via hw counters on a perfcounters > kernel using ''timec'', running a modified version of the ''mmap performance > stress-test'' app i made years ago. > > The MM benchmark app can be downloaded from: > > http://redhat.com/~mingo/misc/mmap-perf.c > > timec.c can be picked up from: > > http://redhat.com/~mingo/perfcounters/timec.c > > mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and touches > the mapped area as well with a certain chance. The patterns are > pseudo-random and the random seed is initialized to the same value so > repeated runs produce the exact same mmap sequence. > > I ran the test with a single thread and bound to a single core: > > # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1 > > [ I ran it as root - so that kernel-space hardware-counter statistics are > included as well. ] > > The results are quite surprisingly candid about the true costs of > paravirt_ops on the native kernel''s overhead (CONFIG_PARAVIRT=y): > > ----------------------------------------------- > | Performance counter stats for ''./mmap-perf'' | > ----------------------------------------------- > | | > | x86-defconfig | PARAVIRT=y > |------------------------------------------------------------------ > | > | 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74% > | | > | 1 | 1 CPU migrations > | 91 | 79 context switches > | 55945 | 55943 pagefaults > | ............................................ > | 3781392474 | 3918777174 CPU cycles +3.63% > | 1957153827 | 2161280486 instructions +10.43% >!!> | 50234816 | 51303520 cache references +2.12% > | 5428258 | 5583728 cache misses +2.86% >Is this I or D, or combined?> | | > | 1314.782469 | 1363.694447 time elapsed (msecs) +3.72% > | | > ----------------------------------- > > The most surprising element is that in the paravirt_ops case we run 204 > million more instructions - out of the ~2000 million instructions total. > > That''s an increase of over 10%! >Yow! That''s pretty awful. We knew that static instruction count was up, but wouldn''t have thought that it would hit the dynamic instruction count so much... I think there are some immediate tweaks we can make to the code generated for each call site, which will help to an extent. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ingo Molnar
2009-Jan-20 20:56 UTC
[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
* Jeremy Fitzhardinge <jeremy@goop.org> wrote:> Ingo Molnar wrote: >> * Ingo Molnar <mingo@elte.hu> wrote: >> >> >>>> Times I believe are in nanoseconds for lmbench, anyway lower is >>>> better. >>>> >>>> non pv AVG=464.22 STD=5.56 >>>> paravirt AVG=502.87 STD=7.36 >>>> >>>> Nearly 10% performance drop here, which is quite a bit... hopefully >>>> people are testing the speed of their PV implementations against >>>> non-PV bare metal :) >>>> >>> Ouch, that looks unacceptably expensive. All the major distros turn >>> CONFIG_PARAVIRT on. paravirt_ops was introduced in x86 with the >>> express promise to have no measurable runtime overhead. >>> >> >> Here are some more precise stats done via hw counters on a perfcounters >> kernel using ''timec'', running a modified version of the ''mmap >> performance stress-test'' app i made years ago. >> >> The MM benchmark app can be downloaded from: >> >> http://redhat.com/~mingo/misc/mmap-perf.c >> >> timec.c can be picked up from: >> >> http://redhat.com/~mingo/perfcounters/timec.c >> >> mmap-perf conducts 1 million mmap()/munmap()/mremap() calls, and >> touches the mapped area as well with a certain chance. The patterns are >> pseudo-random and the random seed is initialized to the same value so >> repeated runs produce the exact same mmap sequence. >> >> I ran the test with a single thread and bound to a single core: >> >> # taskset 2 timec -e -5,-4,-3,0,1,2,3 ./mmap-perf 1 >> >> [ I ran it as root - so that kernel-space hardware-counter statistics >> are included as well. ] >> >> The results are quite surprisingly candid about the true costs of >> paravirt_ops on the native kernel''s overhead (CONFIG_PARAVIRT=y): >> >> ----------------------------------------------- >> | Performance counter stats for ''./mmap-perf'' | >> ----------------------------------------------- >> | | >> | x86-defconfig | PARAVIRT=y >> |------------------------------------------------------------------ >> | >> | 1311.554526 | 1360.624932 task clock ticks (msecs) +3.74% >> | | >> | 1 | 1 CPU migrations >> | 91 | 79 context switches >> | 55945 | 55943 pagefaults >> | ............................................ >> | 3781392474 | 3918777174 CPU cycles +3.63% >> | 1957153827 | 2161280486 instructions +10.43% >> > > !! > >> | 50234816 | 51303520 cache references +2.12% >> | 5428258 | 5583728 cache misses +2.86% >> > > Is this I or D, or combined?That''s last-level-cache references+misses (L2 cache): Bit Position Event Name UMask Event Select CPUID.AH.EBX 3 LLC Reference 4FH 2EH 4 LLC Misses 41H 2EH Ingo _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Jan-21 22:23 UTC
[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
Nick Piggin wrote:> Oh, _llc_ references/misses? Ouch. > > You have, what 32K L1I, 32K L1D, and 4MB L2? And even this microbenchmark > is seeing increased L2 misses by nearly 3%. Hmm, I wonder where that is > coming from? Instruction fetches? >I assume so. There should be no extra data accesses with CONFIG_PARAVIRT (hm, there''s probably some extra stack/spill traffic, but I surely hope that''s not falling out of cache). J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Jan-22 22:44 UTC
[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
Zachary Amsden wrote:> These fragments, from native_pgd_val, certainly don''t help: > > c0120f60: 55 push %ebp > c0120f61: 89 e5 mov %esp,%ebp > c0120f63: 5d pop %ebp > c0120f64: c3 ret > c0120f65: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi > c0120f69: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi >Yes, that''s a rather awful noop; compiling without frame pointers reduces this to a single "ret".> That is really disgusting. We absolutely should be patching away the > function calls here in the native case.. not sure we do that today. >I did have some patches to do that at one point. If you set pgd_val = paravirt_nop, then the patching machinery will completely nop out the call site. The problem is that it depends on the calling convention using the same regs for the first arg and return - true for 32-bit, but not 64. We could fix that with identity functions which the patcher recognizes and can replace with either pure nops or inline appropriate register moves. Also, I just posted patches to get rid of all pvops calls when fetching or setting flags in a pte, which I hope will help. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Jan-23 00:08 UTC
[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
H. Peter Anvin wrote:> Right now a number of the call sites contain a huge push/pop sequence > followed by an indirect call. We can patch in the native code to > avoid the branch overhead, but the register constraints and icache > footprint is unchanged.That''s true for the pvops hooks emitted in the .S files, but not so true for ones in C code (well, there are no explicit push/pops, but the presence of the call may cause the compiler to generate them). The .S hooks can definitely be cleaned up, but I don''t think that''s germane to Nick''s observations that the mm code is showing slowdowns. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Jan-23 00:14 UTC
[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
Zachary Amsden wrote:> What about removing the identity functions entirely. They are useless, > really. All that is needed is a patch site filled with nops for Xen to > overwrite, just stuffing the value into the proper registers. For > 64-bit, it can be a simple mov to satisfy the constraints. >I think it comes to the same thing really. Both end up generating a series of nops with values entering and leaving in well-defined registers. The x86-64 calling convention is a bit awkward because the first arg is in rdi and the ret is rax, so it can''t quite be pure nops, or we use a non-standard calling-convention with appropriate thunks to call into C code. I think a mov is a better performance-complexity tradeoff.>> Also, I just posted patches to get rid of all pvops calls when fetching >> or setting flags in a pte, which I hope will help. >> > > Sounds like it will help. >...but apparently not. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ingo Molnar
2009-Jan-27 07:59 UTC
[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
* Jeremy Fitzhardinge <jeremy@goop.org> wrote:>>> Also, I just posted patches to get rid of all pvops calls when >>> fetching or setting flags in a pte, which I hope will help. >> >> Sounds like it will help. > > ...but apparently not.ping? This is a very serious paravirt_ops slowdown affecting the native kernel''s performance to the tune of 5-10% in certain workloads. It''s been about 2 years ago that paravirt_ops went upstream, when you told us that something like this would never happen, that paravirt_ops is designed so flexibly that it will never hinder the native kernel - and if it does it will be easy to fix it. Now is the time to fulfill that promise. Ingo _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Jan-27 08:24 UTC
[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
Ingo Molnar wrote:> This is a very serious paravirt_ops slowdown affecting the native kernel''s > performance to the tune of 5-10% in certain workloads. > > It''s been about 2 years ago that paravirt_ops went upstream, when you told > us that something like this would never happen, that paravirt_ops is > designed so flexibly that it will never hinder the native kernel - and if > it does it will be easy to fix it. Now is the time to fulfill that > promise. >Yep, working on it. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Jan-27 10:17 UTC
[Xen-devel] Re: lmbench lat_mmap slowdown with CONFIG_PARAVIRT
Ingo Molnar wrote:> ping? > > This is a very serious paravirt_ops slowdown affecting the native kernel''s > performance to the tune of 5-10% in certain workloads. > > It''s been about 2 years ago that paravirt_ops went upstream, when you told > us that something like this would never happen, that paravirt_ops is > designed so flexibly that it will never hinder the native kernel - and if > it does it will be easy to fix it. Now is the time to fulfill that > promise.I couldn''t exactly reproduce your results, but I guess they''re similar in shape. Comparing 2.6.29-rc2-nopv with -pvops, I saw this ratio (pass 1-5). Interestingly I''m seeing identical instruction counts for pvops vs non-pvops, and a lower cycle count. The cache references are way up and the miss rate is up a bit, which I guess is the source of the slowdown. With the attached patch, I get a clear improvement; it replaces the do-nothing pte_val/make_pte functions with inlined movs to move the argument to return, overpatching the 6-byte indirect call (on i386 it would just be all nopped out). CPU cycles and cache misses are way down, and the tick count is down from ~5% worse to ~2%. But the cache reference rate is even higher, which really doesn''t make sense to me. But the patch is a clear improvement, and its hard to see how it could make anything worse (its always going to replace an indirect call with simple inlined code). (Full numbers in spreadsheet.) I have a couple of other patches to reduce the register pressure of the pvops calls, but I''m trying to work out how to make sure its not all to complex and/or fragile. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel