Jeremy Fitzhardinge
2009-May-14 00:16 UTC
[Xen-devel] Performance overhead of paravirt_ops on native identified
Hi Ingo, Xiaohui Xin and some other folks at Intel have been looking into what''s behind the performance hit of paravirt_ops when running native. It appears that the hit is entirely due to the paravirtualized spinlocks; the extra call/return in the spinlock path is somehow causing an increase in the cycles/instruction of somewhere around 2-7% (seems to vary quite a lot from test to test). The working theory is that the CPU''s pipeline is getting upset about the call->call->locked-op->return->return, and seems to be failing to speculate (though I haven''t seen anything definitive about the precise reasons). This doesn''t entirely make sense, because the performance hit is also visible on unlock and other operations which don''t involve locked instructions. But spinlock operations clearly swamp all the other pvops operations, even though I can''t imagine that they''re nearly as common (there''s only a .05% increase in instructions executed). If I disable just the pv-spinlock calls, my tests show that pvops is identical to non-pvops performance on native (my measurements show that it is actually about .1% faster, but Xiaohui shows a .05% slowdown). Summary of results, averaging 10 runs of the "mmperf" test, using a no-pvops build as baseline: nopv Pv-nospin Pv-spin CPU cycles 100.00% 99.89% 102.18% instructions 100.00% 100.10% 100.15% CPI 100.00% 99.79% 102.03% cache ref 100.00% 100.84% 100.28% cache miss 100.00% 90.47% 88.56% cache miss rate 100.00% 89.72% 88.31% branches 100.00% 99.93% 100.04% branch miss 100.00% 103.66% 107.72% branch miss rt 100.00% 103.73% 107.67% wallclock 100.00% 99.90% 102.20% The clear effect here is that the 2% increase in CPI is directly reflected in the final wallclock time. (The other interesting effect is that the more ops are out of line calls via pvops, the lower the cache access and miss rates. Not too surprising, but it suggests that the non-pvops kernel is over-inlined. On the flipside, the branch misses go up correspondingly...) So, what''s the fix? Paravirt patching turns all the pvops calls into direct calls, so _spin_lock etc do end up having direct calls. For example, the compiler generated code for paravirtualized _spin_lock is: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq *0xffffffff805a5b30 <_spin_lock+22>: retq The indirect call will get patched to: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq <__ticket_spin_lock> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ <_spin_lock+22>: retq One possibility is to inline _spin_lock, etc, when building an optimised kernel (ie, when there''s no spinlock/preempt instrumentation/debugging enabled). That will remove the outer call/return pair, returning the instruction stream to a single call/return, which will presumably execute the same as the non-pvops case. The downsides arel 1) it will replicate the preempt_disable/enable code at eack lock/unlock callsite; this code is fairly small, but not nothing; and 2) the spinlock definitions are already a very heavily tangled mass of #ifdefs and other preprocessor magic, and making any changes will be non-trivial. The other obvious answer is to disable pv-spinlocks. Making them a separate config option is fairly easy, and it would be trivial to enable them only when Xen is enabled (as the only non-default user). But it doesn''t really address the common case of a distro build which is going to have Xen support enabled, and leaves the open question of whether the native performance cost of pv-spinlocks is worth the performance improvement on a loaded Xen system (10% saving of overall system CPU when guests block rather than spin). Still it is a reasonable short-term workaround. The best solution would be to work out whether this really is a problem interaction with Intel''s pipelines, and come up with something that avoids it. It would be very interesting to see if there''s a similar hit on AMD systems. J>From 839033e472c8f3b228be35e57a8b31fbb7f9cf98 Mon Sep 17 00:00:00 2001From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Date: Wed, 13 May 2009 11:58:17 -0700 Subject: [PATCH] x86: add config to disable PV spinlocks Paravirtualized spinlocks seem to cause a 2-7% performance hit when running a pvops kernel native. Without them, the pvops kernel is identical to a non-pvops kernel in performance. [ Impact: reduce overhead of pvops when running native ] Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5f50179..a99ed71 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -498,6 +498,17 @@ config PARAVIRT over full virtualization. However, when run without a hypervisor the kernel is theoretically slower and slightly larger. +config PARAVIRT_SPINLOCKS + bool "Enable paravirtualized spinlocks" + depends on PARAVIRT && SMP + default XEN + ---help--- + Paravirtualized spinlocks allow a pvops backend to replace the + spinlock implementation with something virtualization-friendly + (for example, block the virtual CPU rather than spinning). + Unfortunately the downside is an as-yet unexplained performance + when running native. + config PARAVIRT_CLOCK bool default n diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 1fe5837..4fb37c8 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -1443,7 +1443,7 @@ u64 _paravirt_ident_64(u64); #define paravirt_nop ((void *)_paravirt_nop) -#ifdef CONFIG_SMP +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS) static inline int __raw_spin_is_locked(struct raw_spinlock *lock) { diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index e5e6caf..b7e5db8 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -172,7 +172,7 @@ static inline int __ticket_spin_is_contended(raw_spinlock_t *lock) return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1; } -#ifndef CONFIG_PARAVIRT +#ifndef CONFIG_PARAVIRT_SPINLOCKS static inline int __raw_spin_is_locked(raw_spinlock_t *lock) { @@ -206,7 +206,7 @@ static __always_inline void __raw_spin_lock_flags(raw_spinlock_t *lock, __raw_spin_lock(lock); } -#endif +#endif /* CONFIG_PARAVIRT_SPINLOCKS */ static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock) { diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 68a4ff6..4f78bd6 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -90,7 +90,8 @@ obj-$(CONFIG_DEBUG_NX_TEST) += test_nx.o obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o obj-$(CONFIG_KVM_GUEST) += kvm.o obj-$(CONFIG_KVM_CLOCK) += kvmclock.o -obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o paravirt-spinlocks.o +obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o +obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index aa34423..70ec9b9 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -134,7 +134,9 @@ static void *get_call_destination(u8 type) .pv_irq_ops = pv_irq_ops, .pv_apic_ops = pv_apic_ops, .pv_mmu_ops = pv_mmu_ops, +#ifdef CONFIG_PARAVIRT_SPINLOCKS .pv_lock_ops = pv_lock_ops, +#endif }; return *((void **)&tmpl + type); } diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile index 3b767d0..172438f 100644 --- a/arch/x86/xen/Makefile +++ b/arch/x86/xen/Makefile @@ -9,5 +9,6 @@ obj-y := enlighten.o setup.o multicalls.o mmu.o irq.o \ time.o xen-asm.o xen-asm_$(BITS).o \ grant-table.o suspend.o -obj-$(CONFIG_SMP) += smp.o spinlock.o -obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o \ No newline at end of file +obj-$(CONFIG_SMP) += smp.o +obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= spinlock.o +obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h index 5c50a10..22494fd 100644 --- a/arch/x86/xen/xen-ops.h +++ b/arch/x86/xen/xen-ops.h @@ -61,15 +61,26 @@ void xen_setup_vcpu_info_placement(void); #ifdef CONFIG_SMP void xen_smp_init(void); -void __init xen_init_spinlocks(void); -__cpuinit void xen_init_lock_cpu(int cpu); -void xen_uninit_lock_cpu(int cpu); - extern cpumask_var_t xen_cpu_initialized_map; #else static inline void xen_smp_init(void) {} #endif +#ifdef CONFIG_PARAVIRT_SPINLOCKS +void __init xen_init_spinlocks(void); +__cpuinit void xen_init_lock_cpu(int cpu); +void xen_uninit_lock_cpu(int cpu); +#else +static inline void xen_init_spinlocks(void) +{ +} +static inline void xen_init_lock_cpu(int cpu) +{ +} +static inline void xen_uninit_lock_cpu(int cpu) +{ +} +#endif /* Declare an asm function, along with symbols needed to make it inlineable */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
H. Peter Anvin
2009-May-14 01:10 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
Jeremy Fitzhardinge wrote:> > So, what''s the fix? > > Paravirt patching turns all the pvops calls into direct calls, so > _spin_lock etc do end up having direct calls. For example, the compiler > generated code for paravirtualized _spin_lock is: > > <_spin_lock+0>: mov %gs:0xb4c8,%rax > <_spin_lock+9>: incl 0xffffffffffffe044(%rax) > <_spin_lock+15>: callq *0xffffffff805a5b30 > <_spin_lock+22>: retq > > The indirect call will get patched to: > <_spin_lock+0>: mov %gs:0xb4c8,%rax > <_spin_lock+9>: incl 0xffffffffffffe044(%rax) > <_spin_lock+15>: callq <__ticket_spin_lock> > <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ > <_spin_lock+22>: retq > > One possibility is to inline _spin_lock, etc, when building an > optimised kernel (ie, when there''s no spinlock/preempt > instrumentation/debugging enabled). That will remove the outer > call/return pair, returning the instruction stream to a single > call/return, which will presumably execute the same as the non-pvops > case. The downsides arel 1) it will replicate the > preempt_disable/enable code at eack lock/unlock callsite; this code is > fairly small, but not nothing; and 2) the spinlock definitions are > already a very heavily tangled mass of #ifdefs and other preprocessor > magic, and making any changes will be non-trivial. >The other obvious option, it would seem to me, would be to eliminate the *inner* call/return pair, i.e. merging the _spin_lock setup code in with the internals of each available implementation (in the case above, __ticket_spin_lock). This is effectively what happens on native. The one problem with that is that every callsite now becomes a patching target. That brings me to a somewhat half-arsed thought I have been walking around with for a while. Consider a paravirt -- or for that matter any other call which is runtime-static; this isn''t just limited to paravirt -- function which looks to the C compiler just like any other external function -- no indirection. We can point it by default to a function which is really just an indirect jump to the appropriate handler, that handles the prepatching case. However, a linktime pass over vmlinux.o can find all the points where this function is called, and turn it into a list of patch sites(*). The advantages are: 1. [minor] no additional nop padding due to indirect function calls. 2. [major] no need for a ton of wrapper macros manifest in the code. paravirt_ops that turn into pure inline code in the native case is obviously another ball of wax entirely; there inline assembly wrappers are simply unavoidable. -hpa (*) if patching code on SMP was cheaper, we could actually do this lazily, and wouldn''t have to store a list of patch sites. I don''t feel brave enough to go down that route. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-May-14 08:05 UTC
Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified
>>> Jeremy Fitzhardinge <jeremy@goop.org> 14.05.09 02:16 >>> >One possibility is to inline _spin_lock, etc, when building an >optimised kernel (ie, when there''s no spinlock/preempt >instrumentation/debugging enabled). That will remove the outer >call/return pair, returning the instruction stream to a single >call/return, which will presumably execute the same as the non-pvops >case. The downsides arel 1) it will replicate the >preempt_disable/enable code at eack lock/unlock callsite; this code is >fairly small, but not nothing; and 2) the spinlock definitions are >already a very heavily tangled mass of #ifdefs and other preprocessor >magic, and making any changes will be non-trivial. > >The other obvious answer is to disable pv-spinlocks. Making them a >separate config option is fairly easy, and it would be trivial to >enable them only when Xen is enabled (as the only non-default user). >But it doesn''t really address the common case of a distro build which >is going to have Xen support enabled, and leaves the open question of >whether the native performance cost of pv-spinlocks is worth the >performance improvement on a loaded Xen system (10% saving of overall >system CPU when guests block rather than spin). Still it is a >reasonable short-term workaround.Wouldn''t a third solution be to use ticket spinlocks everywhere, i.e. eliminate the current indirection, and replace it by an indirection for just the contention case? As I view it, the problem for Xen aren''t really the ticket locks by themselves, but rather the extra spinning involved, which is of concern only if a lock is contended. We''re using ticket locks quite happily in our kernels, with directed instead of global wakeup from the unlock path. The only open issue we currently have is that while for native keeping interrupts disabled while spinning may be acceptable (though I''m not sure how -rt folks are viewing this), in a pv environment one should really re-enable interrupts here due to the potentially much higher latency. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Peter Zijlstra
2009-May-14 08:25 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
On Wed, 2009-05-13 at 18:10 -0700, H. Peter Anvin wrote:> Jeremy Fitzhardinge wrote: > > > > So, what''s the fix? > > > > Paravirt patching turns all the pvops calls into direct calls, so > > _spin_lock etc do end up having direct calls. For example, the compiler > > generated code for paravirtualized _spin_lock is: > > > > <_spin_lock+0>: mov %gs:0xb4c8,%rax > > <_spin_lock+9>: incl 0xffffffffffffe044(%rax) > > <_spin_lock+15>: callq *0xffffffff805a5b30 > > <_spin_lock+22>: retq > > > > The indirect call will get patched to: > > <_spin_lock+0>: mov %gs:0xb4c8,%rax > > <_spin_lock+9>: incl 0xffffffffffffe044(%rax) > > <_spin_lock+15>: callq <__ticket_spin_lock> > > <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ > > <_spin_lock+22>: retq > > > > One possibility is to inline _spin_lock, etc, when building an > > optimised kernel (ie, when there''s no spinlock/preempt > > instrumentation/debugging enabled). That will remove the outer > > call/return pair, returning the instruction stream to a single > > call/return, which will presumably execute the same as the non-pvops > > case. The downsides arel 1) it will replicate the > > preempt_disable/enable code at eack lock/unlock callsite; this code is > > fairly small, but not nothing; and 2) the spinlock definitions are > > already a very heavily tangled mass of #ifdefs and other preprocessor > > magic, and making any changes will be non-trivial. > > > > The other obvious option, it would seem to me, would be to eliminate the > *inner* call/return pair, i.e. merging the _spin_lock setup code in with > the internals of each available implementation (in the case above, > __ticket_spin_lock). This is effectively what happens on native. The > one problem with that is that every callsite now becomes a patching target. > > That brings me to a somewhat half-arsed thought I have been walking > around with for a while. > > Consider a paravirt -- or for that matter any other call which is > runtime-static; this isn''t just limited to paravirt -- function which > looks to the C compiler just like any other external function -- no > indirection. We can point it by default to a function which is really > just an indirect jump to the appropriate handler, that handles the > prepatching case. However, a linktime pass over vmlinux.o can find all > the points where this function is called, and turn it into a list of > patch sites(*). The advantages are: > > 1. [minor] no additional nop padding due to indirect function calls. > 2. [major] no need for a ton of wrapper macros manifest in the code. > > paravirt_ops that turn into pure inline code in the native case is > obviously another ball of wax entirely; there inline assembly wrappers > are simply unavoidable. > > -hpa > > (*) if patching code on SMP was cheaper, we could actually do this > lazily, and wouldn''t have to store a list of patch sites. I don''t feel > brave enough to go down that route.This sounds remarkably like what the dynamic function call tracer does. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Peter Zijlstra
2009-May-14 08:33 UTC
Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified
On Thu, 2009-05-14 at 09:05 +0100, Jan Beulich wrote:> Wouldn''t a third solution be to use ticket spinlocks everywhere, i.e. eliminate > the current indirection, and replace it by an indirection for just the contention > case? As I view it, the problem for Xen aren''t really the ticket locks by > themselves, but rather the extra spinning involved, which is of concern only > if a lock is contended. We''re using ticket locks quite happily in our kernels, > with directed instead of global wakeup from the unlock path. The only open > issue we currently have is that while for native keeping interrupts disabled > while spinning may be acceptable (though I''m not sure how -rt folks are > viewing this), in a pv environment one should really re-enable interrupts > here due to the potentially much higher latency.the -rt folks don''t nearly have as many spinlocks, and for those we do like ticket locks, because they are much fairer and give better worst case contention behaviour. Also, for the -rt folks, preempt disable is about as bad as irq disable. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
H. Peter Anvin
2009-May-14 14:05 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
Peter Zijlstra wrote:> > This sounds remarkably like what the dynamic function call tracer does. >I''m sure this has been invented before... probably more than once. Far too many things we invent don''t get generalized. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don''t speak on their behalf. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-May-14 17:36 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
H. Peter Anvin wrote:> The other obvious option, it would seem to me, would be to eliminate the > *inner* call/return pair, i.e. merging the _spin_lock setup code in with > the internals of each available implementation (in the case above, > __ticket_spin_lock). This is effectively what happens on native. The > one problem with that is that every callsite now becomes a patching target. >Yes, that''s an option. It has the downside of requiring changes to the common spinlock code in kernel/spinlock.c and linux/spinlock_api*.h. The amount of duplicated code is potentially quite large, but there aren''t that many spinlock implementations. Also, there''s not much point in using pv spinlocks when all the instrumentation is on. Lock contention metering, for example, never does a proper lock operation, but does a spin with repeated trylocks; we can''t optimise that, so there''s no point in trying. So maybe if we can fast-path the fast-path to pv spinlocks, the problem is more tractable...> That brings me to a somewhat half-arsed thought I have been walking > around with for a while. > > Consider a paravirt -- or for that matter any other call which is > runtime-static; this isn''t just limited to paravirt -- function which > looks to the C compiler just like any other external function -- no > indirection. We can point it by default to a function which is really > just an indirect jump to the appropriate handler, that handles the > prepatching case. However, a linktime pass over vmlinux.o can find all > the points where this function is called, and turn it into a list of > patch sites(*). The advantages are: > > 1. [minor] no additional nop padding due to indirect function calls. > 2. [major] no need for a ton of wrapper macros manifest in the code. > > paravirt_ops that turn into pure inline code in the native case is > obviously another ball of wax entirely; there inline assembly wrappers > are simply unavoidable. >We did consider something like this at the outset. As I remember, there were a few concerns: * There was no relocation data available in the kernel. I played around with ways to make it work, but they ended up being fairly complex and brittle, with a tendency (of course) to trigger binutils bugs. Maybe that has changed. * We didn''t really want to implement two separate mechanisms for the same thing. Given that we wanted to inline things like cli/sti/pushf/popf, we needed to have something capable of full patching. Having a separate mechanisms for patching calls is harder to justify. Now that pvops is well settled, perhaps it makes sense to consider adding another more general patching mechanism to avoid the indirect calls (a dynamic linker, essentially). I won''t make any great claims about the beauty of the PV_CALL* gunk, but at the very least it is contained within paravirt.h.> (*) if patching code on SMP was cheaper, we could actually do this > lazily, and wouldn''t have to store a list of patch sites. I don''t feel > brave enough to go down that route. >The problem that the tracepoints people were trying to solve was harder, where they wanted to replace an arbitrary set of instructions with some other arbitrary instructions (or a call) - that would need some kind SMP synchronization, both for general sanity and to keep the Intel rules happy. In theory relinking a call should just be a single word write into the instruction, but I don''t know if that gets into undefined territory or not. On older P4 systems it would end up blowing away the trace cache on all cpus when you write to code like that, so you''d want to be sure that your references are getting resolved fairly quickly. But its hard to see how patching the offset in a call instruction would end up calling something other than the old or new function. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-May-14 17:45 UTC
Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified
Jan Beulich wrote:> Wouldn''t a third solution be to use ticket spinlocks everywhere, i.e. eliminate > the current indirection, and replace it by an indirection for just the contention > case? As I view it, the problem for Xen aren''t really the ticket locks by > themselves, but rather the extra spinning involved, which is of concern only > if a lock is contended. We''re using ticket locks quite happily in our kernels, > with directed instead of global wakeup from the unlock path.Do you have a patch to illustrate what you mean? How do you keep track of the target vcpu for the directed wakeup? Are you using the event-poll mechanism to block? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
H. Peter Anvin
2009-May-14 17:50 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
Jeremy Fitzhardinge wrote:> > We did consider something like this at the outset. As I remember, there > were a few concerns: > > * There was no relocation data available in the kernel. I played > around with ways to make it work, but they ended up being fairly > complex and brittle, with a tendency (of course) to trigger > binutils bugs. Maybe that has changed.We already do this pass (in fact, we do something like three passes of it.) It''s basically the vmlinux.o pass.> * We didn''t really want to implement two separate mechanisms for the > same thing. Given that we wanted to inline things like > cli/sti/pushf/popf, we needed to have something capable of full > patching. Having a separate mechanisms for patching calls is > harder to justify. Now that pvops is well settled, perhaps it > makes sense to consider adding another more general patching > mechanism to avoid the indirect calls (a dynamic linker, essentially).Full patching is understandable (although I think sometimes the code generated was worse than out-of-line... I believe you have fixed that.)> I won''t make any great claims about the beauty of the PV_CALL* gunk, but > at the very least it is contained within paravirt.h.There is still massive spillover into other code, though, at least some of which could possibly be avoided. I don''t know.>> (*) if patching code on SMP was cheaper, we could actually do this >> lazily, and wouldn''t have to store a list of patch sites. I don''t feel >> brave enough to go down that route. >> > The problem that the tracepoints people were trying to solve was harder, > where they wanted to replace an arbitrary set of instructions with some > other arbitrary instructions (or a call) - that would need some kind SMP > synchronization, both for general sanity and to keep the Intel rules happy. > > In theory relinking a call should just be a single word write into the > instruction, but I don''t know if that gets into undefined territory or > not. On older P4 systems it would end up blowing away the trace cache > on all cpus when you write to code like that, so you''d want to be sure > that your references are getting resolved fairly quickly. But its hard > to see how patching the offset in a call instruction would end up > calling something other than the old or new function.The problem is that since the call offset field can be arbitrarily aligned -- it could even cross page boundaries -- you still have absolutely no SMP atomicity guarantees. So you still have all the same problems. Without -hpa _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-May-15 08:10 UTC
Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified
>>> Jeremy Fitzhardinge <jeremy@goop.org> 14.05.09 19:45 >>> >Jan Beulich wrote: >> Wouldn''t a third solution be to use ticket spinlocks everywhere, i.e. eliminate >> the current indirection, and replace it by an indirection for just the contention >> case? As I view it, the problem for Xen aren''t really the ticket locks by >> themselves, but rather the extra spinning involved, which is of concern only >> if a lock is contended. We''re using ticket locks quite happily in our kernels, >> with directed instead of global wakeup from the unlock path. >Do you have a patch to illustrate what you mean? How do you keep track >of the target vcpu for the directed wakeup? Are you using the >event-poll mechanism to block?A patch for the pv-ops kernel would require some time. What I can give you right away - just for reference - are the sources we currently use in our kernel: attached. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
tip-bot for Jeremy Fitzhardinge
2009-May-15 18:18 UTC
[Xen-devel] [tip:x86/urgent] x86: Fix performance regression caused by paravirt_ops on native kernels
Commit-ID: b4ecc126991b30fe5f9a59dfacda046aeac124b2 Gitweb: http://git.kernel.org/tip/b4ecc126991b30fe5f9a59dfacda046aeac124b2 Author: Jeremy Fitzhardinge <jeremy@goop.org> AuthorDate: Wed, 13 May 2009 17:16:55 -0700 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Fri, 15 May 2009 20:07:42 +0200 x86: Fix performance regression caused by paravirt_ops on native kernels Xiaohui Xin and some other folks at Intel have been looking into what''s behind the performance hit of paravirt_ops when running native. It appears that the hit is entirely due to the paravirtualized spinlocks introduced by: | commit 8efcbab674de2bee45a2e4cdf97de16b8e609ac8 | Date: Mon Jul 7 12:07:51 2008 -0700 | | paravirt: introduce a "lock-byte" spinlock implementation The extra call/return in the spinlock path is somehow causing an increase in the cycles/instruction of somewhere around 2-7% (seems to vary quite a lot from test to test). The working theory is that the CPU''s pipeline is getting upset about the call->call->locked-op->return->return, and seems to be failing to speculate (though I haven''t seen anything definitive about the precise reasons). This doesn''t entirely make sense, because the performance hit is also visible on unlock and other operations which don''t involve locked instructions. But spinlock operations clearly swamp all the other pvops operations, even though I can''t imagine that they''re nearly as common (there''s only a .05% increase in instructions executed). If I disable just the pv-spinlock calls, my tests show that pvops is identical to non-pvops performance on native (my measurements show that it is actually about .1% faster, but Xiaohui shows a .05% slowdown). Summary of results, averaging 10 runs of the "mmperf" test, using a no-pvops build as baseline: nopv Pv-nospin Pv-spin CPU cycles 100.00% 99.89% 102.18% instructions 100.00% 100.10% 100.15% CPI 100.00% 99.79% 102.03% cache ref 100.00% 100.84% 100.28% cache miss 100.00% 90.47% 88.56% cache miss rate 100.00% 89.72% 88.31% branches 100.00% 99.93% 100.04% branch miss 100.00% 103.66% 107.72% branch miss rt 100.00% 103.73% 107.67% wallclock 100.00% 99.90% 102.20% The clear effect here is that the 2% increase in CPI is directly reflected in the final wallclock time. (The other interesting effect is that the more ops are out of line calls via pvops, the lower the cache access and miss rates. Not too surprising, but it suggests that the non-pvops kernel is over-inlined. On the flipside, the branch misses go up correspondingly...) So, what''s the fix? Paravirt patching turns all the pvops calls into direct calls, so _spin_lock etc do end up having direct calls. For example, the compiler generated code for paravirtualized _spin_lock is: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq *0xffffffff805a5b30 <_spin_lock+22>: retq The indirect call will get patched to: <_spin_lock+0>: mov %gs:0xb4c8,%rax <_spin_lock+9>: incl 0xffffffffffffe044(%rax) <_spin_lock+15>: callq <__ticket_spin_lock> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ <_spin_lock+22>: retq One possibility is to inline _spin_lock, etc, when building an optimised kernel (ie, when there''s no spinlock/preempt instrumentation/debugging enabled). That will remove the outer call/return pair, returning the instruction stream to a single call/return, which will presumably execute the same as the non-pvops case. The downsides arel 1) it will replicate the preempt_disable/enable code at eack lock/unlock callsite; this code is fairly small, but not nothing; and 2) the spinlock definitions are already a very heavily tangled mass of #ifdefs and other preprocessor magic, and making any changes will be non-trivial. The other obvious answer is to disable pv-spinlocks. Making them a separate config option is fairly easy, and it would be trivial to enable them only when Xen is enabled (as the only non-default user). But it doesn''t really address the common case of a distro build which is going to have Xen support enabled, and leaves the open question of whether the native performance cost of pv-spinlocks is worth the performance improvement on a loaded Xen system (10% saving of overall system CPU when guests block rather than spin). Still it is a reasonable short-term workaround. [ Impact: fix pvops performance regression when running native ] Analysed-by: "Xin Xiaohui" <xiaohui.xin@intel.com> Analysed-by: "Li Xin" <xin.li@intel.com> Analysed-by: "Nakajima Jun" <jun.nakajima@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Xen-devel <xen-devel@lists.xensource.com> LKML-Reference: <4A0B62F7.5030802@goop.org> [ fixed the help text ] Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/Kconfig | 13 +++++++++++++ arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/spinlock.h | 4 ++-- arch/x86/kernel/Makefile | 3 ++- arch/x86/kernel/paravirt.c | 2 ++ arch/x86/xen/Makefile | 5 +++-- arch/x86/xen/xen-ops.h | 19 +++++++++++++++---- 7 files changed, 38 insertions(+), 10 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index df9e885..a6efe0a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -498,6 +498,19 @@ config PARAVIRT over full virtualization. However, when run without a hypervisor the kernel is theoretically slower and slightly larger. +config PARAVIRT_SPINLOCKS + bool "Paravirtualization layer for spinlocks" + depends on PARAVIRT && SMP && EXPERIMENTAL + ---help--- + Paravirtualized spinlocks allow a pvops backend to replace the + spinlock implementation with something virtualization-friendly + (for example, block the virtual CPU rather than spinning). + + Unfortunately the downside is an up to 5% performance hit on + native kernels, with various workloads. + + If you are unsure how to answer this question, answer N. + config PARAVIRT_CLOCK bool default n diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 378e369..a53da00 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -1443,7 +1443,7 @@ u64 _paravirt_ident_64(u64); #define paravirt_nop ((void *)_paravirt_nop) -#ifdef CONFIG_SMP +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS) static inline int __raw_spin_is_locked(struct raw_spinlock *lock) { diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index e5e6caf..b7e5db8 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -172,7 +172,7 @@ static inline int __ticket_spin_is_contended(raw_spinlock_t *lock) return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) - 1)) > 1; } -#ifndef CONFIG_PARAVIRT +#ifndef CONFIG_PARAVIRT_SPINLOCKS static inline int __raw_spin_is_locked(raw_spinlock_t *lock) { @@ -206,7 +206,7 @@ static __always_inline void __raw_spin_lock_flags(raw_spinlock_t *lock, __raw_spin_lock(lock); } -#endif +#endif /* CONFIG_PARAVIRT_SPINLOCKS */ static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock) { diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 145cce7..88d1bfc 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -89,7 +89,8 @@ obj-$(CONFIG_DEBUG_NX_TEST) += test_nx.o obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o obj-$(CONFIG_KVM_GUEST) += kvm.o obj-$(CONFIG_KVM_CLOCK) += kvmclock.o -obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o paravirt-spinlocks.o +obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o +obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 8e45f44..9faf43b 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -134,7 +134,9 @@ static void *get_call_destination(u8 type) .pv_irq_ops = pv_irq_ops, .pv_apic_ops = pv_apic_ops, .pv_mmu_ops = pv_mmu_ops, +#ifdef CONFIG_PARAVIRT_SPINLOCKS .pv_lock_ops = pv_lock_ops, +#endif }; return *((void **)&tmpl + type); } diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile index 3b767d0..172438f 100644 --- a/arch/x86/xen/Makefile +++ b/arch/x86/xen/Makefile @@ -9,5 +9,6 @@ obj-y := enlighten.o setup.o multicalls.o mmu.o irq.o \ time.o xen-asm.o xen-asm_$(BITS).o \ grant-table.o suspend.o -obj-$(CONFIG_SMP) += smp.o spinlock.o -obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o \ No newline at end of file +obj-$(CONFIG_SMP) += smp.o +obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= spinlock.o +obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h index 2013946..ca6596b 100644 --- a/arch/x86/xen/xen-ops.h +++ b/arch/x86/xen/xen-ops.h @@ -62,15 +62,26 @@ void xen_setup_vcpu_info_placement(void); #ifdef CONFIG_SMP void xen_smp_init(void); -void __init xen_init_spinlocks(void); -__cpuinit void xen_init_lock_cpu(int cpu); -void xen_uninit_lock_cpu(int cpu); - extern cpumask_var_t xen_cpu_initialized_map; #else static inline void xen_smp_init(void) {} #endif +#ifdef CONFIG_PARAVIRT_SPINLOCKS +void __init xen_init_spinlocks(void); +__cpuinit void xen_init_lock_cpu(int cpu); +void xen_uninit_lock_cpu(int cpu); +#else +static inline void xen_init_spinlocks(void) +{ +} +static inline void xen_init_lock_cpu(int cpu) +{ +} +static inline void xen_uninit_lock_cpu(int cpu) +{ +} +#endif /* Declare an asm function, along with symbols needed to make it inlineable */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-May-15 18:50 UTC
Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified
Jan Beulich wrote:> A patch for the pv-ops kernel would require some time. What I can give you > right away - just for reference - are the sources we currently use in our kernel: > attached.Hm, I see. Putting a call out to a pv-ops function in the ticket lock slow path looks pretty straightforward. The need for an extra lock on the contended unlock side is a bit unfortunate; have you measured to see what hit that has? Seems to me like you could avoid the problem by using per-cpu storage rather than stack storage (though you''d need to copy the per-cpu data to stack when handling a nested spinlock). What''s the thinking behind the xen_spin_adjust() stuff?> static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) { > unsigned int token, count; bool free; __ticket_spin_lock_preamble; if > (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count = 1 > << 10; __ticket_spin_lock_body; } while (unlikely(!count) && > !xen_spin_wait(lock, token)); }How does this work? Doesn''t it always go into the slowpath loop even if the preamble got the lock with no contention? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-May-18 07:19 UTC
Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified
>>> Jeremy Fitzhardinge <jeremy@goop.org> 15.05.09 20:50 >>> >Jan Beulich wrote: >> A patch for the pv-ops kernel would require some time. What I can give you >> right away - just for reference - are the sources we currently use in our kernel: >> attached. > >Hm, I see. Putting a call out to a pv-ops function in the ticket lock >slow path looks pretty straightforward. The need for an extra lock on >the contended unlock side is a bit unfortunate; have you measured to see >what hit that has? Seems to me like you could avoid the problem by >using per-cpu storage rather than stack storage (though you''d need to >copy the per-cpu data to stack when handling a nested spinlock).Not sure how you''d imagine this to work: The unlock code has to look at all cpus'' data in either case, so an inner lock would still be required imo.>What''s the thinking behind the xen_spin_adjust() stuff?That''s the placeholder for implementing interrupt re-enabling in the irq-save lock path. The requirement is that if a nested lock attempt hits the same lock on the same cpu that it failed to get acquired on earlier (but got a ticket already), tickets for the given (lock, cpu) pair need to be circularly shifted around so that the innermost requestor gets the earliest ticket. This is what that function''s job will become if I ever get to implement this.>> static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) { >> unsigned int token, count; bool free; __ticket_spin_lock_preamble; if >> (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count = 1 >> << 10; __ticket_spin_lock_body; } while (unlikely(!count) && >> !xen_spin_wait(lock, token)); } > >How does this work? Doesn''t it always go into the slowpath loop even if >the preamble got the lock with no contention?It indeed always enters the slowpath loop, but only for a single pass through part of its body (the first compare in the body macro will make it exit the loop right away: ''token'' is not only the ticket here, but the full lock->slock contents). But yes, I think you''re right, one could avoid entering the body altogether by moving the containing loop into the if(!free) body. The logic went through a number of re-writes, so I must have overlooked that opportunity on the last round of adjustments. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-May-20 22:42 UTC
Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified
Jan Beulich wrote:>>>> Jeremy Fitzhardinge <jeremy@goop.org> 15.05.09 20:50 >>> >>>> >> Jan Beulich wrote: >> >>> A patch for the pv-ops kernel would require some time. What I can give you >>> right away - just for reference - are the sources we currently use in our kernel: >>> attached. >>> >> Hm, I see. Putting a call out to a pv-ops function in the ticket lock >> slow path looks pretty straightforward. The need for an extra lock on >> the contended unlock side is a bit unfortunate; have you measured to see >> what hit that has? Seems to me like you could avoid the problem by >> using per-cpu storage rather than stack storage (though you''d need to >> copy the per-cpu data to stack when handling a nested spinlock). >> > > Not sure how you''d imagine this to work: The unlock code has to look at all > cpus'' data in either case, so an inner lock would still be required imo. >Well, rather than an explicit lock I was thinking of an optimistic scheme like the pv clock update algorithm. "struct spinning" would have a version counter it could update using the even=stable/odd=unstable algorithm. The lock side would simply save the current per-cpu "struct spinning" state onto its own stack (assuming you can even have nested spinning), and then update the per-cpu copy with the current lock. The kick side can check the version counter to make sure it gets a consistent snapshot of the target cpu''s current lock state. I think that only the locking side requires locked instructions, and the kick side can be all unlocked.>> What''s the thinking behind the xen_spin_adjust() stuff? >> > > That''s the placeholder for implementing interrupt re-enabling in the irq-save > lock path. The requirement is that if a nested lock attempt hits the same > lock on the same cpu that it failed to get acquired on earlier (but got a ticket > already), tickets for the given (lock, cpu) pair need to be circularly shifted > around so that the innermost requestor gets the earliest ticket. This is what > that function''s job will become if I ever get to implement this. >Sounds fiddly.>>> static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) { >>> unsigned int token, count; bool free; __ticket_spin_lock_preamble; if >>> (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count = 1 >>> << 10; __ticket_spin_lock_body; } while (unlikely(!count) && >>> !xen_spin_wait(lock, token)); } >>> >> How does this work? Doesn''t it always go into the slowpath loop even if >> the preamble got the lock with no contention? >> > > It indeed always enters the slowpath loop, but only for a single pass through > part of its body (the first compare in the body macro will make it exit the loop > right away: ''token'' is not only the ticket here, but the full lock->slock > contents). But yes, I think you''re right, one could avoid entering the body > altogether by moving the containing loop into the if(!free) body. The logic > went through a number of re-writes, so I must have overlooked that > opportunity on the last round of adjustments. >I was thinking of something like this: (completely untested) void __ticket_spin_lock(raw_spinlock_t *lock) { unsigned short inc = 0x100; unsigned short token; do { unsigned count = 1 << 10; asm volatile ( " lock xaddw %w0, %1\n" "1: cmpb %h0, %b0\n" " je 2f\n" " rep ; nop\n" " movb %1, %b0\n" /* don''t need lfence here, because loads are in-order */ " sub $1,%2\n" " jnz 1b\n" "2:" : "+Q" (inc), "+m" (lock->slock), "+r" (count) : : "memory", "cc"); if (likely(count != 0)) break; token = inc; inc = 0; } while (unlikely(!xen_spin_wait(lock, token))); } where xen_spin_wait() would actually be a pvops call, and perhaps the asm could be pulled out into an inline to deal with the 8/16 bit ticket difference. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Chuck Ebbert
2009-May-21 22:42 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
On Wed, 13 May 2009 17:16:55 -0700 Jeremy Fitzhardinge <jeremy@goop.org> wrote:> Paravirt patching turns all the pvops calls into direct calls, so > _spin_lock etc do end up having direct calls. For example, the compiler > generated code for paravirtualized _spin_lock is: > > <_spin_lock+0>: mov %gs:0xb4c8,%rax > <_spin_lock+9>: incl 0xffffffffffffe044(%rax) > <_spin_lock+15>: callq *0xffffffff805a5b30 > <_spin_lock+22>: retq > > The indirect call will get patched to: > <_spin_lock+0>: mov %gs:0xb4c8,%rax > <_spin_lock+9>: incl 0xffffffffffffe044(%rax) > <_spin_lock+15>: callq <__ticket_spin_lock> > <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ > <_spin_lock+22>: retq >Can''t those calls be changed to jumps? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-May-21 22:48 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
Chuck Ebbert wrote:> On Wed, 13 May 2009 17:16:55 -0700 > Jeremy Fitzhardinge <jeremy@goop.org> wrote: > > >> Paravirt patching turns all the pvops calls into direct calls, so >> _spin_lock etc do end up having direct calls. For example, the compiler >> generated code for paravirtualized _spin_lock is: >> >> <_spin_lock+0>: mov %gs:0xb4c8,%rax >> <_spin_lock+9>: incl 0xffffffffffffe044(%rax) >> <_spin_lock+15>: callq *0xffffffff805a5b30 >> <_spin_lock+22>: retq >> >> The indirect call will get patched to: >> <_spin_lock+0>: mov %gs:0xb4c8,%rax >> <_spin_lock+9>: incl 0xffffffffffffe044(%rax) >> <_spin_lock+15>: callq <__ticket_spin_lock> >> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ >> <_spin_lock+22>: retq >> >> > > Can''t those calls be changed to jumps? >In this specific instance of this example, yes. But if you start enabling various spinlock debug options then there''ll be code following the call. It would be hard for the runtime patching machinery to know when it would be safe to do the substitution. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
H. Peter Anvin
2009-May-21 23:10 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
Jeremy Fitzhardinge wrote:> > In this specific instance of this example, yes. But if you start > enabling various spinlock debug options then there''ll be code following > the call. It would be hard for the runtime patching machinery to know > when it would be safe to do the substitution. >"When it is immediately followed by a ret" seems like a straightforward rule to me? -hpa _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xin, Xiaohui
2009-May-22 01:26 UTC
[Xen-devel] RE: Performance overhead of paravirt_ops on native identified
Remember we have done one experiment with "jump", the result shows seems the overhead is even more than the call. Thanks Xiaohui -----Original Message----- From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] Sent: 2009年5月22日 6:49 To: Chuck Ebbert Cc: Ingo Molnar; Xin, Xiaohui; Li, Xin; Nakajima, Jun; H. Peter Anvin; Nick Piggin; Linux Kernel Mailing List; Xen-devel Subject: Re: Performance overhead of paravirt_ops on native identified Chuck Ebbert wrote:> On Wed, 13 May 2009 17:16:55 -0700 > Jeremy Fitzhardinge <jeremy@goop.org> wrote: > > >> Paravirt patching turns all the pvops calls into direct calls, so >> _spin_lock etc do end up having direct calls. For example, the compiler >> generated code for paravirtualized _spin_lock is: >> >> <_spin_lock+0>: mov %gs:0xb4c8,%rax >> <_spin_lock+9>: incl 0xffffffffffffe044(%rax) >> <_spin_lock+15>: callq *0xffffffff805a5b30 >> <_spin_lock+22>: retq >> >> The indirect call will get patched to: >> <_spin_lock+0>: mov %gs:0xb4c8,%rax >> <_spin_lock+9>: incl 0xffffffffffffe044(%rax) >> <_spin_lock+15>: callq <__ticket_spin_lock> >> <_spin_lock+20>: nop; nop /* or whatever 2-byte nop */ >> <_spin_lock+22>: retq >> >> > > Can't those calls be changed to jumps? >In this specific instance of this example, yes. But if you start enabling various spinlock debug options then there'll be code following the call. It would be hard for the runtime patching machinery to know when it would be safe to do the substitution. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
H. Peter Anvin
2009-May-22 03:39 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
Xin, Xiaohui wrote:> Remember we have done one experiment with "jump", the result shows seems the overhead is even more than the call.I didn''t, no. That seems extremely weird to me. (Unbalancing the call/ret stack is known to suck royally, of course.)>>> >>> >> Can''t those calls be changed to jumps? >> > > In this specific instance of this example, yes. But if you start > enabling various spinlock debug options then there''ll be code following > the call. It would be hard for the runtime patching machinery to know > when it would be safe to do the substitution. >When there is code after the call, it''s rather obviously not safe. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don''t speak on their behalf. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-May-22 04:27 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
Xin, Xiaohui wrote:> Remember we have done one experiment with "jump", the result shows seems the overhead is even more than the call. >I don''t think you had mentioned that. You''re saying that a call->jmp->ret is slower than call->call->ret->ret? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xin, Xiaohui
2009-May-22 05:59 UTC
[Xen-devel] RE: Performance overhead of paravirt_ops on native identified
What I mean is that if the binary of _spin_lock is like this: (gdb) disassemble _spin_lock Dump of assembler code for function _spin_lock: 0xffffffff80497c0f <_spin_lock+0>: mov 1252634(%rip),%r11 # #0xffffffff805c9930 <test_lock_ops+16> 0xffffffff80497c16 <_spin_lock+7>: jmpq *%r11 End of assembler dump. (gdb) disassemble In this situation the binary contains a jump, the overhead is more than the call. Thanks Xiaohui -----Original Message----- From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] Sent: 2009年5月22日 12:28 To: Xin, Xiaohui Cc: Chuck Ebbert; Ingo Molnar; Li, Xin; Nakajima, Jun; H. Peter Anvin; Nick Piggin; Linux Kernel Mailing List; Xen-devel Subject: Re: Performance overhead of paravirt_ops on native identified Xin, Xiaohui wrote:> Remember we have done one experiment with "jump", the result shows seems the overhead is even more than the call. >I don't think you had mentioned that. You're saying that a call->jmp->ret is slower than call->call->ret->ret? J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
H. Peter Anvin
2009-May-22 16:33 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
Xin, Xiaohui wrote:> What I mean is that if the binary of _spin_lock is like this: > (gdb) disassemble _spin_lock > Dump of assembler code for function _spin_lock: > 0xffffffff80497c0f <_spin_lock+0>: mov 1252634(%rip),%r11 # #0xffffffff805c9930 <test_lock_ops+16> > 0xffffffff80497c16 <_spin_lock+7>: jmpq *%r11 > End of assembler dump. > (gdb) disassemble > > In this situation the binary contains a jump, the overhead is more than the call. >That''s an indirect jump, though. I don''t think anyone was suggesting using an indirect jump; the final patched version should be a direct jump (instead of a direct call.) I can see how indirect jumps might be slower, since they are probably not optimized as aggressively in hardware as indirect calls -- indirect jumps are generally used for switch tables, which often have low predictability, whereas indirect calls are generally used for method calls, which are (a) incredibly important for OOP languages, and (b) generally highly predictable on the dynamic scale. However, direct jumps and calls don''t need prediction at all (although of course rets do.) -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don''t speak on their behalf. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-May-22 22:44 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
H. Peter Anvin wrote:> That''s an indirect jump, though. I don''t think anyone was suggesting > using an indirect jump; the final patched version should be a direct > jump (instead of a direct call.) > > I can see how indirect jumps might be slower, since they are probably > not optimized as aggressively in hardware as indirect calls -- indirect > jumps are generally used for switch tables, which often have low > predictability, whereas indirect calls are generally used for method > calls, which are (a) incredibly important for OOP languages, and (b) > generally highly predictable on the dynamic scale. > > However, direct jumps and calls don''t need prediction at all (although > of course rets do.)I did a quick experiment to see how many sites this optimisation could actually affect. Firstly, it does absolutely nothing with frame pointers enabled. Arranging for no frame pointers is quite tricky, since it means disabling all debugging, tracing and other things. With no frame pointers, its about 26 of 5400 indirect calls are immediately followed by ret (not all of those sites are pvops calls). With preempt disabled, this goes up to 45 sites. I haven''t done any actual runtime tests, but a quick survey of the affected sites shows that only a couple are performance-sensitive; _spin_lock and _spin_lock_irq and _spin_lock_irqsave are the most obvious. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
H. Peter Anvin
2009-May-22 22:47 UTC
[Xen-devel] Re: Performance overhead of paravirt_ops on native identified
Jeremy Fitzhardinge wrote:> > I did a quick experiment to see how many sites this optimisation could > actually affect. Firstly, it does absolutely nothing with frame > pointers enabled. Arranging for no frame pointers is quite tricky, > since it means disabling all debugging, tracing and other things. > > With no frame pointers, its about 26 of 5400 indirect calls are > immediately followed by ret (not all of those sites are pvops calls). > With preempt disabled, this goes up to 45 sites. > > I haven''t done any actual runtime tests, but a quick survey of the > affected sites shows that only a couple are performance-sensitive; > _spin_lock and _spin_lock_irq and _spin_lock_irqsave are the most obvious. >OK, that doesn''t seem like a very productive avenue. Problem still remains, obviously. -hpa _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel