thr3ads.net - Xen devel - [Xen-devel] Performance overhead of paravirt

If this information is useful, please help other people find it:
Share via:

Jeremy Fitzhardinge

2009-May-14 00:16 UTC

[Xen-devel] Performance overhead of paravirt_ops on native identified

Hi Ingo,

Xiaohui Xin and some other folks at Intel have been looking into what''s
behind the performance hit of paravirt_ops when running native.

It appears that the hit is entirely due to the paravirtualized
spinlocks; the extra call/return in the spinlock path is somehow
causing an increase in the cycles/instruction of somewhere around 2-7%
(seems to vary quite a lot from test to test).  The working theory is
that the CPU''s pipeline is getting upset about the
call->call->locked-op->return->return, and seems to be failing to
speculate (though I haven''t seen anything definitive about the precise
reasons).  This doesn''t entirely make sense, because the performance
hit is also visible on unlock and other operations which don''t involve
locked instructions.  But spinlock operations clearly swamp all the
other pvops operations, even though I can''t imagine that
they''re
nearly as common (there''s only a .05% increase in instructions
executed).

If I disable just the pv-spinlock calls, my tests show that pvops is
identical to non-pvops performance on native (my measurements show that
it is actually about .1% faster, but Xiaohui shows a .05% slowdown).

Summary of results, averaging 10 runs of the "mmperf" test, using a
no-pvops build as baseline:

		nopv		Pv-nospin	Pv-spin
CPU cycles	100.00%		99.89%		102.18%
instructions	100.00%		100.10%		100.15%
CPI		100.00%		99.79%		102.03%
cache ref	100.00%		100.84%		100.28%
cache miss	100.00%		90.47%		88.56%
cache miss rate	100.00%		89.72%		88.31%
branches	100.00%		99.93%		100.04%
branch miss	100.00%		103.66%		107.72%
branch miss rt	100.00%		103.73%		107.67%
wallclock	100.00%		99.90%		102.20%

The clear effect here is that the 2% increase in CPI is
directly reflected in the final wallclock time.

(The other interesting effect is that the more ops are
out of line calls via pvops, the lower the cache access
and miss rates.  Not too surprising, but it suggests that
the non-pvops kernel is over-inlined.  On the flipside,
the branch misses go up correspondingly...)


So, what''s the fix?

Paravirt patching turns all the pvops calls into direct calls, so
_spin_lock etc do end up having direct calls.  For example, the compiler
generated code for paravirtualized _spin_lock is:

<_spin_lock+0>:		mov    %gs:0xb4c8,%rax
<_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
<_spin_lock+15>:	callq  *0xffffffff805a5b30
<_spin_lock+22>:	retq

The indirect call will get patched to:
<_spin_lock+0>:		mov    %gs:0xb4c8,%rax
<_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
<_spin_lock+15>:	callq <__ticket_spin_lock>
<_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
<_spin_lock+22>:	retq

One possibility is to inline _spin_lock, etc, when building an
optimised kernel (ie, when there''s no spinlock/preempt
instrumentation/debugging enabled).  That will remove the outer
call/return pair, returning the instruction stream to a single
call/return, which will presumably execute the same as the non-pvops
case.  The downsides arel 1) it will replicate the
preempt_disable/enable code at eack lock/unlock callsite; this code is
fairly small, but not nothing; and 2) the spinlock definitions are
already a very heavily tangled mass of #ifdefs and other preprocessor
magic, and making any changes will be non-trivial.

The other obvious answer is to disable pv-spinlocks.  Making them a
separate config option is fairly easy, and it would be trivial to
enable them only when Xen is enabled (as the only non-default user).
But it doesn''t really address the common case of a distro build which
is going to have Xen support enabled, and leaves the open question of
whether the native performance cost of pv-spinlocks is worth the
performance improvement on a loaded Xen system (10% saving of overall
system CPU when guests block rather than spin).  Still it is a
reasonable short-term workaround.

The best solution would be to work out whether this really is a problem
interaction with Intel''s pipelines, and come up with something that
avoids it.  It would be very interesting to see if there''s a similar
hit
on AMD systems.

   J
>From 839033e472c8f3b228be35e57a8b31fbb7f9cf98 Mon Sep 17 00:00:00 2001From: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Date: Wed, 13 May 2009 11:58:17 -0700
Subject: [PATCH] x86: add config to disable PV spinlocks

Paravirtualized spinlocks seem to cause a 2-7% performance hit when
running a pvops kernel native.  Without them, the pvops kernel is
identical to a non-pvops kernel in performance.

[ Impact: reduce overhead of pvops when running native ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5f50179..a99ed71 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -498,6 +498,17 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_SPINLOCKS
+	bool "Enable paravirtualized spinlocks"
+	depends on PARAVIRT && SMP
+	default XEN
+	---help---
+	  Paravirtualized spinlocks allow a pvops backend to replace the
+	  spinlock implementation with something virtualization-friendly
+	  (for example, block the virtual CPU rather than spinning).
+	  Unfortunately the downside is an as-yet unexplained performance
+	  when running native.
+
 config PARAVIRT_CLOCK
 	bool
 	default n
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 1fe5837..4fb37c8 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -1443,7 +1443,7 @@ u64 _paravirt_ident_64(u64);
 
 #define paravirt_nop	((void *)_paravirt_nop)
 
-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
 
 static inline int __raw_spin_is_locked(struct raw_spinlock *lock)
 {
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index e5e6caf..b7e5db8 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -172,7 +172,7 @@ static inline int __ticket_spin_is_contended(raw_spinlock_t
*lock)
 	return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) -
1)) > 1;
 }
 
-#ifndef CONFIG_PARAVIRT
+#ifndef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
 {
@@ -206,7 +206,7 @@ static __always_inline void
__raw_spin_lock_flags(raw_spinlock_t *lock,
 	__raw_spin_lock(lock);
 }
 
-#endif
+#endif	/* CONFIG_PARAVIRT_SPINLOCKS */
 
 static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
 {
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 68a4ff6..4f78bd6 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -90,7 +90,8 @@ obj-$(CONFIG_DEBUG_NX_TEST)	+= test_nx.o
 obj-$(CONFIG_VMI)		+= vmi_32.o vmiclock_32.o
 obj-$(CONFIG_KVM_GUEST)		+= kvm.o
 obj-$(CONFIG_KVM_CLOCK)		+= kvmclock.o
-obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
paravirt-spinlocks.o
+obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
+obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index aa34423..70ec9b9 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -134,7 +134,9 @@ static void *get_call_destination(u8 type)
 		.pv_irq_ops = pv_irq_ops,
 		.pv_apic_ops = pv_apic_ops,
 		.pv_mmu_ops = pv_mmu_ops,
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 		.pv_lock_ops = pv_lock_ops,
+#endif
 	};
 	return *((void **)&tmpl + type);
 }
diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index 3b767d0..172438f 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -9,5 +9,6 @@ obj-y		:= enlighten.o setup.o multicalls.o mmu.o irq.o \
 			time.o xen-asm.o xen-asm_$(BITS).o \
 			grant-table.o suspend.o
 
-obj-$(CONFIG_SMP)		+= smp.o spinlock.o
-obj-$(CONFIG_XEN_DEBUG_FS)	+= debugfs.o
\ No newline at end of file
+obj-$(CONFIG_SMP)		+= smp.o
+obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= spinlock.o
+obj-$(CONFIG_XEN_DEBUG_FS)	+= debugfs.o
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 5c50a10..22494fd 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -61,15 +61,26 @@ void xen_setup_vcpu_info_placement(void);
 #ifdef CONFIG_SMP
 void xen_smp_init(void);
 
-void __init xen_init_spinlocks(void);
-__cpuinit void xen_init_lock_cpu(int cpu);
-void xen_uninit_lock_cpu(int cpu);
-
 extern cpumask_var_t xen_cpu_initialized_map;
 #else
 static inline void xen_smp_init(void) {}
 #endif
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init xen_init_spinlocks(void);
+__cpuinit void xen_init_lock_cpu(int cpu);
+void xen_uninit_lock_cpu(int cpu);
+#else
+static inline void xen_init_spinlocks(void)
+{
+}
+static inline void xen_init_lock_cpu(int cpu)
+{
+}
+static inline void xen_uninit_lock_cpu(int cpu)
+{
+}
+#endif
 
 /* Declare an asm function, along with symbols needed to make it
    inlineable */



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

H. Peter Anvin

2009-May-14 01:10 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Jeremy Fitzhardinge wrote:> 
> So, what''s the fix?
> 
> Paravirt patching turns all the pvops calls into direct calls, so
> _spin_lock etc do end up having direct calls.  For example, the compiler
> generated code for paravirtualized _spin_lock is:
> 
> <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
> <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
> <_spin_lock+15>:	callq  *0xffffffff805a5b30
> <_spin_lock+22>:	retq
> 
> The indirect call will get patched to:
> <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
> <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
> <_spin_lock+15>:	callq <__ticket_spin_lock>
> <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
> <_spin_lock+22>:	retq
> 
> One possibility is to inline _spin_lock, etc, when building an
> optimised kernel (ie, when there''s no spinlock/preempt
> instrumentation/debugging enabled).  That will remove the outer
> call/return pair, returning the instruction stream to a single
> call/return, which will presumably execute the same as the non-pvops
> case.  The downsides arel 1) it will replicate the
> preempt_disable/enable code at eack lock/unlock callsite; this code is
> fairly small, but not nothing; and 2) the spinlock definitions are
> already a very heavily tangled mass of #ifdefs and other preprocessor
> magic, and making any changes will be non-trivial.
> 
The other obvious option, it would seem to me, would be to eliminate the
*inner* call/return pair, i.e. merging the _spin_lock setup code in with
the internals of each available implementation (in the case above,
__ticket_spin_lock).  This is effectively what happens on native.  The
one problem with that is that every callsite now becomes a patching target.

That brings me to a somewhat half-arsed thought I have been walking
around with for a while.

Consider a paravirt -- or for that matter any other call which is
runtime-static; this isn''t just limited to paravirt -- function which
looks to the C compiler just like any other external function -- no
indirection.  We can point it by default to a function which is really
just an indirect jump to the appropriate handler, that handles the
prepatching case.  However, a linktime pass over vmlinux.o can find all
the points where this function is called, and turn it into a list of
patch sites(*).  The advantages are:

1. [minor] no additional nop padding due to indirect function calls.
2. [major] no need for a ton of wrapper macros manifest in the code.

paravirt_ops that turn into pure inline code in the native case is
obviously another ball of wax entirely; there inline assembly wrappers
are simply unavoidable.

	-hpa

(*) if patching code on SMP was cheaper, we could actually do this
lazily, and wouldn''t have to store a list of patch sites.  I
don''t feel
brave enough to go down that route.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2009-May-14 08:05 UTC

head link

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

>>> Jeremy Fitzhardinge <jeremy@goop.org> 14.05.09 02:16
>>>
>One possibility is to inline _spin_lock, etc, when building an
>optimised kernel (ie, when there''s no spinlock/preempt
>instrumentation/debugging enabled).  That will remove the outer
>call/return pair, returning the instruction stream to a single
>call/return, which will presumably execute the same as the non-pvops
>case.  The downsides arel 1) it will replicate the
>preempt_disable/enable code at eack lock/unlock callsite; this code is
>fairly small, but not nothing; and 2) the spinlock definitions are
>already a very heavily tangled mass of #ifdefs and other preprocessor
>magic, and making any changes will be non-trivial.
>
>The other obvious answer is to disable pv-spinlocks.  Making them a
>separate config option is fairly easy, and it would be trivial to
>enable them only when Xen is enabled (as the only non-default user).
>But it doesn''t really address the common case of a distro build
which
>is going to have Xen support enabled, and leaves the open question of
>whether the native performance cost of pv-spinlocks is worth the
>performance improvement on a loaded Xen system (10% saving of overall
>system CPU when guests block rather than spin).  Still it is a
>reasonable short-term workaround.
Wouldn''t a third solution be to use ticket spinlocks everywhere, i.e.
eliminate
the current indirection, and replace it by an indirection for just the
contention
case? As I view it, the problem for Xen aren''t really the ticket locks
by
themselves, but rather the extra spinning involved, which is of concern only
if a lock is contended. We''re using ticket locks quite happily in our
kernels,
with directed instead of global wakeup from the unlock path. The only open
issue we currently have is that while for native keeping interrupts disabled
while spinning may be acceptable (though I''m not sure how -rt folks are
viewing this), in a pv environment one should really re-enable interrupts
here due to the potentially much higher latency.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Peter Zijlstra

2009-May-14 08:25 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

On Wed, 2009-05-13 at 18:10 -0700, H. Peter Anvin wrote:> Jeremy Fitzhardinge wrote:
> > 
> > So, what''s the fix?
> > 
> > Paravirt patching turns all the pvops calls into direct calls, so
> > _spin_lock etc do end up having direct calls.  For example, the
compiler
> > generated code for paravirtualized _spin_lock is:
> > 
> > <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
> > <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
> > <_spin_lock+15>:	callq  *0xffffffff805a5b30
> > <_spin_lock+22>:	retq
> > 
> > The indirect call will get patched to:
> > <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
> > <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
> > <_spin_lock+15>:	callq <__ticket_spin_lock>
> > <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
> > <_spin_lock+22>:	retq
> > 
> > One possibility is to inline _spin_lock, etc, when building an
> > optimised kernel (ie, when there''s no spinlock/preempt
> > instrumentation/debugging enabled).  That will remove the outer
> > call/return pair, returning the instruction stream to a single
> > call/return, which will presumably execute the same as the non-pvops
> > case.  The downsides arel 1) it will replicate the
> > preempt_disable/enable code at eack lock/unlock callsite; this code is
> > fairly small, but not nothing; and 2) the spinlock definitions are
> > already a very heavily tangled mass of #ifdefs and other preprocessor
> > magic, and making any changes will be non-trivial.
> > 
> 
> The other obvious option, it would seem to me, would be to eliminate the
> *inner* call/return pair, i.e. merging the _spin_lock setup code in with
> the internals of each available implementation (in the case above,
> __ticket_spin_lock).  This is effectively what happens on native.  The
> one problem with that is that every callsite now becomes a patching target.
> 
> That brings me to a somewhat half-arsed thought I have been walking
> around with for a while.
> 
> Consider a paravirt -- or for that matter any other call which is
> runtime-static; this isn''t just limited to paravirt -- function
which
> looks to the C compiler just like any other external function -- no
> indirection.  We can point it by default to a function which is really
> just an indirect jump to the appropriate handler, that handles the
> prepatching case.  However, a linktime pass over vmlinux.o can find all
> the points where this function is called, and turn it into a list of
> patch sites(*).  The advantages are:
> 
> 1. [minor] no additional nop padding due to indirect function calls.
> 2. [major] no need for a ton of wrapper macros manifest in the code.
> 
> paravirt_ops that turn into pure inline code in the native case is
> obviously another ball of wax entirely; there inline assembly wrappers
> are simply unavoidable.
> 
> 	-hpa
> 
> (*) if patching code on SMP was cheaper, we could actually do this
> lazily, and wouldn''t have to store a list of patch sites.  I
don''t feel
> brave enough to go down that route.
This sounds remarkably like what the dynamic function call tracer does.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Peter Zijlstra

2009-May-14 08:33 UTC

head link

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

On Thu, 2009-05-14 at 09:05 +0100, Jan Beulich wrote:> Wouldn''t a third solution be to use ticket spinlocks everywhere,
i.e. eliminate
> the current indirection, and replace it by an indirection for just the
contention
> case? As I view it, the problem for Xen aren''t really the ticket
locks by
> themselves, but rather the extra spinning involved, which is of concern
only
> if a lock is contended. We''re using ticket locks quite happily in
our kernels,
> with directed instead of global wakeup from the unlock path. The only open
> issue we currently have is that while for native keeping interrupts
disabled
> while spinning may be acceptable (though I''m not sure how -rt
folks are
> viewing this), in a pv environment one should really re-enable interrupts
> here due to the potentially much higher latency.
the -rt folks don''t nearly have as many spinlocks, and for those we do
like ticket locks, because they are much fairer and give better worst
case contention behaviour.

Also, for the -rt folks, preempt disable is about as bad as irq disable.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

H. Peter Anvin

2009-May-14 14:05 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Peter Zijlstra wrote:> 
> This sounds remarkably like what the dynamic function call tracer does.
> 
I''m sure this has been invented before... probably more than once.  Far
too many things we invent don''t get generalized.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-May-14 17:36 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

H. Peter Anvin wrote:> The other obvious option, it would seem to me, would be to eliminate the
> *inner* call/return pair, i.e. merging the _spin_lock setup code in with
> the internals of each available implementation (in the case above,
> __ticket_spin_lock).  This is effectively what happens on native.  The
> one problem with that is that every callsite now becomes a patching target.
>   
Yes, that''s an option.  It has the downside of requiring changes to the
common spinlock code in kernel/spinlock.c and linux/spinlock_api*.h.   
The amount of duplicated code is potentially quite large, but there 
aren''t that many spinlock implementations.

Also, there''s not much point in using pv spinlocks when all the 
instrumentation is on.  Lock contention metering, for example, never 
does a proper lock operation, but does a spin with repeated trylocks; we 
can''t optimise that, so there''s no point in trying.

So maybe if we can fast-path the fast-path to pv spinlocks, the problem 
is more tractable...
> That brings me to a somewhat half-arsed thought I have been walking
> around with for a while.
>
> Consider a paravirt -- or for that matter any other call which is
> runtime-static; this isn''t just limited to paravirt -- function
which
> looks to the C compiler just like any other external function -- no
> indirection.  We can point it by default to a function which is really
> just an indirect jump to the appropriate handler, that handles the
> prepatching case.  However, a linktime pass over vmlinux.o can find all
> the points where this function is called, and turn it into a list of
> patch sites(*).  The advantages are:
>
> 1. [minor] no additional nop padding due to indirect function calls.
> 2. [major] no need for a ton of wrapper macros manifest in the code.
>
> paravirt_ops that turn into pure inline code in the native case is
> obviously another ball of wax entirely; there inline assembly wrappers
> are simply unavoidable.
>   
We did consider something like this at the outset.  As I remember, there 
were a few concerns:

    * There was no relocation data available in the kernel.  I played
      around with ways to make it work, but they ended up being fairly
      complex and brittle, with a tendency (of course) to trigger
      binutils bugs.  Maybe that has changed.
    * We didn''t really want to implement two separate mechanisms for
the
      same thing.  Given that we wanted to inline things like
      cli/sti/pushf/popf, we needed to have something capable of full
      patching.  Having a separate mechanisms for patching calls is
      harder to justify.  Now that pvops is well settled, perhaps it
      makes sense to consider adding another more general patching
      mechanism to avoid the indirect calls (a dynamic linker, essentially).

I won''t make any great claims about the beauty of the PV_CALL* gunk,
but
at the very least it is contained within paravirt.h.
> (*) if patching code on SMP was cheaper, we could actually do this
> lazily, and wouldn''t have to store a list of patch sites.  I
don''t feel
> brave enough to go down that route.
>   
The problem that the tracepoints people were trying to solve was harder, 
where they wanted to replace an arbitrary set of instructions with some 
other arbitrary instructions (or a call) - that would need some kind SMP 
synchronization, both for general sanity and to keep the Intel rules happy.

In theory relinking a call should just be a single word write into the 
instruction, but I don''t know if that gets into undefined territory or 
not.  On older P4 systems it would end up blowing away the trace cache 
on all cpus when you write to code like that, so you''d want to be sure 
that your references are getting resolved fairly quickly.  But its hard 
to see how patching the offset in a call instruction would end up 
calling something other than the old or new function.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-May-14 17:45 UTC

head link

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

Jan Beulich wrote:> Wouldn''t a third solution be to use ticket spinlocks everywhere,
i.e. eliminate
> the current indirection, and replace it by an indirection for just the
contention
> case? As I view it, the problem for Xen aren''t really the ticket
locks by
> themselves, but rather the extra spinning involved, which is of concern
only
> if a lock is contended. We''re using ticket locks quite happily in
our kernels,
> with directed instead of global wakeup from the unlock path.Do you have a patch to illustrate what you mean?  How do you keep track 
of the target vcpu for the directed wakeup?  Are you using the 
event-poll mechanism to block?

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

H. Peter Anvin

2009-May-14 17:50 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Jeremy Fitzhardinge wrote:> 
> We did consider something like this at the outset.  As I remember, there 
> were a few concerns:
> 
>     * There was no relocation data available in the kernel.  I played
>       around with ways to make it work, but they ended up being fairly
>       complex and brittle, with a tendency (of course) to trigger
>       binutils bugs.  Maybe that has changed.
We already do this pass (in fact, we do something like three passes of
it.)  It''s basically the vmlinux.o pass.
>     * We didn''t really want to implement two separate mechanisms
for the
>       same thing.  Given that we wanted to inline things like
>       cli/sti/pushf/popf, we needed to have something capable of full
>       patching.  Having a separate mechanisms for patching calls is
>       harder to justify.  Now that pvops is well settled, perhaps it
>       makes sense to consider adding another more general patching
>       mechanism to avoid the indirect calls (a dynamic linker,
essentially).
Full patching is understandable (although I think sometimes the code
generated was worse than out-of-line... I believe you have fixed that.)
> I won''t make any great claims about the beauty of the PV_CALL*
gunk, but
> at the very least it is contained within paravirt.h.
There is still massive spillover into other code, though, at least some
of which could possibly be avoided.  I don''t know.
>> (*) if patching code on SMP was cheaper, we could actually do this
>> lazily, and wouldn''t have to store a list of patch sites.  I
don''t feel
>> brave enough to go down that route.
>>   
> The problem that the tracepoints people were trying to solve was harder, 
> where they wanted to replace an arbitrary set of instructions with some 
> other arbitrary instructions (or a call) - that would need some kind SMP 
> synchronization, both for general sanity and to keep the Intel rules happy.
> 
> In theory relinking a call should just be a single word write into the 
> instruction, but I don''t know if that gets into undefined
territory or
> not.  On older P4 systems it would end up blowing away the trace cache 
> on all cpus when you write to code like that, so you''d want to be
sure
> that your references are getting resolved fairly quickly.  But its hard 
> to see how patching the offset in a call instruction would end up 
> calling something other than the old or new function.
The problem is that since the call offset field can be arbitrarily
aligned -- it could even cross page boundaries -- you still have
absolutely no SMP atomicity guarantees.  So you still have all the same
problems.  Without

	-hpa

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2009-May-15 08:10 UTC

head link

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

>>> Jeremy Fitzhardinge <jeremy@goop.org> 14.05.09 19:45
>>>
>Jan Beulich wrote:
>> Wouldn''t a third solution be to use ticket spinlocks
everywhere, i.e. eliminate
>> the current indirection, and replace it by an indirection for just the
contention
>> case? As I view it, the problem for Xen aren''t really the
ticket locks by
>> themselves, but rather the extra spinning involved, which is of concern
only
>> if a lock is contended. We''re using ticket locks quite happily
in our kernels,
>> with directed instead of global wakeup from the unlock path.
>Do you have a patch to illustrate what you mean?  How do you keep track 
>of the target vcpu for the directed wakeup?  Are you using the 
>event-poll mechanism to block?
A patch for the pv-ops kernel would require some time. What I can give you
right away - just for reference - are the sources we currently use in our
kernel:
attached.

Jan




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

tip-bot for Jeremy Fitzhardinge

2009-May-15 18:18 UTC

head link

[Xen-devel] [tip:x86/urgent] x86: Fix performance regression caused by paravirt_ops on native kernels

Commit-ID:  b4ecc126991b30fe5f9a59dfacda046aeac124b2
Gitweb:     http://git.kernel.org/tip/b4ecc126991b30fe5f9a59dfacda046aeac124b2
Author:     Jeremy Fitzhardinge <jeremy@goop.org>
AuthorDate: Wed, 13 May 2009 17:16:55 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 15 May 2009 20:07:42 +0200

x86: Fix performance regression caused by paravirt_ops on native kernels

Xiaohui Xin and some other folks at Intel have been looking into what''s
behind the performance hit of paravirt_ops when running native.

It appears that the hit is entirely due to the paravirtualized
spinlocks introduced by:

 | commit 8efcbab674de2bee45a2e4cdf97de16b8e609ac8
 | Date:   Mon Jul 7 12:07:51 2008 -0700
 |
 |     paravirt: introduce a "lock-byte" spinlock implementation

The extra call/return in the spinlock path is somehow
causing an increase in the cycles/instruction of somewhere around 2-7%
(seems to vary quite a lot from test to test).  The working theory is
that the CPU''s pipeline is getting upset about the
call->call->locked-op->return->return, and seems to be failing to
speculate (though I haven''t seen anything definitive about the precise
reasons).  This doesn''t entirely make sense, because the performance
hit is also visible on unlock and other operations which don''t involve
locked instructions.  But spinlock operations clearly swamp all the
other pvops operations, even though I can''t imagine that
they''re
nearly as common (there''s only a .05% increase in instructions
executed).

If I disable just the pv-spinlock calls, my tests show that pvops is
identical to non-pvops performance on native (my measurements show that
it is actually about .1% faster, but Xiaohui shows a .05% slowdown).

Summary of results, averaging 10 runs of the "mmperf" test, using a
no-pvops build as baseline:

		nopv		Pv-nospin	Pv-spin
CPU cycles	100.00%		99.89%		102.18%
instructions	100.00%		100.10%		100.15%
CPI		100.00%		99.79%		102.03%
cache ref	100.00%		100.84%		100.28%
cache miss	100.00%		90.47%		88.56%
cache miss rate	100.00%		89.72%		88.31%
branches	100.00%		99.93%		100.04%
branch miss	100.00%		103.66%		107.72%
branch miss rt	100.00%		103.73%		107.67%
wallclock	100.00%		99.90%		102.20%

The clear effect here is that the 2% increase in CPI is
directly reflected in the final wallclock time.

(The other interesting effect is that the more ops are
out of line calls via pvops, the lower the cache access
and miss rates.  Not too surprising, but it suggests that
the non-pvops kernel is over-inlined.  On the flipside,
the branch misses go up correspondingly...)

So, what''s the fix?

Paravirt patching turns all the pvops calls into direct calls, so
_spin_lock etc do end up having direct calls.  For example, the compiler
generated code for paravirtualized _spin_lock is:

<_spin_lock+0>:		mov    %gs:0xb4c8,%rax
<_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
<_spin_lock+15>:	callq  *0xffffffff805a5b30
<_spin_lock+22>:	retq

The indirect call will get patched to:
<_spin_lock+0>:		mov    %gs:0xb4c8,%rax
<_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
<_spin_lock+15>:	callq <__ticket_spin_lock>
<_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
<_spin_lock+22>:	retq

One possibility is to inline _spin_lock, etc, when building an
optimised kernel (ie, when there''s no spinlock/preempt
instrumentation/debugging enabled).  That will remove the outer
call/return pair, returning the instruction stream to a single
call/return, which will presumably execute the same as the non-pvops
case.  The downsides arel 1) it will replicate the
preempt_disable/enable code at eack lock/unlock callsite; this code is
fairly small, but not nothing; and 2) the spinlock definitions are
already a very heavily tangled mass of #ifdefs and other preprocessor
magic, and making any changes will be non-trivial.

The other obvious answer is to disable pv-spinlocks.  Making them a
separate config option is fairly easy, and it would be trivial to
enable them only when Xen is enabled (as the only non-default user).
But it doesn''t really address the common case of a distro build which
is going to have Xen support enabled, and leaves the open question of
whether the native performance cost of pv-spinlocks is worth the
performance improvement on a loaded Xen system (10% saving of overall
system CPU when guests block rather than spin).  Still it is a
reasonable short-term workaround.

[ Impact: fix pvops performance regression when running native ]

Analysed-by: "Xin Xiaohui" <xiaohui.xin@intel.com>
Analysed-by: "Li Xin" <xin.li@intel.com>
Analysed-by: "Nakajima Jun" <jun.nakajima@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Xen-devel <xen-devel@lists.xensource.com>
LKML-Reference: <4A0B62F7.5030802@goop.org>
[ fixed the help text ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/x86/Kconfig                |   13 +++++++++++++
 arch/x86/include/asm/paravirt.h |    2 +-
 arch/x86/include/asm/spinlock.h |    4 ++--
 arch/x86/kernel/Makefile        |    3 ++-
 arch/x86/kernel/paravirt.c      |    2 ++
 arch/x86/xen/Makefile           |    5 +++--
 arch/x86/xen/xen-ops.h          |   19 +++++++++++++++----
 7 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index df9e885..a6efe0a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -498,6 +498,19 @@ config PARAVIRT
 	  over full virtualization.  However, when run without a hypervisor
 	  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_SPINLOCKS
+	bool "Paravirtualization layer for spinlocks"
+	depends on PARAVIRT && SMP && EXPERIMENTAL
+	---help---
+	  Paravirtualized spinlocks allow a pvops backend to replace the
+	  spinlock implementation with something virtualization-friendly
+	  (for example, block the virtual CPU rather than spinning).
+
+	  Unfortunately the downside is an up to 5% performance hit on
+	  native kernels, with various workloads.
+
+	  If you are unsure how to answer this question, answer N.
+
 config PARAVIRT_CLOCK
 	bool
 	default n
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 378e369..a53da00 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -1443,7 +1443,7 @@ u64 _paravirt_ident_64(u64);
 
 #define paravirt_nop	((void *)_paravirt_nop)
 
-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
 
 static inline int __raw_spin_is_locked(struct raw_spinlock *lock)
 {
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index e5e6caf..b7e5db8 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -172,7 +172,7 @@ static inline int __ticket_spin_is_contended(raw_spinlock_t
*lock)
 	return (((tmp >> TICKET_SHIFT) - tmp) & ((1 << TICKET_SHIFT) -
1)) > 1;
 }
 
-#ifndef CONFIG_PARAVIRT
+#ifndef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
 {
@@ -206,7 +206,7 @@ static __always_inline void
__raw_spin_lock_flags(raw_spinlock_t *lock,
 	__raw_spin_lock(lock);
 }
 
-#endif
+#endif	/* CONFIG_PARAVIRT_SPINLOCKS */
 
 static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
 {
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 145cce7..88d1bfc 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -89,7 +89,8 @@ obj-$(CONFIG_DEBUG_NX_TEST)	+= test_nx.o
 obj-$(CONFIG_VMI)		+= vmi_32.o vmiclock_32.o
 obj-$(CONFIG_KVM_GUEST)		+= kvm.o
 obj-$(CONFIG_KVM_CLOCK)		+= kvmclock.o
-obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
paravirt-spinlocks.o
+obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
+obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 8e45f44..9faf43b 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -134,7 +134,9 @@ static void *get_call_destination(u8 type)
 		.pv_irq_ops = pv_irq_ops,
 		.pv_apic_ops = pv_apic_ops,
 		.pv_mmu_ops = pv_mmu_ops,
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 		.pv_lock_ops = pv_lock_ops,
+#endif
 	};
 	return *((void **)&tmpl + type);
 }
diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index 3b767d0..172438f 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -9,5 +9,6 @@ obj-y		:= enlighten.o setup.o multicalls.o mmu.o irq.o \
 			time.o xen-asm.o xen-asm_$(BITS).o \
 			grant-table.o suspend.o
 
-obj-$(CONFIG_SMP)		+= smp.o spinlock.o
-obj-$(CONFIG_XEN_DEBUG_FS)	+= debugfs.o
\ No newline at end of file
+obj-$(CONFIG_SMP)		+= smp.o
+obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= spinlock.o
+obj-$(CONFIG_XEN_DEBUG_FS)	+= debugfs.o
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 2013946..ca6596b 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -62,15 +62,26 @@ void xen_setup_vcpu_info_placement(void);
 #ifdef CONFIG_SMP
 void xen_smp_init(void);
 
-void __init xen_init_spinlocks(void);
-__cpuinit void xen_init_lock_cpu(int cpu);
-void xen_uninit_lock_cpu(int cpu);
-
 extern cpumask_var_t xen_cpu_initialized_map;
 #else
 static inline void xen_smp_init(void) {}
 #endif
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init xen_init_spinlocks(void);
+__cpuinit void xen_init_lock_cpu(int cpu);
+void xen_uninit_lock_cpu(int cpu);
+#else
+static inline void xen_init_spinlocks(void)
+{
+}
+static inline void xen_init_lock_cpu(int cpu)
+{
+}
+static inline void xen_uninit_lock_cpu(int cpu)
+{
+}
+#endif
 
 /* Declare an asm function, along with symbols needed to make it
    inlineable */

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-May-15 18:50 UTC

head link

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

Jan Beulich wrote:> A patch for the pv-ops kernel would require some time. What I can give you
> right away - just for reference - are the sources we currently use in our
kernel:
> attached.
Hm, I see.  Putting a call out to a pv-ops function in the ticket lock 
slow path looks pretty straightforward.  The need for an extra lock on 
the contended unlock side is a bit unfortunate; have you measured to see 
what hit that has?  Seems to me like you could avoid the problem by 
using per-cpu storage rather than stack storage (though you''d need to 
copy the per-cpu data to stack when handling a nested spinlock).

What''s the thinking behind the xen_spin_adjust() stuff?
> static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) { 
> unsigned int token, count; bool free; __ticket_spin_lock_preamble; if 
> (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count = 1 
> << 10; __ticket_spin_lock_body; } while (unlikely(!count) && 
> !xen_spin_wait(lock, token)); } 

How does this work?  Doesn''t it always go into the slowpath loop even
if
the preamble got the lock with no contention?

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2009-May-18 07:19 UTC

head link

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

>>> Jeremy Fitzhardinge <jeremy@goop.org> 15.05.09 20:50
>>>
>Jan Beulich wrote:
>> A patch for the pv-ops kernel would require some time. What I can give
you
>> right away - just for reference - are the sources we currently use in
our kernel:
>> attached.
>
>Hm, I see.  Putting a call out to a pv-ops function in the ticket lock 
>slow path looks pretty straightforward.  The need for an extra lock on 
>the contended unlock side is a bit unfortunate; have you measured to see 
>what hit that has?  Seems to me like you could avoid the problem by 
>using per-cpu storage rather than stack storage (though you''d need
to
>copy the per-cpu data to stack when handling a nested spinlock).
Not sure how you''d imagine this to work: The unlock code has to look at
all
cpus'' data in either case, so an inner lock would still be required
imo.
>What''s the thinking behind the xen_spin_adjust() stuff?
That''s the placeholder for implementing interrupt re-enabling in the
irq-save
lock path. The requirement is that if a nested lock attempt hits the same
lock on the same cpu that it failed to get acquired on earlier (but got a ticket
already), tickets for the given (lock, cpu) pair need to be circularly shifted
around so that the innermost requestor gets the earliest ticket. This is what
that function''s job will become if I ever get to implement this.
>> static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock) { 
>> unsigned int token, count; bool free; __ticket_spin_lock_preamble; if 
>> (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count = 1 
>> << 10; __ticket_spin_lock_body; } while (unlikely(!count)
&&
>> !xen_spin_wait(lock, token)); } 
>
>How does this work?  Doesn''t it always go into the slowpath loop
even if
>the preamble got the lock with no contention?
It indeed always enters the slowpath loop, but only for a single pass through
part of its body (the first compare in the body macro will make it exit the loop
right away: ''token'' is not only the ticket here, but the full
lock->slock
contents). But yes, I think you''re right, one could avoid entering the
body
altogether by moving the containing loop into the if(!free) body. The logic
went through a number of re-writes, so I must have overlooked that
opportunity on the last round of adjustments.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-May-20 22:42 UTC

head link

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

Jan Beulich wrote:>>>> Jeremy Fitzhardinge <jeremy@goop.org> 15.05.09 20:50
>>>
>>>>         
>> Jan Beulich wrote:
>>     
>>> A patch for the pv-ops kernel would require some time. What I can
give you
>>> right away - just for reference - are the sources we currently use
in our kernel:
>>> attached.
>>>       
>> Hm, I see.  Putting a call out to a pv-ops function in the ticket lock 
>> slow path looks pretty straightforward.  The need for an extra lock on 
>> the contended unlock side is a bit unfortunate; have you measured to
see
>> what hit that has?  Seems to me like you could avoid the problem by 
>> using per-cpu storage rather than stack storage (though you''d
need to
>> copy the per-cpu data to stack when handling a nested spinlock).
>>     
>
> Not sure how you''d imagine this to work: The unlock code has to
look at all
> cpus'' data in either case, so an inner lock would still be
required imo.
>   
Well, rather than an explicit lock I was thinking of an optimistic 
scheme like the pv clock update algorithm.

"struct spinning" would have a version counter it could update using
the
even=stable/odd=unstable algorithm.  The lock side would simply save the 
current per-cpu "struct spinning" state onto its own stack (assuming
you
can even have nested spinning), and then update the per-cpu copy with 
the current lock.  The kick side can check the version counter to make 
sure it gets a consistent snapshot of the target cpu''s current lock
state.

I think that only the locking side requires locked instructions, and the 
kick side can be all unlocked.
>> What''s the thinking behind the xen_spin_adjust() stuff?
>>     
>
> That''s the placeholder for implementing interrupt re-enabling in
the irq-save
> lock path. The requirement is that if a nested lock attempt hits the same
> lock on the same cpu that it failed to get acquired on earlier (but got a
ticket
> already), tickets for the given (lock, cpu) pair need to be circularly
shifted
> around so that the innermost requestor gets the earliest ticket. This is
what
> that function''s job will become if I ever get to implement this.
>   
Sounds fiddly.
>>> static __always_inline void __ticket_spin_lock(raw_spinlock_t
*lock) {
>>> unsigned int token, count; bool free; __ticket_spin_lock_preamble;
if
>>> (unlikely(!free)) token = xen_spin_adjust(lock, token); do { count
= 1
>>> << 10; __ticket_spin_lock_body; } while (unlikely(!count)
&&
>>> !xen_spin_wait(lock, token)); } 
>>>       
>> How does this work?  Doesn''t it always go into the slowpath
loop even if
>> the preamble got the lock with no contention?
>>     
>
> It indeed always enters the slowpath loop, but only for a single pass
through
> part of its body (the first compare in the body macro will make it exit the
loop
> right away: ''token'' is not only the ticket here, but the
full lock->slock
> contents). But yes, I think you''re right, one could avoid entering
the body
> altogether by moving the containing loop into the if(!free) body. The logic
> went through a number of re-writes, so I must have overlooked that
> opportunity on the last round of adjustments.
>   
I was thinking of something like this: (completely untested)

void __ticket_spin_lock(raw_spinlock_t *lock)
{
	unsigned short inc = 0x100;
	unsigned short token;

	do {
		unsigned count = 1 << 10;
		asm volatile (
			"	lock xaddw %w0, %1\n"
			"1:	cmpb %h0, %b0\n"
			"	je 2f\n"
			"	rep ; nop\n"
			"	movb %1, %b0\n"
			/* don''t need lfence here, because loads are in-order */
			"	sub $1,%2\n"
			"	jnz 1b\n"
			"2:"
			: "+Q" (inc), "+m" (lock->slock), "+r"
(count)
			:
			: "memory", "cc");

		if (likely(count != 0))
			break;

		token = inc;
		inc = 0;
	} while (unlikely(!xen_spin_wait(lock, token)));
}


where xen_spin_wait() would actually be a pvops call, and perhaps the 
asm could be pulled out into an inline to deal with the 8/16 bit ticket 
difference.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Chuck Ebbert

2009-May-21 22:42 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

On Wed, 13 May 2009 17:16:55 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> Paravirt patching turns all the pvops calls into direct calls, so
> _spin_lock etc do end up having direct calls.  For example, the compiler
> generated code for paravirtualized _spin_lock is:
> 
> <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
> <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
> <_spin_lock+15>:	callq  *0xffffffff805a5b30
> <_spin_lock+22>:	retq
> 
> The indirect call will get patched to:
> <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
> <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
> <_spin_lock+15>:	callq <__ticket_spin_lock>
> <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
> <_spin_lock+22>:	retq
> 
Can''t those calls be changed to jumps?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-May-21 22:48 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Chuck Ebbert wrote:> On Wed, 13 May 2009 17:16:55 -0700
> Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Paravirt patching turns all the pvops calls into direct calls, so
>> _spin_lock etc do end up having direct calls.  For example, the
compiler
>> generated code for paravirtualized _spin_lock is:
>>
>> <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
>> <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
>> <_spin_lock+15>:	callq  *0xffffffff805a5b30
>> <_spin_lock+22>:	retq
>>
>> The indirect call will get patched to:
>> <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
>> <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
>> <_spin_lock+15>:	callq <__ticket_spin_lock>
>> <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
>> <_spin_lock+22>:	retq
>>
>>     
>
> Can''t those calls be changed to jumps?
>   
In this specific instance of this example, yes.  But if you start 
enabling various spinlock debug options then there''ll be code following
the call.  It would be hard for the runtime patching machinery to know 
when it would be safe to do the substitution.

    J


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

H. Peter Anvin

2009-May-21 23:10 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Jeremy Fitzhardinge wrote:> 
> In this specific instance of this example, yes.  But if you start 
> enabling various spinlock debug options then there''ll be code
following
> the call.  It would be hard for the runtime patching machinery to know 
> when it would be safe to do the substitution.
> 
"When it is immediately followed by a ret" seems like a
straightforward
rule to me?

	-hpa

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xin, Xiaohui

2009-May-22 01:26 UTC

head link

[Xen-devel] RE: Performance overhead of paravirt_ops on native identified

Remember we have done one experiment with "jump", the result shows
seems the overhead is even more than the call.

Thanks
Xiaohui

-----Original Message-----
From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] 
Sent: 2009年5月22日 6:49
To: Chuck Ebbert
Cc: Ingo Molnar; Xin, Xiaohui; Li, Xin; Nakajima, Jun; H. Peter Anvin; Nick
Piggin; Linux Kernel Mailing List; Xen-devel
Subject: Re: Performance overhead of paravirt_ops on native identified

Chuck Ebbert wrote:> On Wed, 13 May 2009 17:16:55 -0700
> Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Paravirt patching turns all the pvops calls into direct calls, so
>> _spin_lock etc do end up having direct calls.  For example, the
compiler
>> generated code for paravirtualized _spin_lock is:
>>
>> <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
>> <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
>> <_spin_lock+15>:	callq  *0xffffffff805a5b30
>> <_spin_lock+22>:	retq
>>
>> The indirect call will get patched to:
>> <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
>> <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
>> <_spin_lock+15>:	callq <__ticket_spin_lock>
>> <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
>> <_spin_lock+22>:	retq
>>
>>     
>
> Can't those calls be changed to jumps?
>   
In this specific instance of this example, yes.  But if you start 
enabling various spinlock debug options then there'll be code following 
the call.  It would be hard for the runtime patching machinery to know 
when it would be safe to do the substitution.

    J


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

H. Peter Anvin

2009-May-22 03:39 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Xin, Xiaohui wrote:> Remember we have done one experiment with "jump", the result
shows seems the overhead is even more than the call.
I didn''t, no.  That seems extremely weird to me.

(Unbalancing the call/ret stack is known to suck royally, of course.)
>>>
>>>     
>> Can''t those calls be changed to jumps?
>>   
> 
> In this specific instance of this example, yes.  But if you start 
> enabling various spinlock debug options then there''ll be code
following
> the call.  It would be hard for the runtime patching machinery to know 
> when it would be safe to do the substitution.
> 
When there is code after the call, it''s rather obviously not safe.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-May-22 04:27 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Xin, Xiaohui wrote:> Remember we have done one experiment with "jump", the result
shows seems the overhead is even more than the call.
>   
I don''t think you had mentioned that. You''re saying that a
call->jmp->ret is slower than call->call->ret->ret?

J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xin, Xiaohui

2009-May-22 05:59 UTC

head link

[Xen-devel] RE: Performance overhead of paravirt_ops on native identified

What I mean is that if the binary of _spin_lock is like this:
(gdb) disassemble _spin_lock
Dump of assembler code for function _spin_lock:
0xffffffff80497c0f <_spin_lock+0>:      mov    1252634(%rip),%r11        #
#0xffffffff805c9930 <test_lock_ops+16>
0xffffffff80497c16 <_spin_lock+7>:      jmpq   *%r11
End of assembler dump.
(gdb) disassemble

In this situation the binary contains a jump, the overhead is more than the
call.

Thanks
Xiaohui

-----Original Message-----
From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] 
Sent: 2009年5月22日 12:28
To: Xin, Xiaohui
Cc: Chuck Ebbert; Ingo Molnar; Li, Xin; Nakajima, Jun; H. Peter Anvin; Nick
Piggin; Linux Kernel Mailing List; Xen-devel
Subject: Re: Performance overhead of paravirt_ops on native identified

Xin, Xiaohui wrote:> Remember we have done one experiment with "jump", the result
shows seems the overhead is even more than the call.
>   
I don't think you had mentioned that. You're saying that a
call->jmp->ret is slower than call->call->ret->ret?

J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

H. Peter Anvin

2009-May-22 16:33 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Xin, Xiaohui wrote:> What I mean is that if the binary of _spin_lock is like this:
> (gdb) disassemble _spin_lock
> Dump of assembler code for function _spin_lock:
> 0xffffffff80497c0f <_spin_lock+0>:      mov    1252634(%rip),%r11    
# #0xffffffff805c9930 <test_lock_ops+16>
> 0xffffffff80497c16 <_spin_lock+7>:      jmpq   *%r11
> End of assembler dump.
> (gdb) disassemble
> 
> In this situation the binary contains a jump, the overhead is more than the
call.
> 
That''s an indirect jump, though.  I don''t think anyone was
suggesting
using an indirect jump; the final patched version should be a direct
jump (instead of a direct call.)

I can see how indirect jumps might be slower, since they are probably
not optimized as aggressively in hardware as indirect calls -- indirect
jumps are generally used for switch tables, which often have low
predictability, whereas indirect calls are generally used for method
calls, which are (a) incredibly important for OOP languages, and (b)
generally highly predictable on the dynamic scale.

However, direct jumps and calls don''t need prediction at all (although
of course rets do.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2009-May-22 22:44 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

H. Peter Anvin wrote:> That''s an indirect jump, though.  I don''t think anyone
was suggesting
> using an indirect jump; the final patched version should be a direct
> jump (instead of a direct call.)
>
> I can see how indirect jumps might be slower, since they are probably
> not optimized as aggressively in hardware as indirect calls -- indirect
> jumps are generally used for switch tables, which often have low
> predictability, whereas indirect calls are generally used for method
> calls, which are (a) incredibly important for OOP languages, and (b)
> generally highly predictable on the dynamic scale.
>
> However, direct jumps and calls don''t need prediction at all
(although
> of course rets do.)
I did a quick experiment to see how many sites this optimisation could 
actually affect.  Firstly, it does absolutely nothing with frame 
pointers enabled.  Arranging for no frame pointers is quite tricky, 
since it means disabling all debugging, tracing and other things.

With no frame pointers, its about 26 of 5400 indirect calls are 
immediately followed by ret (not all of those sites are pvops calls).  
With preempt disabled, this goes up to 45 sites.

I haven''t done any actual runtime tests, but a quick survey of the 
affected sites shows that only a couple are performance-sensitive; 
_spin_lock and _spin_lock_irq and _spin_lock_irqsave are the most obvious.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

H. Peter Anvin

2009-May-22 22:47 UTC

head link

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Jeremy Fitzhardinge wrote:> 
> I did a quick experiment to see how many sites this optimisation could 
> actually affect.  Firstly, it does absolutely nothing with frame 
> pointers enabled.  Arranging for no frame pointers is quite tricky, 
> since it means disabling all debugging, tracing and other things.
> 
> With no frame pointers, its about 26 of 5400 indirect calls are 
> immediately followed by ret (not all of those sites are pvops calls).  
> With preempt disabled, this goes up to 45 sites.
> 
> I haven''t done any actual runtime tests, but a quick survey of the
> affected sites shows that only a couple are performance-sensitive; 
> _spin_lock and _spin_lock_irq and _spin_lock_irqsave are the most obvious.
> 
OK, that doesn''t seem like a very productive avenue.  Problem still
remains, obviously.

	-hpa


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - May 2009 - Performance overhead of paravirt_ops on native identified

[Xen-devel] Performance overhead of paravirt_ops on native identified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

[Xen-devel] [tip:x86/urgent] x86: Fix performance regression caused by paravirt_ops on native kernels

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

Re: [Xen-devel] Performance overhead of paravirt_ops on nativeidentified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

[Xen-devel] RE: Performance overhead of paravirt_ops on native identified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

[Xen-devel] RE: Performance overhead of paravirt_ops on native identified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified

[Xen-devel] Re: Performance overhead of paravirt_ops on native identified