thr3ads.net - Xen devel - [Xen-devel] [SPAM] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Song Xiang

2010-Nov-13 10:56 UTC

[Xen-devel] [SPAM] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

Hi all,

This provides a patch to arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and  
Xen 4.0.1 to mitigate the heavy contention on handle_pmt_io when  
running a HVM configured with 32 cores on a 48-core HP machine. (8 * 6  
AMD 2.4GHz Opteron chips)

We used seven applications to profile guest HVM(Debian GNU/Linux 5.0,  
kernel version 2.6.35-rc5 ) upon Xen 4.0.0 and Xen 4.0.1, and found  
all applications encountered heavy contention on a spin_lock inside  
handle_pmt_io in Xen.

The patch is a workaround for eliminating the contention on  
handle_pmt_io, As the virtual time must be fresh, there should be  
someone updating it. But it is not necessary to let a VCPU update the  
virtual time when another one has been updating it. Thus the update  
can be skipped when the VCPU finds someone else is updating the  
virtual time. The patch substitutes the spin_lock with spin_try_lock  
to check whether someone is holding the spin_lock. If so, there must  
be someone refreshing the virtual time, and others can just skip the  
operation. Otherwise, the spin_lock is acquired, and the current VCPU  
should update the virtual time.

The performance improvements of each application(running on a 32-core  
HVM and pinning one thread/process to each core) after applying the  
patch to Xen 4.0.0 and Xen 4.0.1 is as follows:


original	patched	improvements
gmake (sec)	91.2	86.4	5.6%
phoenix-histogram (sec)	32.79	27.43	19.5%
phoenix-wordcount (sec)	279.22	232.85	19.9%
phoenix-linear_regression (sec)	2.57	2.4	7.2%
specjvm-compress (ops/min)	774.37	923.71	19.0%
specjvm-crypto  (ops/min)	209.55	251.79	20.0%
specjvm-xml-validation  (ops/min)	624.46	785.51	26.0%

			Performance of each application on a 32-core HVM on Xen 4.0.0


original	patched	improvements
gmake (sec)	89.04	85.93	3.6%
phoenix-histogram (sec)	42.63	28.27	50.8%
phoenix-wordcount (sec)	280.38	238.93	17.3%
phoenix-linear_regression (sec)	2.58	2.42	6.5%
specjvm-compress (ops/min)	751.33	923.84	23.0%
specjvm-crypto  (ops/min)	209.33	243.28	16.2%
specjvm-xml-validation  (ops/min)	620.41	772.25	24.5%

			Performance of each application on a 32-core HVM on Xen 4.0.1


For more details, please refer to our technical  report:
http://ppi.fudan.edu.cn/_media/xen-profile-report.pdf?id=system_research_group&cache=cache


The patch is same for xen4.0.0 and xen4.0.1:
Index: arch/x86/hvm/pmtimer.c
==================================================================---
arch/x86/hvm/pmtimer.c	(revision 4651)
+++ arch/x86/hvm/pmtimer.c	(working copy)
@@ -206,10 +206,17 @@

      if ( dir == IOREQ_READ )
      {
-        spin_lock(&s->lock);
-        pmt_update_time(s);
-        *val = s->pm.tmr_val;
-        spin_unlock(&s->lock);
+        /*
+         * if acquired the PMTState lock then update the time
+         * else other vcpu is updating it ,it should be up to date.
+         */
+        if (spin_trylock(&s->lock)) {
+            pmt_update_time(s);
+            *val = s->pm.tmr_val;
+            spin_unlock(&s->lock);
+        }
+        else
+            *val = (s->pm.tmr_val & TMR_VAL_MASK);
          return X86EMUL_OKAY;
      }

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Nov-15 08:28 UTC

head link

Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

>>> On 13.11.10 at 11:56, Song Xiang <classicxsong@gmail.com>
wrote:
> --- arch/x86/hvm/pmtimer.c	(revision 4651)
> +++ arch/x86/hvm/pmtimer.c	(working copy)
> @@ -206,10 +206,17 @@
> 
>       if ( dir == IOREQ_READ )
>       {
> -        spin_lock(&s->lock);
> -        pmt_update_time(s);
> -        *val = s->pm.tmr_val;
> -        spin_unlock(&s->lock);
> +        /*
> +         * if acquired the PMTState lock then update the time
> +         * else other vcpu is updating it ,it should be up to date.
> +         */
> +        if (spin_trylock(&s->lock)) {
> +            pmt_update_time(s);
> +            *val = s->pm.tmr_val;
> +            spin_unlock(&s->lock);
> +        }
> +        else
> +            *val = (s->pm.tmr_val & TMR_VAL_MASK);
>           return X86EMUL_OKAY;
>       }
I don''t think this is correct: While it seems reasonable to skip the
global update, returning potentially stale time to the guest isn''t.
You''d need to mimic what pmt_update_time() does, just without
updating global state. That, however, isn''t going to be that
trivial as you need to also read global state for doing the
calculations.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Nov-16 07:48 UTC

head link

Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

On 16/11/2010 14:51, "Song Xiang" <classicxsong@gmail.com>
wrote:
> +        /*
> +         * if acquired the PMTState lock then update the time
> +         * else other vcpu is updating it ,it should be up to date.
> +         */
> +        tmp = atomic_read(&s-> ownership);
> +        if (spin_trylock(&s->lock)) {
> +            pmt_update_time(s);
> +            *val = s->pm.tmr_val;
> +            spin_unlock(&s->lock);
> +            atomic_inc(&s-> ownership);
> +        }
> +        else {
> +            while (tmp == atomic_read(&s-> ownership));
You''ve kind of implemented a spin_barrier(). What you implemented could
be
better and equivalently done as something like:

if (spin_trylock(&s->lock)) {
  ...
} else {
  spin_barrier(&s->lock);
}

No need for your new field at all! It initially seems weird that this
performs much better than the original code, but I guess it might: if all
VCPUs are piling in here at the same time, rather than having to execute one
by one, we''ll have one go first and then the others will all execute
simultaneously read-only in convoy...

Anyway, please modify your patch as I suggest above, and then also confirm
what speedups you get with the revised patch. I want to know what the wins
are before applying optimisation hacks. I don''t particularly have an
issue
with them as long as they work! :-)

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Nov-16 07:50 UTC

head link

[SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

>>> On 16.11.10 at 15:51, Song Xiang <classicxsong@gmail.com>
wrote:
> We updated our patch as follows:
> 
> Index: arch/x86/hvm/pmtimer.c
> ==================================================================> ---
arch/x86/hvm/pmtimer.c
> +++ arch/x86/hvm/pmtimer.c
> @@ -206,13 +206,23 @@
> 
>       if ( dir == IOREQ_READ )
>       {
> -        spin_lock(&s->lock);
> -        pmt_update_time(s);
> -        *val = s->pm.tmr_val;
> -        spin_unlock(&s->lock);
> +        uint32_t tmp;
> +        /*
> +         * if acquired the PMTState lock then update the time
> +         * else other vcpu is updating it ,it should be up to date.
> +         */
> +        tmp = atomic_read(&s-> ownership);
> +        if (spin_trylock(&s->lock)) {
> +            pmt_update_time(s);
> +            *val = s->pm.tmr_val;
> +            spin_unlock(&s->lock);
> +            atomic_inc(&s-> ownership);
> +        }
> +        else {
> +            while (tmp == atomic_read(&s-> ownership));
> +
> +	      *val = s->pm.tmr_val;
> +        }
>           return X86EMUL_OKAY;
>       }
> 
> Index: include/asm-x86/hvm/vpt.h
> ==================================================================> ---
include/asm-x86/hvm/vpt.h
> +++ include/asm-x86/hvm/vpt.h
> @@ -120,6 +120,7 @@
>       uint64_t scale;             /* Multiplier to get from tsc to  
> timer ticks */
>       struct timer timer;         /* To make sure we send SCIs */
>       spinlock_t lock;
> +   atomic_t ownership;
>   } PMTState;
> 
>   struct pl_time {    /* platform time */
> 
> 
> The key idea is, to keep the returned time fresh, any VCPU that fails  
> to acquire the PMTState lock will wait until the PMTState lock holder  
> updates the global virtual time, and then returns the freshest time.
>
> We add a field in PMTState, named ownership. The PMTState->ownership  
> can only be updated by the PMTState lock holder.It is updated after  
> the PMTState lock holder has updated the global virtual time and  
> released the MTState lock.  Other VCPUs that fail to acquire the  
> PMTState lock will check whether the MTState->ownership is updated in  
> a while loop.  When the PMTState->ownership is changed, the global  
> virtual time must be the freshest, and can be returned to the guest OS.
> 
> The time returned to the guest in this patch is fresher than the  
> previous one we have proposed.
That''s not better: Depending on timing, the reader may well loop
until the *next* acquirer releases the lock (and increments
"ownership").

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Nov-16 07:51 UTC

head link

Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

On 16/11/2010 07:48, "Keir Fraser" <keir@xen.org> wrote:
> Anyway, please modify your patch as I suggest above, and then also confirm
> what speedups you get with the revised patch. I want to know what the wins
> are before applying optimisation hacks. I don''t particularly have
an issue
> with them as long as they work! :-)
Oh, also, if your patch is intended to be applied straight away, you also
need to always provide a suitable patch description and Signed-off-by
line(s) directly above the patch in your email. Otherwise I tend to silently
ignore the email.

 -- Keir
>  -- Keir
> 
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2010-Nov-16 07:54 UTC

head link

Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

>>> On 16.11.10 at 08:48, Keir Fraser <keir@xen.org> wrote:
> On 16/11/2010 14:51, "Song Xiang" <classicxsong@gmail.com>
wrote:
> 
>> +        /*
>> +         * if acquired the PMTState lock then update the time
>> +         * else other vcpu is updating it ,it should be up to date.
>> +         */
>> +        tmp = atomic_read(&s-> ownership);
>> +        if (spin_trylock(&s->lock)) {
>> +            pmt_update_time(s);
>> +            *val = s->pm.tmr_val;
>> +            spin_unlock(&s->lock);
>> +            atomic_inc(&s-> ownership);
>> +        }
>> +        else {
>> +            while (tmp == atomic_read(&s-> ownership));
> 
> You''ve kind of implemented a spin_barrier(). What you implemented
could be
> better and equivalently done as something like:
> 
> if (spin_trylock(&s->lock)) {
>   ...
> } else {
>   spin_barrier(&s->lock);
> }
> 
> No need for your new field at all! It initially seems weird that this
> performs much better than the original code, but I guess it might: if all
> VCPUs are piling in here at the same time, rather than having to execute
one
> by one, we''ll have one go first and then the others will all
execute
> simultaneously read-only in convoy...
They''d need to re-measure whether this still provides a meaningful
benefit, I think.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2010-Nov-16 08:08 UTC

head link

Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

On 16/11/2010 07:54, "Jan Beulich" <JBeulich@novell.com> wrote:
>> No need for your new field at all! It initially seems weird that this
>> performs much better than the original code, but I guess it might: if
all
>> VCPUs are piling in here at the same time, rather than having to
execute one
>> by one, we''ll have one go first and then the others will all
execute
>> simultaneously read-only in convoy...
> 
> They''d need to re-measure whether this still provides a meaningful
> benefit, I think.
Definitely. I explicitly asked for that.

Oh, also the approach is currently buggy. Although s->pm.tmr_val is atomic
to read, it is not atomically updated in pmt_update_time(). Need to
calculate new value for tmr_val in a local variable then update the shared
field with an atomic_set()-alike *(volatile u32 *)&s->pm.tmr_val = x type
of
arrangement.

For all this extra fragility, the perf win needs to be compelling.

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Song Xiang

2010-Nov-16 14:51 UTC

head link

[SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

We updated our patch as follows:

Index: arch/x86/hvm/pmtimer.c
==================================================================---
arch/x86/hvm/pmtimer.c
+++ arch/x86/hvm/pmtimer.c
@@ -206,13 +206,23 @@

      if ( dir == IOREQ_READ )
      {
-        spin_lock(&s->lock);
-        pmt_update_time(s);
-        *val = s->pm.tmr_val;
-        spin_unlock(&s->lock);
+        uint32_t tmp;
+        /*
+         * if acquired the PMTState lock then update the time
+         * else other vcpu is updating it ,it should be up to date.
+         */
+        tmp = atomic_read(&s-> ownership);
+        if (spin_trylock(&s->lock)) {
+            pmt_update_time(s);
+            *val = s->pm.tmr_val;
+            spin_unlock(&s->lock);
+            atomic_inc(&s-> ownership);
+        }
+        else {
+            while (tmp == atomic_read(&s-> ownership));
+
+	      *val = s->pm.tmr_val;
+        }
          return X86EMUL_OKAY;
      }

Index: include/asm-x86/hvm/vpt.h
==================================================================---
include/asm-x86/hvm/vpt.h
+++ include/asm-x86/hvm/vpt.h
@@ -120,6 +120,7 @@
      uint64_t scale;             /* Multiplier to get from tsc to  
timer ticks */
      struct timer timer;         /* To make sure we send SCIs */
      spinlock_t lock;
+   atomic_t ownership;
  } PMTState;

  struct pl_time {    /* platform time */


The key idea is, to keep the returned time fresh, any VCPU that fails  
to acquire the PMTState lock will wait until the PMTState lock holder  
updates the global virtual time, and then returns the freshest time.

We add a field in PMTState, named ownership. The PMTState->ownership  
can only be updated by the PMTState lock holder.It is updated after  
the PMTState lock holder has updated the global virtual time and  
released the MTState lock.  Other VCPUs that fail to acquire the  
PMTState lock will check whether the MTState->ownership is updated in  
a while loop.  When the PMTState->ownership is changed, the global  
virtual time must be the freshest, and can be returned to the guest OS.

The time returned to the guest in this patch is fresher than the  
previous one we have proposed.


在 2010-11-15，上午8:28， Jan Beulich 写道：
>>>> On 13.11.10 at 11:56, Song Xiang <classicxsong@gmail.com>
wrote:
>> --- arch/x86/hvm/pmtimer.c	(revision 4651)
>> +++ arch/x86/hvm/pmtimer.c	(working copy)
>> @@ -206,10 +206,17 @@
>>
>>      if ( dir == IOREQ_READ )
>>      {
>> -        spin_lock(&s->lock);
>> -        pmt_update_time(s);
>> -        *val = s->pm.tmr_val;
>> -        spin_unlock(&s->lock);
>> +        /*
>> +         * if acquired the PMTState lock then update the time
>> +         * else other vcpu is updating it ,it should be up to date.
>> +         */
>> +        if (spin_trylock(&s->lock)) {
>> +            pmt_update_time(s);
>> +            *val = s->pm.tmr_val;
>> +            spin_unlock(&s->lock);
>> +        }
>> +        else
>> +            *val = (s->pm.tmr_val & TMR_VAL_MASK);
>>          return X86EMUL_OKAY;
>>      }
>
> I don''t think this is correct: While it seems reasonable to skip
the
> global update, returning potentially stale time to the guest
isn''t.
> You''d need to mimic what pmt_update_time() does, just without
> updating global state. That, however, isn''t going to be that
> trivial as you need to also read global state for doing the
> calculations.
>
> Jan
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Nov 2010 - [SPAM] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

[Xen-devel] [SPAM] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

[SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability

[SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability