Song Xiang
2010-Nov-13 10:56 UTC
[Xen-devel] [SPAM] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability
Hi all, This provides a patch to arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to mitigate the heavy contention on handle_pmt_io when running a HVM configured with 32 cores on a 48-core HP machine. (8 * 6 AMD 2.4GHz Opteron chips) We used seven applications to profile guest HVM(Debian GNU/Linux 5.0, kernel version 2.6.35-rc5 ) upon Xen 4.0.0 and Xen 4.0.1, and found all applications encountered heavy contention on a spin_lock inside handle_pmt_io in Xen. The patch is a workaround for eliminating the contention on handle_pmt_io, As the virtual time must be fresh, there should be someone updating it. But it is not necessary to let a VCPU update the virtual time when another one has been updating it. Thus the update can be skipped when the VCPU finds someone else is updating the virtual time. The patch substitutes the spin_lock with spin_try_lock to check whether someone is holding the spin_lock. If so, there must be someone refreshing the virtual time, and others can just skip the operation. Otherwise, the spin_lock is acquired, and the current VCPU should update the virtual time. The performance improvements of each application(running on a 32-core HVM and pinning one thread/process to each core) after applying the patch to Xen 4.0.0 and Xen 4.0.1 is as follows: original patched improvements gmake (sec) 91.2 86.4 5.6% phoenix-histogram (sec) 32.79 27.43 19.5% phoenix-wordcount (sec) 279.22 232.85 19.9% phoenix-linear_regression (sec) 2.57 2.4 7.2% specjvm-compress (ops/min) 774.37 923.71 19.0% specjvm-crypto (ops/min) 209.55 251.79 20.0% specjvm-xml-validation (ops/min) 624.46 785.51 26.0% Performance of each application on a 32-core HVM on Xen 4.0.0 original patched improvements gmake (sec) 89.04 85.93 3.6% phoenix-histogram (sec) 42.63 28.27 50.8% phoenix-wordcount (sec) 280.38 238.93 17.3% phoenix-linear_regression (sec) 2.58 2.42 6.5% specjvm-compress (ops/min) 751.33 923.84 23.0% specjvm-crypto (ops/min) 209.33 243.28 16.2% specjvm-xml-validation (ops/min) 620.41 772.25 24.5% Performance of each application on a 32-core HVM on Xen 4.0.1 For more details, please refer to our technical report: http://ppi.fudan.edu.cn/_media/xen-profile-report.pdf?id=system_research_group&cache=cache The patch is same for xen4.0.0 and xen4.0.1: Index: arch/x86/hvm/pmtimer.c ==================================================================--- arch/x86/hvm/pmtimer.c (revision 4651) +++ arch/x86/hvm/pmtimer.c (working copy) @@ -206,10 +206,17 @@ if ( dir == IOREQ_READ ) { - spin_lock(&s->lock); - pmt_update_time(s); - *val = s->pm.tmr_val; - spin_unlock(&s->lock); + /* + * if acquired the PMTState lock then update the time + * else other vcpu is updating it ,it should be up to date. + */ + if (spin_trylock(&s->lock)) { + pmt_update_time(s); + *val = s->pm.tmr_val; + spin_unlock(&s->lock); + } + else + *val = (s->pm.tmr_val & TMR_VAL_MASK); return X86EMUL_OKAY; } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2010-Nov-15 08:28 UTC
Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability
>>> On 13.11.10 at 11:56, Song Xiang <classicxsong@gmail.com> wrote: > --- arch/x86/hvm/pmtimer.c (revision 4651) > +++ arch/x86/hvm/pmtimer.c (working copy) > @@ -206,10 +206,17 @@ > > if ( dir == IOREQ_READ ) > { > - spin_lock(&s->lock); > - pmt_update_time(s); > - *val = s->pm.tmr_val; > - spin_unlock(&s->lock); > + /* > + * if acquired the PMTState lock then update the time > + * else other vcpu is updating it ,it should be up to date. > + */ > + if (spin_trylock(&s->lock)) { > + pmt_update_time(s); > + *val = s->pm.tmr_val; > + spin_unlock(&s->lock); > + } > + else > + *val = (s->pm.tmr_val & TMR_VAL_MASK); > return X86EMUL_OKAY; > }I don''t think this is correct: While it seems reasonable to skip the global update, returning potentially stale time to the guest isn''t. You''d need to mimic what pmt_update_time() does, just without updating global state. That, however, isn''t going to be that trivial as you need to also read global state for doing the calculations. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-16 07:48 UTC
Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability
On 16/11/2010 14:51, "Song Xiang" <classicxsong@gmail.com> wrote:> + /* > + * if acquired the PMTState lock then update the time > + * else other vcpu is updating it ,it should be up to date. > + */ > + tmp = atomic_read(&s-> ownership); > + if (spin_trylock(&s->lock)) { > + pmt_update_time(s); > + *val = s->pm.tmr_val; > + spin_unlock(&s->lock); > + atomic_inc(&s-> ownership); > + } > + else { > + while (tmp == atomic_read(&s-> ownership));You''ve kind of implemented a spin_barrier(). What you implemented could be better and equivalently done as something like: if (spin_trylock(&s->lock)) { ... } else { spin_barrier(&s->lock); } No need for your new field at all! It initially seems weird that this performs much better than the original code, but I guess it might: if all VCPUs are piling in here at the same time, rather than having to execute one by one, we''ll have one go first and then the others will all execute simultaneously read-only in convoy... Anyway, please modify your patch as I suggest above, and then also confirm what speedups you get with the revised patch. I want to know what the wins are before applying optimisation hacks. I don''t particularly have an issue with them as long as they work! :-) -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2010-Nov-16 07:50 UTC
[SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability
>>> On 16.11.10 at 15:51, Song Xiang <classicxsong@gmail.com> wrote: > We updated our patch as follows: > > Index: arch/x86/hvm/pmtimer.c > ==================================================================> --- arch/x86/hvm/pmtimer.c > +++ arch/x86/hvm/pmtimer.c > @@ -206,13 +206,23 @@ > > if ( dir == IOREQ_READ ) > { > - spin_lock(&s->lock); > - pmt_update_time(s); > - *val = s->pm.tmr_val; > - spin_unlock(&s->lock); > + uint32_t tmp; > + /* > + * if acquired the PMTState lock then update the time > + * else other vcpu is updating it ,it should be up to date. > + */ > + tmp = atomic_read(&s-> ownership); > + if (spin_trylock(&s->lock)) { > + pmt_update_time(s); > + *val = s->pm.tmr_val; > + spin_unlock(&s->lock); > + atomic_inc(&s-> ownership); > + } > + else { > + while (tmp == atomic_read(&s-> ownership)); > + > + *val = s->pm.tmr_val; > + } > return X86EMUL_OKAY; > } > > Index: include/asm-x86/hvm/vpt.h > ==================================================================> --- include/asm-x86/hvm/vpt.h > +++ include/asm-x86/hvm/vpt.h > @@ -120,6 +120,7 @@ > uint64_t scale; /* Multiplier to get from tsc to > timer ticks */ > struct timer timer; /* To make sure we send SCIs */ > spinlock_t lock; > + atomic_t ownership; > } PMTState; > > struct pl_time { /* platform time */ > > > The key idea is, to keep the returned time fresh, any VCPU that fails > to acquire the PMTState lock will wait until the PMTState lock holder > updates the global virtual time, and then returns the freshest time. > > We add a field in PMTState, named ownership. The PMTState->ownership > can only be updated by the PMTState lock holder.It is updated after > the PMTState lock holder has updated the global virtual time and > released the MTState lock. Other VCPUs that fail to acquire the > PMTState lock will check whether the MTState->ownership is updated in > a while loop. When the PMTState->ownership is changed, the global > virtual time must be the freshest, and can be returned to the guest OS. > > The time returned to the guest in this patch is fresher than the > previous one we have proposed.That''s not better: Depending on timing, the reader may well loop until the *next* acquirer releases the lock (and increments "ownership"). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-16 07:51 UTC
Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability
On 16/11/2010 07:48, "Keir Fraser" <keir@xen.org> wrote:> Anyway, please modify your patch as I suggest above, and then also confirm > what speedups you get with the revised patch. I want to know what the wins > are before applying optimisation hacks. I don''t particularly have an issue > with them as long as they work! :-)Oh, also, if your patch is intended to be applied straight away, you also need to always provide a suitable patch description and Signed-off-by line(s) directly above the patch in your email. Otherwise I tend to silently ignore the email. -- Keir> -- Keir > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2010-Nov-16 07:54 UTC
Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability
>>> On 16.11.10 at 08:48, Keir Fraser <keir@xen.org> wrote: > On 16/11/2010 14:51, "Song Xiang" <classicxsong@gmail.com> wrote: > >> + /* >> + * if acquired the PMTState lock then update the time >> + * else other vcpu is updating it ,it should be up to date. >> + */ >> + tmp = atomic_read(&s-> ownership); >> + if (spin_trylock(&s->lock)) { >> + pmt_update_time(s); >> + *val = s->pm.tmr_val; >> + spin_unlock(&s->lock); >> + atomic_inc(&s-> ownership); >> + } >> + else { >> + while (tmp == atomic_read(&s-> ownership)); > > You''ve kind of implemented a spin_barrier(). What you implemented could be > better and equivalently done as something like: > > if (spin_trylock(&s->lock)) { > ... > } else { > spin_barrier(&s->lock); > } > > No need for your new field at all! It initially seems weird that this > performs much better than the original code, but I guess it might: if all > VCPUs are piling in here at the same time, rather than having to execute one > by one, we''ll have one go first and then the others will all execute > simultaneously read-only in convoy...They''d need to re-measure whether this still provides a meaningful benefit, I think. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Nov-16 08:08 UTC
Re: [SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability
On 16/11/2010 07:54, "Jan Beulich" <JBeulich@novell.com> wrote:>> No need for your new field at all! It initially seems weird that this >> performs much better than the original code, but I guess it might: if all >> VCPUs are piling in here at the same time, rather than having to execute one >> by one, we''ll have one go first and then the others will all execute >> simultaneously read-only in convoy... > > They''d need to re-measure whether this still provides a meaningful > benefit, I think.Definitely. I explicitly asked for that. Oh, also the approach is currently buggy. Although s->pm.tmr_val is atomic to read, it is not atomically updated in pmt_update_time(). Need to calculate new value for tmr_val in a local variable then update the shared field with an atomic_set()-alike *(volatile u32 *)&s->pm.tmr_val = x type of arrangement. For all this extra fragility, the perf win needs to be compelling. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Song Xiang
2010-Nov-16 14:51 UTC
[SPAM] Re: [Xen-devel] A patch to xen/arch/x86/hvm/pmtimer.c for both Xen 4.0.0 and Xen 4.0.1 to improve HVM scalability
We updated our patch as follows: Index: arch/x86/hvm/pmtimer.c ==================================================================--- arch/x86/hvm/pmtimer.c +++ arch/x86/hvm/pmtimer.c @@ -206,13 +206,23 @@ if ( dir == IOREQ_READ ) { - spin_lock(&s->lock); - pmt_update_time(s); - *val = s->pm.tmr_val; - spin_unlock(&s->lock); + uint32_t tmp; + /* + * if acquired the PMTState lock then update the time + * else other vcpu is updating it ,it should be up to date. + */ + tmp = atomic_read(&s-> ownership); + if (spin_trylock(&s->lock)) { + pmt_update_time(s); + *val = s->pm.tmr_val; + spin_unlock(&s->lock); + atomic_inc(&s-> ownership); + } + else { + while (tmp == atomic_read(&s-> ownership)); + + *val = s->pm.tmr_val; + } return X86EMUL_OKAY; } Index: include/asm-x86/hvm/vpt.h ==================================================================--- include/asm-x86/hvm/vpt.h +++ include/asm-x86/hvm/vpt.h @@ -120,6 +120,7 @@ uint64_t scale; /* Multiplier to get from tsc to timer ticks */ struct timer timer; /* To make sure we send SCIs */ spinlock_t lock; + atomic_t ownership; } PMTState; struct pl_time { /* platform time */ The key idea is, to keep the returned time fresh, any VCPU that fails to acquire the PMTState lock will wait until the PMTState lock holder updates the global virtual time, and then returns the freshest time. We add a field in PMTState, named ownership. The PMTState->ownership can only be updated by the PMTState lock holder.It is updated after the PMTState lock holder has updated the global virtual time and released the MTState lock. Other VCPUs that fail to acquire the PMTState lock will check whether the MTState->ownership is updated in a while loop. When the PMTState->ownership is changed, the global virtual time must be the freshest, and can be returned to the guest OS. The time returned to the guest in this patch is fresher than the previous one we have proposed. 在 2010-11-15,上午8:28, Jan Beulich 写道:>>>> On 13.11.10 at 11:56, Song Xiang <classicxsong@gmail.com> wrote: >> --- arch/x86/hvm/pmtimer.c (revision 4651) >> +++ arch/x86/hvm/pmtimer.c (working copy) >> @@ -206,10 +206,17 @@ >> >> if ( dir == IOREQ_READ ) >> { >> - spin_lock(&s->lock); >> - pmt_update_time(s); >> - *val = s->pm.tmr_val; >> - spin_unlock(&s->lock); >> + /* >> + * if acquired the PMTState lock then update the time >> + * else other vcpu is updating it ,it should be up to date. >> + */ >> + if (spin_trylock(&s->lock)) { >> + pmt_update_time(s); >> + *val = s->pm.tmr_val; >> + spin_unlock(&s->lock); >> + } >> + else >> + *val = (s->pm.tmr_val & TMR_VAL_MASK); >> return X86EMUL_OKAY; >> } > > I don''t think this is correct: While it seems reasonable to skip the > global update, returning potentially stale time to the guest isn''t. > You''d need to mimic what pmt_update_time() does, just without > updating global state. That, however, isn''t going to be that > trivial as you need to also read global state for doing the > calculations. > > Jan >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel