Wei Huang
2011-May-03 20:17 UTC
[Xen-devel] [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions
FPU: create lazy and non-lazy FPU restore functions Currently Xen relies on #NM (via CR0.TS) to trigger FPU context restore. But not all FPU state is tracked by TS bit. This function creates two FPU restore functions: vcpu_restore_fpu_lazy() and vcpu_restore_fpu_eager(). vcpu_restore_fpu_lazy() is still used when #NM is triggered. vcpu_restore_fpu_eager(), as a comparision, is called for vcpu which is being scheduled in on every context switch. Signed-off-by: Wei Huang <wei.huang2@amd.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-May-04 07:09 UTC
[Xen-devel] Re: [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions
>>> On 03.05.11 at 22:17, Wei Huang <wei.huang2@amd.com> wrote:Again as pointed out earlier, ...>--- a/xen/arch/x86/domain.c Tue May 03 13:49:27 2011 -0500 >+++ b/xen/arch/x86/domain.c Tue May 03 13:59:37 2011 -0500 >@@ -1578,6 +1578,7 @@ > memcpy(stack_regs, &n->arch.user_regs, CTXT_SWITCH_STACK_BYTES); > if ( xsave_enabled(n) && n->arch.xcr0 != get_xcr0() ) > set_xcr0(n->arch.xcr0); >+ vcpu_restore_fpu_eager(n);... this call is unconditional, ...> n->arch.ctxt_switch_to(n); > } > >--- a/xen/arch/x86/i387.c Tue May 03 13:49:27 2011 -0500 >+++ b/xen/arch/x86/i387.c Tue May 03 13:59:37 2011 -0500 >@@ -160,10 +160,25 @@ > /*******************************/ > /* VCPU FPU Functions */ > /*******************************/ >+/* Restore FPU state whenever VCPU is schduled in. */ >+void vcpu_restore_fpu_eager(struct vcpu *v) >+{ >+ ASSERT(!is_idle_vcpu(v)); >+ >+ /* Avoid recursion */ >+ clts(); >+ >+ /* save the nonlazy extended state which is not tracked by CR0.TS bit */ >+ if ( xsave_enabled(v) ) >+ fpu_xrstor(v, XSTATE_NONLAZY); >+ >+ stts();... while here you do an unconditional clts followed by an xrstor only checking whether xsave is enabled (but not checking whether there''s any non-lazy state to be restored) and, possibly the most expensive of all, an unconditional write of CR0. Jan>+} >+ > /* > * Restore FPU state when #NM is triggered. > */ >-void vcpu_restore_fpu(struct vcpu *v) >+void vcpu_restore_fpu_lazy(struct vcpu *v) > { > ASSERT(!is_idle_vcpu(v)); >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei Huang
2011-May-04 16:33 UTC
[Xen-devel] Re: [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions
Checking whether there is a non-lazy state to save is architectural specific and very messy. For instance, we need to read LWP_CBADDR to confirm LWP''s dirty state. This MSR is AMD specific and we don''t want to add it here. Plus reading data from LWP_CBADDR MSR might be as expensive as clts/stts. My previous email showed that the overhead with LWP is around 1%-2% of __context_switch(). For non lwp-capable CPU, this overhead should be much smaller (only clts and stts) because xfeature_mask[LWP] is 0. Yes, clts() and stts() don''t have to called every time. How about this one? /* Restore FPU state whenever VCPU is schduled in. */ void vcpu_restore_fpu_eager(struct vcpu *v) { ASSERT(!is_idle_vcpu(v)); /* save the nonlazy extended state which is not tracked by CR0.TS bit */ if ( xsave_enabled(v) ) { /* Avoid recursion */ clts(); fpu_xrstor(v, XSTATE_NONLAZY); stts(); } . On 05/04/2011 02:09 AM, Jan Beulich wrote:>>>> On 03.05.11 at 22:17, Wei Huang<wei.huang2@amd.com> wrote: > Again as pointed out earlier, ... > >> --- a/xen/arch/x86/domain.c Tue May 03 13:49:27 2011 -0500 >> +++ b/xen/arch/x86/domain.c Tue May 03 13:59:37 2011 -0500 >> @@ -1578,6 +1578,7 @@ >> memcpy(stack_regs,&n->arch.user_regs, CTXT_SWITCH_STACK_BYTES); >> if ( xsave_enabled(n)&& n->arch.xcr0 != get_xcr0() ) >> set_xcr0(n->arch.xcr0); >> + vcpu_restore_fpu_eager(n); > ... this call is unconditional, ... > >> n->arch.ctxt_switch_to(n); >> } >> >> --- a/xen/arch/x86/i387.c Tue May 03 13:49:27 2011 -0500 >> +++ b/xen/arch/x86/i387.c Tue May 03 13:59:37 2011 -0500 >> @@ -160,10 +160,25 @@ >> /*******************************/ >> /* VCPU FPU Functions */ >> /*******************************/ >> +/* Restore FPU state whenever VCPU is schduled in. */ >> +void vcpu_restore_fpu_eager(struct vcpu *v) >> +{ >> + ASSERT(!is_idle_vcpu(v)); >> + >> + /* Avoid recursion */ >> + clts(); >> + >> + /* save the nonlazy extended state which is not tracked by CR0.TS bit */ >> + if ( xsave_enabled(v) ) >> + fpu_xrstor(v, XSTATE_NONLAZY); >> + >> + stts(); > ... while here you do an unconditional clts followed by an xrstor only > checking whether xsave is enabled (but not checking whether there''s > any non-lazy state to be restored) and, possibly the most expensive > of all, an unconditional write of CR0. > > Jan > >> +} >> + >> /* >> * Restore FPU state when #NM is triggered. >> */ >> -void vcpu_restore_fpu(struct vcpu *v) >> +void vcpu_restore_fpu_lazy(struct vcpu *v) >> { >> ASSERT(!is_idle_vcpu(v)); >> > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-May-05 07:13 UTC
[Xen-devel] Re: [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions
>>> On 04.05.11 at 18:33, Wei Huang <wei.huang2@amd.com> wrote: > Checking whether there is a non-lazy state to save is architectural > specific and very messy. For instance, we need to read LWP_CBADDR to > confirm LWP''s dirty state. This MSR is AMD specific and we don''t want to > add it here. Plus reading data from LWP_CBADDR MSR might be as expensive > as clts/stts. > > My previous email showed that the overhead with LWP is around 1%-2% of > __context_switch(). For non lwp-capable CPU, this overhead should be > much smaller (only clts and stts) because xfeature_mask[LWP] is 0.I wasn''t talking about determining whether LWP state is dirty, but much rather about LWP not being in use at all.> Yes, clts() and stts() don''t have to called every time. How about this one? > > /* Restore FPU state whenever VCPU is schduled in. */ > void vcpu_restore_fpu_eager(struct vcpu *v) > { > ASSERT(!is_idle_vcpu(v)); > > > /* save the nonlazy extended state which is not tracked by CR0.TS bit */ > if ( xsave_enabled(v) ) > { > /* Avoid recursion */ > clts(); > fpu_xrstor(v, XSTATE_NONLAZY); > stts(); > }That''s certainly better, but I''d still like to see the xsave_enabled() check to be replaced by some form of lwp_enabled() or lazy_xsave_needed() or some such (which will at once exclude all pv guests until you care to add support for them). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wei Huang
2011-May-05 21:41 UTC
Re: [Xen-devel] Re: [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions
Hi Jan, If we want to make LWP restore optional in vcpu_restore_fpu_eager(), we have to change vcpu_save_fpu() as well. Otherwise, the extended state will become inconsistent for non-LWP VCPUs (because save and restore is asymmetric). There are two approaches: 1. In vcpu_save_fpu(), clean physical CPU''s extended state for VCPU which is being scheduled in. This prevents messy states from causing problems. The disadvantage is the cleaning cost, which would out-weight the benefits. 2. Add a new variable in VCPU to track whether nonlazy state is dirty. I think this is better. See the attached file. Let me know if it is what you want. After that, I will re-spin the patches. Thanks, -Wei On 05/05/2011 02:13 AM, Jan Beulich wrote:>>>> On 04.05.11 at 18:33, Wei Huang<wei.huang2@amd.com> wrote: >> Checking whether there is a non-lazy state to save is architectural >> specific and very messy. For instance, we need to read LWP_CBADDR to >> confirm LWP''s dirty state. This MSR is AMD specific and we don''t want to >> add it here. Plus reading data from LWP_CBADDR MSR might be as expensive >> as clts/stts. >> >> My previous email showed that the overhead with LWP is around 1%-2% of >> __context_switch(). For non lwp-capable CPU, this overhead should be >> much smaller (only clts and stts) because xfeature_mask[LWP] is 0. > I wasn''t talking about determining whether LWP state is dirty, but > much rather about LWP not being in use at all. > >> Yes, clts() and stts() don''t have to called every time. How about this one? >> >> /* Restore FPU state whenever VCPU is schduled in. */ >> void vcpu_restore_fpu_eager(struct vcpu *v) >> { >> ASSERT(!is_idle_vcpu(v)); >> >> >> /* save the nonlazy extended state which is not tracked by CR0.TS bit */ >> if ( xsave_enabled(v) ) >> { >> /* Avoid recursion */ >> clts(); >> fpu_xrstor(v, XSTATE_NONLAZY); >> stts(); >> } > That''s certainly better, but I''d still like to see the xsave_enabled() > check to be replaced by some form of lwp_enabled() or > lazy_xsave_needed() or some such (which will at once exclude all > pv guests until you care to add support for them). > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-May-06 07:49 UTC
Re: [Xen-devel] Re: [PATCH] FPU LWP 6/8: create lazy and non-lazy FPU restore functions
>>> On 05.05.11 at 23:41, Wei Huang <wei.huang2@amd.com> wrote: > Hi Jan, > > If we want to make LWP restore optional in vcpu_restore_fpu_eager(), we > have to change vcpu_save_fpu() as well. Otherwise, the extended state > will become inconsistent for non-LWP VCPUs (because save and restore is > asymmetric). There are two approaches: > > 1. In vcpu_save_fpu(), clean physical CPU''s extended state for VCPU > which is being scheduled in. This prevents messy states from causing > problems. The disadvantage is the cleaning cost, which would out-weight > the benefits.Cleaning cost? Wasn''t it that one can express to default-initialize fields during xrstor (which, if indeed expensive, you''d want to trigger only if you know the physical CPU''s state is dirty, i.e. in this case requiring a per-CPU variable that gets evaluated and updated on context restore).> 2. Add a new variable in VCPU to track whether nonlazy state is dirty. I > think this is better. See the attached file. > > Let me know if it is what you want. After that, I will re-spin the patches.Yes, this looks like what I meant. Two suggestions: The new field''s name (nonlazy_xstate_dirty) would perhaps better be something like nonlazy_xstate_used, so that name and use are in sync. And the check in vcpu_restore_fpu_eager() probably doesn''t need to re-evaluate xsave_enabled(v), since the flag can''t get set without this (if you absolutely want to, put in an ASSERT() to this effect). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel