Keir, please consider backing out c/s 20627. I don''t believe all the cases have been properly thought through, and the consequences have impact on applications and thus on existing customers. As far as I can tell, there is no urgency to get this into Xen 4.0 since existing apps and guest OS''s that use rdtscp must check cpuid to see if the instruction is present on the hardware. But putting a partial solution into 4.0 may cause Xen versioning issues that affect apps for years to come. This is an ABI, not a feature!> From: Xu, Dongxiao [mailto:dongxiao.xu@intel.com] > Sent: Friday, December 11, 2009 5:24 PM > Subject: RE: [Xen-devel][PATCH 02/02] VMX: Add HVM RDTSCP support > > Whether a system has rdtscp support is indicated by > the cpuid. Management tool or system admin should > use CPUIDdetermine whether the migration is allowed. > I think besides RDTSCP, we already have such cases.This may be true in concept, but existing tools (including the default xm tools) do NOT check for this... I just tested a live migration between a Nehalem (which supports rdtscp) and a Conroe (which does not). The live migration works fine and the app using rdtscp runs fine on the Nehalem and then crashes when the live migration completes on the Conroe. I *know* of existing code in Oracle that will be broken by this! List of "open" discussions: - virtualization of rdtscp on processors that don''t support it (PV does, HVM doesn''t) - virtualizing (or not) TSC-AUX - the Xen 32-bit vs Xen 64-bit inconsistency - how to communicate pcpu vs vcpu and pnode vs vnode (and whether this has any relevance for TSC-AUX) - hvm support of the pvrdtscp algorithm - toolset capability and compatibility On any of these points, I''m not saying I am right and anyone else is wrong, I''m just saying further discussion is warranted and getting it wrong in 4.0 has significant risk and consequences if we proceed haphazardly. I am only urging caution and proper design. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 14/12/2009 18:02, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> This may be true in concept, but existing tools (including > the default xm tools) do NOT check for this... I just > tested a live migration between a Nehalem (which supports > rdtscp) and a Conroe (which does not). The live migration > works fine and the app using rdtscp runs fine on the > Nehalem and then crashes when the live migration completes > on the Conroe. I *know* of existing code in Oracle > that will be broken by this!This is a general problem for migration between dissimilar processors. The solution is to ''level'' the feature sets, by masking CPUID flags from the more-featured processor. In this case you would mask out RDTSCP (and perhaps others too). This does need the RDTSCP flag setting/clearing to be moved to xc_cpuid_x86.c, as currently the user cannot override the policy wedged into the hypervisor itself. That''s an easy thing to fix. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser wrote:> On 14/12/2009 18:02, "Dan Magenheimer" <dan.magenheimer@oracle.com> > wrote: > >> This may be true in concept, but existing tools (including >> the default xm tools) do NOT check for this... I just >> tested a live migration between a Nehalem (which supports >> rdtscp) and a Conroe (which does not). The live migration >> works fine and the app using rdtscp runs fine on the >> Nehalem and then crashes when the live migration completes >> on the Conroe. I *know* of existing code in Oracle >> that will be broken by this! > > This is a general problem for migration between dissimilar > processors. The solution is to ''level'' the feature sets, by masking > CPUID flags from the more-featured processor. In this case you would > mask out RDTSCP (and perhaps others too). This does need the RDTSCP > flag setting/clearing to be moved to xc_cpuid_x86.c, as currently the > user cannot override the policy wedged into the hypervisor itself. > That''s an easy thing to fix.I will write a patch for this. Thanks! -- Dongxiao> > -- Keir > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Dec-15 15:56 UTC
[Xen-devel] RE: Live migration fails due to c/s 20627
Hi Dongxiao -- Why would you disable live migraton between two very widely used Intel processors for ALL HVM domains just because some domains use the rdtscp instruction? Why not just add the code to do rdtscp emulation, which would NOT break live migration? There are many cases where rdtsc/rdtscp instructions are emulated and so most of the code is already there. You only need to intercept illegal instruction traps, so there is not a significant performance issue. And the code to do the emulation is necessary to implement the pvrdtscp algorithm on hvm anyway (which I think was the reason this whole discussion started). Dan> -----Original Message----- > From: Xu, Dongxiao [mailto:dongxiao.xu@intel.com] > Sent: Monday, December 14, 2009 9:40 PM > To: Keir Fraser; Dan Magenheimer > Cc: Jeremy Fitzhardinge; xen-devel@lists.xensource.com; Kurt Hackel; > Dugger, Donald D; Nakajima, Jun; Zhang, Xiantao > Subject: RE: Live migration fails due to c/s 20627 > > > Keir Fraser wrote: > > On 14/12/2009 18:02, "Dan Magenheimer" <dan.magenheimer@oracle.com> > > wrote: > > > >> This may be true in concept, but existing tools (including > >> the default xm tools) do NOT check for this... I just > >> tested a live migration between a Nehalem (which supports > >> rdtscp) and a Conroe (which does not). The live migration > >> works fine and the app using rdtscp runs fine on the > >> Nehalem and then crashes when the live migration completes > >> on the Conroe. I *know* of existing code in Oracle > >> that will be broken by this! > > > > This is a general problem for migration between dissimilar > > processors. The solution is to ''level'' the feature sets, by masking > > CPUID flags from the more-featured processor. In this case you would > > mask out RDTSCP (and perhaps others too). This does need the RDTSCP > > flag setting/clearing to be moved to xc_cpuid_x86.c, as > currently the > > user cannot override the policy wedged into the hypervisor itself. > > That''s an easy thing to fix. > > I will write a patch for this. Thanks! > > -- Dongxiao > > > > > -- Keir > > > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer wrote:> Hi Dongxiao -- > > Why would you disable live migraton between two > very widely used Intel processors for ALL HVM domains > just because some domains use the rdtscp instruction?Dan, I won''t disable the migration. As Keir said, I will put the cpuid logic in xc_cpuid_x86.c so that admin can use configuration file to mask rdtscp feature through cpuid. This is the common usage model for live migration between two different hosts.> > Why not just add the code to do rdtscp emulation, > which would NOT break live migration?Add rdtscp emulation has such problem that, in Intel VMX, the vmexit control for rdtsc and rdtscp is the same, so if we trap rdtscp for emulation, OS will suffer from looooots of rdtsc vmexit, which will bring performance downgrade.> > There are many cases where rdtsc/rdtscp instructions > are emulated and so most of the code is already there. > You only need to intercept illegal instruction traps, > so there is not a significant performance issue. > And the code to do the emulation is necessary > to implement the pvrdtscp algorithm on hvm anywayI think in HVM environment, we should respect the native behavior. Moreover, it would be valuable for guest if it could get the node/cpu info which reflects hardware topology. Thanks! Dongxiao> (which I think was the reason this whole discussion > started). > > Dan > >> -----Original Message----- >> From: Xu, Dongxiao [mailto:dongxiao.xu@intel.com] >> Sent: Monday, December 14, 2009 9:40 PM >> To: Keir Fraser; Dan Magenheimer >> Cc: Jeremy Fitzhardinge; xen-devel@lists.xensource.com; Kurt Hackel; >> Dugger, Donald D; Nakajima, Jun; Zhang, Xiantao >> Subject: RE: Live migration fails due to c/s 20627 >> >> >> Keir Fraser wrote: >>> On 14/12/2009 18:02, "Dan Magenheimer" <dan.magenheimer@oracle.com> >>> wrote: >>> >>>> This may be true in concept, but existing tools (including >>>> the default xm tools) do NOT check for this... I just >>>> tested a live migration between a Nehalem (which supports >>>> rdtscp) and a Conroe (which does not). The live migration >>>> works fine and the app using rdtscp runs fine on the >>>> Nehalem and then crashes when the live migration completes >>>> on the Conroe. I *know* of existing code in Oracle >>>> that will be broken by this! >>> >>> This is a general problem for migration between dissimilar >>> processors. The solution is to ''level'' the feature sets, by masking >>> CPUID flags from the more-featured processor. In this case you would >>> mask out RDTSCP (and perhaps others too). This does need the RDTSCP >>> flag setting/clearing to be moved to xc_cpuid_x86.c, as currently >>> the user cannot override the policy wedged into the hypervisor >>> itself. That''s an easy thing to fix. >> >> I will write a patch for this. Thanks! >> >> -- Dongxiao >> >>> >>> -- Keir >>> >>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Dec-15 16:31 UTC
[Xen-devel] RE: Live migration fails due to c/s 20627
> Add rdtscp emulation has such problem that, in Intel VMX, the > vmexit control for rdtsc and rdtscp is the same, so if we trap > rdtscp for emulation, OS will suffer from looooots of rdtsc vmexit, > which will bring performance downgrade.OK, I see... you probably are not aware of all the recent work in Xen in this area. Machines that do not have rdtscp support are highly likely to be emulating rdtsc anyway. This is true for both PV and HVM domains. Please check docs/misc/tscmode.txt in a xen tree from the last few days. But in any case, I am not suggesting any change in RDTSC_EXITING... that code is fine. I am suggesting adding code similar to the emulate_invalid_rdtscp for PV in c/s 20504 to catch the illegal instruction traps on machines where the instruction is illegal. Dan> -----Original Message----- > From: Xu, Dongxiao [mailto:dongxiao.xu@intel.com] > Sent: Tuesday, December 15, 2009 9:10 AM > To: Dan Magenheimer; Keir Fraser > Cc: Jeremy Fitzhardinge; xen-devel@lists.xensource.com; Kurt Hackel; > Dugger, Donald D; Nakajima, Jun; Zhang, Xiantao > Subject: RE: Live migration fails due to c/s 20627 > > > Dan Magenheimer wrote: > > Hi Dongxiao -- > > > > Why would you disable live migraton between two > > very widely used Intel processors for ALL HVM domains > > just because some domains use the rdtscp instruction? > > Dan, I won''t disable the migration. As Keir said, I will put > the cpuid logic in xc_cpuid_x86.c so that admin can use > configuration file to mask rdtscp feature through cpuid. > This is the common usage model for live migration > between two different hosts. > > > > > Why not just add the code to do rdtscp emulation, > > which would NOT break live migration? > > Add rdtscp emulation has such problem that, in Intel VMX, the > vmexit control for rdtsc and rdtscp is the same, so if we trap > rdtscp for emulation, OS will suffer from looooots of rdtsc vmexit, > which will bring performance downgrade. > > > > > There are many cases where rdtsc/rdtscp instructions > > are emulated and so most of the code is already there. > > You only need to intercept illegal instruction traps, > > so there is not a significant performance issue. > > And the code to do the emulation is necessary > > to implement the pvrdtscp algorithm on hvm anyway > > I think in HVM environment, we should respect the native > behavior. Moreover, it would be valuable for > guest if it could get the node/cpu info which reflects > hardware topology. > > Thanks! > Dongxiao > > > (which I think was the reason this whole discussion > > started). > > > > Dan > > > >> -----Original Message----- > >> From: Xu, Dongxiao [mailto:dongxiao.xu@intel.com] > >> Sent: Monday, December 14, 2009 9:40 PM > >> To: Keir Fraser; Dan Magenheimer > >> Cc: Jeremy Fitzhardinge; xen-devel@lists.xensource.com; > Kurt Hackel; > >> Dugger, Donald D; Nakajima, Jun; Zhang, Xiantao > >> Subject: RE: Live migration fails due to c/s 20627 > >> > >> > >> Keir Fraser wrote: > >>> On 14/12/2009 18:02, "Dan Magenheimer" > <dan.magenheimer@oracle.com> > >>> wrote: > >>> > >>>> This may be true in concept, but existing tools (including > >>>> the default xm tools) do NOT check for this... I just > >>>> tested a live migration between a Nehalem (which supports > >>>> rdtscp) and a Conroe (which does not). The live migration > >>>> works fine and the app using rdtscp runs fine on the > >>>> Nehalem and then crashes when the live migration completes > >>>> on the Conroe. I *know* of existing code in Oracle > >>>> that will be broken by this! > >>> > >>> This is a general problem for migration between dissimilar > >>> processors. The solution is to ''level'' the feature sets, > by masking > >>> CPUID flags from the more-featured processor. In this > case you would > >>> mask out RDTSCP (and perhaps others too). This does need > the RDTSCP > >>> flag setting/clearing to be moved to xc_cpuid_x86.c, as currently > >>> the user cannot override the policy wedged into the hypervisor > >>> itself. That''s an easy thing to fix. > >> > >> I will write a patch for this. Thanks! > >> > >> -- Dongxiao > >> > >>> > >>> -- Keir > >>> > >>> > >>> > >>> _______________________________________________ > >>> Xen-devel mailing list > >>> Xen-devel@lists.xensource.com > >>> http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Dec-15 17:08 UTC
[Xen-devel] RE: Live migration fails due to c/s 20627
Oops, forgot to reply to this part of your message.> > There are many cases where rdtsc/rdtscp instructions > > are emulated and so most of the code is already there. > > You only need to intercept illegal instruction traps, > > so there is not a significant performance issue. > > And the code to do the emulation is necessary > > to implement the pvrdtscp algorithm on hvm anyway > > I think in HVM environment, we should respect the native > behavior.I''m not sure why you think that. There are many many native silicon features that are not exposed directly to the guest OS. The whole point of virtualization is to present an abstract hardware interface so that multiple VMs can be supported on a single machine and VMs can be moved between underlying hardware implementations. Breaking that flexibility for a rarely utilized instruction such as rdtscp seems like a very bad idea.> Moreover, it would be valuable for > guest if it could get the node/cpu info which reflects > hardware topology.As Jeremy has pointed out, the guest OS already has other mechanisms to provide this information, and as Jun has pointed out, the non-rdtscp mechanism (lsl on Linux) may even be faster. Windows does not even provide TSC_AUX, so it definitely has other ways to obtain node/cpu info. And, as I''ve said before, the node/cpu info provided by Linux in TSC_AUX is wrong anyway (except in very constrained environments such as where the admin has pinned vcpus to pcpus). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Dec-15 17:12 UTC
[Xen-devel] Re: Live migration fails due to c/s 20627
On 12/15/09 08:10, Xu, Dongxiao wrote:>> Why not just add the code to do rdtscp emulation, >> which would NOT break live migration? >> > Add rdtscp emulation has such problem that, in Intel VMX, the > vmexit control for rdtsc and rdtscp is the same, so if we trap > rdtscp for emulation, OS will suffer from looooots of rdtsc vmexit, > which will bring performance downgrade. >I don''t see why that''s relevant. In the case where you''ve migrated the domain, if the CPU has rdtsc but not rdtscp, won''t the rdtscp vmexit with an illegal instruction trap? In that case you can emulate rdtscp while still having direct execution of rdtsc. Of course, having a wide difference between rdtscp and rdtsc performance may cause its own set of problems. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge wrote:> On 12/15/09 08:10, Xu, Dongxiao wrote: >>> Why not just add the code to do rdtscp emulation, >>> which would NOT break live migration? >>> >> Add rdtscp emulation has such problem that, in Intel VMX, the >> vmexit control for rdtsc and rdtscp is the same, so if we trap >> rdtscp for emulation, OS will suffer from looooots of rdtsc vmexit, >> which will bring performance downgrade. >> > > I don''t see why that''s relevant. In the case where you''ve migrated > the domain, if the CPU has rdtsc but not rdtscp, won''t the rdtscp > vmexit with an illegal instruction trap? In that case you can > emulate rdtscp while still having direct execution of rdtsc.If CPU has rdtsc but no rdtscp, then the VM exec control bit in VMCS won''t be turned on. Therefore if rdtscp instruction runs, it will encounter invalid op code directly but no VMEXIT.> > Of course, having a wide difference between rdtscp and rdtsc > performance may cause its own set of problems. > > JBest Regards, -- Dongxiao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer wrote:> Oops, forgot to reply to this part of your message. > >>> There are many cases where rdtsc/rdtscp instructions >>> are emulated and so most of the code is already there. >>> You only need to intercept illegal instruction traps, >>> so there is not a significant performance issue. >>> And the code to do the emulation is necessary >>> to implement the pvrdtscp algorithm on hvm anyway >> >> I think in HVM environment, we should respect the native >> behavior. > > I''m not sure why you think that. There are many many > native silicon features that are not exposed directly > to the guest OS. The whole point of virtualization > is to present an abstract hardware interface so that > multiple VMs can be supported on a single machine and > VMs can be moved between underlying hardware implementations. > Breaking that flexibility for a rarely utilized instruction > such as rdtscp seems like a very bad idea.I didn''t break the flexibility, as stated before, my next patch will move the code for guest cpuid presentation, and then Administrator could mask the bit for migration. It is not only Rdtscp has this problem.> >> Moreover, it would be valuable for >> guest if it could get the node/cpu info which reflects >> hardware topology. > > As Jeremy has pointed out, the guest OS already has > other mechanisms to provide this information, and > as Jun has pointed out, the non-rdtscp mechanism (lsl > on Linux) may even be faster. Windows does not even > provide TSC_AUX, so it definitely has other ways to > obtain node/cpu info. And, as I''ve said before, > the node/cpu info provided by Linux in TSC_AUX is > wrong anyway (except in very constrained environments > such as where the admin has pinned vcpus to pcpus).After guest numa is implemented, it is not a constrained environment. Guest will benefit from the information such as Implementing fast vgetcpu() and so on... Best Regards, -- Dongxiao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
HVM RDTSCP fix. - Put the guest rdtscp cpuid logic in xc_cpuid_x86.c. - MSR_TSC_AUX''s high 32bit is reserved, so only write the low 32bit. Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com> Xu, Dongxiao wrote:> Keir Fraser wrote: >> On 14/12/2009 18:02, "Dan Magenheimer" <dan.magenheimer@oracle.com> >> wrote: >> >>> This may be true in concept, but existing tools (including >>> the default xm tools) do NOT check for this... I just >>> tested a live migration between a Nehalem (which supports >>> rdtscp) and a Conroe (which does not). The live migration >>> works fine and the app using rdtscp runs fine on the >>> Nehalem and then crashes when the live migration completes >>> on the Conroe. I *know* of existing code in Oracle >>> that will be broken by this! >> >> This is a general problem for migration between dissimilar >> processors. The solution is to ''level'' the feature sets, by masking >> CPUID flags from the more-featured processor. In this case you would >> mask out RDTSCP (and perhaps others too). This does need the RDTSCP >> flag setting/clearing to be moved to xc_cpuid_x86.c, as currently the >> user cannot override the policy wedged into the hypervisor itself. >> That''s an easy thing to fix. > > I will write a patch for this. Thanks! > > -- Dongxiao > >> >> -- Keir >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Dec-15 18:25 UTC
[Xen-devel] Re: Live migration fails due to c/s 20627
On 12/15/2009 09:24 AM, Xu, Dongxiao wrote:> If CPU has rdtsc but no rdtscp, then the VM exec control bit in VMCS > won''t be turned on. Therefore if rdtscp instruction runs, it will encounter > invalid op code directly but no VMEXIT. >Ah, right. You''d need to make that particular illegal instruction vmexit. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Dec-15 19:20 UTC
[Xen-devel] RE: Live migration fails due to c/s 20627
> -----Original Message----- > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > Sent: Tuesday, December 15, 2009 11:26 AM > To: Xu, Dongxiao > Cc: Dan Magenheimer; Keir Fraser; xen-devel@lists.xensource.com; Kurt > Hackel; Dugger, Donald D; Nakajima, Jun; Zhang, Xiantao > Subject: Re: Live migration fails due to c/s 20627 > > > On 12/15/2009 09:24 AM, Xu, Dongxiao wrote: > > If CPU has rdtsc but no rdtscp, then the VM exec control bit in VMCS > > won''t be turned on. Therefore if rdtscp instruction runs, > it will encounter > > invalid op code directly but no VMEXIT. > > > > Ah, right. You''d need to make that particular illegal > instruction vmexit.Or make ALL illegal instructions vmexit, decode, if rdtscp emulate it, else vmenter again. Is this not done anyplace else in the hvm code? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 15/12/2009 18:25, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:> On 12/15/2009 09:24 AM, Xu, Dongxiao wrote: >> If CPU has rdtsc but no rdtscp, then the VM exec control bit in VMCS >> won''t be turned on. Therefore if rdtscp instruction runs, it will encounter >> invalid op code directly but no VMEXIT. >> > > Ah, right. You''d need to make that particular illegal instruction vmexit.We''d need to VMEXIT on any guest #UD and then call into our x86 emulator. There''s just a slight feeling that could have wider impact and implications than the specific case we''d want to handle here. Then again, #UD is rare and usually unexpected, so perhaps such a seemingly broad change is not so dangerous. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 15/12/2009 19:20, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>> Ah, right. You''d need to make that particular illegal >> instruction vmexit. > > Or make ALL illegal instructions vmexit, decode, if rdtscp > emulate it, else vmenter again. > > Is this not done anyplace else in the hvm code?Oh, in fact I am wrong in my previous email, replying to Jeremy, claiming we do not trap-and-emulate on illegal instructions. In fact we *do*, as it got added to handle SYSCALL vs SYSENTER when migrating between Intel and AMD hosts. So all that would need to be done is to add RDTSCP support to x86_emulate.c, as it''s currently missing. But that''s pretty trivial. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Dec-15 19:52 UTC
Re: [Xen-devel] Re: Live migration fails due to c/s 20627
On 15/12/2009 19:31, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:> So all that would need to be done is to add RDTSCP support to x86_emulate.c, > as it''s currently missing. But that''s pretty trivial....I''ll sort out a patch for this tomorrow. No reason this shouldn''t be added to the emulator. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer wrote on Tue, 15 Dec 2009 at 09:08:40:> > As Jeremy has pointed out, the guest OS already has > other mechanisms to provide this information, and > as Jun has pointed out, the non-rdtscp mechanism (lsl > on Linux) may even be faster. Windows does not even > provide TSC_AUX, so it definitely has other ways to > obtain node/cpu info. And, as I''ve said before, > the node/cpu info provided by Linux in TSC_AUX is > wrong anyway (except in very constrained environments > such as where the admin has pinned vcpus to pcpus).I think we should distinguish architectural support and Linux-specific issues. We need to enable RDSTCP support in HVM if user apps want to use it. If we want the Linux or other kernel to stop using TSC_AUX, we can mask off the feature in CPUID. Access to the MSR should result in #GP in the guest if the feature is not advertized. BTW, I''m hearing that the max TSC difference among CPUs is less than 10 cycles or so (virtually 0) at least Intel-based machines (except limited ones with known issues), so probably such pinning should not be required because vcpu migration should take more than. Jun ___ Intel Open Source Technology Center _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Zhang, Xiantao
2009-Dec-16 02:43 UTC
[Xen-devel] RE: Live migration fails due to c/s 20627
>. And, as I''ve said before, > the node/cpu info provided by Linux in TSC_AUX is > wrong anyway (except in very constrained environments > such as where the admin has pinned vcpus to pcpus).I don''t agree with you at this point. For guest numa support, it should be a must to pin virtual node''s vcpus to its related physical node and crossing-node vcpu migration should be disallowed by default, otherwise guest numa support is meaningless, right ? If vcpu''s migration only happens in its physical node, I can''t see why you thought the info provided in the MSR is wrong. Actually, each vcpu should have a virtual TSC_AUX_MSR(guest should set it before using it), and this virtual MSR is saved/restored from/to physical TSC_AUX_MSR between context switch, so in vmx non-root mode the value in physical TSC_AUX_MSR should follow guest''s setting rather than host''s setting , and it also reflect guest''s info related to virtual node/virtual cpu, and it still should be the expected value for guest''s applications. In addition, we have to know host''s TSC_AUX_MSR and guest''s TSC_AUX_MSR are totally two different things except that they are saved in one physical register in cpu''s different execution phases, shouldn''t mix them together. Xiantao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Dec-16 04:14 UTC
[Xen-devel] RE: Live migration fails due to c/s 20627
> >. And, as I''ve said before, > > the node/cpu info provided by Linux in TSC_AUX is > > wrong anyway (except in very constrained environments > > such as where the admin has pinned vcpus to pcpus). > > I don''t agree with you at this point. For guest numa support, > it should be a must to pin virtual node''s vcpus to its > related physical node and crossing-node vcpu migration should > be disallowed by default, otherwise guest numa support is > meaningless, right ?It''s not a must. A system administrator should always have the option of choosing flexibility vs performance. I agree that when performance is higher priority, pinning is a must, but pinning may even have issues when the guest''s nvcpus exceeds the number of cores in a node. So I am saying there are many cases where TSC_AUX, when set by a guest OS, will be incorrect. And yes I agree there are cases (with pinning) where it will be correct. But how does an app or OS know whether to trust TSC_AUX or not? Better to have some other method to get pcpu/pnode information that is known to be always correct (via some method of querying the hypervisor directly).> If vcpu''s migration only happens in its physical node, I > can''t see why you thought the info provided in the MSR is > wrong. Actually, each vcpu should have a virtual > TSC_AUX_MSR(guest should set it before using it), and this > virtual MSR is saved/restored from/to physical TSC_AUX_MSR > between context switch, so in vmx non-root mode the value in > physical TSC_AUX_MSR should follow guest''s setting rather > than host''s setting , and it also reflect guest''s info > related to virtual node/virtual cpu, and it still should be > the expected value for guest''s applications. In addition, we > have to know host''s TSC_AUX_MSR and guest''s TSC_AUX_MSR are > totally two different things except that they are saved in > one physical register in cpu''s different execution phases, > shouldn''t mix them together.My argument is simply that if TSC_AUX cannot ALWAYS be trusted by an application, apps will NEVER trust it. And if apps NEVER trust it, why expose it at all? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Zhang, Xiantao
2009-Dec-16 05:07 UTC
[Xen-devel] RE: Live migration fails due to c/s 20627
Dan Magenheimer wrote:>>> . And, as I''ve said before, >>> the node/cpu info provided by Linux in TSC_AUX is >>> wrong anyway (except in very constrained environments >>> such as where the admin has pinned vcpus to pcpus). >> >> I don''t agree with you at this point. For guest numa support, >> it should be a must to pin virtual node''s vcpus to its >> related physical node and crossing-node vcpu migration should >> be disallowed by default, otherwise guest numa support is >> meaningless, right ? > > It''s not a must. A system administrator should always > have the option of choosing flexibility vs performance. > I agree that when performance is higher priority, pinning > is a must, but pinning may even have issues when the > guest''s nvcpus exceeds the number of cores in a node.Could you elaborate the issues you can see ? Normally, virtual node''s number of vcpus should be less than one physical node''s cpu number. But enen if vcpu''s number is more than physical cpu''s number in a node, why it can introduce issues ?> So I am saying there are many cases where TSC_AUX, > when set by a guest OS, will be incorrect.Could you figure out the incorrect cases ?>And yes I > agree there are cases (with pinning) where it will > be correct. But how does an app or OS know whether to > trust TSC_AUX or not?If hypervisor exposes this instruction to guest, it should be trusted and safe to use, because hypervisor should be responsible for fully virtualizing this instruction and let guest think it is running one a native machine.>Better to have some other > method to get pcpu/pnode information that is known > to be always correct (via some method of querying the hypervisor > directlyI don''t think guest should know host''s numa info through anyway. Basically, guest only needs to be aware guest''s numa info. For example, host numa info maybe 2 nodes and each node is configured with 16G mem and 16 LPs , guest''s virtual numa info maybe 2 nodes and each node has 2G mem and 2 vcpus. In this case, guest only needs to get the virtual numa info instead of host''s numa info when it enables numa support. And at the same time, hypervisor is reponsible for how to allocate 2G memory from 16G mem from the physical node, and how to schudle virtual node''s vcpus to physical cpus(according to performance vs flexibility as you said).>> If vcpu''s migration only happens in its physical node, I >> can''t see why you thought the info provided in the MSR is >> wrong. Actually, each vcpu should have a virtual >> TSC_AUX_MSR(guest should set it before using it), and this >> virtual MSR is saved/restored from/to physical TSC_AUX_MSR >> between context switch, so in vmx non-root mode the value in >> physical TSC_AUX_MSR should follow guest''s setting rather >> than host''s setting , and it also reflect guest''s info >> related to virtual node/virtual cpu, and it still should be >> the expected value for guest''s applications. In addition, we >> have to know host''s TSC_AUX_MSR and guest''s TSC_AUX_MSR are >> totally two different things except that they are saved in >> one physical register in cpu''s different execution phases, >> shouldn''t mix them together. > > My argument is simply that if TSC_AUX cannot ALWAYS > be trusted by an application, apps will NEVER trust it. > And if apps NEVER trust it, why expose it at all?This instruction is safe to use and has been fully virtualized in vmx non-root mode via Dongxiao''s patch, why not trust it ? I can''t figure out one reason. :-) Xiantao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Dec-16 06:19 UTC
[Xen-devel] Re: Live migration fails due to c/s 20627
On 12/15/2009 08:14 PM, Dan Magenheimer wrote:> My argument is simply that if TSC_AUX cannot ALWAYS > be trusted by an application, apps will NEVER trust it. > And if apps NEVER trust it, why expose it at all? >The cpu/node info is only of heuristic value anyway; it is never trustworthy in an absolute sense. Apps just use it to try to optimise their own memory allocation and use patterns, but they can''t rely on that info for actual correctness. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Dec-16 15:20 UTC
RE: [Xen-devel] Re: Live migration fails due to c/s 20627
> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > > On 12/15/2009 08:14 PM, Dan Magenheimer wrote: > > My argument is simply that if TSC_AUX cannot ALWAYS > > be trusted by an application, apps will NEVER trust it. > > And if apps NEVER trust it, why expose it at all? > > The cpu/node info is only of heuristic value anyway; it is never > trustworthy in an absolute sense. Apps just use it to try to > optimise > their own memory allocation and use patterns, but they can''t rely on > that info for actual correctness.Well, "heuristic" implies a reasonably high probability of getting the right answer. Would you agree that the probability that TSC_AUX gets the "right" answer is much higher in a physical environment than in a (non-pinned) virtual environment? And if the heuristic is wrong more often than right, that using that heuristic is a bad idea? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2009-Dec-16 15:41 UTC
Re: [Xen-devel] RE: Live migration fails due to c/s 20627
Dan Magenheimer wrote:>> -----Original Message----- >> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] >> Sent: Tuesday, December 15, 2009 11:26 AM >> To: Xu, Dongxiao >> Cc: Dan Magenheimer; Keir Fraser; xen-devel@lists.xensource.com; Kurt >> Hackel; Dugger, Donald D; Nakajima, Jun; Zhang, Xiantao >> Subject: Re: Live migration fails due to c/s 20627 >> >> >> On 12/15/2009 09:24 AM, Xu, Dongxiao wrote: >>> If CPU has rdtsc but no rdtscp, then the VM exec control bit in VMCS >>> won''t be turned on. Therefore if rdtscp instruction runs, >> it will encounter >>> invalid op code directly but no VMEXIT. >>> >> Ah, right. You''d need to make that particular illegal >> instruction vmexit. > > Or make ALL illegal instructions vmexit, decode, if rdtscp > emulate it, else vmenter again.But is it useful to emulate RDTSCP? I see two use cases for this instruction: 1) NUMA aware malloc: You need to know the current node number _quickly_ to use the right bucket to take the memory from. You do not even want using a syscall for this, that''s why getcpu in Linux is implemented as a vsyscall either using RDTSCP or LSL. If you emulate this, this will need a few thousand cycles. 2) Making sure TSC values are consistent: By looking at the core ID you learn whether two consecutive RDTSCPs are from the same core and are thus reliable. If you loose a few thousand cycles with emulation, than the whole purpose of doing the RDTSCPs is in question, as your results would be spoiled due to the overhead. These two issues are the main reason I refrained from implementing RDTSCP virtualization some months ago, as even virtualizing them introduces a slight overhead (MSR save/restore). As software seems to cope with not having this instruction (and using the perfectly virtualized lsl instruction, for instance), I thought the benefit would not justify the effort. Dan, can you summarize the usage of RDTSCP emulation in PV? Honestly I got lost in all these threads.. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448 3567 12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Dec-16 16:23 UTC
RE: [Xen-devel] RE: Live migration fails due to c/s 20627
Since this discussion seems to be going in circles, I suspect we may have some fundamentally different assumptions. You likely have some unstated ideas, maybe about the underlying implementation of the Linux NUMA syscalls when running on Xen, or maybe defaults for how NUMA-ness might be specified when creating an HVM domain. But all of these are mostly unrelated to rdtscp. The only reason that this discussion has involved NUMA concepts is that the rdtscp instruction, by accident rather than by design, may on some (but not all) guest OS''s, communicate the guest OS''s concept of cpu and node information to an application. As Jeremy has pointed out, this cpu/node information is exactly the same information that can be obtained by a system call. So the only reason that rdtscp is better than using the system call would be for performance. Rdtscp is faster than a system call in many situations, but now is often emulated in Xen (even on processors that do support the hardware instruction*), so cannot be assumed to be much faster than a system call. And the difference in performance is only measurable if an app is executing rdtscp many thousands of times every second. Are there apps that execute rdtscp many thousands of times every second PRIMARILY TO OBTAIN the cpu/node information? If so, I agree that it is unfortunately necessary to expose the rdtscp instruction. If not, I would highly recommend we do NOT expose it now. Otherwise, to use Keir''s words, we are "Supporting CPU instructions just because they''re there [which] is not a useful effort." Once rdtscp/TSC_AUX is exposed to guests, it is very hard to remove it again (as saved guests may have tested the cpuid bit once at startup and will fail if restored). Other brief NUMA-related replies below. * See xen-unstable.hg/docs/misc/tscmode.txt for explanation> From: Zhang, Xiantao [mailto:xiantao.zhang@intel.com] > Dan Magenheimer wrote: > >>> . And, as I''ve said before, > >>> the node/cpu info provided by Linux in TSC_AUX is > >>> wrong anyway (except in very constrained environments > >>> such as where the admin has pinned vcpus to pcpus). > >> > >> I don''t agree with you at this point. For guest numa support, > >> it should be a must to pin virtual node''s vcpus to its > >> related physical node and crossing-node vcpu migration should > >> be disallowed by default, otherwise guest numa support is > >> meaningless, right ? > > > > It''s not a must. A system administrator should always > > have the option of choosing flexibility vs performance. > > I agree that when performance is higher priority, pinning > > is a must, but pinning may even have issues when the > > guest''s nvcpus exceeds the number of cores in a node. > > Could you elaborate the issues you can see ? Normally, > virtual node''s number of vcpus should be less than one > physical node''s cpu number. But enen if vcpu''s number is more > than physical cpu''s number in a node, why it can introduce issues ?Suppose a guest believes it has eight cores on a single processor/node. It is now started on a machine that has four cores per processor/node (and two or more sockets). Since the guest believes it is running on a single node, it communicates that information (via TSC_AUX or vgetcpu) to an application. The application is NUMA-aware, but since the guest OS told it that all cores are on the same node, it doesn''t use it''s NUMA code/mode. Suppose a guest believes it has a total of four cores, two cores on each of two nodes. It is now started on some future machine with 16 cores all on a single node. Since the guest believes it is running on two nodes, it communicates that information (via TSC_AUX or vgetcpu) to an application. The application is NUMA-aware, and the guest OS told it that there are two nodes. This app has very high memory bandwidth needs, so it spends lots of time doing NUMA-related syscalls such as Linux move_pages to ensure that the memory is on the same node as the cpu. All of these move calls are wasted. Both of these situations are very possible in a cloud environment. (NOTE: Since this NUMA-related discussion is orthogonal to rdtscp, we should probably start a separate thread for further discussion.) If the above discussion doesn''t clarify my concerns and I haven''t answered other questions in your email, please let me know. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Dec-16 16:54 UTC
RE: [Xen-devel] RE: Live migration fails due to c/s 20627
Hi Andre --> But is it useful to emulate RDTSCP? I see two use cases for thisThanks for your support!> Dan, can you summarize the usage of RDTSCP emulation in PV? > Honestly I got lost in all these threads..Me too :-) In PV, the "pvcpuid" bit is not set so guest OS''s that use the proper PV access method to cpuid will believe that the hardware does not support rdtscp. Since cpuid is unprivileged, apps running in PV domains may check the bit and use rdtscp. In this case, TSC_AUX should contain 0. If this domain is saved/restored/migrated to a machine that does not support the rdtscp, the instruction is emulated. If tsc_mode=3, rdtscp is handled specially. See xen-unstable.hg/docs/misc/tscmode.txt for more info. Thanks, Dan> -----Original Message----- > From: Andre Przywara [mailto:andre.przywara@amd.com] > Sent: Wednesday, December 16, 2009 8:41 AM > To: Dan Magenheimer > Cc: Jeremy Fitzhardinge; Xu, Dongxiao; xen-devel@lists.xensource.com; > Kurt Hackel; Dugger, Donald D; Keir Fraser; Nakajima, Jun; Zhang, > Xiantao > Subject: Re: [Xen-devel] RE: Live migration fails due to c/s 20627 > > > Dan Magenheimer wrote: > >> -----Original Message----- > >> From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] > >> Sent: Tuesday, December 15, 2009 11:26 AM > >> To: Xu, Dongxiao > >> Cc: Dan Magenheimer; Keir Fraser; > xen-devel@lists.xensource.com; Kurt > >> Hackel; Dugger, Donald D; Nakajima, Jun; Zhang, Xiantao > >> Subject: Re: Live migration fails due to c/s 20627 > >> > >> > >> On 12/15/2009 09:24 AM, Xu, Dongxiao wrote: > >>> If CPU has rdtsc but no rdtscp, then the VM exec control > bit in VMCS > >>> won''t be turned on. Therefore if rdtscp instruction runs, > >> it will encounter > >>> invalid op code directly but no VMEXIT. > >>> > >> Ah, right. You''d need to make that particular illegal > >> instruction vmexit. > > > > Or make ALL illegal instructions vmexit, decode, if rdtscp > > emulate it, else vmenter again. > But is it useful to emulate RDTSCP? I see two use cases for this > instruction: > 1) NUMA aware malloc: > You need to know the current node number _quickly_ to use the right > bucket to take the memory from. You do not even want using a > syscall for > this, that''s why getcpu in Linux is implemented as a vsyscall either > using RDTSCP or LSL. If you emulate this, this will need a > few thousand > cycles. > 2) Making sure TSC values are consistent: > By looking at the core ID you learn whether two consecutive > RDTSCPs are > from the same core and are thus reliable. If you loose a few thousand > cycles with emulation, than the whole purpose of doing the > RDTSCPs is in > question, as your results would be spoiled due to the overhead. > > These two issues are the main reason I refrained from implementing > RDTSCP virtualization some months ago, as even virtualizing them > introduces a slight overhead (MSR save/restore). As software seems to > cope with not having this instruction (and using the perfectly > virtualized lsl instruction, for instance), I thought the > benefit would > not justify the effort. > > Dan, can you summarize the usage of RDTSCP emulation in PV? > Honestly I > got lost in all these threads.. > > > Regards, > Andre. > > -- > Andre Przywara > AMD-Operating System Research Center (OSRC), Dresden, Germany > Tel: +49 351 448 3567 12 > ----to satisfy European Law for business letters: > Advanced Micro Devices GmbH > Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen > Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni > Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen > Registergericht Muenchen, HRB Nr. 43632 > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Dec-16 17:31 UTC
Re: [Xen-devel] Re: Live migration fails due to c/s 20627
On 12/16/2009 07:20 AM, Dan Magenheimer wrote:> Well, "heuristic" implies a reasonably high probability of > getting the right answer. Would you agree that the probability > that TSC_AUX gets the "right" answer is much higher > in a physical environment than in a (non-pinned) virtual > environment? And if the heuristic is wrong more often > than right, that using that heuristic is a bad idea? >It won''t make a difference either way. Running in a Xen domain, the kernel will only see a single NUMA node, so the node id is constant. The CPU number may not correspond to a pcpu all the time, but scheduler affinity should make a given vcpu number correspond to the same pcpu for a while. In either case, an application paying attention to cpu+node will do at least as well as an app ignoring them. So I don''t think your argument that "if TSC_AUX cannot ALWAYS be trusted by an application, apps will NEVER trust it" is true at all. Aside from the fact that the cpu+node issue is completely irrelevant to whether we support TSC_AUX. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Dec-16 17:38 UTC
Re: [Xen-devel] RE: Live migration fails due to c/s 20627
On 12/16/2009 08:23 AM, Dan Magenheimer wrote:> As Jeremy has pointed out, this cpu/node information is exactly > the same information that can be obtained by a system call. > So the only reason that rdtscp is better than using the > system call would be for performance. >No, not a system call. The vgetcpu vsyscall will return the info with no syscalls, regardless of whether rdtscp is available. It encodes the data in the segment limit of a special segment, and it can be read back with the "lsl" instruction.> Rdtscp is faster than a system call in many situations, but > now is often emulated in Xen (even on processors that do support > the hardware instruction*), so cannot be assumed to be much > faster than a system call. And the difference in performance > is only measurable if an app is executing rdtscp many thousands > of times every second. >"lsl" is probably at least as fast as rdtscp when executed natively, and definitely if rdtscp is emulated.> Suppose a guest believes it has eight cores on a single > processor/node.[...]> Suppose a guest believes it has a total of four cores, > two cores on each of two nodes.The pvops kernel never attempts to determine the underlying machine topology; it always assumes a single NUMA node. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel