While pv-ops so far doesn''t care to eliminate the two trap-and- emulate CR0 accesses from the asm/xor.h save/restore operations, the legacy x86-64 kernel uses conditional clts()/stts() for this purpose. While looking into whether to extend this to the newly added (in 3.5) AVX operations there I realized that this isn''t fully correct: It doesn''t properly nest inside a kernel_fpu_begin()/ kernel_fpu_end() pair (as it will stts() at the end no matter what the original state of CR0.TS was). In order to not introduce completely new hypercalls to overcome this (fpu_taskswitch isn''t really extensible on its own), I''m considering to add a new VM assist, altering the fpu_taskswitch behavior so that it would return an indicator whether any change to the virtual CR0.TS was done. That way, the kernel can implement a true save/restore cycle here. In order to allow the kernel to run on older hypervisors without extra conditionals (behaving the same way as it does currently, i.e. with the incorrect nesting), the return value 0 (which the hypercall currently always returns) would need to be used to indicate that the bit got actually flipped (such that on an old hypervisor an updated kernel would always think that something needs to be restored). Would that be an acceptable solution? Can someone think of other ways to deal with this? (And - is pv-ops interested in eliminating the two CR0 access emulations in what is supposed to be a fast path?) Jan
On 15/06/2012 17:03, "Jan Beulich" <JBeulich@suse.com> wrote:> While pv-ops so far doesn''t care to eliminate the two trap-and- > emulate CR0 accesses from the asm/xor.h save/restore > operations, the legacy x86-64 kernel uses conditional clts()/stts() > for this purpose. While looking into whether to extend this to the > newly added (in 3.5) AVX operations there I realized that this isn''t > fully correct: It doesn''t properly nest inside a kernel_fpu_begin()/ > kernel_fpu_end() pair (as it will stts() at the end no matter what > the original state of CR0.TS was). > > In order to not introduce completely new hypercalls to overcome > this (fpu_taskswitch isn''t really extensible on its own), I''m > considering to add a new VM assist, altering the fpu_taskswitch > behavior so that it would return an indicator whether any change > to the virtual CR0.TS was done. That way, the kernel can > implement a true save/restore cycle here.It should be possible for the guest kernel to track its CR0.TS setting shouldn''t it? It gets modified only via a few paravirt hooks, and implicitly cleared on #NM. -- Keir> In order to allow the kernel to run on older hypervisors without > extra conditionals (behaving the same way as it does currently, > i.e. with the incorrect nesting), the return value 0 (which the > hypercall currently always returns) would need to be used to > indicate that the bit got actually flipped (such that on an old > hypervisor an updated kernel would always think that > something needs to be restored). > > Would that be an acceptable solution? Can someone think of > other ways to deal with this? (And - is pv-ops interested in > eliminating the two CR0 access emulations in what is supposed > to be a fast path?) > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
>>> On 15.06.12 at 19:06, Keir Fraser <keir@xen.org> wrote: > On 15/06/2012 17:03, "Jan Beulich" <JBeulich@suse.com> wrote: > >> While pv-ops so far doesn''t care to eliminate the two trap-and- >> emulate CR0 accesses from the asm/xor.h save/restore >> operations, the legacy x86-64 kernel uses conditional clts()/stts() >> for this purpose. While looking into whether to extend this to the >> newly added (in 3.5) AVX operations there I realized that this isn''t >> fully correct: It doesn''t properly nest inside a kernel_fpu_begin()/ >> kernel_fpu_end() pair (as it will stts() at the end no matter what >> the original state of CR0.TS was). >> >> In order to not introduce completely new hypercalls to overcome >> this (fpu_taskswitch isn''t really extensible on its own), I''m >> considering to add a new VM assist, altering the fpu_taskswitch >> behavior so that it would return an indicator whether any change >> to the virtual CR0.TS was done. That way, the kernel can >> implement a true save/restore cycle here. > > It should be possible for the guest kernel to track its CR0.TS setting > shouldn''t it? It gets modified only via a few paravirt hooks, and implicitly > cleared on #NM.Sure, but selling this to the Linux maintainers I would expect to be harder than fitting the Xen side of things into the current save- and-restore model the native xor code uses. It would only be strait forward to implement on the legacy, forward ported trees. However, with the #NM handler in pv-ops apparently not leveraging the fact that CR0.TS is already cleared for it on entry, maybe this could indeed be introduced together. Konrad? Jan
On 18/06/2012 08:32, "Jan Beulich" <JBeulich@suse.com> wrote:>> It should be possible for the guest kernel to track its CR0.TS setting >> shouldn''t it? It gets modified only via a few paravirt hooks, and implicitly >> cleared on #NM. > > Sure, but selling this to the Linux maintainers I would expect to be > harder than fitting the Xen side of things into the current save- > and-restore model the native xor code uses. It would only be strait > forward to implement on the legacy, forward ported trees.Wouldn''t it be hidden entirely behind pv_ops hooks and within Xen-specific SSE save/restore code? I suppose you''d need to statically allocate the per-cpu space for tracking the CR0.TS state... But overall it seems it will be of little/no concern to other kernel maintainers? -- Keir> However, with the #NM handler in pv-ops apparently not > leveraging the fact that CR0.TS is already cleared for it on entry, > maybe this could indeed be introduced together. Konrad?
>>> On 18.06.12 at 14:12, Keir Fraser <keir@xen.org> wrote: > On 18/06/2012 08:32, "Jan Beulich" <JBeulich@suse.com> wrote: > >>> It should be possible for the guest kernel to track its CR0.TS setting >>> shouldn''t it? It gets modified only via a few paravirt hooks, and implicitly >>> cleared on #NM. >> >> Sure, but selling this to the Linux maintainers I would expect to be >> harder than fitting the Xen side of things into the current save- >> and-restore model the native xor code uses. It would only be strait >> forward to implement on the legacy, forward ported trees. > > Wouldn''t it be hidden entirely behind pv_ops hooks and within Xen-specific > SSE save/restore code? I suppose you''d need to statically allocate the > per-cpu space for tracking the CR0.TS state... But overall it seems it will > be of little/no concern to other kernel maintainers?The #NM handler part wouldn''t afaict. Everything else indeed ought to be restricted to the functions backing paravirt.h''s clts() and write_cr0(). Jan
On 18/06/2012 13:45, "Jan Beulich" <JBeulich@suse.com> wrote:>> Wouldn''t it be hidden entirely behind pv_ops hooks and within Xen-specific >> SSE save/restore code? I suppose you''d need to statically allocate the >> per-cpu space for tracking the CR0.TS state... But overall it seems it will >> be of little/no concern to other kernel maintainers? > > The #NM handler part wouldn''t afaict. Everything else indeed > ought to be restricted to the functions backing paravirt.h''s clts() > and write_cr0().Yes, the #NM handler might need extra treatment. Perhaps we could install our own #NM wrapper around the kernel''s generic handler, which first clears the saved CR0.TS flag. Then on clts() in the generic handler, our paravirt clts() would see the flag is already clear and do no hypercall. -- Keir
On Mon, Jun 18, 2012 at 08:32:05AM +0100, Jan Beulich wrote:> >>> On 15.06.12 at 19:06, Keir Fraser <keir@xen.org> wrote: > > On 15/06/2012 17:03, "Jan Beulich" <JBeulich@suse.com> wrote: > > > >> While pv-ops so far doesn''t care to eliminate the two trap-and- > >> emulate CR0 accesses from the asm/xor.h save/restore > >> operations, the legacy x86-64 kernel uses conditional clts()/stts() > >> for this purpose. While looking into whether to extend this to the > >> newly added (in 3.5) AVX operations there I realized that this isn''t > >> fully correct: It doesn''t properly nest inside a kernel_fpu_begin()/ > >> kernel_fpu_end() pair (as it will stts() at the end no matter what > >> the original state of CR0.TS was).That sounds like a bug in the generic code then?> >> > >> In order to not introduce completely new hypercalls to overcome > >> this (fpu_taskswitch isn''t really extensible on its own), I''m > >> considering to add a new VM assist, altering the fpu_taskswitch > >> behavior so that it would return an indicator whether any change > >> to the virtual CR0.TS was done. That way, the kernel can > >> implement a true save/restore cycle here.How would that work with the multi-calls? Right now clts is batched and so is cr0 write.> > > > It should be possible for the guest kernel to track its CR0.TS setting > > shouldn''t it? It gets modified only via a few paravirt hooks, and implicitlyHm, the clts() paravirt could take advantage of the per-cpu cr0 to figure out whether it truly needs to do anything.> > cleared on #NM. > Sure, but selling this to the Linux maintainers I would expect to be > harder than fitting the Xen side of things into the current save- > and-restore model the native xor code uses. It would only be strait > forward to implement on the legacy, forward ported trees. > > However, with the #NM handler in pv-ops apparently not > leveraging the fact that CR0.TS is already cleared for it on entry, > maybe this could indeed be introduced together. Konrad?Would this require an extra pvops call from the #NM handler?> > Jan
On 18/06/2012 19:24, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:>>> It should be possible for the guest kernel to track its CR0.TS setting >>> shouldn''t it? It gets modified only via a few paravirt hooks, and implicitly > > Hm, the clts() paravirt could take advantage of the per-cpu cr0 to > figure out whether it truly needs to do anything.Exactly.>>> cleared on #NM. >> Sure, but selling this to the Linux maintainers I would expect to be >> harder than fitting the Xen side of things into the current save- >> and-restore model the native xor code uses. It would only be strait >> forward to implement on the legacy, forward ported trees. >> >> However, with the #NM handler in pv-ops apparently not >> leveraging the fact that CR0.TS is already cleared for it on entry, >> maybe this could indeed be introduced together. Konrad? > > Would this require an extra pvops call from the #NM handler?Or we wrap the #NM handler (somehow?) and clear the per-cpu cr0.ts software flag before calling into the generic #NM handler. -- Keir
>>> On 15.06.12 at 19:06, Keir Fraser <keir@xen.org> wrote: > On 15/06/2012 17:03, "Jan Beulich" <JBeulich@suse.com> wrote: > >> While pv-ops so far doesn''t care to eliminate the two trap-and- >> emulate CR0 accesses from the asm/xor.h save/restore >> operations, the legacy x86-64 kernel uses conditional clts()/stts() >> for this purpose. While looking into whether to extend this to the >> newly added (in 3.5) AVX operations there I realized that this isn''t >> fully correct: It doesn''t properly nest inside a kernel_fpu_begin()/ >> kernel_fpu_end() pair (as it will stts() at the end no matter what >> the original state of CR0.TS was). >> >> In order to not introduce completely new hypercalls to overcome >> this (fpu_taskswitch isn''t really extensible on its own), I''m >> considering to add a new VM assist, altering the fpu_taskswitch >> behavior so that it would return an indicator whether any change >> to the virtual CR0.TS was done. That way, the kernel can >> implement a true save/restore cycle here. > > It should be possible for the guest kernel to track its CR0.TS setting > shouldn''t it? It gets modified only via a few paravirt hooks, and implicitly > cleared on #NM.While this works fine and is fairly non-intrusive, it''s not really buying us much: The non-SSE variants of the xor code will still outperform the SSE one on both 32- and 64-bit x86 (and the MMX ones on 32-bit). So I now instead wonder why linux-2.6.18''s include/asm-x86_64/mach-xen/asm/xor.h doesn''t simply forward to asm-generic/xor.h, or at least doesn''t override the template selection logic. Jun, do you recall whether this was perhaps done without any actual measurements when the port to x86-64 was first done? Jan