Jan Beulich
2008-Apr-29 12:39 UTC
[Xen-devel] x86''s context switch ordering of operations
In the process of inventing a reasonable mechanism to support some advanced debugging features for pv guests (last exception record MSRs, last branch stack MSRs after #DE, DS area) I was considering to add another shared state area (similar to the relocated vCPU info, but read-only to the guest and not permanently mapped), where the hypervisor could store relevant information which otherwise can get destroyed before the guest would be able to pick it up, as well as state the CPU is to use which the guest must not be able to modify directly (and extensible to a reasonable degree to support future hardware enhancements). To do so, I was considering using {un,}map_domain_page() from the context switch path, but there are two major problems with the ordering of operations: - for the outgoing task, ''current'' is being changed before the ctxt_switch_from() hook is being called - for the incoming task, write_ptbase() happens only after the ctxt_switch_to() hook was already called I''m wondering whether there are hidden dependencies that require this particular (somewhat non-natural) ordering. While looking into this, I noticed two things that I''m not quite clear on regarding VCPUOP_register_vcpu_info: 1) How does the storing of vcpu_info_mfn in the hypervisor survive migration or save/restore? The mainline Linux code, which uses this hypercall, doesn''t appear to make any attempt to revert to using the default location during suspend or to re-setup the alternate location during resume (but of course I''m not sure that guest is save/restore/ migrate ready in the first place). I would imagine it to be at least difficult for the guest to manage its state post resume without the hypervisor having restored the previously established alternative placement. 2) The implementation in the hypervisor seems to have added yet another scalibility issue (on 32-bits), as this is being carried out using map_domain_page_global() - if there are sufficiently many guests with sufficiently many vCPU-s, there just won''t be any space left at some point. This worries me especially in the context of seeing a call to sh_map_domain_page_global() that is followed by a BUG_ON() checking whether the call failed. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Apr-29 12:50 UTC
Re: [Xen-devel] x86''s context switch ordering of operations
On 29/4/08 13:39, "Jan Beulich" <jbeulich@novell.com> wrote:> To do so, I was considering using {un,}map_domain_page() from > the context switch path, but there are two major problems with the > ordering of operations: > - for the outgoing task, ''current'' is being changed before the > ctxt_switch_from() hook is being called > - for the incoming task, write_ptbase() happens only after the > ctxt_switch_to() hook was already called > I''m wondering whether there are hidden dependencies that require > this particular (somewhat non-natural) ordering.ctxt_switch_{from,to} exist only in x86 Xen and are called from a single hook point out from the common scheduler. Thus either they both happen before, or both happen after, current is changed by the common scheduler. It took a while for the scheduler interfaces to settle down to something both x86 and ia64 was happy with so I''m not particularly excited about revisiting them. I''m not sure why you''d want to map_domain_page() on context switch anyway. The map_domain_page() 32-bit implementation is inherently per-domain already.> 1) How does the storing of vcpu_info_mfn in the hypervisor survive > migration or save/restore? The mainline Linux code, which uses this > hypercall, doesn''t appear to make any attempt to revert to using the > default location during suspend or to re-setup the alternate location > during resume (but of course I''m not sure that guest is save/restore/ > migrate ready in the first place). I would imagine it to be at least > difficult for the guest to manage its state post resume without the > hypervisor having restored the previously established alternative > placement.I don''t see that it would be hard for the guest to do it itself before bringing back all VCPUs (either by bringing them up or by exiting the stopmachine state). Is save/restore even supported by pv_ops kernels yet?> 2) The implementation in the hypervisor seems to have added yet another > scalibility issue (on 32-bits), as this is being carried out using > map_domain_page_global() - if there are sufficiently many guests with > sufficiently many vCPU-s, there just won''t be any space left at some > point. This worries me especially in the context of seeing a call to > sh_map_domain_page_global() that is followed by a BUG_ON() checking > whether the call failed.The hypervisor generally assumes that vcpu_info''s are permanently and globally mapped. That obviously places an unavoidable scalability limit for 32-bit Xen. I have no problem with telling people who are concerned about the limit to use 64-bit Xen instead. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Apr-29 13:39 UTC
Re: [Xen-devel] x86''s context switch ordering of operations
>>> Keir Fraser <keir.fraser@eu.citrix.com> 29.04.08 14:50 >>> >On 29/4/08 13:39, "Jan Beulich" <jbeulich@novell.com> wrote: > >> To do so, I was considering using {un,}map_domain_page() from >> the context switch path, but there are two major problems with the >> ordering of operations: >> - for the outgoing task, ''current'' is being changed before the >> ctxt_switch_from() hook is being called >> - for the incoming task, write_ptbase() happens only after the >> ctxt_switch_to() hook was already called >> I''m wondering whether there are hidden dependencies that require >> this particular (somewhat non-natural) ordering. > >ctxt_switch_{from,to} exist only in x86 Xen and are called from a single >hook point out from the common scheduler. Thus either they both happen >before, or both happen after, current is changed by the common scheduler. ItMaybe I''m mistaken (or it is being done twice with no good reason), but I see a set_current(next) in x86''s context_switch() ...>took a while for the scheduler interfaces to settle down to something both >x86 and ia64 was happy with so I''m not particularly excited about revisiting >them. I''m not sure why you''d want to map_domain_page() on context switch >anyway. The map_domain_page() 32-bit implementation is inherently per-domain >already.If pages mapped that way survive context switches, then it would certainly be possible to map them once and keep them until no longer needed. Doing this during context switch was more as an attempt to conserve on virtual address use (so other vCPU-s of the same guest not using this functionality would have less chances of running out of space). The background is that I think that it''ll also be necessary to extend MAX_VIRT_CPUS beyond 32 at some not too distant point (at least in dom0 for CPU frequency management - or do you have another scheme in mind how to deal with systems having more than 32 CPU threads), resulting in more pressure on the address space.>> 2) The implementation in the hypervisor seems to have added yet another >> scalibility issue (on 32-bits), as this is being carried out using >> map_domain_page_global() - if there are sufficiently many guests with >> sufficiently many vCPU-s, there just won''t be any space left at some >> point. This worries me especially in the context of seeing a call to >> sh_map_domain_page_global() that is followed by a BUG_ON() checking >> whether the call failed. > >The hypervisor generally assumes that vcpu_info''s are permanently and >globally mapped. That obviously places an unavoidable scalability limit for >32-bit Xen. I have no problem with telling people who are concerned about >the limit to use 64-bit Xen instead.I know your position here, but - are all 32-on-64 migration/save/restore issues meanwhile resolved (that is, can the tools meanwhile deal with either size domains no matter whether using a 32- or 64-bit dom0)? If not, there may be reasons beyond that of needing vm86 mode that might force people to stay with 32-bit Xen. (I certainly agree that there are unavoidable limitations, but obviously there is a big difference between requiring 64 bytes and 4k per vCPU for this particular functionality.) Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Apr-29 13:58 UTC
Re: [Xen-devel] x86''s context switch ordering of operations
On 29/4/08 14:39, "Jan Beulich" <jbeulich@novell.com> wrote:>> ctxt_switch_{from,to} exist only in x86 Xen and are called from a single >> hook point out from the common scheduler. Thus either they both happen >> before, or both happen after, current is changed by the common scheduler. It > > Maybe I''m mistaken (or it is being done twice with no good reason), but > I see a set_current(next) in x86''s context_switch() ...Um, good point, I''d forgotten exactly how the code fitted together. Anyhow, the reason you see ctxt_switch_{from,to} happening after set_current() is because context_switch() and __context_switch() can actually be decoupled. When switching to the idle vcpu we run context_switch() but we do not run __context_switch().> If pages mapped that way survive context switches, then it would > certainly be possible to map them once and keep them until no longer > needed. Doing this during context switch was more as an attempt to > conserve on virtual address use (so other vCPU-s of the same guest > not using this functionality would have less chances of running out > of space). The background is that I think that it''ll also be necessary > to extend MAX_VIRT_CPUS beyond 32 at some not too distant point > (at least in dom0 for CPU frequency management - or do you have > another scheme in mind how to deal with systems having more than > 32 CPU threads), resulting in more pressure on the address space.I''m hoping that Intel''s patches to allow uniproc dom0 to perform multiproc Cx and Px state management will be acceptable. Apart from that, yes we may have to increase MAX_VIRT_CPUS.> I know your position here, but - are all 32-on-64 migration/save/restore > issues meanwhile resolved (that is, can the tools meanwhile deal with > either size domains no matter whether using a 32- or 64-bit dom0)? If > not, there may be reasons beyond that of needing vm86 mode that > might force people to stay with 32-bit Xen. (I certainly agree that there > are unavoidable limitations, but obviously there is a big difference > between requiring 64 bytes and 4k per vCPU for this particular > functionality.)I don''t really see a few kilobytes of overhead per vcpu as very significant. Given the limitations of the map_domain_page_global() address space, we''re limiting ourselves to probably around 700-800 vcpus. That''s quite a lot imo! I''m not sure on our position regarding 32-on-64 save/restore compatibility. Tim Deegan made some patches a while ago, but that was mainly focused on correctly saving 64-bit HVM domUs from a 32-bit dom0. I also know that Oracle had some patches they floated a while ago. I don;t they ever got posted for inclusion into xen-unstable though. *However* I do know that I''d rather we spent time fixing 32-on-64 save/restore compatibility than fretting about and optimising 32-bit Xen scalability. The former has greater long-term usefulness. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2008-Apr-29 15:37 UTC
Re: [Xen-devel] x86''s context switch ordering of operations
>>> Keir Fraser <keir.fraser@eu.citrix.com> 29.04.08 15:58 >>> >Um, good point, I''d forgotten exactly how the code fitted together. Anyhow, >the reason you see ctxt_switch_{from,to} happening after set_current() is >because context_switch() and __context_switch() can actually be decoupled. >When switching to the idle vcpu we run context_switch() but we do not run >__context_switch().Okay, that could be easily dealt with by doing set_current() explicitly in the switch-to-idle case, and moving it into __context_switch() in the other cases. Any word on the significance of doing write_ptbase() after calling ctxt_switch_to()? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Apr-29 16:52 UTC
Re: [Xen-devel] x86''s context switch ordering of operations
On 29/4/08 16:37, "Jan Beulich" <jbeulich@novell.com> wrote:> Okay, that could be easily dealt with by doing set_current() explicitly > in the switch-to-idle case, and moving it into __context_switch() in > the other cases.It wouldn''t really help you. If you switch to from VCPU A to idle and then to VCPU B, you would still end up calling ctxt_switch_from(A) when current==idle.> Any word on the significance of doing write_ptbase() after calling > ctxt_switch_to()?It probably could be done earlier. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2008-Apr-29 17:03 UTC
Re: [Xen-devel] x86''s context switch ordering of operations
Jan Beulich wrote:> 1) How does the storing of vcpu_info_mfn in the hypervisor survive > migration or save/restore? The mainline Linux code, which uses this > hypercall, doesn''t appear to make any attempt to revert to using the > default location during suspend or to re-setup the alternate location > during resume (but of course I''m not sure that guest is save/restore/ > migrate ready in the first place). I would imagine it to be at least > difficult for the guest to manage its state post resume without the > hypervisor having restored the previously established alternative > placement. >The only kernel which uses it is 32-on-32 pvops, and that doesn''t currently support migration. It would be easy for the guest to restore that state for itself shortly after resuming. I still need to add 32-on-64 and 64-on-64 implementations for this. Just haven''t looked at it yet.> 2) The implementation in the hypervisor seems to have added yet another > scalibility issue (on 32-bits), as this is being carried out using > map_domain_page_global() - if there are sufficiently many guests with > sufficiently many vCPU-s, there just won''t be any space left at some > point. This worries me especially in the context of seeing a call to > sh_map_domain_page_global() that is followed by a BUG_ON() checking > whether the call failed. >Yes, we discussed it, and, erm, don''t do that. Guests should be able to deal with VCPUOP_register_vcpu_info failing, but that doesn''t address overall heap starvation. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel