Hi Keir, I noticed changeset 027812e4a63c, in which you split off context_switch_finalise() from context_switch(). I really appreciate the comments you added! /* * Called by the scheduler to switch to another VCPU. On entry, although * VCPUF_running is no longer asserted for @prev, its context is still running * on the local CPU and is not committed to memory. The local scheduler lock * is therefore still held, and interrupts are disabled, because the local CPU * is in an inconsistent state. * * The callee must ensure that the local CPU is no longer running in @prev''s * context, and that the context is saved to memory, before returning. * Alternatively, if implementing lazy context switching, it suffices to ensure * that invoking __sync_lazy_execstate() will switch and commit @prev''s state. */ extern void context_switch( struct vcpu *prev, struct vcpu *next); PowerPC has a relatively large set of (general-purpose) registers; half are volatile and half are not. When we take an exception, we do not save the nonvolatiles in the exception handler, since we may be returning to the same domain anyways, and in that case C code will ensure that the nonvolatiles are correct. Later on, if it turns out we are switching domains, we save/restore all the state we can, then return to the exception handler which saves the old set of nonvolatiles and loads the new one. Until that point, some domain state is spread arbitrarily across our stack. That means that context_switch() cannot actually save all of @prev''s state to memory (and neither can __sync_lazy_execstate()) -- only by returning all the way to assembly can we accomplish that. Thoughts? -- Hollis Blanchard IBM Linux Technology Center _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25 Aug 2005, at 22:55, Hollis Blanchard wrote:> Later on, if it turns out we are switching domains, we save/restore > all the > state we can, then return to the exception handler which saves the old > set of > nonvolatiles and loads the new one. Until that point, some domain > state is > spread arbitrarily across our stack. > > That means that context_switch() cannot actually save all of @prev''s > state to > memory (and neither can __sync_lazy_execstate()) -- only by returning > all the > way to assembly can we accomplish that. > > Thoughts?What you need is a synchronisation point, visible to other CPUs, beyond which things like DOM0_GETVCPUCONTEXT can be sure to read consistent current state for the descheduled vcpu. See domain_sleep_sync() for the current way we ensure that state is committed to memory. If you have a lot of register state, have you considered maintaining a Xen stack per VCPU? The context-switch interface already supports this, for ia64. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Aug 26, 2005, at 4:37 AM, Keir Fraser wrote:> > On 25 Aug 2005, at 22:55, Hollis Blanchard wrote: > >> Later on, if it turns out we are switching domains, we save/restore >> all the >> state we can, then return to the exception handler which saves the >> old set of >> nonvolatiles and loads the new one. Until that point, some domain >> state is >> spread arbitrarily across our stack. >> >> That means that context_switch() cannot actually save all of @prev''s >> state to >> memory (and neither can __sync_lazy_execstate()) -- only by returning >> all the >> way to assembly can we accomplish that. >> >> Thoughts? > > What you need is a synchronisation point, visible to other CPUs, > beyond which things like DOM0_GETVCPUCONTEXT can be sure to read > consistent current state for the descheduled vcpu. See > domain_sleep_sync() for the current way we ensure that state is > committed to memory.Hmmmmm. I think the basic problem is that in the exception handler we don''t usually know we will need this state. The exception is a debug exception, where we know we will need it for the GDB stub. However, we also have a hypervisor-dedicated timer, HDEC (hypervisor decrementer). Rather than using it as a plain tick which may or may not cause a scheduler exception, we can use it to *always* mean a context switch. In that case, we would always save the full state on HDEC entry, because we know it will always cause a context switch. Judging by set_ac_timer() callers, it seems that only the scheduler really uses the Xen timer tick. If non-scheduler components start using Xen-internal ticks, this approach wouldn''t hold up (or rather, it would start becoming less efficient). Would that also work for DOM0_GETVCPUCONTEXT? Let''s assume the dom0 vcpu and the target vcpu are running on separate dedicated processors. In that case, dom0 could wait for the target vcpu to take an HDEC at some point in the future, but if it really is a dedicated vcpu then we would want the schedule interval to be the maximum, so that could be a long time. Another option is to have vcpu_pause() end up resetting the target vcpu''s processor''s HDEC via an IPI, which would cause a fake scheduler HDEC to go off, syncronizing the target vcpu''s state. What do you think?> If you have a lot of register state, have you considered maintaining a > Xen stack per VCPU? The context-switch interface already supports > this, for ia64.We have plenty of space on the per-CPU stack for the register state (we use it anyways on a debug exception for the GDB stub). And even if we had one stack per VCPU, we would still want to avoid unnecessarily saving/restoring the nonvolatiles... -- Hollis Blanchard IBM Linux Technology Center _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 26 Aug 2005, at 17:38, Hollis Blanchard wrote:> Hmmmmm. I think the basic problem is that in the exception handler we > don''t usually know we will need this state. The exception is a debug > exception, where we know we will need it for the GDB stub. > > However, we also have a hypervisor-dedicated timer, HDEC (hypervisor > decrementer). Rather than using it as a plain tick which may or may > not cause a scheduler exception, we can use it to *always* mean a > context switch. In that case, we would always save the full state on > HDEC entry, because we know it will always cause a context switch. > Judging by set_ac_timer() callers, it seems that only the scheduler > really uses the Xen timer tick. If non-scheduler components start > using Xen-internal ticks, this approach wouldn''t hold up (or rather, > it would start becoming less efficient).Why not move the non-volatile save/restore into your context switch routine, rather than deferring it until you exit the hypervisor? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Aug 26, 2005, at 12:14 PM, Keir Fraser wrote:> > On 26 Aug 2005, at 17:38, Hollis Blanchard wrote: > >> Hmmmmm. I think the basic problem is that in the exception handler we >> don''t usually know we will need this state. The exception is a debug >> exception, where we know we will need it for the GDB stub. >> >> However, we also have a hypervisor-dedicated timer, HDEC (hypervisor >> decrementer). Rather than using it as a plain tick which may or may >> not cause a scheduler exception, we can use it to *always* mean a >> context switch. In that case, we would always save the full state on >> HDEC entry, because we know it will always cause a context switch. >> Judging by set_ac_timer() callers, it seems that only the scheduler >> really uses the Xen timer tick. If non-scheduler components start >> using Xen-internal ticks, this approach wouldn''t hold up (or rather, >> it would start becoming less efficient). > > Why not move the non-volatile save/restore into your context switch > routine, rather than deferring it until you exit the hypervisor?This is a key point. r14-r31 are nonvolatile in our C ABI, which means that callees must preserve their contents for callers. At a high level, our exception handlers look like this: exception: save r0-r13 to cpu_user_regs call c_handler restore r0-r13 return We know that c_handler() will use r14-r31, but we also know that when it returns, their contents will have been restored. So saving and restoring them in assembly would be a waste of time. context_switch() will be called from somewhere beneath c_handler(). At that point, the original nonvolatiles will have been saved across many stack frames (starting with c_handler()''s), so we really are unable to access them at this point. However, we trust that by the time we get back to the exception handler, the original nonvolatiles will have been restored off all those stack frames. -- Hollis Blanchard IBM Linux Technology Center _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 26 Aug 2005, at 18:40, Hollis Blanchard wrote:> context_switch() will be called from somewhere beneath c_handler(). At > that point, the original nonvolatiles will have been saved across many > stack frames (starting with c_handler()''s), so we really are unable to > access them at this point. However, we trust that by the time we get > back to the exception handler, the original nonvolatiles will have > been restored off all those stack frames.Hmmmm... Anyone interested in the state of a paused vcpu will call sync_vcpu_execstate() after descheduling the vcpu. That is an entirely arch-specific function that you can define to wait until the non-volatile registers are safely saved to memory. Maybe you could add a flag to the arch-specific portion of the vcpu structure that gets set after the non-volatile registers are saved to memory and cleared when they are restored to active use. Then sync_vcpu_execstate() can spin on that flag waiting for it to be non-zero. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Friday 26 August 2005 13:06, Keir Fraser wrote:> On 26 Aug 2005, at 18:40, Hollis Blanchard wrote: > > context_switch() will be called from somewhere beneath c_handler(). At > > that point, the original nonvolatiles will have been saved across many > > stack frames (starting with c_handler()''s), so we really are unable to > > access them at this point. However, we trust that by the time we get > > back to the exception handler, the original nonvolatiles will have > > been restored off all those stack frames. > > Hmmmm... Anyone interested in the state of a paused vcpu will call > sync_vcpu_execstate() after descheduling the vcpu. That is an entirely > arch-specific function that you can define to wait until the > non-volatile registers are safely saved to memory. > > Maybe you could add a flag to the arch-specific portion of the vcpu > structure that gets set after the non-volatile registers are saved to > memory and cleared when they are restored to active use. Then > sync_vcpu_execstate() can spin on that flag waiting for it to be > non-zero.I think this could work, in conjunction with using the HDEC timer solely as a context-switch interrupt (so saving all nonvolatiles on HDEC). However, if we ever want to context switch for other reasons (e.g. a "yield" hypercall), we''re back to the same problem: the hcall exception handler won''t save the nonvolatiles... -- Hollis Blanchard IBM Linux Technology Center _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 26 Aug 2005, at 20:24, Hollis Blanchard wrote:> However, if we ever want to context switch for other reasons (e.g. a > "yield" > hypercall), we''re back to the same problem: the hcall exception > handler won''t > save the nonvolatiles...Instead of save/restore in just one exception handler, you could check a flag on the exit path of all exception handlers (maybe the same flag that sync_vcpu_execstate spins on). If that flag is clear you know you need to take a slower path that saves the non-volatiles into the old vcpu struct and loads from the new vcpu struct. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Hollis -- Sorry for the late reply... keeping up with xen-devel is getting tough! How does Linux do this on Power? Xen/ia64 heavily leverages the equivalent Linux/ia64 code. As you may know, Linux/ia64 scatters state all over memory and uses "unwind descriptors" so that it can recover all the state. I''d imagine Power does something similar... Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tuesday 30 August 2005 10:25, Dan Magenheimer wrote:> Sorry for the late reply... keeping up with xen-devel > is getting tough!Yes, I just occasionally skim...> How does Linux do this on Power? Xen/ia64 heavily > leverages the equivalent Linux/ia64 code. As you > may know, Linux/ia64 scatters state all over > memory and uses "unwind descriptors" so that it > can recover all the state. I''d imagine Power does > something similar...Linux has a per-kernel thread stack, so ''_switch'' saves the current (i.e. kernel, not original usermode) nonvolatiles to the previous task structure. For Xen/PPC, we use a per-cpu stack, so we need to save the original (domain) nonvolatiles to the previous domain structure. -- Hollis Blanchard IBM Linux Technology Center _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel