Jiang, Yunhong
2010-Apr-01 09:22 UTC
[Xen-devel] [PATCH] [RFC] Fix a small window on CPU online/offline
This is a RFC patch for a small window on CPU online/offline. It is not a clean solution, I try to send it out before I finish checking all the related code, so that I can get feedback from community. Currently there is a small window on CPU online/offline. During take_cpu_down() in stop_machine_run() context, the CPU is marked offline and irq is disabled. But it is only at play_dead(), in idle_task_exit() from cpu_exit_clear(), the offlining CPU try to sync the lazy exec states. The window is, when play_dead(), the stop_machine_run() is done already, and the vcpu whose context is out-of-sync may be scheduled on another CPU. This may cause several issues: a) When the vcpu is scheduled on another CPU, it will try to sync the context on the original CPU, through flush_tlb_mask, as following code in context_switch(). Because the original CPU is marked as offline and irq disabled, it will hang in flush_area_mask. I try to send patch 21079:8ab60a883fd5 to avoid the hang. if ( unlikely(!cpu_isset(cpu, dirty_mask) && !cpus_empty(dirty_mask)) ) { /* Other cpus call __sync_lazy_execstate from flush ipi handler. */ flush_tlb_mask(&dirty_mask); } b) However, changeset 21079 is not the right solution still, although the patch itself is ok. With this changeset, system will not hang. But the vCPU''s context is not synced. c) More is, when the offlining CPU execute the idle_task_exit(), it may try to re-sync the vcpu context with the guest, this will clobber the running vCPU. The following code try to sync the vcpu context in stop_machine_run() context, so that the vCPU will get the the context synced. However, it still not resolve issue c. I''m considering to mark the curr_vcpu() to be idle also, so that idle_task_exit() will not try to sync context again, but I suspect that is not a right way. Any suggestion? BTW, the flush_local is to make sure we flush all TLB context, so that when CPU online again, there is no garbage on the CPU, especially if the CPU has no deep C state. --jyh diff -r ebd84be3420a xen/arch/x86/smpboot.c --- a/xen/arch/x86/smpboot.c Tue Mar 30 18:31:39 2010 +0100 +++ b/xen/arch/x86/smpboot.c Thu Apr 01 16:47:57 2010 +0800 @@ -34,6 +34,7 @@ * Rusty Russell : Hacked into shape for new "hotplug" boot process. */ #include <xen/config.h> +#include <asm/i387.h> #include <xen/init.h> #include <xen/kernel.h> #include <xen/mm.h> @@ -1308,6 +1309,22 @@ int __cpu_disable(void) cpu_disable_scheduler(); + + if ( !is_idle_vcpu(this_cpu(curr_vcpu)) ) + { + struct cpu_user_regs *stack_regs = guest_cpu_user_regs(); + struct vcpu *v; + + v = this_cpu(curr_vcpu); + memcpy(&v->arch.guest_context.user_regs, + stack_regs, + CTXT_SWITCH_STACK_BYTES); + unlazy_fpu(v); + current->arch.ctxt_switch_from(v); + } + + flush_local(FLUSH_CACHE | FLUSH_TLB_GLOBAL |FLUSH_TLB); + return 0; } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-01 10:25 UTC
[Xen-devel] Re: [PATCH] [RFC] Fix a small window on CPU online/offline
On 01/04/2010 10:22, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:> The following code try to sync the vcpu context in stop_machine_run() context, > so that the vCPU will get the the context synced. However, it still not > resolve issue c. I''m considering to mark the curr_vcpu() to be idle also, so > that idle_task_exit() will not try to sync context again, but I suspect that > is not a right way. > > Any suggestion?Painful though it is to say it, the answer may be to run the stop_machine in a ''hypervisor thread'' context, as Linux does. That sidesteps the whole issue, but of course would need us to introduce a whole bunch of infrastructure which we don''t have in Xen presently. Another approach would be to defer a lot of what goes on in __cpu_disable() until play_dead()/cpu_exit_clear(). You could do the stop_machine_run() invocation from there, knowing that you can sync guest state before zapping the cpu_online_map... Actually this approach does start to unravel and need quite a lot of subtle changes itself! I would probably investigate option A (a more Linux-y stop_machine_run) but instead of the kthread_create() infrastructure I would consider extending the purpose of our ''idle vcpus'' to provide a sufficiently more generic ''hypervisor vcpu context''. For our purposes that would mean: 1. Allow the scheduling priority of an idle vcpu to be changed to highest-priority (would mean some, hopefully not very major, scheduler surgery). 2. Add a hook to the idle loop to call out to a hook in stop_machine.c. Then you would loop over all online CPUs, like in Linux, whacking up the priority and getting the idle vcpu to call back to us. Umm, we would also need some kind of wait_for_completion() mechanism, which might look a bit like aspects of continue_hypercall_on_cpu() -- we would pause the running vcpu, change schedule_tail hook, and exit. We would then get our pause count dropped when the stop_machine stuff is done, and re-gain control via schedule_tail. Well, it wouldn''t be a trivial patch by any means, but potentially not *too* bad, and less fragile than some other approaches? I think it''s a major benefit that it takes us closer to Linux semantics, as this stuff is fragile, and we''re already quite a way down the road of Linux but currently our stop_machine is just a bit half-arsed and causing us problems. By the way, you could consider that c/s 21049 starts to take us down this path: the spin_trylock()/create_continuation() semantics is not totally dissimilar to Linux''s mutex_lock (in which other softirqs/vcpus can get to run while we wait for the lock to be released), which are used for the cpu-hotplug related locks in Linux. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-01 10:33 UTC
[Xen-devel] Re: [PATCH] [RFC] Fix a small window on CPU online/offline
On 01/04/2010 11:25, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:> By the way, you could consider that c/s 21049 starts to take us down this > path: the spin_trylock()/create_continuation() semantics is not totally > dissimilar to Linux''s mutex_lock (in which other softirqs/vcpus can get to > run while we wait for the lock to be released), which are used for the > cpu-hotplug related locks in Linux.Thinking some more, we could even have more Linux-y mutex and completion semantics if we allowed vcpus to be voluntarily preemptible in the Linux way. Obviously our barrier to that currently is that we have per-cpu stacks, not per-vcpu stacks. One way to get around this without needing to hack out the concept of per-cpu stacks would be to realise that all this mutex/completion stuff would only get used on a few controloperation paths, mainly from physdevop/sysctl/domctl -- i.e., dom0 management hypercalls. We could wrap those up in a start_preemptible()/end_preemptible() region which would alloc/free a shadow stack. Within those regions it would be safe to use new mutex/completion abstractions which could then behave just like in Linux. We would have an underlying mechanism which could copy the active stack out into the shadow, adjust schedule_tail, and then do_softirq() to get into the scheduler. I don''t know whether it is worth going this far, but perhaps it is easier to have some easier-to-use synchronisation primitives like this in the long run. You can certainly see it''s going to mke the dom0 paths through Xen easier to understand, and make code-borrowing from Linux a less difficult and fragile proposition. I could help with some of this, and/or what I described in my previous email, by the way. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiang, Yunhong
2010-Apr-02 03:31 UTC
[Xen-devel] RE: [PATCH] [RFC] Fix a small window on CPU online/offline
Keir, thanks for your detailed suggestion! But I suspect even if we backport all Kernel infrastructure, we may still need some changes for this. Per my understanding, In kernel, the task is migrated after the offline CPU is dead, there is no lazy state sync, that means we need fixing this issue still. Have a short discussion with Kevin, maybe we can sync the state in cpu_disable_scheduler if current is idle, and then set a flag so that we will not sync again in context siwtch later. If the current is not idle, we can leave the context switch to do the sync for us. I will do more investigate to see how many changes are needed. --jyh>-----Original Message----- >From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: Thursday, April 01, 2010 6:25 PM >To: Jiang, Yunhong >Cc: Jan Beulich; xen-devel@lists.xensource.com >Subject: Re: [PATCH] [RFC] Fix a small window on CPU online/offline > >On 01/04/2010 10:22, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote: > >> The following code try to sync the vcpu context in stop_machine_run() context, >> so that the vCPU will get the the context synced. However, it still not >> resolve issue c. I''m considering to mark the curr_vcpu() to be idle also, so >> that idle_task_exit() will not try to sync context again, but I suspect that >> is not a right way. >> >> Any suggestion? > >Painful though it is to say it, the answer may be to run the stop_machine in >a ''hypervisor thread'' context, as Linux does. That sidesteps the whole >issue, but of course would need us to introduce a whole bunch of >infrastructure which we don''t have in Xen presently. > >Another approach would be to defer a lot of what goes on in __cpu_disable() >until play_dead()/cpu_exit_clear(). You could do the stop_machine_run() >invocation from there, knowing that you can sync guest state before zapping >the cpu_online_map... Actually this approach does start to unravel and need >quite a lot of subtle changes itself! > >I would probably investigate option A (a more Linux-y stop_machine_run) but >instead of the kthread_create() infrastructure I would consider extending >the purpose of our ''idle vcpus'' to provide a sufficiently more generic >''hypervisor vcpu context''. For our purposes that would mean: > 1. Allow the scheduling priority of an idle vcpu to be changed to >highest-priority (would mean some, hopefully not very major, scheduler >surgery). > 2. Add a hook to the idle loop to call out to a hook in stop_machine.c. > >Then you would loop over all online CPUs, like in Linux, whacking up the >priority and getting the idle vcpu to call back to us. Umm, we would also >need some kind of wait_for_completion() mechanism, which might look a bit >like aspects of continue_hypercall_on_cpu() -- we would pause the running >vcpu, change schedule_tail hook, and exit. We would then get our pause count >dropped when the stop_machine stuff is done, and re-gain control via >schedule_tail. > >Well, it wouldn''t be a trivial patch by any means, but potentially not *too* >bad, and less fragile than some other approaches? I think it''s a major >benefit that it takes us closer to Linux semantics, as this stuff is >fragile, and we''re already quite a way down the road of Linux but currently >our stop_machine is just a bit half-arsed and causing us problems. > >By the way, you could consider that c/s 21049 starts to take us down this >path: the spin_trylock()/create_continuation() semantics is not totally >dissimilar to Linux''s mutex_lock (in which other softirqs/vcpus can get to >run while we wait for the lock to be released), which are used for the >cpu-hotplug related locks in Linux. > > -- Keir >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-02 07:14 UTC
[Xen-devel] Re: [PATCH] [RFC] Fix a small window on CPU online/offline
On 02/04/2010 04:31, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:> Have a short discussion with Kevin, maybe we can sync the state in > cpu_disable_scheduler if current is idle, and then set a flag so that we will > not sync again in context siwtch later. If the current is not idle, we can > leave the context switch to do the sync for us. I will do more investigate to > see how many changes are needed.Hm, actually maybe that could work. You might not even need a flag in case current is non-idle in cpu_disable_scheduler(). It might suffice to force context_switch() to do full context switch synchronously when the CPU is going offline (see appended patch). I was thinking there was a race as soon as the cpu is cleared from cpu_online_map, but actually the problem occurs only once the vcpu is descheduled, so if we can synchronously sync its state before calling context_saved(), perhaps we are fine. This could be a very small patch after all! :-) -- Keir --- a/xen/arch/x86/domain.c Thu Apr 01 09:55:27 2010 +0100 +++ b/xen/arch/x86/domain.c Fri Apr 02 08:12:14 2010 +0100 @@ -1442,7 +1442,8 @@ set_current(next); - if ( (per_cpu(curr_vcpu, cpu) == next) || is_idle_vcpu(next) ) + if ( (per_cpu(curr_vcpu, cpu) == next) || + (is_idle_vcpu(next) && !cpu_is_offline(cpu)) ) { local_irq_enable(); } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-02 07:27 UTC
Re: [Xen-devel] Re: [PATCH] [RFC] Fix a small window on CPU online/offline
On 02/04/2010 08:14, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:> Hm, actually maybe that could work. You might not even need a flag in case > current is non-idle in cpu_disable_scheduler(). It might suffice to force > context_switch() to do full context switch synchronously when the CPU is > going offline (see appended patch). I was thinking there was a race as soon > as the cpu is cleared from cpu_online_map, but actually the problem occurs > only once the vcpu is descheduled, so if we can synchronously sync its state > before calling context_saved(), perhaps we are fine. > > This could be a very small patch after all! :-)Perhaps even as small as the attached patch? It''s simple enough we could almost consider it for 4.0.0. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel