Yunhong, Just thinking some more about the above patch, which made the xenpf_lock and systctl_lock both retryable. I don''t think this is necessary after all, since cpu_down() is called via continue_hypercall_on_cpu(), and hence these locks are released before the hypercall continuation is executed. Hence we should be able to revert a couple of hunks of this changeset, as only cpu_add_remove_lock is actually a problem, wouldn''t you agree? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiang, Yunhong
2010-Apr-15 08:50 UTC
[Xen-devel] RE: CPU offlining patch xen-unstable:21049
I think the try_lock is not for the cpu_down(). The point is, if another CPU is trying the get the lock. Considering following scnerio: 1) cpu_down() in CPU A, and get the xenpf_lock, then call to stop_machine_run(), trying to bring all CPU to stop_machine_run context. 2) At the same time, another vcpu in CPU B do a xenpf hypercall, and try to get the xenpf_lock. If ther is no retyr for this lock, it can''t get xenpf_lock, it will never go to the softirq So the system will hang. Hope this make thing clear. Thanks --jyh>-----Original Message----- >From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: Thursday, April 15, 2010 4:31 PM >To: Jiang, Yunhong >Cc: xen-devel@lists.xensource.com >Subject: CPU offlining patch xen-unstable:21049 > >Yunhong, > >Just thinking some more about the above patch, which made the xenpf_lock and >systctl_lock both retryable. I don''t think this is necessary after all, >since cpu_down() is called via continue_hypercall_on_cpu(), and hence these >locks are released before the hypercall continuation is executed. Hence we >should be able to revert a couple of hunks of this changeset, as only >cpu_add_remove_lock is actually a problem, wouldn''t you agree? > > -- Keir >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 15/04/2010 09:50, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:> I think the try_lock is not for the cpu_down(). The point is, if another CPU > is trying the get the lock. > > Considering following scnerio: > 1) cpu_down() in CPU A, and get the xenpf_lock, then call to > stop_machine_run(), trying to bring all CPU to stop_machine_run context. > 2) At the same time, another vcpu in CPU B do a xenpf hypercall, and try to > get the xenpf_lock. If ther is no retyr for this lock, it can''t get > xenpf_lock, it will never go to the softirq > So the system will hang. > > Hope this make thing clear.But CPU A doesn''t hold the xenpf_lock when it calls stop_machine_run(). It dropped it before cpu_down() got invoked, because that gets executed via continue_hypercall_on_cpu(). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jiang, Yunhong
2010-Apr-15 10:19 UTC
[Xen-devel] RE: CPU offlining patch xen-unstable:21049
Aha, yes, you are right. So do I need create a patch, or you can simply revert some chunks? --jyh>-----Original Message----- >From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >Sent: Thursday, April 15, 2010 6:17 PM >To: Jiang, Yunhong >Cc: xen-devel@lists.xensource.com >Subject: Re: CPU offlining patch xen-unstable:21049 > >On 15/04/2010 09:50, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote: > >> I think the try_lock is not for the cpu_down(). The point is, if another CPU >> is trying the get the lock. >> >> Considering following scnerio: >> 1) cpu_down() in CPU A, and get the xenpf_lock, then call to >> stop_machine_run(), trying to bring all CPU to stop_machine_run context. >> 2) At the same time, another vcpu in CPU B do a xenpf hypercall, and try to >> get the xenpf_lock. If ther is no retyr for this lock, it can''t get >> xenpf_lock, it will never go to the softirq >> So the system will hang. >> >> Hope this make thing clear. > >But CPU A doesn''t hold the xenpf_lock when it calls stop_machine_run(). It >dropped it before cpu_down() got invoked, because that gets executed via >continue_hypercall_on_cpu(). > > -- Keir >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I decided to keep the spin_trylock as they quell my paranoia about other possible deadlock scenarios inside those complicated hypercall functions. But I have modified the comments appropriately, in xen-unstable:21179. Note that this also depends on xen-unstable:21178 (we mustn''t execute the hypercall continuation immediately, in the context of the caller of c_h_o_c()). Thanks. But, here''s a more subtle and more tricky deadlock scenario for you. You''ll like this one :-): stop_machine_run() schedules a softirq on every CPU. Let''s say CPU A enters our softirq handler, interrupting some guest VCPU X which is still scheduled on CPU A. But some other CPU B could be waiting for X to be descheduled (one obvious example is hvmop_flush_tlb_all, which is a good one because some HVM guest can call that at any time). So we never get full softirq rendezvous because CPU B is spinning in hvmop_flush_tlb_all(), while CPU A spins in the stop_machine softirq handler. Deadlock! What do you think of that? :-D -- Keir On 15/04/2010 11:19, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:> Aha, yes, you are right. So do I need create a patch, or you can simply revert > some chunks? > > --jyh > >> -----Original Message----- >> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >> Sent: Thursday, April 15, 2010 6:17 PM >> To: Jiang, Yunhong >> Cc: xen-devel@lists.xensource.com >> Subject: Re: CPU offlining patch xen-unstable:21049 >> >> On 15/04/2010 09:50, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote: >> >>> I think the try_lock is not for the cpu_down(). The point is, if another CPU >>> is trying the get the lock. >>> >>> Considering following scnerio: >>> 1) cpu_down() in CPU A, and get the xenpf_lock, then call to >>> stop_machine_run(), trying to bring all CPU to stop_machine_run context. >>> 2) At the same time, another vcpu in CPU B do a xenpf hypercall, and try to >>> get the xenpf_lock. If ther is no retyr for this lock, it can''t get >>> xenpf_lock, it will never go to the softirq >>> So the system will hang. >>> >>> Hope this make thing clear. >> >> But CPU A doesn''t hold the xenpf_lock when it calls stop_machine_run(). It >> dropped it before cpu_down() got invoked, because that gets executed via >> continue_hypercall_on_cpu(). >> >> -- Keir >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel