>>> On 27.01.11 at 14:05, Xiaowei Yang <xiaowei.yang@huawei.com> wrote: > We created a scenario to reproduce the bug: > ---------------------------------------------------------------- > // proc1/proc1.2 are 2 threads sharing one page table. > // proc1 is the parent of proc2. > > proc1 proc2 proc1.2 > ... ... // in gup_pte_range() > ... ... pte = gup_get_pte() > ... ... page1 = pte_page(pte) // (1) > do_wp_page(page1) ... ... > ... exit_map() ... > ... ... get_page(page1) // (2) > ----------------------------------------------------------------- > > do_wp_page() and exit_map() cause page1 to be released into free list > before get_page() in proc1.2 is called. The longer the delay between > (1)&(2), the easier the BUG_ON shows.The scenario indeed seems to apply independent of virtualization, but the window obviously can be unbounded unless running native. However, going through all the comments in gup.c again I wonder whether pv Xen guests don''t violate the major assumption: There is talk about interrupts being off preventing (or sufficiently deferring) remote CPUs doing TLB flushes. In pv Xen guests, however, non-local TLB flushes do not happen by sending IPIs - the hypercall interface gets used instead. If that''s indeed the case, I would have expected quite a few bug reports, but I''m unaware of any - Nick, am I overlooking something here? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Peter Zijlstra
2011-Jan-27 15:01 UTC
[Xen-devel] Re: One (possible) x86 get_user_pages bug
On Thu, 2011-01-27 at 14:49 +0000, Jan Beulich wrote:> >>> On 27.01.11 at 14:05, Xiaowei Yang <xiaowei.yang@huawei.com> wrote: > > We created a scenario to reproduce the bug: > > ---------------------------------------------------------------- > > // proc1/proc1.2 are 2 threads sharing one page table. > > // proc1 is the parent of proc2. > > > > proc1 proc2 proc1.2 > > ... ... // in gup_pte_range() > > ... ... pte = gup_get_pte() > > ... ... page1 = pte_page(pte) // (1) > > do_wp_page(page1) ... ... > > ... exit_map() ... > > ... ... get_page(page1) // (2) > > ----------------------------------------------------------------- > > > > do_wp_page() and exit_map() cause page1 to be released into free list > > before get_page() in proc1.2 is called. The longer the delay between > > (1)&(2), the easier the BUG_ON shows. > > The scenario indeed seems to apply independent of virtualization, > but the window obviously can be unbounded unless running > native. > > However, going through all the comments in gup.c again I wonder > whether pv Xen guests don''t violate the major assumption: There > is talk about interrupts being off preventing (or sufficiently > deferring) remote CPUs doing TLB flushes. In pv Xen guests, > however, non-local TLB flushes do not happen by sending IPIs - > the hypercall interface gets used instead. If that''s indeed the > case, I would have expected quite a few bug reports, but I''m > unaware of any - Nick, am I overlooking something here?Indeed, the delay of tlb flush ipi''s should ensure that the pages aren''t freed and should cover the race with unmap. If Xen violates this then xen needs to fix this somehow.. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Jan-27 18:27 UTC
[Xen-devel] Re: One (possible) x86 get_user_pages bug
On 01/27/2011 06:49 AM, Jan Beulich wrote:> However, going through all the comments in gup.c again I wonder > whether pv Xen guests don''t violate the major assumption: There > is talk about interrupts being off preventing (or sufficiently > deferring) remote CPUs doing TLB flushes. In pv Xen guests, > however, non-local TLB flushes do not happen by sending IPIs - > the hypercall interface gets used insteadYes, I was aware of that synchronization mechanism, and I think I''d convinced myself we were OK. But I can''t think was that reasoning was - perhaps it was something as simple as "gupf isn''t used under Xen" (which may have been true at the time, but isn''t now). As clever as it is, the whole "disable interrupts -> hold off IPI -> prevent TLB flush -> delay freeing" chain seems pretty fragile. I guess its OK if we assume that x86 will forever have IPI-based cross-cpu TLB flushes, but one could imagine some kind of "remote tlb shootdown using bus transaction" appearing in the architecture. And even just considering virtualization, having non-IPI-based tlb shootdown is a measurable performance win, since a hypervisor can optimise away a cross-VCPU shootdown if it knows no physical TLB contains the target VCPU''s entries. I can imagine the KVM folks could get some benefit from that as well. So is there some way we can preserve the current scheme''s benefits while making it a bit more general? (If anyone else has non-IPI-based shootdown, it would be s390; is there some inspiration there? An instruction perhaps?) J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Peter Zijlstra
2011-Jan-27 19:27 UTC
[Xen-devel] Re: One (possible) x86 get_user_pages bug
On Thu, 2011-01-27 at 10:27 -0800, Jeremy Fitzhardinge wrote:> > So is there some way we can preserve the current scheme''s benefits while > making it a bit more general? (If anyone else has non-IPI-based > shootdown, it would be s390; is there some inspiration there? An > instruction perhaps?)Well, you can provide a xen gupf implementation based on rcu freed page-tables like powerpc, sparc and s390 have. But you''ll have to change the mmu_gather implementation of xen and use the get_page_unless_zero() thing mentioned before (or use page_cache_get_speculative()). But I see no need to change the x86 implementation, if the architecture ever changes the way it does tlb invalidation we need to change more than just the gupf implementation. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:> And even just considering virtualization, having non-IPI-based tlb > shootdown is a measurable performance win, since a hypervisor can > optimise away a cross-VCPU shootdown if it knows no physical TLB > contains the target VCPU''s entries. I can imagine the KVM folks could > get some benefit from that as well.It''s nice to avoid the IPI (and waking up a cpu if it happens to be asleep) but I think the risk of deviating too much from the baremetal arch is too large, as demonstrated by this bug. (well, async page faults is a counterexample, I wonder if/when it will bite us) -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I agree i.e. deviation from underlying arch consideration is not a good idea. Also, agreed, hypervisor knows which page entries are ready for TLB flush across vCPUs. But, using above knowledge, along with TLB flush based on IPI is a better solution. Its ability to synchronize it with pCPU based IPI and TLB flush across vCPU. is key. IPIs themselves should be in few hundred uSecs in terms latency. Also, why should pCPU be in sleep state for active vCPU scheduled page workload? -Kaushik -----Original Message----- From: Avi Kivity [mailto:avi@redhat.com] Sent: Sunday, January 30, 2011 5:02 AM To: Jeremy Fitzhardinge Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra; fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin; wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com; linux-kernel@vger.kernel.org; Marcelo Tosatti Subject: Re: One (possible) x86 get_user_pages bug On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:> And even just considering virtualization, having non-IPI-based tlb > shootdown is a measurable performance win, since a hypervisor can > optimise away a cross-VCPU shootdown if it knows no physical TLB > contains the target VCPU''s entries. I can imagine the KVM folks could > get some benefit from that as well.It''s nice to avoid the IPI (and waking up a cpu if it happens to be asleep) but I think the risk of deviating too much from the baremetal arch is too large, as demonstrated by this bug. (well, async page faults is a counterexample, I wonder if/when it will bite us) -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Jan-31 18:04 UTC
[Xen-devel] Re: One (possible) x86 get_user_pages bug
On 01/30/2011 02:21 PM, Kaushik Barde wrote:> I agree i.e. deviation from underlying arch consideration is not a good > idea. > > Also, agreed, hypervisor knows which page entries are ready for TLB flush > across vCPUs. > > But, using above knowledge, along with TLB flush based on IPI is a better > solution. Its ability to synchronize it with pCPU based IPI and TLB flush > across vCPU. is key.I''m not sure I follow you here. The issue with TLB flush IPIs is that the hypervisor doesn''t know the purpose of the IPI and ends up (potentially) waking up a sleeping VCPU just to flush its tlb - but since it was sleeping there were no stale TLB entries to flush. Xen''s TLB flush hypercalls can optimise that case by only sending a real IPI to PCPUs which are actually running target VCPUs. In other cases, where a PCPU is known to have stale entries but it isn''t running a relevant VCPU, it can just mark a deferred TLB flush which gets executed before the VCPU runs again. In other words, Xen can take significant advantage of getting a higher-level call ("flush these TLBs") compared just a simple IPI. Are you suggesting that the hypervisor should export some kind of "known dirty TLB" table to the guest, and have the guest work out which VCPUs need IPIs sent to them? How would that work?> IPIs themselves should be in few hundred uSecs in terms latency. Also, why > should pCPU be in sleep state for active vCPU scheduled page workload?A "few hundred uSecs" is really very slow - that''s nearly a millisecond. It''s worth spending some effort to avoid those kinds of delays. J> -Kaushik > > -----Original Message----- > From: Avi Kivity [mailto:avi@redhat.com] > Sent: Sunday, January 30, 2011 5:02 AM > To: Jeremy Fitzhardinge > Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra; > fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin; > wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com; > linux-kernel@vger.kernel.org; Marcelo Tosatti > Subject: Re: One (possible) x86 get_user_pages bug > > On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote: >> And even just considering virtualization, having non-IPI-based tlb >> shootdown is a measurable performance win, since a hypervisor can >> optimise away a cross-VCPU shootdown if it knows no physical TLB >> contains the target VCPU''s entries. I can imagine the KVM folks could >> get some benefit from that as well. > It''s nice to avoid the IPI (and waking up a cpu if it happens to be > asleep) but I think the risk of deviating too much from the baremetal > arch is too large, as demonstrated by this bug. > > (well, async page faults is a counterexample, I wonder if/when it will > bite us) >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
<< I''m not sure I follow you here. The issue with TLB flush IPIs is that the hypervisor doesn''t know the purpose of the IPI and ends up (potentially) waking up a sleeping VCPU just to flush its tlb - but since it was sleeping there were no stale TLB entries to flush.>> That''s what I was trying understand, what is "Sleep" here? Is it ACPI sleep or some internal scheduling state? If vCPUs are asynchronous to pCPU in terms of ACPI sleep state, then they need to synced-up. That''s where entire ACPI modeling needs to be considered. That''s where KVM may not see this issue. Maybe I am missing something here. << A "few hundred uSecs" is really very slow - that''s nearly a millisecond. It''s worth spending some effort to avoid those kinds of delays.>> Actually, just checked IPIs are usually 1000-1500 cycles long (comparable to VMEXIT). My point is ideal solution should be where virtual platform behavior is closer to bare metal interrupts, memory, cpu state etc.. How to do it ? well that''s what needs to be figured out :-) -Kaushik -----Original Message----- From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] Sent: Monday, January 31, 2011 10:05 AM To: Kaushik Barde Cc: ''Avi Kivity''; ''Jan Beulich''; ''Xiaowei Yang''; ''Nick Piggin''; ''Peter Zijlstra''; fanhenglong@huawei.com; ''Kenneth Lee''; ''linqaingmin''; wangzhenguo@huawei.com; ''Wu Fengguang''; xen-devel@lists.xensource.com; linux-kernel@vger.kernel.org; ''Marcelo Tosatti'' Subject: Re: One (possible) x86 get_user_pages bug On 01/30/2011 02:21 PM, Kaushik Barde wrote:> I agree i.e. deviation from underlying arch consideration is not a good > idea. > > Also, agreed, hypervisor knows which page entries are ready for TLB flush > across vCPUs. > > But, using above knowledge, along with TLB flush based on IPI is a better > solution. Its ability to synchronize it with pCPU based IPI and TLB flush > across vCPU. is key.I''m not sure I follow you here. The issue with TLB flush IPIs is that the hypervisor doesn''t know the purpose of the IPI and ends up (potentially) waking up a sleeping VCPU just to flush its tlb - but since it was sleeping there were no stale TLB entries to flush. Xen''s TLB flush hypercalls can optimise that case by only sending a real IPI to PCPUs which are actually running target VCPUs. In other cases, where a PCPU is known to have stale entries but it isn''t running a relevant VCPU, it can just mark a deferred TLB flush which gets executed before the VCPU runs again. In other words, Xen can take significant advantage of getting a higher-level call ("flush these TLBs") compared just a simple IPI. Are you suggesting that the hypervisor should export some kind of "known dirty TLB" table to the guest, and have the guest work out which VCPUs need IPIs sent to them? How would that work?> IPIs themselves should be in few hundred uSecs in terms latency. Also, why > should pCPU be in sleep state for active vCPU scheduled page workload?A "few hundred uSecs" is really very slow - that''s nearly a millisecond. It''s worth spending some effort to avoid those kinds of delays. J> -Kaushik > > -----Original Message----- > From: Avi Kivity [mailto:avi@redhat.com] > Sent: Sunday, January 30, 2011 5:02 AM > To: Jeremy Fitzhardinge > Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra; > fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin; > wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com; > linux-kernel@vger.kernel.org; Marcelo Tosatti > Subject: Re: One (possible) x86 get_user_pages bug > > On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote: >> And even just considering virtualization, having non-IPI-based tlb >> shootdown is a measurable performance win, since a hypervisor can >> optimise away a cross-VCPU shootdown if it knows no physical TLB >> contains the target VCPU''s entries. I can imagine the KVM folks could >> get some benefit from that as well. > It''s nice to avoid the IPI (and waking up a cpu if it happens to be > asleep) but I think the risk of deviating too much from the baremetal > arch is too large, as demonstrated by this bug. > > (well, async page faults is a counterexample, I wonder if/when it will > bite us) >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2011-Jan-31 22:10 UTC
[Xen-devel] Re: One (possible) x86 get_user_pages bug
On 01/31/2011 12:10 PM, Kaushik Barde wrote:> << I''m not sure I follow you here. The issue with TLB flush IPIs is that > the hypervisor doesn''t know the purpose of the IPI and ends up > (potentially) waking up a sleeping VCPU just to flush its tlb - but > since it was sleeping there were no stale TLB entries to flush.>> > > That''s what I was trying understand, what is "Sleep" here? Is it ACPI sleep > or some internal scheduling state? If vCPUs are asynchronous to pCPU in > terms of ACPI sleep state, then they need to synced-up. That''s where entire > ACPI modeling needs to be considered. That''s where KVM may not see this > issue. Maybe I am missing something here.No, nothing to do with ACPI. Multiple virtual CPUs (VCPUs) can be multiplexed onto a single physical CPU (PCPU), in much the same way as tasks are scheduled onto CPUs (identically, in KVM''s case). If a VCPU is not currently running - either because it is simply descheduled, or because it is blocked (what I slightly misleadingly called "sleeping" above) in a hypercall, then it is not currently using any physical CPU resources, including the TLBs. In that case, there''s no need to flush that''s VCPU''s TLB entries, because there are none.> << A "few hundred uSecs" is really very slow - that''s nearly a > millisecond. It''s worth spending some effort to avoid those kinds of > delays.>> > > Actually, just checked IPIs are usually 1000-1500 cycles long (comparable to > VMEXIT). My point is ideal solution should be where virtual platform > behavior is closer to bare metal interrupts, memory, cpu state etc.. How to > do it ? well that''s what needs to be figured out :-)The interesting number is not the raw cost of an IPI, but the overall cost of the remote TLB flush. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel