thr3ads.net - Xen devel - [Xen-devel] Re: One (possible) x86 get_user

If this information is useful, please help other people find it:
Share via:

Jan Beulich

2011-Jan-27 14:49 UTC

[Xen-devel] Re: One (possible) x86 get_user_pages bug

>>> On 27.01.11 at 14:05, Xiaowei Yang <xiaowei.yang@huawei.com>
wrote:
> We created a scenario to reproduce the bug:
> ----------------------------------------------------------------
> // proc1/proc1.2 are 2 threads sharing one page table.
> // proc1 is the parent of proc2.
> 
> proc1               proc2          proc1.2
> ...                 ...            // in gup_pte_range()
> ...                 ...            pte = gup_get_pte()
> ...                 ...            page1 = pte_page(pte)  // (1)
> do_wp_page(page1)   ...            ...
> ...                 exit_map()     ...
> ...                 ...            get_page(page1)        // (2)
> -----------------------------------------------------------------
> 
> do_wp_page() and exit_map() cause page1 to be released into free list 
> before get_page() in proc1.2 is called. The longer the delay between 
> (1)&(2), the easier the BUG_ON shows.
The scenario indeed seems to apply independent of virtualization,
but the window obviously can be unbounded unless running
native.

However, going through all the comments in gup.c again I wonder
whether pv Xen guests don''t violate the major assumption: There
is talk about interrupts being off preventing (or sufficiently
deferring) remote CPUs doing TLB flushes. In pv Xen guests,
however, non-local TLB flushes do not happen by sending IPIs -
the hypercall interface gets used instead. If that''s indeed the
case, I would have expected quite a few bug reports, but I''m
unaware of any - Nick, am I overlooking something here?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Peter Zijlstra

2011-Jan-27 15:01 UTC

head link

[Xen-devel] Re: One (possible) x86 get_user_pages bug

On Thu, 2011-01-27 at 14:49 +0000, Jan Beulich wrote:> >>> On 27.01.11 at 14:05, Xiaowei Yang
<xiaowei.yang@huawei.com> wrote:
> > We created a scenario to reproduce the bug:
> > ----------------------------------------------------------------
> > // proc1/proc1.2 are 2 threads sharing one page table.
> > // proc1 is the parent of proc2.
> > 
> > proc1               proc2          proc1.2
> > ...                 ...            // in gup_pte_range()
> > ...                 ...            pte = gup_get_pte()
> > ...                 ...            page1 = pte_page(pte)  // (1)
> > do_wp_page(page1)   ...            ...
> > ...                 exit_map()     ...
> > ...                 ...            get_page(page1)        // (2)
> > -----------------------------------------------------------------
> > 
> > do_wp_page() and exit_map() cause page1 to be released into free list 
> > before get_page() in proc1.2 is called. The longer the delay between 
> > (1)&(2), the easier the BUG_ON shows.
> 
> The scenario indeed seems to apply independent of virtualization,
> but the window obviously can be unbounded unless running
> native.
> 
> However, going through all the comments in gup.c again I wonder
> whether pv Xen guests don''t violate the major assumption: There
> is talk about interrupts being off preventing (or sufficiently
> deferring) remote CPUs doing TLB flushes. In pv Xen guests,
> however, non-local TLB flushes do not happen by sending IPIs -
> the hypercall interface gets used instead. If that''s indeed the
> case, I would have expected quite a few bug reports, but I''m
> unaware of any - Nick, am I overlooking something here?
Indeed, the delay of tlb flush ipi''s should ensure that the pages
aren''t
freed and should cover the race with unmap.

If Xen violates this then xen needs to fix this somehow..

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2011-Jan-27 18:27 UTC

head link

[Xen-devel] Re: One (possible) x86 get_user_pages bug

On 01/27/2011 06:49 AM, Jan Beulich wrote:> However, going through all the comments in gup.c again I wonder
> whether pv Xen guests don''t violate the major assumption: There
> is talk about interrupts being off preventing (or sufficiently
> deferring) remote CPUs doing TLB flushes. In pv Xen guests,
> however, non-local TLB flushes do not happen by sending IPIs -
> the hypercall interface gets used instead
Yes, I was aware of that synchronization mechanism, and I think I''d
convinced myself we were OK.  But I can''t think was that reasoning was
-
perhaps it was something as simple as "gupf isn''t used under
Xen" (which
may have been true at the time, but isn''t now).

As clever as it is, the whole "disable interrupts -> hold off IPI ->
prevent TLB flush -> delay freeing" chain seems pretty fragile.  I guess
its OK if we assume that x86 will forever have IPI-based cross-cpu TLB
flushes, but one could imagine some kind of "remote tlb shootdown using
bus transaction" appearing in the architecture.

And even just considering virtualization, having non-IPI-based tlb
shootdown is a measurable performance win, since a hypervisor can
optimise away a cross-VCPU shootdown if it knows no physical TLB
contains the target VCPU''s entries.  I can imagine the KVM folks could
get some benefit from that as well.

So is there some way we can preserve the current scheme''s benefits
while
making it a bit more general?  (If anyone else has non-IPI-based
shootdown, it would be s390; is there some inspiration there?  An
instruction perhaps?)

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Peter Zijlstra

2011-Jan-27 19:27 UTC

head link

[Xen-devel] Re: One (possible) x86 get_user_pages bug

On Thu, 2011-01-27 at 10:27 -0800, Jeremy Fitzhardinge
wrote:> 
> So is there some way we can preserve the current scheme''s benefits
while
> making it a bit more general?  (If anyone else has non-IPI-based
> shootdown, it would be s390; is there some inspiration there?  An
> instruction perhaps?) 
Well, you can provide a xen gupf implementation based on rcu freed
page-tables like powerpc, sparc and s390 have.

But you''ll have to change the mmu_gather implementation of xen and use
the get_page_unless_zero() thing mentioned before (or use
page_cache_get_speculative()).

But I see no need to change the x86 implementation, if the architecture
ever changes the way it does tlb invalidation we need to change more
than just the gupf implementation.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Avi Kivity

2011-Jan-30 13:01 UTC

head link

[Xen-devel] Re: One (possible) x86 get_user_pages bug

On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:> And even just considering virtualization, having non-IPI-based tlb
> shootdown is a measurable performance win, since a hypervisor can
> optimise away a cross-VCPU shootdown if it knows no physical TLB
> contains the target VCPU''s entries.  I can imagine the KVM folks
could
> get some benefit from that as well.
It''s nice to avoid the IPI (and waking up a cpu if it happens to be 
asleep) but I think the risk of deviating too much from the baremetal 
arch is too large, as demonstrated by this bug.

(well, async page faults is a counterexample, I wonder if/when it will 
bite us)

-- 
error compiling committee.c: too many arguments to function


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Kaushik Barde

2011-Jan-30 22:21 UTC

head link

[Xen-devel] RE: One (possible) x86 get_user_pages bug

I agree i.e. deviation from underlying arch consideration is not a good
idea.

Also, agreed, hypervisor knows which page entries are ready for TLB flush
across vCPUs. 

But, using above knowledge, along with TLB flush based on IPI is a better
solution.  Its ability to synchronize it with pCPU based IPI and TLB flush
across vCPU. is key. 

IPIs themselves should be in few hundred uSecs in terms latency. Also, why
should pCPU be in sleep state for active vCPU scheduled page workload?

-Kaushik

-----Original Message-----
From: Avi Kivity [mailto:avi@redhat.com] 
Sent: Sunday, January 30, 2011 5:02 AM
To: Jeremy Fitzhardinge
Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra;
fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin;
wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com;
linux-kernel@vger.kernel.org; Marcelo Tosatti
Subject: Re: One (possible) x86 get_user_pages bug

On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:> And even just considering virtualization, having non-IPI-based tlb
> shootdown is a measurable performance win, since a hypervisor can
> optimise away a cross-VCPU shootdown if it knows no physical TLB
> contains the target VCPU''s entries.  I can imagine the KVM folks
could
> get some benefit from that as well.
It''s nice to avoid the IPI (and waking up a cpu if it happens to be 
asleep) but I think the risk of deviating too much from the baremetal 
arch is too large, as demonstrated by this bug.

(well, async page faults is a counterexample, I wonder if/when it will 
bite us)

-- 
error compiling committee.c: too many arguments to function


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2011-Jan-31 18:04 UTC

head link

[Xen-devel] Re: One (possible) x86 get_user_pages bug

On 01/30/2011 02:21 PM, Kaushik Barde wrote:> I agree i.e. deviation from underlying arch consideration is not a good
> idea.
>
> Also, agreed, hypervisor knows which page entries are ready for TLB flush
> across vCPUs. 
>
> But, using above knowledge, along with TLB flush based on IPI is a better
> solution.  Its ability to synchronize it with pCPU based IPI and TLB flush
> across vCPU. is key. 
I''m not sure I follow you here.  The issue with TLB flush IPIs is that
the hypervisor doesn''t know the purpose of the IPI and ends up
(potentially) waking up a sleeping VCPU just to flush its tlb - but
since it was sleeping there were no stale TLB entries to flush.

Xen''s TLB flush hypercalls can optimise that case by only sending a
real
IPI to PCPUs which are actually running target VCPUs.  In other cases,
where a PCPU is known to have stale entries but it isn''t running a
relevant VCPU, it can just mark a deferred TLB flush which gets executed
before the VCPU runs again.

In other words, Xen can take significant advantage of getting a
higher-level call ("flush these TLBs") compared just a simple IPI.

Are you suggesting that the hypervisor should export some kind of "known
dirty TLB" table to the guest, and have the guest work out which VCPUs
need IPIs sent to them?  How would that work?
> IPIs themselves should be in few hundred uSecs in terms latency. Also, why
> should pCPU be in sleep state for active vCPU scheduled page workload?
A "few hundred uSecs" is really very slow - that''s nearly a
millisecond.  It''s worth spending some effort to avoid those kinds of
delays.

    J
> -Kaushik
>
> -----Original Message-----
> From: Avi Kivity [mailto:avi@redhat.com] 
> Sent: Sunday, January 30, 2011 5:02 AM
> To: Jeremy Fitzhardinge
> Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra;
> fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin;
> wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com;
> linux-kernel@vger.kernel.org; Marcelo Tosatti
> Subject: Re: One (possible) x86 get_user_pages bug
>
> On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:
>> And even just considering virtualization, having non-IPI-based tlb
>> shootdown is a measurable performance win, since a hypervisor can
>> optimise away a cross-VCPU shootdown if it knows no physical TLB
>> contains the target VCPU''s entries.  I can imagine the KVM
folks could
>> get some benefit from that as well.
> It''s nice to avoid the IPI (and waking up a cpu if it happens to
be
> asleep) but I think the risk of deviating too much from the baremetal 
> arch is too large, as demonstrated by this bug.
>
> (well, async page faults is a counterexample, I wonder if/when it will 
> bite us)
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Kaushik Barde

2011-Jan-31 20:10 UTC

head link

[Xen-devel] RE: One (possible) x86 get_user_pages bug

<< I''m not sure I follow you here.  The issue with TLB flush IPIs
is that
the hypervisor doesn''t know the purpose of the IPI and ends up
(potentially) waking up a sleeping VCPU just to flush its tlb - but
since it was sleeping there were no stale TLB entries to flush.>>

That''s what I was trying understand, what is "Sleep" here? Is
it ACPI sleep
or some internal scheduling state? If vCPUs  are asynchronous to pCPU in
terms of ACPI sleep state, then they need to synced-up. That''s where
entire
ACPI modeling needs to be considered. That''s where KVM may not see this
issue. Maybe I am missing something here.

<< A "few hundred uSecs" is really very slow - that''s
nearly a
millisecond.  It''s worth spending some effort to avoid those kinds of
delays.>>

Actually, just checked IPIs are usually 1000-1500 cycles long (comparable to
VMEXIT). My point is ideal solution should be where virtual platform
behavior is closer to bare metal interrupts, memory, cpu state etc.. How to
do it ? well that''s what needs to be figured out :-)

-Kaushik

-----Original Message-----
From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] 
Sent: Monday, January 31, 2011 10:05 AM
To: Kaushik Barde
Cc: ''Avi Kivity''; ''Jan Beulich'';
''Xiaowei Yang''; ''Nick Piggin'';
''Peter
Zijlstra''; fanhenglong@huawei.com; ''Kenneth Lee'';
''linqaingmin'';
wangzhenguo@huawei.com; ''Wu Fengguang'';
xen-devel@lists.xensource.com;
linux-kernel@vger.kernel.org; ''Marcelo Tosatti''
Subject: Re: One (possible) x86 get_user_pages bug

On 01/30/2011 02:21 PM, Kaushik Barde wrote:> I agree i.e. deviation from underlying arch consideration is not a good
> idea.
>
> Also, agreed, hypervisor knows which page entries are ready for TLB flush
> across vCPUs. 
>
> But, using above knowledge, along with TLB flush based on IPI is a better
> solution.  Its ability to synchronize it with pCPU based IPI and TLB flush
> across vCPU. is key. 
I''m not sure I follow you here.  The issue with TLB flush IPIs is that
the hypervisor doesn''t know the purpose of the IPI and ends up
(potentially) waking up a sleeping VCPU just to flush its tlb - but
since it was sleeping there were no stale TLB entries to flush.

Xen''s TLB flush hypercalls can optimise that case by only sending a
real
IPI to PCPUs which are actually running target VCPUs.  In other cases,
where a PCPU is known to have stale entries but it isn''t running a
relevant VCPU, it can just mark a deferred TLB flush which gets executed
before the VCPU runs again.

In other words, Xen can take significant advantage of getting a
higher-level call ("flush these TLBs") compared just a simple IPI.

Are you suggesting that the hypervisor should export some kind of "known
dirty TLB" table to the guest, and have the guest work out which VCPUs
need IPIs sent to them?  How would that work?
> IPIs themselves should be in few hundred uSecs in terms latency. Also, why
> should pCPU be in sleep state for active vCPU scheduled page workload?
A "few hundred uSecs" is really very slow - that''s nearly a
millisecond.  It''s worth spending some effort to avoid those kinds of
delays.

    J
> -Kaushik
>
> -----Original Message-----
> From: Avi Kivity [mailto:avi@redhat.com] 
> Sent: Sunday, January 30, 2011 5:02 AM
> To: Jeremy Fitzhardinge
> Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra;
> fanhenglong@huawei.com; Kaushik Barde; Kenneth Lee; linqaingmin;
> wangzhenguo@huawei.com; Wu Fengguang; xen-devel@lists.xensource.com;
> linux-kernel@vger.kernel.org; Marcelo Tosatti
> Subject: Re: One (possible) x86 get_user_pages bug
>
> On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:
>> And even just considering virtualization, having non-IPI-based tlb
>> shootdown is a measurable performance win, since a hypervisor can
>> optimise away a cross-VCPU shootdown if it knows no physical TLB
>> contains the target VCPU''s entries.  I can imagine the KVM
folks could
>> get some benefit from that as well.
> It''s nice to avoid the IPI (and waking up a cpu if it happens to
be
> asleep) but I think the risk of deviating too much from the baremetal 
> arch is too large, as demonstrated by this bug.
>
> (well, async page faults is a counterexample, I wonder if/when it will 
> bite us)
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2011-Jan-31 22:10 UTC

head link

[Xen-devel] Re: One (possible) x86 get_user_pages bug

On 01/31/2011 12:10 PM, Kaushik Barde wrote:> << I''m not sure I follow you here.  The issue with TLB flush
IPIs is that
> the hypervisor doesn''t know the purpose of the IPI and ends up
> (potentially) waking up a sleeping VCPU just to flush its tlb - but
> since it was sleeping there were no stale TLB entries to flush.>>
>
> That''s what I was trying understand, what is "Sleep"
here? Is it ACPI sleep
> or some internal scheduling state? If vCPUs  are asynchronous to pCPU in
> terms of ACPI sleep state, then they need to synced-up. That''s
where entire
> ACPI modeling needs to be considered. That''s where KVM may not see
this
> issue. Maybe I am missing something here.
No, nothing to do with ACPI.  Multiple virtual CPUs (VCPUs) can be
multiplexed onto a single physical CPU (PCPU), in much the same way as
tasks are scheduled onto CPUs (identically, in KVM''s case).  If a VCPU
is not currently running - either because it is simply descheduled, or
because it is blocked (what I slightly misleadingly called "sleeping"
above) in a hypercall, then it is not currently using any physical CPU
resources, including the TLBs.  In that case, there''s no need to flush
that''s VCPU''s TLB entries, because there are none.
> << A "few hundred uSecs" is really very slow -
that''s nearly a
> millisecond.  It''s worth spending some effort to avoid those kinds
of
> delays.>>
>
> Actually, just checked IPIs are usually 1000-1500 cycles long (comparable
to
> VMEXIT). My point is ideal solution should be where virtual platform
> behavior is closer to bare metal interrupts, memory, cpu state etc.. How to
> do it ? well that''s what needs to be figured out :-)
The interesting number is not the raw cost of an IPI, but the overall
cost of the remote TLB flush.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jan 2011 - Re: One (possible) x86 get_user_pages bug

[Xen-devel] Re: One (possible) x86 get_user_pages bug

[Xen-devel] Re: One (possible) x86 get_user_pages bug

[Xen-devel] Re: One (possible) x86 get_user_pages bug

[Xen-devel] Re: One (possible) x86 get_user_pages bug

[Xen-devel] Re: One (possible) x86 get_user_pages bug

[Xen-devel] RE: One (possible) x86 get_user_pages bug

[Xen-devel] Re: One (possible) x86 get_user_pages bug

[Xen-devel] RE: One (possible) x86 get_user_pages bug

[Xen-devel] Re: One (possible) x86 get_user_pages bug