thr3ads.net - Xen devel - Re: [Xen-devel] Fwd: Re: struct page field arrangement [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Jan Beulich

2007-Mar-16 11:58 UTC

Re: [Xen-devel] Fwd: Re: struct page field arrangement

>>>> Keir Fraser <keir@xensource.com> 01.03.07 11:22
>>>
>>Can we confident that the mm_pin/mm_unpin code (which walks pagetables
and
>>has to find every page to make every one read-only or writable) is safe?
>>Presumably for this to be true we need to be sure that noone can
meanwhile
>>concurrently be populating the pagetable we are walking with extra
>>pgds/puds/pmds/ptes...
>
>Since the pin/unpin walking only cares about pgd/pud/pmd entries,
synchronization
>is guaranteed through mm->page_table_lock. The pte lock is used only for
leaf
>entries, which are of no concern to (un)pinning.
I''m afraid I have to correct myself. Stress testing has shown severe
problems, and after a few hours of staring at this I''m almost certain
there
is a race condition here: While no new pte-s can ever appear, the logic in
mm/vmscan.c may try to modify pte-s in an mm currently being unpinned
(at least through ptep_clear_flush_young() called from
page_referenced_one() in mm/rmap.c). If this happens when
xen_pgd_unpin() has already passed the respective pte page, but
mm_walk() hasn''t reached the page, yet, the update will fail (if done
directly, ptwr will no pick this up, and if done through a hypercall, the
call would fail, likely producing a BUG()).

Of course we could back out that changeset, but one of the reasons for
submitting it was to get closer to native. Therefore I''m considering
alternatives:
- lock all pte pages right after taking the page table lock in the pin/unpin
  functions, and drop them right before dropping the page table lock
  (this nesting should be no problem, as none of them can ever nest
  elsewhere, since otherwise the non-split-pt-lock case would not work)
- find a way to fix up the possibly resulting page fault (e.g. allow the
  ptwr code to update the page if it is PGT_l1_page_table but has a
  zero type reference count; the PGT_writable case would be more
  difficult and would probably need to be caught in the guest by checking
  the pte and finding it to be writable); the hypercalls don''t seem to
be
  affected (do_mmu_update seems okay as it doesn''t look at the type
  reference count, and do_update_va_mapping can be called only on
  the currently active address space, which cannot be the one being in
  transition)
Dealing with an mm being pinned seems more difficult, as its L1 page
table pages will not be in PGT_l1_page_table state yet. Thus another
alternative could be to make page_check_address() in the kernel
detect the being-(un)pinned status, or even adjust pte_lockptr() to
return the page table lock for mm-s being (un)pinned (this would
perhaps be the cheapest fix).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Mar-16 12:11 UTC

head link

Re: [Xen-devel] Fwd: Re: struct page field arrangement

On 16/3/07 11:58, "Jan Beulich" <jbeulich@novell.com> wrote:
>> Since the pin/unpin walking only cares about pgd/pud/pmd entries,
>> synchronization
>> is guaranteed through mm->page_table_lock. The pte lock is used only
for leaf
>> entries, which are of no concern to (un)pinning.
> 
> I''m afraid I have to correct myself. Stress testing has shown
severe
> problems, and after a few hours of staring at this I''m almost
certain there
> is a race condition here: While no new pte-s can ever appear, the logic in
> mm/vmscan.c may try to modify pte-s in an mm currently being unpinned
> (at least through ptep_clear_flush_young() called from
> page_referenced_one() in mm/rmap.c). If this happens when
> xen_pgd_unpin() has already passed the respective pte page, but
> mm_walk() hasn''t reached the page, yet, the update will fail (if
done
> directly, ptwr will no pick this up, and if done through a hypercall, the
> call would fail, likely producing a BUG()).
What kind of stress test did you run? I was expecting that unpin would be
okay because we only call mm_unpin() from _arch_exit_mmap() if the mm_count
is 1 (which I believe means the mm is not active in any task).

 -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Mar-16 12:25 UTC

head link

Re: [Xen-devel] Fwd: Re: struct page field arrangement

On 16/3/07 12:11, "Keir Fraser" <keir@xensource.com> wrote:
>> page_referenced_one() in mm/rmap.c). If this happens when
>> xen_pgd_unpin() has already passed the respective pte page, but
>> mm_walk() hasn''t reached the page, yet, the update will fail
(if done
>> directly, ptwr will no pick this up, and if done through a hypercall,
the
>> call would fail, likely producing a BUG()).
> 
> What kind of stress test did you run? I was expecting that unpin would be
> okay because we only call mm_unpin() from _arch_exit_mmap() if the mm_count
> is 1 (which I believe means the mm is not active in any task).
And actually the pinning happens on activate_mm() in most cases, which I
would expect to be ''early enough'' since noone can run on the
mm before that?

If you''ve managed to provoke bugs then that''s very interesting
(and scary)!

I suppose if I understand the rmap case correctly, we''re still
susceptible
to the paging kernel thread trying to page things out at any time? Is that
what you think you''ve been seeing go wrong?

 -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2007-Mar-16 12:37 UTC

head link

Re: [Xen-devel] Fwd: Re: struct page field arrangement

On 16/3/07 11:58, "Jan Beulich" <jbeulich@novell.com> wrote:
> Of course we could back out that changeset, but one of the reasons for
> submitting it was to get closer to native.
None of the fixes sound straightforward. :-)

I''d suggest we back out the mm/Kconfig change for 3.0.5 (since that is
close, and I''m not super confident that we''d get any of your
proposed fixes
correct first time) and then have a go at supporting split-ptlock in 3.0.6
timeframe.

The situation after all is that our scalability beyond 4 VCPUs is currently
almost certainly bottlenecked by conservative locking in Xen rather than by
per-mm locking in Linux. So we''re looking at complicating the Linux
code
close to a Xen release for no actual user benefit.

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2007-Mar-16 12:46 UTC

head link

Re: [Xen-devel] Fwd: Re: struct page field arrangement

I agree.
>>> Keir Fraser <keir@xensource.com> 16.03.07 13:37 >>>On 16/3/07 11:58, "Jan Beulich" <jbeulich@novell.com> wrote:
> Of course we could back out that changeset, but one of the reasons for
> submitting it was to get closer to native.
None of the fixes sound straightforward. :-)

I''d suggest we back out the mm/Kconfig change for 3.0.5 (since that is
close, and I''m not super confident that we''d get any of your
proposed fixes
correct first time) and then have a go at supporting split-ptlock in 3.0.6
timeframe.

The situation after all is that our scalability beyond 4 VCPUs is currently
almost certainly bottlenecked by conservative locking in Xen rather than by
per-mm locking in Linux. So we''re looking at complicating the Linux
code
close to a Xen release for no actual user benefit.

 -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jan Beulich

2007-Mar-16 12:54 UTC

head link

Re: [Xen-devel] Fwd: Re: struct page field arrangement

>>> Keir Fraser <keir@xensource.com> 16.03.07 13:25 >>>
>On 16/3/07 12:11, "Keir Fraser" <keir@xensource.com> wrote:
>
>>> page_referenced_one() in mm/rmap.c). If this happens when
>>> xen_pgd_unpin() has already passed the respective pte page, but
>>> mm_walk() hasn''t reached the page, yet, the update will
fail (if done
>>> directly, ptwr will no pick this up, and if done through a
hypercall, the
>>> call would fail, likely producing a BUG()).
>> 
>> What kind of stress test did you run? I was expecting that unpin would
be
>> okay because we only call mm_unpin() from _arch_exit_mmap() if the
mm_count
>> is 1 (which I believe means the mm is not active in any task).
newburn on machines with not too much (<= 2G) memory.
>And actually the pinning happens on activate_mm() in most cases, which I
>would expect to be ''early enough'' since noone can run on
the mm before that?
>
>If you''ve managed to provoke bugs then that''s very
interesting (and scary)!
>
>I suppose if I understand the rmap case correctly, we''re still
susceptible
>to the paging kernel thread trying to page things out at any time? Is that
>what you think you''ve been seeing go wrong?
Yes, somewhere in that area. From the data I have (page fault on the
page table write in ptep_clear_flush_young(), with the page table dump
showing the page to be writeable and present) I can only conclude that
the race is with the unpin path (otherwise I should see the page being
write protected), while the vm scan tries to recover memory at the same
time, and since this scan is scanning zones, not mm-s, the references to
the mm-s are being obtained from struct page -> vma -> mm (i.e. the
mm-s'' use counts don''t matter here).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Mar 2007 - Re: Fwd: Re: struct page field arrangement

Re: [Xen-devel] Fwd: Re: struct page field arrangement

Re: [Xen-devel] Fwd: Re: struct page field arrangement

Re: [Xen-devel] Fwd: Re: struct page field arrangement

Re: [Xen-devel] Fwd: Re: struct page field arrangement

Re: [Xen-devel] Fwd: Re: struct page field arrangement

Re: [Xen-devel] Fwd: Re: struct page field arrangement