thr3ads.net - Linux Virtualization - [patch 0/6] Guest page hinting version 7. [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Martin Schwidefsky

2009-Mar-27 15:09 UTC

[patch 0/6] Guest page hinting version 7.

Greetings,
the circus is back in town -- another version of the guest page hinting
patches. The patches differ from version 6 only in the kernel version,
they apply against 2.6.29. My short sniff test showed that the code
is still working as expected.

To recap (you can skip this if you read the boiler plate of the last
version of the patches):
The main benefit for guest page hinting vs. the ballooner is that there
is no need for a monitor that keeps track of the memory usage of all the
guests, a complex algorithm that calculates the working set sizes and for
the calls into the guest kernel to control the size of the balloons.
The host just does normal LRU based paging. If the host picks one of the
pages the guest can recreate, the host can throw it away instead of writing
it to the paging device. Simple and elegant.
The main disadvantage is the added complexity that is introduced to the
guests memory management code to do the page state changes and to deal
with discard faults.


Right after booting the page states on my 256 MB z/VM guest looked like
this (r=resident, p=preserved, z=zero, S=stable, U=unused,
P=potentially volatile, V=volatile):

<state>|--tot--|---r---|---p---|---z---|
    S  |  19719|  19673|      0|     46|
    U  | 235416|   2734|      0| 232682|
    P  |      1|      1|      0|      0|
    V  |   7008|   7008|      0|      0|
tot->  | 262144|  29416|      0| 232728|

about 25% of the pages are in voltile state. After grepping through the
linux source tree this picture changes:

<state>|--tot--|---r---|---p---|---z---|
    S  |  43784|  43744|      0|     40|
    U  |  78631|   2397|      0|  76234|
    P  |      2|      2|      0|      0|
    V  | 139727| 139727|      0|      0|
tot->  | 262144| 185870|      0|  76274|

about 75% of the pages are now volatile. Depending on the workload you
will get different results.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

Martin Schwidefsky

2009-Mar-27 15:09 UTC

head link

[patch 1/6] Guest page hinting: core + volatile page cache.

An embedded and charset-unspecified text was scrubbed...
Name: 001-hva-core.diff
Url:
http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/5330ff63/attachment.txt

Martin Schwidefsky

2009-Mar-27 15:09 UTC

head link

[patch 2/6] Guest page hinting: volatile swap cache.

An embedded and charset-unspecified text was scrubbed...
Name: 002-hva-swap.diff
Url:
http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/8a184a96/attachment.txt

Martin Schwidefsky

2009-Mar-27 15:09 UTC

head link

[patch 3/6] Guest page hinting: mlocked pages.

An embedded and charset-unspecified text was scrubbed...
Name: 003-hva-mlock.diff
Url:
http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/3f48764b/attachment.txt

Martin Schwidefsky

2009-Mar-27 15:09 UTC

head link

[patch 4/6] Guest page hinting: writable page table entries.

An embedded and charset-unspecified text was scrubbed...
Name: 004-hva-prot.diff
Url:
http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/b43abc1f/attachment.txt

Martin Schwidefsky

2009-Mar-27 15:09 UTC

head link

[patch 5/6] Guest page hinting: minor fault optimization.

An embedded and charset-unspecified text was scrubbed...
Name: 005-hva-nohv.diff
Url:
http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/22ab8076/attachment.txt

Martin Schwidefsky

2009-Mar-27 15:09 UTC

head link

[patch 6/6] Guest page hinting: s390 support.

An embedded and charset-unspecified text was scrubbed...
Name: 006-hva-s390.diff
Url:
http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/c0a6d14d/attachment.txt

Rik van Riel

2009-Mar-27 22:57 UTC

head link

[patch 1/6] Guest page hinting: core + volatile page cache.

Martin Schwidefsky wrote:
> The major obstacles that need to get addressed:
> * Concurrent page state changes:
>   To guard against concurrent page state updates some kind of lock
>   is needed. If page_make_volatile() has already done the 11 checks it
>   will issue the state change primitive. If in the meantime one of
>   the conditions has changed the user that requires that page in
>   stable state will have to wait in the page_make_stable() function
>   until the make volatile operation has finished. It is up to the
>   architecture to define how this is done with the three primitives
>   page_test_set_state_change, page_clear_state_change and
>   page_state_change.
>   There are some alternatives how this can be done, e.g. a global
>   lock, or lock per segment in the kernel page table, or the per page
>   bit PG_arch_1 if it is still free.
Can this be taken care of by memory barriers and
careful ordering of operations?

If we consider the states unused -> volatile -> stable
as progressively higher, "upgrades" can be done before
any kernel operation that requires the page to be in
that state (but after setting up the things that allow
it to be found), while downgrades can be done after the
kernel is done with needing the page at a higher level.

Since the downgrade checks for users that need the page
in a higher state, no lock should be required.

In fact, it may be possible to manage the page state
bitmap with compare-and-swap, without needing a call
to the hypervisor.
> Signed-off-by: Martin Schwidefsky <schwidefsky at de.ibm.com>
Some comments and questions in line.
> @@ -601,6 +604,21 @@ copy_one_pte(struct mm_struct *dst_mm, s
>  
>  out_set_pte:
>  	set_pte_at(dst_mm, addr, dst_pte, pte);
> +	return;
> +
> +out_discard_pte:
> +	/*
> +	 * If the page referred by the pte has the PG_discarded bit set,
> +	 * copy_one_pte is racing with page_discard. The pte may not be
> +	 * copied or we can end up with a pte pointing to a page not
> +	 * in the page cache anymore. Do what try_to_unmap_one would do
> +	 * if the copy_one_pte had taken place before page_discard.
> +	 */
> +	if (page->index != linear_page_index(vma, addr))
> +		/* If nonlinear, store the file page offset in the pte. */
> +		set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index));
> +	else
> +		pte_clear(dst_mm, addr, dst_pte);
>  }
It would be good to document that PG_discarded can only happen for
file pages and NOT for eg. clean swap cache pages.
> @@ -1390,6 +1391,7 @@ int test_clear_page_writeback(struct pag
>  			radix_tree_tag_clear(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
> +			page_make_volatile(page, 1);
>  			if (bdi_cap_account_writeback(bdi)) {
>  				__dec_bdi_stat(bdi, BDI_WRITEBACK);
>  				__bdi_writeout_inc(bdi);
Does this mark the page volatile before the IO writing the
dirty data back to disk has even started?  Is that OK?

-- 
All rights reversed.

Dave Hansen

2009-Mar-27 23:03 UTC

head link

[patch 0/6] Guest page hinting version 7.

On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky
wrote:> If the host picks one of the
> pages the guest can recreate, the host can throw it away instead of writing
> it to the paging device. Simple and elegant.
Heh, simple and elegant for the hypervisor.  But I'm not sure I'm going
to call *anything* that requires a new CPU instruction elegant. ;)

I don't see any description of it in there any more, but I thought this
entire patch set was to get rid of the idiotic triple I/Os in the
following scenario:

1. Hypervisor picks a page and evicts it out to disk, pays the I/O cost
   to get it written out. (I/O #1)
2. Linux comes along (being a bit late to the party) and picks the same
   page, also decides it needs to be out to disk
3. Linux tries to write the page to disk, but touches it in the 
   process, pulling the page back in from the store where the hypervisor
   wrote it. (I/O #2)
4. Linux writes the page to its swap device (I/O #3)

I don't see that mentioned at all in the current description.
Simplifying the hypervisor is hard to get behind, but cutting system I/O
by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;)

Can we persuade the hypervisor to tell us which pages it decided to page
out and just skip those when we're scanning the LRU?

-- Dave

Rusty Russell

2009-Mar-28 06:35 UTC

head link

[patch 0/6] Guest page hinting version 7.

On Saturday 28 March 2009 01:39:05 Martin Schwidefsky
wrote:> Greetings,
> the circus is back in town -- another version of the guest page hinting
> patches. The patches differ from version 6 only in the kernel version,
> they apply against 2.6.29. My short sniff test showed that the code
> is still working as expected.
> 
> To recap (you can skip this if you read the boiler plate of the last
> version of the patches):
> The main benefit for guest page hinting vs. the ballooner is that there
> is no need for a monitor that keeps track of the memory usage of all the
> guests, a complex algorithm that calculates the working set sizes and for
> the calls into the guest kernel to control the size of the balloons.
I thought you weren't convinced of the concrete benefits over ballooning,
or am I misremembering?

Thanks,
Rusty.

Rik van Riel

2009-Apr-01 02:10 UTC

head link

[patch 2/6] Guest page hinting: volatile swap cache.

Martin Schwidefsky wrote:> From: Martin Schwidefsky <schwidefsky at de.ibm.com>
> From: Hubertus Franke <frankeh at watson.ibm.com>
> From: Himanshu Raj
> 
> The volatile page state can be used for anonymous pages as well, if
> they have been added to the swap cache and the swap write is finished.
> Signed-off-by: Martin Schwidefsky <schwidefsky at de.ibm.com>
Acked-by: Rik van Riel <riel at redhat.com>

-- 
All rights reversed.

Rik van Riel

2009-Apr-01 02:52 UTC

head link

[patch 3/6] Guest page hinting: mlocked pages.

Martin Schwidefsky wrote:> From: Martin Schwidefsky <schwidefsky at de.ibm.com>
> From: Hubertus Franke <frankeh at watson.ibm.com>
> From: Himanshu Raj
> 
> Add code to get mlock() working with guest page hinting. The problem
> with mlock is that locked pages may not be removed from page cache.
> That means they need to be stable. 
> Signed-off-by: Martin Schwidefsky <schwidefsky at de.ibm.com>
Acked-by: Rik van Riel <riel at redhat.com>

-- 
All rights reversed.

Rik van Riel

2009-Apr-01 13:25 UTC

head link

[patch 4/6] Guest page hinting: writable page table entries.

Martin Schwidefsky wrote:

This code has me stumped.  Does it mean that if a page already
has the PageWritable bit set (and count_ok stays 0), we will
always mark the page as volatile?

How does that work out on !s390?
>  /**
> + * __page_check_writable() - check page state for new writable pte
> + *
> + * @page: the page the new writable pte refers to
> + * @pte: the new writable pte
> + */
> +void __page_check_writable(struct page *page, pte_t pte, unsigned int
offset)
> +{
> +	int count_ok = 0;
> +
> +	preempt_disable();
> +	while (page_test_set_state_change(page))
> +		cpu_relax();
> +
> +	if (!TestSetPageWritable(page)) {
> +		count_ok = check_counts(page, offset);
> +		if (check_bits(page) && count_ok)
> +			page_set_volatile(page, 1);
> +		else
> +			/*
> +			 * If two processes create a write mapping at the
> +			 * same time check_counts will return false or if
> +			 * the page is currently isolated from the LRU
> +			 * check_bits will return false but the page might
> +			 * be in volatile state.
> +			 * We have to take care about the dirty bit so the
> +			 * only option left is to make the page stable but
> +			 * we can try to make it volatile a bit later.
> +			 */
> +			page_set_stable_if_present(page);
> +	}
> +	page_clear_state_change(page);
> +	if (!count_ok)
> +		page_make_volatile(page, 1);
> +	preempt_enable();
> +}
> +EXPORT_SYMBOL(__page_check_writable);

-- 
All rights reversed.

Martin Schwidefsky

2009-Apr-01 14:36 UTC

head link

[patch 4/6] Guest page hinting: writable page table entries.

On Wed, 01 Apr 2009 09:25:34 -0400
Rik van Riel <riel at redhat.com> wrote:
> Martin Schwidefsky wrote:
> 
> This code has me stumped.  Does it mean that if a page already
> has the PageWritable bit set (and count_ok stays 0), we will
> always mark the page as volatile?
> 
> How does that work out on !s390?
No, we will not always mark the page as volatile. If PG_writable is
already set count_ok will stay 0 and a call to page_make_volatile is
done. This differs from page_set_volatile as it repeats all the
required checks, then calls page_set_volatile with a PageWritable(page)
as second argument. What state the page will get depends on the
architecture definition of page_set_volatile. For s390 this will do a
state transition to potentially volatile as the PG_writable bit is set.
On architecture that cannot check the dirty bit on a physical page basis
you need to make the page stable.
> >  /**
> > + * __page_check_writable() - check page state for new writable pte
> > + *
> > + * @page: the page the new writable pte refers to
> > + * @pte: the new writable pte
> > + */
> > +void __page_check_writable(struct page *page, pte_t pte, unsigned int
offset)
> > +{
> > +	int count_ok = 0;
> > +
> > +	preempt_disable();
> > +	while (page_test_set_state_change(page))
> > +		cpu_relax();
> > +
> > +	if (!TestSetPageWritable(page)) {
> > +		count_ok = check_counts(page, offset);
> > +		if (check_bits(page) && count_ok)
> > +			page_set_volatile(page, 1);
> > +		else
> > +			/*
> > +			 * If two processes create a write mapping at the
> > +			 * same time check_counts will return false or if
> > +			 * the page is currently isolated from the LRU
> > +			 * check_bits will return false but the page might
> > +			 * be in volatile state.
> > +			 * We have to take care about the dirty bit so the
> > +			 * only option left is to make the page stable but
> > +			 * we can try to make it volatile a bit later.
> > +			 */
> > +			page_set_stable_if_present(page);
> > +	}
> > +	page_clear_state_change(page);
> > +	if (!count_ok)
> > +		page_make_volatile(page, 1);
> > +	preempt_enable();
> > +}
> > +EXPORT_SYMBOL(__page_check_writable);
> 
> 
-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

Rik van Riel

2009-Apr-01 14:45 UTC

head link

[patch 4/6] Guest page hinting: writable page table entries.

Martin Schwidefsky wrote:> On Wed, 01 Apr 2009 09:25:34 -0400
> Rik van Riel <riel at redhat.com> wrote:
> 
>> Martin Schwidefsky wrote:
>>
>> This code has me stumped.  Does it mean that if a page already
>> has the PageWritable bit set (and count_ok stays 0), we will
>> always mark the page as volatile?
>>
>> How does that work out on !s390?
> 
> No, we will not always mark the page as volatile. If PG_writable is
> already set count_ok will stay 0 and a call to page_make_volatile is
> done. This differs from page_set_volatile as it repeats all the
> required checks, then calls page_set_volatile with a PageWritable(page)
> as second argument. What state the page will get depends on the
> architecture definition of page_set_volatile. For s390 this will do a
> state transition to potentially volatile as the PG_writable bit is set.
> On architecture that cannot check the dirty bit on a physical page basis
> you need to make the page stable.
Good point. I guess that means patch 4/6 checks out right, then :)

Acked-by: Rik van Riel <riel at redhat.com>

-- 
All rights reversed.

Rik van Riel

2009-Apr-01 15:33 UTC

head link

[patch 5/6] Guest page hinting: minor fault optimization.

Martin Schwidefsky wrote:> From: Martin Schwidefsky <schwidefsky at de.ibm.com>
> From: Hubertus Franke <frankeh at watson.ibm.com>
> From: Himanshu Raj
> 
> On of the challenges of the guest page hinting scheme is the cost for
> the state transitions. If the cost gets too high the whole concept of
> page state information is in question. Therefore it is important to
> avoid the state transitions when possible. 
> Signed-off-by: Martin Schwidefsky <schwidefsky at de.ibm.com>
Acked-by: Rik van Riel <riel at redhat.com>

-- 
All rights reversed.

Rik van Riel

2009-Apr-01 16:18 UTC

head link

[patch 6/6] Guest page hinting: s390 support.

Martin Schwidefsky wrote:> From: Martin Schwidefsky <schwidefsky at de.ibm.com>
> From: Hubertus Franke <frankeh at watson.ibm.com>
> From: Himanshu Raj
> 
> s390 uses the milli-coded ESSA instruction to set the page state. The
> page state is formed by four guest page states called block usage states
> and three host page states called block content states.
> Signed-off-by: Martin Schwidefsky <schwidefsky at de.ibm.com>
Acked-by: Rik van Riel <riel at redhat.com>

-- 
All rights reversed.

Possibly Parallel Threads

Search for more apparently analagous threads

Linux Virtualization - Mar 2009 - [patch 0/6] Guest page hinting version 7.

[patch 0/6] Guest page hinting version 7.

[patch 1/6] Guest page hinting: core + volatile page cache.

[patch 2/6] Guest page hinting: volatile swap cache.

[patch 3/6] Guest page hinting: mlocked pages.

[patch 4/6] Guest page hinting: writable page table entries.

[patch 5/6] Guest page hinting: minor fault optimization.

[patch 6/6] Guest page hinting: s390 support.

[patch 1/6] Guest page hinting: core + volatile page cache.

[patch 0/6] Guest page hinting version 7.

[patch 0/6] Guest page hinting version 7.

[patch 2/6] Guest page hinting: volatile swap cache.

[patch 3/6] Guest page hinting: mlocked pages.

[patch 4/6] Guest page hinting: writable page table entries.

[patch 4/6] Guest page hinting: writable page table entries.

[patch 4/6] Guest page hinting: writable page table entries.

[patch 5/6] Guest page hinting: minor fault optimization.

[patch 6/6] Guest page hinting: s390 support.

Possibly Parallel Threads