Greetings, the circus is back in town -- another version of the guest page hinting patches. The patches differ from version 6 only in the kernel version, they apply against 2.6.29. My short sniff test showed that the code is still working as expected. To recap (you can skip this if you read the boiler plate of the last version of the patches): The main benefit for guest page hinting vs. the ballooner is that there is no need for a monitor that keeps track of the memory usage of all the guests, a complex algorithm that calculates the working set sizes and for the calls into the guest kernel to control the size of the balloons. The host just does normal LRU based paging. If the host picks one of the pages the guest can recreate, the host can throw it away instead of writing it to the paging device. Simple and elegant. The main disadvantage is the added complexity that is introduced to the guests memory management code to do the page state changes and to deal with discard faults. Right after booting the page states on my 256 MB z/VM guest looked like this (r=resident, p=preserved, z=zero, S=stable, U=unused, P=potentially volatile, V=volatile): <state>|--tot--|---r---|---p---|---z---| S | 19719| 19673| 0| 46| U | 235416| 2734| 0| 232682| P | 1| 1| 0| 0| V | 7008| 7008| 0| 0| tot-> | 262144| 29416| 0| 232728| about 25% of the pages are in voltile state. After grepping through the linux source tree this picture changes: <state>|--tot--|---r---|---p---|---z---| S | 43784| 43744| 0| 40| U | 78631| 2397| 0| 76234| P | 2| 2| 0| 0| V | 139727| 139727| 0| 0| tot-> | 262144| 185870| 0| 76274| about 75% of the pages are now volatile. Depending on the workload you will get different results. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Martin Schwidefsky
2009-Mar-27 15:09 UTC
[patch 1/6] Guest page hinting: core + volatile page cache.
An embedded and charset-unspecified text was scrubbed... Name: 001-hva-core.diff Url: http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/5330ff63/attachment.txt
Martin Schwidefsky
2009-Mar-27 15:09 UTC
[patch 2/6] Guest page hinting: volatile swap cache.
An embedded and charset-unspecified text was scrubbed... Name: 002-hva-swap.diff Url: http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/8a184a96/attachment.txt
An embedded and charset-unspecified text was scrubbed... Name: 003-hva-mlock.diff Url: http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/3f48764b/attachment.txt
Martin Schwidefsky
2009-Mar-27 15:09 UTC
[patch 4/6] Guest page hinting: writable page table entries.
An embedded and charset-unspecified text was scrubbed... Name: 004-hva-prot.diff Url: http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/b43abc1f/attachment.txt
Martin Schwidefsky
2009-Mar-27 15:09 UTC
[patch 5/6] Guest page hinting: minor fault optimization.
An embedded and charset-unspecified text was scrubbed... Name: 005-hva-nohv.diff Url: http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/22ab8076/attachment.txt
An embedded and charset-unspecified text was scrubbed... Name: 006-hva-s390.diff Url: http://lists.linux-foundation.org/pipermail/virtualization/attachments/20090327/c0a6d14d/attachment.txt
Rik van Riel
2009-Mar-27 22:57 UTC
[patch 1/6] Guest page hinting: core + volatile page cache.
Martin Schwidefsky wrote:> The major obstacles that need to get addressed: > * Concurrent page state changes: > To guard against concurrent page state updates some kind of lock > is needed. If page_make_volatile() has already done the 11 checks it > will issue the state change primitive. If in the meantime one of > the conditions has changed the user that requires that page in > stable state will have to wait in the page_make_stable() function > until the make volatile operation has finished. It is up to the > architecture to define how this is done with the three primitives > page_test_set_state_change, page_clear_state_change and > page_state_change. > There are some alternatives how this can be done, e.g. a global > lock, or lock per segment in the kernel page table, or the per page > bit PG_arch_1 if it is still free.Can this be taken care of by memory barriers and careful ordering of operations? If we consider the states unused -> volatile -> stable as progressively higher, "upgrades" can be done before any kernel operation that requires the page to be in that state (but after setting up the things that allow it to be found), while downgrades can be done after the kernel is done with needing the page at a higher level. Since the downgrade checks for users that need the page in a higher state, no lock should be required. In fact, it may be possible to manage the page state bitmap with compare-and-swap, without needing a call to the hypervisor.> Signed-off-by: Martin Schwidefsky <schwidefsky at de.ibm.com>Some comments and questions in line.> @@ -601,6 +604,21 @@ copy_one_pte(struct mm_struct *dst_mm, s > > out_set_pte: > set_pte_at(dst_mm, addr, dst_pte, pte); > + return; > + > +out_discard_pte: > + /* > + * If the page referred by the pte has the PG_discarded bit set, > + * copy_one_pte is racing with page_discard. The pte may not be > + * copied or we can end up with a pte pointing to a page not > + * in the page cache anymore. Do what try_to_unmap_one would do > + * if the copy_one_pte had taken place before page_discard. > + */ > + if (page->index != linear_page_index(vma, addr)) > + /* If nonlinear, store the file page offset in the pte. */ > + set_pte_at(dst_mm, addr, dst_pte, pgoff_to_pte(page->index)); > + else > + pte_clear(dst_mm, addr, dst_pte); > }It would be good to document that PG_discarded can only happen for file pages and NOT for eg. clean swap cache pages.> @@ -1390,6 +1391,7 @@ int test_clear_page_writeback(struct pag > radix_tree_tag_clear(&mapping->page_tree, > page_index(page), > PAGECACHE_TAG_WRITEBACK); > + page_make_volatile(page, 1); > if (bdi_cap_account_writeback(bdi)) { > __dec_bdi_stat(bdi, BDI_WRITEBACK); > __bdi_writeout_inc(bdi);Does this mark the page volatile before the IO writing the dirty data back to disk has even started? Is that OK? -- All rights reversed.
On Fri, 2009-03-27 at 16:09 +0100, Martin Schwidefsky wrote:> If the host picks one of the > pages the guest can recreate, the host can throw it away instead of writing > it to the paging device. Simple and elegant.Heh, simple and elegant for the hypervisor. But I'm not sure I'm going to call *anything* that requires a new CPU instruction elegant. ;) I don't see any description of it in there any more, but I thought this entire patch set was to get rid of the idiotic triple I/Os in the following scenario: 1. Hypervisor picks a page and evicts it out to disk, pays the I/O cost to get it written out. (I/O #1) 2. Linux comes along (being a bit late to the party) and picks the same page, also decides it needs to be out to disk 3. Linux tries to write the page to disk, but touches it in the process, pulling the page back in from the store where the hypervisor wrote it. (I/O #2) 4. Linux writes the page to its swap device (I/O #3) I don't see that mentioned at all in the current description. Simplifying the hypervisor is hard to get behind, but cutting system I/O by 2/3 is a much nicer benefit for 1200 lines of invasive code. ;) Can we persuade the hypervisor to tell us which pages it decided to page out and just skip those when we're scanning the LRU? -- Dave
On Saturday 28 March 2009 01:39:05 Martin Schwidefsky wrote:> Greetings, > the circus is back in town -- another version of the guest page hinting > patches. The patches differ from version 6 only in the kernel version, > they apply against 2.6.29. My short sniff test showed that the code > is still working as expected. > > To recap (you can skip this if you read the boiler plate of the last > version of the patches): > The main benefit for guest page hinting vs. the ballooner is that there > is no need for a monitor that keeps track of the memory usage of all the > guests, a complex algorithm that calculates the working set sizes and for > the calls into the guest kernel to control the size of the balloons.I thought you weren't convinced of the concrete benefits over ballooning, or am I misremembering? Thanks, Rusty.
Martin Schwidefsky wrote:> From: Martin Schwidefsky <schwidefsky at de.ibm.com> > From: Hubertus Franke <frankeh at watson.ibm.com> > From: Himanshu Raj > > The volatile page state can be used for anonymous pages as well, if > they have been added to the swap cache and the swap write is finished.> Signed-off-by: Martin Schwidefsky <schwidefsky at de.ibm.com>Acked-by: Rik van Riel <riel at redhat.com> -- All rights reversed.
Martin Schwidefsky wrote:> From: Martin Schwidefsky <schwidefsky at de.ibm.com> > From: Hubertus Franke <frankeh at watson.ibm.com> > From: Himanshu Raj > > Add code to get mlock() working with guest page hinting. The problem > with mlock is that locked pages may not be removed from page cache. > That means they need to be stable.> Signed-off-by: Martin Schwidefsky <schwidefsky at de.ibm.com>Acked-by: Rik van Riel <riel at redhat.com> -- All rights reversed.
Rik van Riel
2009-Apr-01 13:25 UTC
[patch 4/6] Guest page hinting: writable page table entries.
Martin Schwidefsky wrote: This code has me stumped. Does it mean that if a page already has the PageWritable bit set (and count_ok stays 0), we will always mark the page as volatile? How does that work out on !s390?> /** > + * __page_check_writable() - check page state for new writable pte > + * > + * @page: the page the new writable pte refers to > + * @pte: the new writable pte > + */ > +void __page_check_writable(struct page *page, pte_t pte, unsigned int offset) > +{ > + int count_ok = 0; > + > + preempt_disable(); > + while (page_test_set_state_change(page)) > + cpu_relax(); > + > + if (!TestSetPageWritable(page)) { > + count_ok = check_counts(page, offset); > + if (check_bits(page) && count_ok) > + page_set_volatile(page, 1); > + else > + /* > + * If two processes create a write mapping at the > + * same time check_counts will return false or if > + * the page is currently isolated from the LRU > + * check_bits will return false but the page might > + * be in volatile state. > + * We have to take care about the dirty bit so the > + * only option left is to make the page stable but > + * we can try to make it volatile a bit later. > + */ > + page_set_stable_if_present(page); > + } > + page_clear_state_change(page); > + if (!count_ok) > + page_make_volatile(page, 1); > + preempt_enable(); > +} > +EXPORT_SYMBOL(__page_check_writable);-- All rights reversed.
Martin Schwidefsky
2009-Apr-01 14:36 UTC
[patch 4/6] Guest page hinting: writable page table entries.
On Wed, 01 Apr 2009 09:25:34 -0400 Rik van Riel <riel at redhat.com> wrote:> Martin Schwidefsky wrote: > > This code has me stumped. Does it mean that if a page already > has the PageWritable bit set (and count_ok stays 0), we will > always mark the page as volatile? > > How does that work out on !s390?No, we will not always mark the page as volatile. If PG_writable is already set count_ok will stay 0 and a call to page_make_volatile is done. This differs from page_set_volatile as it repeats all the required checks, then calls page_set_volatile with a PageWritable(page) as second argument. What state the page will get depends on the architecture definition of page_set_volatile. For s390 this will do a state transition to potentially volatile as the PG_writable bit is set. On architecture that cannot check the dirty bit on a physical page basis you need to make the page stable.> > /** > > + * __page_check_writable() - check page state for new writable pte > > + * > > + * @page: the page the new writable pte refers to > > + * @pte: the new writable pte > > + */ > > +void __page_check_writable(struct page *page, pte_t pte, unsigned int offset) > > +{ > > + int count_ok = 0; > > + > > + preempt_disable(); > > + while (page_test_set_state_change(page)) > > + cpu_relax(); > > + > > + if (!TestSetPageWritable(page)) { > > + count_ok = check_counts(page, offset); > > + if (check_bits(page) && count_ok) > > + page_set_volatile(page, 1); > > + else > > + /* > > + * If two processes create a write mapping at the > > + * same time check_counts will return false or if > > + * the page is currently isolated from the LRU > > + * check_bits will return false but the page might > > + * be in volatile state. > > + * We have to take care about the dirty bit so the > > + * only option left is to make the page stable but > > + * we can try to make it volatile a bit later. > > + */ > > + page_set_stable_if_present(page); > > + } > > + page_clear_state_change(page); > > + if (!count_ok) > > + page_make_volatile(page, 1); > > + preempt_enable(); > > +} > > +EXPORT_SYMBOL(__page_check_writable); > >-- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Rik van Riel
2009-Apr-01 14:45 UTC
[patch 4/6] Guest page hinting: writable page table entries.
Martin Schwidefsky wrote:> On Wed, 01 Apr 2009 09:25:34 -0400 > Rik van Riel <riel at redhat.com> wrote: > >> Martin Schwidefsky wrote: >> >> This code has me stumped. Does it mean that if a page already >> has the PageWritable bit set (and count_ok stays 0), we will >> always mark the page as volatile? >> >> How does that work out on !s390? > > No, we will not always mark the page as volatile. If PG_writable is > already set count_ok will stay 0 and a call to page_make_volatile is > done. This differs from page_set_volatile as it repeats all the > required checks, then calls page_set_volatile with a PageWritable(page) > as second argument. What state the page will get depends on the > architecture definition of page_set_volatile. For s390 this will do a > state transition to potentially volatile as the PG_writable bit is set. > On architecture that cannot check the dirty bit on a physical page basis > you need to make the page stable.Good point. I guess that means patch 4/6 checks out right, then :) Acked-by: Rik van Riel <riel at redhat.com> -- All rights reversed.
Rik van Riel
2009-Apr-01 15:33 UTC
[patch 5/6] Guest page hinting: minor fault optimization.
Martin Schwidefsky wrote:> From: Martin Schwidefsky <schwidefsky at de.ibm.com> > From: Hubertus Franke <frankeh at watson.ibm.com> > From: Himanshu Raj > > On of the challenges of the guest page hinting scheme is the cost for > the state transitions. If the cost gets too high the whole concept of > page state information is in question. Therefore it is important to > avoid the state transitions when possible.> Signed-off-by: Martin Schwidefsky <schwidefsky at de.ibm.com>Acked-by: Rik van Riel <riel at redhat.com> -- All rights reversed.
Martin Schwidefsky wrote:> From: Martin Schwidefsky <schwidefsky at de.ibm.com> > From: Hubertus Franke <frankeh at watson.ibm.com> > From: Himanshu Raj > > s390 uses the milli-coded ESSA instruction to set the page state. The > page state is formed by four guest page states called block usage states > and three host page states called block content states.> Signed-off-by: Martin Schwidefsky <schwidefsky at de.ibm.com>Acked-by: Rik van Riel <riel at redhat.com> -- All rights reversed.