Rework populate-on-demand sweeping Last summer I did some work on testing whether our PoD sweeping code was achieving its goals: namely, never crashing unnecessairly, minimizing boot time, and maximizing the number of superpages in the p2m table. This is one of the resulting patch series. I''m posting it to make sure that maintainers think it''s still suitable for inclusion in 4.2. The patces against 4.1 have been extensively in the XenServer testing framework and have been in use by XenServer customers for over 9 months now. But the p2m code has changed extensively in that time, so one could argue that the testing doesn''t give us the same degree of confidence in the patches against 4.2 as against 4.1. (On the other hand, the PoD code hasn''t changed that much.) I haven''t done more than compile-test it at this point, so please just review ideas and "is this 4.2 material". If I get positive feedback, I''ll do more testing and re-submit.
George Dunlap
2012-Jun-08 11:45 UTC
[PATCH 1 of 2 RFC] xen, pod: Zero-check recently populated pages (checklast)
When demand-populating pages due to guest accesses, check recently populated pages to see if we can reclaim them for the cache. This should keep the PoD cache filled when the start-of-day scrubber is going through. The number 128 was chosen by experiment. Windows does its page scrubbing in parallel; while a small nubmer like 4 works well for single VMs, it breaks down as multiple vcpus are scrubbing different pages in parallel. Increasing to 128 works well for higher numbers of vcpus. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c --- a/xen/arch/x86/mm/p2m-pod.c +++ b/xen/arch/x86/mm/p2m-pod.c @@ -909,6 +909,26 @@ p2m_pod_emergency_sweep_super(struct p2m p2m->pod.reclaim_super = i ? i - SUPERPAGE_PAGES : 0; } +/* When populating a new superpage, look at recently populated superpages + * hoping that they''ve been zeroed. This will snap up zeroed pages as soon as + * the guest OS is done with them. */ +static void +p2m_pod_check_last_super(struct p2m_domain *p2m, unsigned long gfn_aligned) +{ + unsigned long check_gfn; + + ASSERT(p2m->pod.last_populated_index < POD_HISTORY_MAX); + + check_gfn = p2m->pod.last_populated[p2m->pod.last_populated_index]; + + p2m->pod.last_populated[p2m->pod.last_populated_index] = gfn_aligned; + + p2m->pod.last_populated_index = ( p2m->pod.last_populated_index + 1 ) % POD_HISTORY_MAX; + + p2m->pod.reclaim_super += p2m_pod_zero_check_superpage(p2m, check_gfn); +} + + #define POD_SWEEP_STRIDE 16 static void p2m_pod_emergency_sweep(struct p2m_domain *p2m) @@ -1066,6 +1086,12 @@ p2m_pod_demand_populate(struct p2m_domai __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t); } + /* Check the last guest demand-populate */ + if ( p2m->pod.entry_count > p2m->pod.count + && order == 9 + && q & P2M_ALLOC ) + p2m_pod_check_last_super(p2m, gfn_aligned); + pod_unlock(p2m); return 0; out_of_memory: diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h --- a/xen/include/asm-x86/p2m.h +++ b/xen/include/asm-x86/p2m.h @@ -287,6 +287,9 @@ struct p2m_domain { unsigned reclaim_super; /* Last gpfn of a scan */ unsigned reclaim_single; /* Last gpfn of a scan */ unsigned max_guest; /* gpfn of max guest demand-populate */ +#define POD_HISTORY_MAX 128 + unsigned last_populated[POD_HISTORY_MAX]; /* gpfn of last guest page demand-populated */ + int last_populated_index; mm_lock_t lock; /* Locking of private pod structs, * * not relying on the p2m lock. */ } pod;
George Dunlap
2012-Jun-08 11:45 UTC
[PATCH 2 of 2 RFC] xen, pod: Only sweep in an emergency, and only for 4k pages
Testing has shown that doing sweeps for superpages slows down boot significantly, but does not result in a significantly higher number of superpages after boot. Early sweeping for 4k pages causes superpages to be broken up unnecessarily. Only sweep if we''re really out of memory. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c --- a/xen/arch/x86/mm/p2m-pod.c +++ b/xen/arch/x86/mm/p2m-pod.c @@ -880,34 +880,6 @@ p2m_pod_zero_check(struct p2m_domain *p2 } #define POD_SWEEP_LIMIT 1024 -static void -p2m_pod_emergency_sweep_super(struct p2m_domain *p2m) -{ - unsigned long i, start, limit; - - if ( p2m->pod.reclaim_super == 0 ) - { - p2m->pod.reclaim_super = (p2m->pod.max_guest>>PAGE_ORDER_2M)<<PAGE_ORDER_2M; - p2m->pod.reclaim_super -= SUPERPAGE_PAGES; - } - - start = p2m->pod.reclaim_super; - limit = (start > POD_SWEEP_LIMIT) ? (start - POD_SWEEP_LIMIT) : 0; - - for ( i=p2m->pod.reclaim_super ; i > 0 ; i -= SUPERPAGE_PAGES ) - { - p2m_pod_zero_check_superpage(p2m, i); - /* Stop if we''re past our limit and we have found *something*. - * - * NB that this is a zero-sum game; we''re increasing our cache size - * by increasing our ''debt''. Since we hold the p2m lock, - * (entry_count - count) must remain the same. */ - if ( !page_list_empty(&p2m->pod.super) && i < limit ) - break; - } - - p2m->pod.reclaim_super = i ? i - SUPERPAGE_PAGES : 0; -} /* When populating a new superpage, look at recently populated superpages * hoping that they''ve been zeroed. This will snap up zeroed pages as soon as @@ -1021,34 +993,18 @@ p2m_pod_demand_populate(struct p2m_domai return 0; } - /* Once we''ve ballooned down enough that we can fill the remaining - * PoD entries from the cache, don''t sweep even if the particular - * list we want to use is empty: that can lead to thrashing zero pages - * through the cache for no good reason. */ - if ( p2m->pod.entry_count > p2m->pod.count ) - { + /* Only sweep if we''re actually out of memory. Doing anything else + * causes unnecessary time and fragmentation of superpages in the p2m. */ + if ( p2m->pod.count == 0 ) + p2m_pod_emergency_sweep(p2m); - /* If we''re low, start a sweep */ - if ( order == PAGE_ORDER_2M && page_list_empty(&p2m->pod.super) ) - /* Note that sweeps scan other ranges in the p2m. In an scenario - * in which p2m locks are fine-grained, this may result in deadlock. - * Using trylock on the gfn''s as we sweep would avoid it. */ - p2m_pod_emergency_sweep_super(p2m); - - if ( page_list_empty(&p2m->pod.single) && - ( ( order == PAGE_ORDER_4K ) - || (order == PAGE_ORDER_2M && page_list_empty(&p2m->pod.super) ) ) ) - /* Same comment regarding deadlock applies */ - p2m_pod_emergency_sweep(p2m); - } + if ( p2m->pod.count == 0 ) + goto out_of_memory; /* Keep track of the highest gfn demand-populated by a guest fault */ if ( gfn > p2m->pod.max_guest ) p2m->pod.max_guest = gfn; - if ( p2m->pod.count == 0 ) - goto out_of_memory; - /* Get a page f/ the cache. A NULL return value indicates that the * 2-meg range should be marked singleton PoD, and retried */ if ( (p = p2m_pod_cache_get(p2m, order)) == NULL )
Jan Beulich
2012-Jun-08 12:02 UTC
Re: [PATCH 1 of 2 RFC] xen, pod: Zero-check recently populated pages (checklast)
>>> On 08.06.12 at 13:45, George Dunlap <george.dunlap@eu.citrix.com> wrote: > When demand-populating pages due to guest accesses, check recently populated > pages to see if we can reclaim them for the cache. This should keep the PoD > cache filled when the start-of-day scrubber is going through. > > The number 128 was chosen by experiment. Windows does its page > scrubbing in parallel; while a small nubmer like 4 works well for > single VMs, it breaks down as multiple vcpus are scrubbing different > pages in parallel. Increasing to 128 works well for higher numbers of > vcpus. > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > > diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c > --- a/xen/arch/x86/mm/p2m-pod.c > +++ b/xen/arch/x86/mm/p2m-pod.c > @@ -909,6 +909,26 @@ p2m_pod_emergency_sweep_super(struct p2m > p2m->pod.reclaim_super = i ? i - SUPERPAGE_PAGES : 0; > } > > +/* When populating a new superpage, look at recently populated superpages > + * hoping that they''ve been zeroed. This will snap up zeroed pages as soon as > + * the guest OS is done with them. */ > +static void > +p2m_pod_check_last_super(struct p2m_domain *p2m, unsigned long gfn_aligned) > +{ > + unsigned long check_gfn; > + > + ASSERT(p2m->pod.last_populated_index < POD_HISTORY_MAX); > + > + check_gfn = p2m->pod.last_populated[p2m->pod.last_populated_index]; > + > + p2m->pod.last_populated[p2m->pod.last_populated_index] = gfn_aligned; > + > + p2m->pod.last_populated_index = ( p2m->pod.last_populated_index + 1 ) % POD_HISTORY_MAX; > + > + p2m->pod.reclaim_super += p2m_pod_zero_check_superpage(p2m, check_gfn); > +} > + > + > #define POD_SWEEP_STRIDE 16 > static void > p2m_pod_emergency_sweep(struct p2m_domain *p2m) > @@ -1066,6 +1086,12 @@ p2m_pod_demand_populate(struct p2m_domai > __trace_var(TRC_MEM_POD_POPULATE, 0, sizeof(t), &t); > } > > + /* Check the last guest demand-populate */ > + if ( p2m->pod.entry_count > p2m->pod.count > + && order == 9 > + && q & P2M_ALLOC ) > + p2m_pod_check_last_super(p2m, gfn_aligned); > + > pod_unlock(p2m); > return 0; > out_of_memory: > diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h > --- a/xen/include/asm-x86/p2m.h > +++ b/xen/include/asm-x86/p2m.h > @@ -287,6 +287,9 @@ struct p2m_domain { > unsigned reclaim_super; /* Last gpfn of a scan */ > unsigned reclaim_single; /* Last gpfn of a scan */ > unsigned max_guest; /* gpfn of max guest demand-populate */ > +#define POD_HISTORY_MAX 128 > + unsigned last_populated[POD_HISTORY_MAX]; /* gpfn of last guest page demand-populated */unsigned long? Also, wouldn''t it be better to allocate this table dynamically, at once allowing its size to scale with the number of vCPU-s in the guest?> + int last_populated_index;''unsigned int'' is generally better suited for array indexes (and definitely on x86-64). Jan> mm_lock_t lock; /* Locking of private pod structs, > * > * not relying on the p2m lock. > */ > } pod; > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Tim Deegan
2012-Jun-14 09:07 UTC
Re: [PATCH 1 of 2 RFC] xen, pod: Zero-check recently populated pages (checklast)
At 13:02 +0100 on 08 Jun (1339160536), Jan Beulich wrote:> >>> On 08.06.12 at 13:45, George Dunlap <george.dunlap@eu.citrix.com> wrote: > > --- a/xen/include/asm-x86/p2m.h > > +++ b/xen/include/asm-x86/p2m.h > > @@ -287,6 +287,9 @@ struct p2m_domain { > > unsigned reclaim_super; /* Last gpfn of a scan */ > > unsigned reclaim_single; /* Last gpfn of a scan */ > > unsigned max_guest; /* gpfn of max guest demand-populate */ > > +#define POD_HISTORY_MAX 128 > > + unsigned last_populated[POD_HISTORY_MAX]; /* gpfn of last guest page demand-populated */This is the gpfns of the last 128 order-9 superpages populated, right? Also, this line is >80 columns - I think I saw a few others in this series.> unsigned long? > > Also, wouldn''t it be better to allocate this table dynamically, at > once allowing its size to scale with the number of vCPU-s in the > guest?You could even make it a small per-vcpu array, assuming that the parallel scrubbing will be symmetric across vcpus. Cheers, Tim.
Tim Deegan
2012-Jun-14 09:11 UTC
Re: [PATCH 2 of 2 RFC] xen, pod: Only sweep in an emergency, and only for 4k pages
At 11:45 +0000 on 08 Jun (1339155933), George Dunlap wrote:> + if ( p2m->pod.count == 0 ) > + goto out_of_memory; > > /* Keep track of the highest gfn demand-populated by a guest fault */ > if ( gfn > p2m->pod.max_guest ) > p2m->pod.max_guest = gfn; > > - if ( p2m->pod.count == 0 ) > - goto out_of_memory; > -Is this code motion just noise? Since out_of_memory crasheds the guest, I assume so. Cheers, Tim.
Tim Deegan
2012-Jun-14 09:12 UTC
Re: [PATCH 0 of 2 RFC] Rework populate-on-demand sweeping
At 11:45 +0000 on 08 Jun (1339155931), George Dunlap wrote:> I haven''t done more than compile-test it at this point, so please just > review ideas and "is this 4.2 material". If I get positive feedback, > I''ll do more testing and re-submit.Yes, AFAIC this is 4.2 material. Tim.
George Dunlap
2012-Jun-14 12:42 UTC
Re: [PATCH 2 of 2 RFC] xen, pod: Only sweep in an emergency, and only for 4k pages
On Thu, Jun 14, 2012 at 10:11 AM, Tim Deegan <tim@xen.org> wrote:> At 11:45 +0000 on 08 Jun (1339155933), George Dunlap wrote: >> + if ( p2m->pod.count == 0 ) >> + goto out_of_memory; >> >> /* Keep track of the highest gfn demand-populated by a guest fault */ >> if ( gfn > p2m->pod.max_guest ) >> p2m->pod.max_guest = gfn; >> >> - if ( p2m->pod.count == 0 ) >> - goto out_of_memory; >> - > > Is this code motion just noise? Since out_of_memory crasheds the guest, > I assume so.It will have no practical effect, other than improving the efficiency of crashing a guest by a few cycles, if that''s what you''re asking. It''s more about taste: it just seemed silly to make an assignment and *then* see if you were going to crash. :-) -George
Tim Deegan
2012-Jun-14 13:13 UTC
Re: [PATCH 2 of 2 RFC] xen, pod: Only sweep in an emergency, and only for 4k pages
At 13:42 +0100 on 14 Jun (1339681375), George Dunlap wrote:> On Thu, Jun 14, 2012 at 10:11 AM, Tim Deegan <tim@xen.org> wrote: > > At 11:45 +0000 on 08 Jun (1339155933), George Dunlap wrote: > >> + if ( p2m->pod.count == 0 ) > >> + goto out_of_memory; > >> > >> /* Keep track of the highest gfn demand-populated by a guest fault */ > >> if ( gfn > p2m->pod.max_guest ) > >> p2m->pod.max_guest = gfn; > >> > >> - if ( p2m->pod.count == 0 ) > >> - goto out_of_memory; > >> - > > > > Is this code motion just noise? Since out_of_memory crasheds the guest, > > I assume so. > > It will have no practical effect, other than improving the efficiency > of crashing a guest by a few cycles, if that''s what you''re asking. > It''s more about taste: it just seemed silly to make an assignment and > *then* see if you were going to crash. :-)Well, yes. :) I don''t think it quite falls under the description of the patch, though. Can you split it out as a cleanup patch? Cheers, Tim.
George Dunlap
2012-Jun-14 13:32 UTC
Re: [PATCH 2 of 2 RFC] xen, pod: Only sweep in an emergency, and only for 4k pages
[re-including xen-devel in the cc]] On Thu, Jun 14, 2012 at 2:13 PM, Tim Deegan <tim@xen.org> wrote:> At 13:42 +0100 on 14 Jun (1339681375), George Dunlap wrote: >> On Thu, Jun 14, 2012 at 10:11 AM, Tim Deegan <tim@xen.org> wrote: >> > At 11:45 +0000 on 08 Jun (1339155933), George Dunlap wrote: >> >> + if ( p2m->pod.count == 0 ) >> >> + goto out_of_memory; >> >> >> >> /* Keep track of the highest gfn demand-populated by a guest fault */ >> >> if ( gfn > p2m->pod.max_guest ) >> >> p2m->pod.max_guest = gfn; >> >> >> >> - if ( p2m->pod.count == 0 ) >> >> - goto out_of_memory; >> >> - >> > >> > Is this code motion just noise? Since out_of_memory crasheds the guest, >> > I assume so. >> >> It will have no practical effect, other than improving the efficiency >> of crashing a guest by a few cycles, if that''s what you''re asking. >> It''s more about taste: it just seemed silly to make an assignment and >> *then* see if you were going to crash. :-) > > Well, yes. :) I don''t think it quite falls under the description of the > patch, though. Can you split it out as a cleanup patch?If that''s what you meant, you should have said so. :-) I shall do so for the next series. -George
George Dunlap
2012-Jun-14 14:24 UTC
Re: [PATCH 1 of 2 RFC] xen, pod: Zero-check recently populated pages (checklast)
On 14/06/12 10:07, Tim Deegan wrote:> > > At 13:02 +0100 on 08 Jun (1339160536), Jan Beulich wrote: >>>>> On 08.06.12 at 13:45, George Dunlap<george.dunlap@eu.citrix.com> wrote: >>> --- a/xen/include/asm-x86/p2m.h >>> +++ b/xen/include/asm-x86/p2m.h >>> @@ -287,6 +287,9 @@ struct p2m_domain { >>> unsigned reclaim_super; /* Last gpfn of a scan */ >>> unsigned reclaim_single; /* Last gpfn of a scan */ >>> unsigned max_guest; /* gpfn of max guest demand-populate */ >>> +#define POD_HISTORY_MAX 128 >>> + unsigned last_populated[POD_HISTORY_MAX]; /* gpfn of last guest page demand-populated */ > This is the gpfns of the last 128 order-9 superpages populated, right?Ah, yes -- just order 9.> Also, this line is>80 columns - I think I saw a few others in this series.I''ll go through and check, thanks.> >> unsigned long? >> >> Also, wouldn''t it be better to allocate this table dynamically, at >> once allowing its size to scale with the number of vCPU-s in the >> guest? > You could even make it a small per-vcpu array, assuming that the parallel > scrubbing will be symmetric across vcpus.I can''t remember exactly what I found here (this was last summer I was doing the tests); it may be that Windows creates a bunch of tasks which may migrate to various cpus. If that were the case, a global list would be better than per-vcpu lists. The problem with dynamically scaling the list is that I don''t have a heuristic to hand for how to scale it. In both cases, it''s not unlikely that making a change without testing will significantly reduce the effectiveness of the patch. Would you rather hold off and wait until I can get a chance to run my benchmarks again (which may miss the 4.2 cycle), or accept a tidied-up version of this patch first, and hope to get a revised method (using dynamic scaling or per-vcpu arrays) in before 4.2, but for sure by 4.3? -George
Tim Deegan
2012-Jun-14 15:36 UTC
Re: [PATCH 1 of 2 RFC] xen, pod: Zero-check recently populated pages (checklast)
At 15:24 +0100 on 14 Jun (1339687486), George Dunlap wrote:> >>Also, wouldn''t it be better to allocate this table dynamically, at > >>once allowing its size to scale with the number of vCPU-s in the > >>guest? > >You could even make it a small per-vcpu array, assuming that the parallel > >scrubbing will be symmetric across vcpus. > I can''t remember exactly what I found here (this was last summer I was > doing the tests); it may be that Windows creates a bunch of tasks which > may migrate to various cpus. If that were the case, a global list would > be better than per-vcpu lists. > > The problem with dynamically scaling the list is that I don''t have a > heuristic to hand for how to scale it. > > In both cases, it''s not unlikely that making a change without testing > will significantly reduce the effectiveness of the patch. Would you > rather hold off and wait until I can get a chance to run my benchmarks > again (which may miss the 4.2 cycle), or accept a tidied-up version of > this patch first, and hope to get a revised method (using dynamic > scaling or per-vcpu arrays) in before 4.2, but for sure by 4.3?I''d be happy with the fixed size for 4.2. Tim.