Dan Magenheimer
2010-Feb-12 17:24 UTC
[Xen-devel] Tmem vs order>0 allocation, workaround RFC
I just had an idea for a workaround that might be low enough impact to get in for 4.0 and allow tmem to be enabled by default. I think it will not eliminate the fragmentation problem entirely, but would greatly reduce the probability of it causing problems for domain creation/migration when tmem is enabled, and possibly for the other memory utilization features as well. Simply, avail_heap_pages would fail if total_avail_pages is less than 1%(?) of the total memory on the system AND the request is order==0. Essentially, this is reserving a "zone" for order>0 allocations. It could be tied to tmem_enabled but, as previously discussed, even fragmentation from frequent ballooning can fragment memory and cause problems for domain creation/migration... and since, without memory utilization features it is highly unlikely that a system will "accidentally" pack in enough domains to use between 99% and 100% of physical memory anyway, always enabling this restriction would affect very very few systems. Comments? I''m not sure I''ve thought this all the way through and certainly haven''t tested it yet, but it seems like it should be easy to implement in a low-impact patch. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Feb-12 18:07 UTC
RE: [Xen-devel] Tmem vs order>0 allocation, workaround RFC
> Simply, avail_heap_pages would fail if total_avail_pages > is less than 1%(?) of the total memory on the system AND > the request is order==0. Essentially, this is reserving > a "zone" for order>0 allocations.Avoid worst fragmentation issues by reserving a "zone" of physical memory only for order>0 allocations. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> g`"--- a/xen/common/page_alloc.c Fri Feb 12 09:24:18 2010 +0000 +++ b/xen/common/page_alloc.c Fri Feb 12 11:05:19 2010 -0700 @@ -223,6 +223,10 @@ static heap_by_zone_and_order_t *_heap[M static unsigned long *avail[MAX_NUMNODES]; static long total_avail_pages; +static long max_total_avail_pages; /* highwater mark */ +#define ORDER_NONZERO_FRAC 128 +static long order_nonzero_zonesize; /* reserved for order>0 allocations */ + static DEFINE_SPINLOCK(heap_lock); @@ -304,6 +308,13 @@ static struct page_info *alloc_heap_page spin_lock(&heap_lock); /* + When available memory is scarce, allow only larger allocations + to avoid worst of fragmentation issues + */ + if ( !order && (total_avail_pages <= order_nonzero_zonesize) ) + goto fail; + + /* * Start with requested node, but exhaust all node memory in requested * zone before failing, only calc new node value if we fail to find memory * in target node, this avoids needless computation on fast-path. @@ -337,6 +348,7 @@ static struct page_info *alloc_heap_page } /* No suitable memory blocks. Fail the request. */ +fail: spin_unlock(&heap_lock); return NULL; @@ -503,6 +515,11 @@ static void free_heap_pages( avail[node][zone] += 1 << order; total_avail_pages += 1 << order; + if ( total_avail_pages > max_total_avail_pages ) + { + max_total_avail_pages = total_avail_pages; + order_nonzero_zonesize = max_total_avail_pages / ORDER_NONZERO_FRAC; + } /* Merge chunks as far as possible. */ while ( order < MAX_ORDER ) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Feb-15 08:21 UTC
[Xen-devel] Re: Tmem vs order>0 allocation, workaround RFC
On 12/02/2010 17:24, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> I just had an idea for a workaround that might be low enough > impact to get in for 4.0 and allow tmem to be enabled by > default. I think it will not eliminate the fragmentation > problem entirely, but would greatly reduce the probability > of it causing problems for domain creation/migration when tmem > is enabled, and possibly for the other memory utilization > features as well. > > Simply, avail_heap_pages would fail if total_avail_pages > is less than 1%(?) of the total memory on the system AND > the request is order==0. Essentially, this is reserving > a "zone" for order>0 allocations.I don''t see how that necessarily works. Pages can be allocated in order>0 chunks and freed order==0, so even that last 1% can get fragmented. For example, guests get their memory allocated in 2MB chunks where possible; but their balloon drivers may then free arbitrary 4kB pages within those chunks. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Feb-15 14:31 UTC
[Xen-devel] RE: Tmem vs order>0 allocation, workaround RFC
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] > > I just had an idea for a workaround that might be low enough > > impact to get in for 4.0 and allow tmem to be enabled by > > default. I think it will not eliminate the fragmentation > > problem entirely, but would greatly reduce the probability > > of it causing problems for domain creation/migration when tmem > > is enabled, and possibly for the other memory utilization > > features as well. > > > > Simply, avail_heap_pages would fail if total_avail_pages > > is less than 1%(?) of the total memory on the system AND > > the request is order==0. Essentially, this is reserving > > a "zone" for order>0 allocations. > > I don''t see how that necessarily works. Pages can be allocated in > order>0 > chunks and freed order==0, so even that last 1% can get fragmented. For > example, guests get their memory allocated in 2MB chunks where > possible; but > their balloon drivers may then free arbitrary 4kB pages within those > chunks.Good point. BUT... do you know of any other asymmetric allocs/frees? Since the 2MB allocation does fall back if it fails (to 4K I think?, if the patch is modified to restrict the "zone" to order>0&&order<9 will that be sufficient? I know this is quite a hack... I don''t like it much either. But I expect the process of restructuring all data structures to limit them to order==0 to take a long time with an even longer bug tail (AND be a whack-a-mole game in the future unless we disallow order>0 entirely). In that light (and with the low impact of this workaround), this hack may be just fine for a while. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Feb-15 15:40 UTC
[Xen-devel] Re: Tmem vs order>0 allocation, workaround RFC
On 15/02/2010 14:31, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Good point. BUT... do you know of any other asymmetric > allocs/frees? Since the 2MB allocation does fall back > if it fails (to 4K I think?, if the patch is modified > to restrict the "zone" to order>0&&order<9 will that > be sufficient?Even though that one can fall back, the point is that even one asymmetric alloc/free (and that is by far going to be the most common one) can hoover up the 1% ''pool'' and fragment it, so that allocations that cannot fall back can no longer use the pool.> I know this is quite a hack... I don''t like it much > either. But I expect the process of restructuring all > data structures to limit them to order==0 to take a long > time with an even longer bug tail (AND be a whack-a-mole > game in the future unless we disallow order>0 entirely). > In that light (and with the low impact of this workaround), > this hack may be just fine for a while.Well, I think it''s not only not very nice but also dubious whether it will work in practice very well. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Feb-15 15:55 UTC
[Xen-devel] RE: Tmem vs order>0 allocation, workaround RFC
> Even though that one can fall back, the point is that even one > asymmetric > alloc/free (and that is by far going to be the most common one) can > hoover > up the 1% ''pool'' and fragment it, so that allocations that cannot fall > back > can no longer use the pool.Understood. If we eliminate this case, can you think of any others that are asymmetric, except possibly very uncommon ones?> > I know this is quite a hack... I don''t like it much > > either. But I expect the process of restructuring all > > data structures to limit them to order==0 to take a long > > time with an even longer bug tail (AND be a whack-a-mole > > game in the future unless we disallow order>0 entirely). > > In that light (and with the low impact of this workaround), > > this hack may be just fine for a while. > > Well, I think it''s not only not very nice but also dubious whether it > will > work in practice very well.Other than the above, can you (or Jan? or others?) think of other cases where it won''t work in practice? If not, it''s at least worth a try to see if Jan''s test cases continue to see a problem. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Feb-15 16:36 UTC
[Xen-devel] RE: Tmem vs order>0 allocation, workaround RFC
This version should have zero impact if tmem is not enabled. ====== When tmem is enabled, reserve a fraction of memory for allocations of 0<order<9 to avoid fragmentation issues. Signed-off by: Dan Magenheimer <dan.magenheimer@oracle.com> diff -r 3bb163b74673 xen/common/page_alloc.c --- a/xen/common/page_alloc.c Fri Feb 12 09:24:18 2010 +0000 +++ b/xen/common/page_alloc.c Mon Feb 15 09:28:01 2010 -0700 @@ -223,6 +223,12 @@ static heap_by_zone_and_order_t *_heap[M static unsigned long *avail[MAX_NUMNODES]; static long total_avail_pages; +static long max_total_avail_pages; /* highwater mark */ + +/* reserved for midsize (0<order<9) allocations, tmem only for now */ +static long midsize_alloc_zone_pages; +#define MIDSIZE_ALLOC_FRAC 128 + static DEFINE_SPINLOCK(heap_lock); @@ -304,6 +310,15 @@ static struct page_info *alloc_heap_page spin_lock(&heap_lock); /* + When available memory is scarce, allow only mid-size allocations + to avoid worst of fragmentation issues. For now, only special-case + this when transcendent memory is enabled + */ + if ( opt_tmem && ((order == 0) || (order >= 9)) && + (total_avail_pages <= midsize_alloc_zone_pages) ) + goto fail; + + /* * Start with requested node, but exhaust all node memory in requested * zone before failing, only calc new node value if we fail to find memory * in target node, this avoids needless computation on fast-path. @@ -337,6 +352,7 @@ static struct page_info *alloc_heap_page } /* No suitable memory blocks. Fail the request. */ +fail: spin_unlock(&heap_lock); return NULL; @@ -503,6 +519,11 @@ static void free_heap_pages( avail[node][zone] += 1 << order; total_avail_pages += 1 << order; + if ( total_avail_pages > max_total_avail_pages ) + { + max_total_avail_pages = total_avail_pages; + midsize_alloc_zone_pages = max_total_avail_pages / MIDSIZE_ALLOC_FRAC; + } /* Merge chunks as far as possible. */ while ( order < MAX_ORDER ) @@ -842,6 +863,8 @@ static unsigned long avail_heap_pages( unsigned long total_free_pages(void) { + if ( opt_tmem ) + return total_avail_pages - midsize_alloc_zone_pages ; return total_avail_pages; } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Feb-15 16:37 UTC
[Xen-devel] Re: Tmem vs order>0 allocation, workaround RFC
On 15/02/2010 15:55, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Other than the above, can you (or Jan? or others?) think of > other cases where it won''t work in practice? If not, it''s > at least worth a try to see if Jan''s test cases continue > to see a problem.I think that''s the only obvious one. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2010-Feb-16 08:20 UTC
[Xen-devel] RE: Tmem vs order>0 allocation, workaround RFC
Besides generally not liking hackery like this (but we all seem to agree on that part), and besides having an un-explained feeling that there may be other bad effects from this, I also think that on large systems this may not work well: When you have 1Tb, you''d reserve 8G, making Dom0 single-page-below-4G-allocations impossible (unless dom0_mem= was used) if I read the logic correctly. Jan>>> Dan Magenheimer <dan.magenheimer@oracle.com> 15.02.10 17:36 >>>This version should have zero impact if tmem is not enabled. ====== When tmem is enabled, reserve a fraction of memory for allocations of 0<order<9 to avoid fragmentation issues. Signed-off by: Dan Magenheimer <dan.magenheimer@oracle.com> diff -r 3bb163b74673 xen/common/page_alloc.c --- a/xen/common/page_alloc.c Fri Feb 12 09:24:18 2010 +0000 +++ b/xen/common/page_alloc.c Mon Feb 15 09:28:01 2010 -0700 @@ -223,6 +223,12 @@ static heap_by_zone_and_order_t *_heap[M static unsigned long *avail[MAX_NUMNODES]; static long total_avail_pages; +static long max_total_avail_pages; /* highwater mark */ + +/* reserved for midsize (0<order<9) allocations, tmem only for now */ +static long midsize_alloc_zone_pages; +#define MIDSIZE_ALLOC_FRAC 128 + static DEFINE_SPINLOCK(heap_lock); @@ -304,6 +310,15 @@ static struct page_info *alloc_heap_page spin_lock(&heap_lock); /* + When available memory is scarce, allow only mid-size allocations + to avoid worst of fragmentation issues. For now, only special-case + this when transcendent memory is enabled + */ + if ( opt_tmem && ((order == 0) || (order >= 9)) && + (total_avail_pages <= midsize_alloc_zone_pages) ) + goto fail; + + /* * Start with requested node, but exhaust all node memory in requested * zone before failing, only calc new node value if we fail to find memory * in target node, this avoids needless computation on fast-path. @@ -337,6 +352,7 @@ static struct page_info *alloc_heap_page } /* No suitable memory blocks. Fail the request. */ +fail: spin_unlock(&heap_lock); return NULL; @@ -503,6 +519,11 @@ static void free_heap_pages( avail[node][zone] += 1 << order; total_avail_pages += 1 << order; + if ( total_avail_pages > max_total_avail_pages ) + { + max_total_avail_pages = total_avail_pages; + midsize_alloc_zone_pages = max_total_avail_pages / MIDSIZE_ALLOC_FRAC; + } /* Merge chunks as far as possible. */ while ( order < MAX_ORDER ) @@ -842,6 +863,8 @@ static unsigned long avail_heap_pages( unsigned long total_free_pages(void) { + if ( opt_tmem ) + return total_avail_pages - midsize_alloc_zone_pages ; return total_avail_pages; } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Feb-16 15:05 UTC
[Xen-devel] RE: Tmem vs order>0 allocation, workaround RFC
Hi Jan -- Thanks for thinking about this.> may not work well: When you have 1Tb, you''d reserve 8G, making Dom0 > single-page-below-4G-allocations impossible (unless dom0_mem= was > used) if I read the logic correctly.Good point. But tmem doesn''t work very well at all if dom0_mem isn''t set as dom0 is hogging all the spare memory in the system so only fallow memory reclaimed from selfballooning domains can be used by tmem. Under what circumstances does dom0 require single-page-below-4G allocations? Is it only for bounce buffers for PCI passthrough of old devices with 32-bit addressing limitations? Or am I missing a much more common case? (I think it''s important to enumerate and understand -- and document -- all special needs of memory pages as Xen has been fairly careless/lucky with fragmentation so far, but with all the memory optimization technologies in 4.0, we need to root out all the cases.) Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2010-Feb-16 15:15 UTC
[Xen-devel] RE: Tmem vs order>0 allocation, workaround RFC
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:05 >>> >Under what circumstances does dom0 require single-page-below-4G >allocations? Is it only for bounce buffers for PCI passthrough >of old devices with 32-bit addressing limitations? Or am I >missing a much more common case?Not just for pass-through; all devices only supporting 32-bit addressing would have such requirements, and particularly common ones are display adapters which have DRM/AGP drivers loaded for them. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Feb-16 15:31 UTC
[Xen-devel] RE: Tmem vs order>0 allocation, workaround RFC
> From: Jan Beulich [mailto:JBeulich@novell.com] > Subject: RE: Tmem vs order>0 allocation, workaround RFC > > >>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:05 >>> > >Under what circumstances does dom0 require single-page-below-4G > >allocations? Is it only for bounce buffers for PCI passthrough > >of old devices with 32-bit addressing limitations? Or am I > >missing a much more common case? > > Not just for pass-through; all devices only supporting 32-bit > addressing would have such requirements, and particularly common > ones are display adapters which have DRM/AGP drivers loaded for > them.Right, but those are statically allocated when dom0 is launched, not dynamically allocated later after tmem (or other memory allocation technologies) begin working, right? Whereas pass-through devices would need below-4G pages later? (And 32-bit devices in a 1TB machine seems a bit of a stretch, but I suppose it is good to enumerate all the cases.) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2010-Feb-16 15:45 UTC
[Xen-devel] RE: Tmem vs order>0 allocation, workaround RFC
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:31 >>> >> From: Jan Beulich [mailto:JBeulich@novell.com] >> Subject: RE: Tmem vs order>0 allocation, workaround RFC >> >> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:05 >>> >> >Under what circumstances does dom0 require single-page-below-4G >> >allocations? Is it only for bounce buffers for PCI passthrough >> >of old devices with 32-bit addressing limitations? Or am I >>> >missing a much more common case? >> >> Not just for pass-through; all devices only supporting 32-bit >> addressing would have such requirements, and particularly common >> ones are display adapters which have DRM/AGP drivers loaded for >> them. > >Right, but those are statically allocated when dom0 is >launched, not dynamically allocated later after tmem >(or other memory allocation technologies) begin working, >right? Whereas pass-through devices would need below-4G >pages later?No, consistent/coherent allocations can happen at run time. Typically the largest share of the allocations would happen when the respective driver loads, but especially for the DRM/AGP case I think allocations get triggered by user mode (X initializing a display), which may happen at any time.>(And 32-bit devices in a 1TB machine seems a bit of a >stretch, but I suppose it is good to enumerate all the >cases.)Yes, but the 1Tb was just taken as an extreme example. Issues may arise earlier. And the display adapter part would likely remain valid even there - just see the use of vmalloc_32() in drivers/gpu/drm/drm_scatter.c for an example. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2010-Feb-16 16:44 UTC
[Xen-devel] RE: Tmem vs order>0 allocation, workaround RFC
Hi Jan -- Fair enough. You''ve convinced me that I shouldn''t push for tmem to be turned back on by default for the official Xen 4.0 release. But the patch as just checked-in by Keir limits allocations only if tmem is enabled so I will just document that tmem may cause problems if 32-bit-limited devices are in the system. (I''d expect that to be rare in the cloud environment where tmem would be most used.) I do think it''s unfortunate (turning off tmem by default) as I suspect that "thar be (more) dragons" in Xen, when trying to do any kind of memory utilization optimization, that will come back and bite us. Tmem is just the first to aggressively pursue this and disabling it only delays the inevitable. For example, I''ll bet improvements to NUMA support will have many similar problems. Anyway, thanks as usual for thinking deeply through the issue and for trying out tmem... any new technology is going to have some growing pains. Thanks again, Dan> -----Original Message----- > From: Jan Beulich [mailto:JBeulich@novell.com] > Sent: Tuesday, February 16, 2010 8:46 AM > To: Dan Magenheimer > Cc: Grzegorz Milos; Patrick Colp; AndrewPeace; George Dunlap; Ian > Pratt; KeirFraser; TimDeegan; xen-devel@lists.xensource.com; Kurt > Hackel > Subject: RE: Tmem vs order>0 allocation, workaround RFC > > >>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:31 >>> > >> From: Jan Beulich [mailto:JBeulich@novell.com] > >> Subject: RE: Tmem vs order>0 allocation, workaround RFC > >> > >> >>> Dan Magenheimer <dan.magenheimer@oracle.com> 16.02.10 16:05 >>> > >> >Under what circumstances does dom0 require single-page-below-4G > >> >allocations? Is it only for bounce buffers for PCI passthrough > >> >of old devices with 32-bit addressing limitations? Or am I > >>> >missing a much more common case? > >> > >> Not just for pass-through; all devices only supporting 32-bit > >> addressing would have such requirements, and particularly common > >> ones are display adapters which have DRM/AGP drivers loaded for > >> them. > > > >Right, but those are statically allocated when dom0 is > >launched, not dynamically allocated later after tmem > >(or other memory allocation technologies) begin working, > >right? Whereas pass-through devices would need below-4G > >pages later? > > No, consistent/coherent allocations can happen at run time. > Typically the largest share of the allocations would happen when > the respective driver loads, but especially for the DRM/AGP case > I think allocations get triggered by user mode (X initializing a > display), which may happen at any time. > > >(And 32-bit devices in a 1TB machine seems a bit of a > >stretch, but I suppose it is good to enumerate all the > >cases.) > > Yes, but the 1Tb was just taken as an extreme example. Issues may > arise earlier. And the display adapter part would likely remain valid > even there - just see the use of vmalloc_32() in > drivers/gpu/drm/drm_scatter.c for an example. > > Jan >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Feb-16 18:20 UTC
Re: [Xen-devel] RE: Tmem vs order>0 allocation, workaround RFC
On Tue, Feb 16, 2010 at 07:05:48AM -0800, Dan Magenheimer wrote:> Hi Jan -- > > Thanks for thinking about this. > > > may not work well: When you have 1Tb, you''d reserve 8G, making Dom0 > > single-page-below-4G-allocations impossible (unless dom0_mem= was > > used) if I read the logic correctly. > > Good point. But tmem doesn''t work very well at all if dom0_mem > isn''t set as dom0 is hogging all the spare memory in the system > so only fallow memory reclaimed from selfballooning domains > can be used by tmem. > > Under what circumstances does dom0 require single-page-below-4G > allocations? Is it only for bounce buffers for PCI passthrough > of old devices with 32-bit addressing limitations? Or am I > missing a much more common case? (I think it''s important toThe software IO TLB is initialized unconditionally if no IOMMUs are found. This is a 64MB + 32Kb chunk of memory that is exchanged with Xen to make sure it is under the 32-bit mark.> enumerate and understand -- and document -- all special needs > of memory pages as Xen has been fairly careless/lucky with > fragmentation so far, but with all the memory optimization > technologies in 4.0, we need to root out all the cases.) > > Thanks, > Dan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel