Dan Magenheimer
2012-Nov-28 15:50 UTC
[PATCH v8 1/2] hypervisor: XENMEM_claim_pages (subop of existing) hypercall
This is patch 1of2 of an eighth cut of the patch of the proposed XENMEM_claim_pages hypercall/subop, taking into account review feedback from Jan and Keir and IanC and Matthew Daley, plus some fixes found via runtime debugging (using printk and privcmd only). v7->v8: - Replace missing tot_pages adjustment [JBeulich] - Combine domain_inc/dec_tot_pages into one function [JBeulich] - Added a few more clarifying comments [djm] v6->v7: - Need co-existence with low_mem_virq [JBeulich] v5->v6: - Fix post-inc problem [mattjd] v4->v5: - Split tools part into separate patch [JBeulich] - Minor coding style fixups [JBeulich] - Use rcu_lock_domain_by_id, not _target version [JBeulich] v3->v4: (please ignore v3) - Process error, sent stale patch, sorry [djm] v2->v3: - Add per-domain and global "get" for unclaimed info [JBeulich] - New hypercall(s) should fail for unpriv callers [IanC] - Non-zero extent_order disallowed [JBeulich] - Remove bonehead ASSERTs [JBeulich] - Need not hold heaplock for decrease case too [JBeulich] - More descriptive failure return values [JBeulich] - Don''t allow a claim to exceed max_pages [IanC] - Subops must be in correct ifdef block in memory.c [keir] v1->v2: - Add reset-to-zero page claim in domain_kill [JBeulich] - Proper handling of struct passed to hypercall [JBeulich] - Fix alloc_heap_pages when a domain has a claim [JBeulich] - Need not hold heap_lock if !d->unclaimed_pages [keir] - Fix missed tot_pages call in donate_page [djm] - Remove domain_reset_unclaimed_pages; use set with zero [djm] - Bugfixes found through testing in set_unclaimed [djm] - More comments in code [djm] - Code formatting fixes [djm] == Motivation: The goal of this hypercall is to attempt to atomically and very quickly determine if there are sufficient pages available in the system and, if so, "set aside" that quantity of pages for future allocations by that domain. Unlike an existing hypercall such as increase_reservation or populate_physmap, specific physical pageframes are not assigned to the domain because this cannot be done sufficiently quickly (especially for very large allocations in an arbitrarily fragmented system) and so the existing mechanisms result in classic time-of-check-time-of-use (TOCTOU) races. One can think of claiming as similar to a "lazy" allocation, but subsequent hypercalls are required to do the actual physical pageframe allocation. In order for a toolstack to "get" information about whether a domain has a claim and, if so, how large, and also for the toolstack to measure the total system-wide claim, a second subop has been added and exposed through domctl and libxl (see PATCH 2/2). Note that this patchset does not include a full toolstack implementation, in part because Oracle uses xm/xend and a proprietary toolstack. The following pseudo-code describes the proposed claim-based use model, which could easily be implemented as an xl option such as "xl create --claim". Current: - call populate_physmap repeatedly to achieve mem=N memory - if any populate_physmap call fails, report -ENOMEM up the stack - memory is held until domain dies or the toolstack decreases it Proposed: - call claim for mem=N amount of memory - if claim succeeds: call populate_physmap repeatedly to achieve mem=N memory (failsafe) else report -ENOMEM up the stack - claim is held until mem=N is achieved or the domain dies or the toolstack changes it to 0 - memory is held until domain dies or the toolstack decreases it It has been noted that this claim mechanism solves the underlying problem (slow failure of domain creation) for a large class of domains but not all, specifically not handling (but also not making the problem worse for) PV domains that specify the "superpages" flag, and 32-bit PV domains on large RAM systems. These may be addressed at a later time. Code overview: Though the hypercall simply does arithmetic within locks, some of the semantics in the code may be a bit subtle so let me explain a bit. The key variables (d->unclaimed_pages and total_unclaimed_pages) start at zero if no claim has yet been staked for any domain. (Perhaps a better name is "claimed_but_not_yet_possessed" but that''s a bit unwieldy.) If no claim hypercalls are executed, there should be no impact on existing usage. When a claim is successfully staked by a domain, it is like a watermark but there is no record kept of the size of the claim. Instead, d->unclaimed_pages is set to the difference between d->tot_pages and the claim. When d->tot_pages increases or decreases, d->unclaimed_pages atomically decreases or increases. Once d->unclaimed_pages reaches zero, the claim is satisfied and d->unclaimed pages stays at zero -- unless a new claim is subsequently staked. The systemwide variable total_unclaimed_pages is always the sum of d->unclaimed_pages, across all domains. A non-domain- specific heap allocation will fail if total_unclaimed_pages exceeds free (plus, on tmem enabled systems, freeable) pages. Claim semantics may be modified by flags. The initial implementation has one flag, which discerns whether the caller would like tmem freeable pages to be considered in determining whether or not the claim can be successfully staked. Future flags may, for example, specify that the caller would like the claim to be successful only if there are sufficient pages available on a single node (per Dario''s suggestion). A claim can be cancelled by requesting a claim of zero pages. A second subop returns the total outstanding claimed pages systemwide. Note: Save/restore/migrate may need to be modified, else it can be documented that all claims are cancelled. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> arch/x86/mm.c | 2 arch/x86/mm/mem_sharing.c | 4 - common/domain.c | 1 common/domctl.c | 1 common/grant_table.c | 2 common/memory.c | 33 ++++++++++++- common/page_alloc.c | 114 ++++++++++++++++++++++++++++++++++++++++++++-- include/public/domctl.h | 3 - include/public/memory.h | 39 +++++++++++++++ include/xen/mm.h | 6 ++ include/xen/sched.h | 1 11 files changed, 197 insertions(+), 9 deletions(-) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index fad3d33..480edf7 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -3841,7 +3841,7 @@ int donate_page( { if ( d->tot_pages >= d->max_pages ) goto fail; - d->tot_pages++; + domain_adjust_tot_pages(d, 1); } page->count_info = PGC_allocated | 1; diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c index 5103285..e91aac5 100644 --- a/xen/arch/x86/mm/mem_sharing.c +++ b/xen/arch/x86/mm/mem_sharing.c @@ -639,7 +639,7 @@ static int page_make_sharable(struct domain *d, } page_set_owner(page, dom_cow); - d->tot_pages--; + domain_adjust_tot_pages(d, -1); drop_dom_ref = (d->tot_pages == 0); page_list_del(page, &d->page_list); spin_unlock(&d->page_alloc_lock); @@ -680,7 +680,7 @@ static int page_make_private(struct domain *d, struct page_info *page) ASSERT(page_get_owner(page) == dom_cow); page_set_owner(page, d); - if ( d->tot_pages++ == 0 ) + if ( domain_adjust_tot_pages(d, 1) == 1 ) get_domain(d); page_list_add_tail(page, &d->page_list); spin_unlock(&d->page_alloc_lock); diff --git a/xen/common/domain.c b/xen/common/domain.c index 12c8e24..f2cc4f5 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -492,6 +492,7 @@ int domain_kill(struct domain *d) evtchn_destroy(d); gnttab_release_mappings(d); tmem_destroy(d->tmem); + domain_set_unclaimed_pages(d, 0, 0); d->tmem = NULL; /* fallthrough */ case DOMDYING_dying: diff --git a/xen/common/domctl.c b/xen/common/domctl.c index a7a6b9f..e47a991 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -154,6 +154,7 @@ void getdomaininfo(struct domain *d, struct xen_domctl_getdomaininfo *info) info->tot_pages = d->tot_pages; info->max_pages = d->max_pages; + info->unclaimed_pages = d->unclaimed_pages; info->shr_pages = atomic_read(&d->shr_pages); info->paged_pages = atomic_read(&d->paged_pages); info->shared_info_frame = mfn_to_gmfn(d, __pa(d->shared_info)>>PAGE_SHIFT); diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c index 7912769..ca8d861 100644 --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -1656,7 +1656,7 @@ gnttab_transfer( } /* Okay, add the page to ''e''. */ - if ( unlikely(e->tot_pages++ == 0) ) + if ( unlikely(domain_adjust_tot_pages(e, 1) == 1) ) get_knownalive_domain(e); page_list_add_tail(page, &e->page_list); page_set_owner(page, e); diff --git a/xen/common/memory.c b/xen/common/memory.c index 83e2666..8efa6a3 100644 --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -454,7 +454,7 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg) (j * (1UL << exch.out.extent_order))); spin_lock(&d->page_alloc_lock); - d->tot_pages -= dec_count; + domain_adjust_tot_pages(d, -dec_count); drop_dom_ref = (dec_count && !d->tot_pages); spin_unlock(&d->page_alloc_lock); @@ -685,6 +685,37 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg) break; } + case XENMEM_claim_pages: + if ( !IS_PRIV(current->domain) ) + return -EPERM; + + if ( copy_from_guest(&reservation, arg, 1) ) + return -EFAULT; + + if ( !guest_handle_is_null(reservation.extent_start) ) + return -EINVAL; + + if ( reservation.extent_order != 0 ) + return -EINVAL; + + d = rcu_lock_domain_by_id(reservation.domid); + if ( d == NULL ) + return -EINVAL; + + rc = domain_set_unclaimed_pages(d, reservation.nr_extents, + reservation.mem_flags); + + rcu_unlock_domain(d); + + break; + + case XENMEM_get_unclaimed_pages: + if ( !IS_PRIV(current->domain) ) + return -EPERM; + + rc = get_total_unclaimed_pages(); + break; + default: rc = arch_memory_op(op, arg); break; diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 15ebc66..bb3e346 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -238,6 +238,105 @@ static long midsize_alloc_zone_pages; #define MIDSIZE_ALLOC_FRAC 128 static DEFINE_SPINLOCK(heap_lock); +static long total_unclaimed_pages; /* total outstanding claims by all domains */ + +unsigned long domain_adjust_tot_pages(struct domain *d, long pages) +{ + long dom_before, dom_after, dom_claimed, sys_before, sys_after; + + ASSERT(spin_is_locked(&d->page_alloc_lock)); + /* + * can test d->claimed_pages race-free because it can only change + * if d->page_alloc_lock and heap_lock are both held, see also + * domain_set_unclaimed_pages below + */ + if ( !d->unclaimed_pages ) + return d->tot_pages += pages; + spin_lock(&heap_lock); + d->tot_pages += pages; + /* adjust domain unclaimed_pages; may not go negative */ + dom_before = d->unclaimed_pages; + dom_after = dom_before - pages; + if ( (dom_before > 0) && (dom_after < 0) ) + dom_claimed = 0; + else + dom_claimed = dom_after; + d->unclaimed_pages = dom_claimed; + /* flag accounting bug if system unclaimed_pages would go negative */ + sys_before = total_unclaimed_pages; + sys_after = sys_before - (dom_before - dom_claimed); + BUG_ON((sys_before > 0) && (sys_after < 0)); + total_unclaimed_pages = sys_after; + spin_unlock(&heap_lock); + return d->tot_pages; +} + +int domain_set_unclaimed_pages(struct domain *d, unsigned long pages, + unsigned long flags) +{ + int ret = -ENOMEM; + unsigned long claim, avail_pages; + + /* + * take the domain''s page_alloc_lock, else all d->tot_page adjustments + * must always take the global heap_lock rather than only in the much + * rarer case that d->unclaimed_pages is non-zero + */ + spin_lock(&d->page_alloc_lock); + spin_lock(&heap_lock); + + /* pages==0 means "unset" the claim (and flags is ignored) */ + if ( pages == 0 ) + { + total_unclaimed_pages -= d->unclaimed_pages; + d->unclaimed_pages = 0; + ret = 0; + goto out; + } + + /* only one active claim per domain please */ + if ( d->unclaimed_pages ) + { + ret = -EINVAL; + goto out; + } + + /* disallow a claim not exceeding current tot_pages or above max_pages */ + if ( (pages <= d->tot_pages) || (pages > d->max_pages) ) + { + ret = -EINVAL; + goto out; + } + + /* how much memory is available? */ + avail_pages = total_avail_pages; + if ( !(flags & XENMEM_CLAIMF_free_only) ) + avail_pages += tmem_freeable_pages(); + avail_pages -= total_unclaimed_pages; + + /* + * note, if domain has already allocated memory before making a claim + * then the claim must take tot_pages into account + */ + claim = pages - d->tot_pages; + if ( claim > avail_pages ) + goto out; + + /* yay, claim fits in available memory, stake the claim, success! */ + d->unclaimed_pages = claim; + total_unclaimed_pages += d->unclaimed_pages; + ret = 0; + +out: + spin_unlock(&heap_lock); + spin_unlock(&d->page_alloc_lock); + return ret; +} + +long get_total_unclaimed_pages(void) +{ + return total_unclaimed_pages; +} static unsigned long init_node_heap(int node, unsigned long mfn, unsigned long nr, bool_t *use_tail) @@ -374,7 +473,7 @@ static void __init setup_low_mem_virq(void) static void check_low_mem_virq(void) { unsigned long avail_pages = total_avail_pages + - (opt_tmem ? tmem_freeable_pages() : 0); + (opt_tmem ? tmem_freeable_pages() : 0) - total_unclaimed_pages; if ( unlikely(avail_pages <= low_mem_virq_th) ) { @@ -443,6 +542,15 @@ static struct page_info *alloc_heap_pages( spin_lock(&heap_lock); /* + * Claimed memory is considered unavailable unless the request + * is made by a domain with sufficient unclaimed pages. + */ + if ( (total_unclaimed_pages + request > + total_avail_pages + tmem_freeable_pages()) && + (d == NULL || d->unclaimed_pages < request) ) + goto not_found; + + /* * TMEM: When available memory is scarce due to tmem absorbing it, allow * only mid-size allocations to avoid worst of fragmentation issues. * Others try tmem pools then fail. This is a workaround until all @@ -1291,7 +1399,7 @@ int assign_pages( if ( unlikely(d->tot_pages == 0) ) get_knownalive_domain(d); - d->tot_pages += 1 << order; + domain_adjust_tot_pages(d, 1 << order); } for ( i = 0; i < (1 << order); i++ ) @@ -1375,7 +1483,7 @@ void free_domheap_pages(struct page_info *pg, unsigned int order) page_list_del2(&pg[i], &d->page_list, &d->arch.relmem_list); } - d->tot_pages -= 1 << order; + domain_adjust_tot_pages(d, -(1 << order)); drop_dom_ref = (d->tot_pages == 0); spin_unlock_recursive(&d->page_alloc_lock); diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 7c0f23d..5735c28 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -36,7 +36,7 @@ #include "grant_table.h" #include "hvm/save.h" -#define XEN_DOMCTL_INTERFACE_VERSION 0x00000008 +#define XEN_DOMCTL_INTERFACE_VERSION 0x00000009 /* * NB. xen_domctl.domain is an IN/OUT parameter for this operation. @@ -95,6 +95,7 @@ struct xen_domctl_getdomaininfo { uint32_t flags; /* XEN_DOMINF_* */ uint64_aligned_t tot_pages; uint64_aligned_t max_pages; + uint64_aligned_t unclaimed_pages; uint64_aligned_t shr_pages; uint64_aligned_t paged_pages; uint64_aligned_t shared_info_frame; /* GMFN of shared_info struct */ diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h index f1ddbc0..15d6c72 100644 --- a/xen/include/public/memory.h +++ b/xen/include/public/memory.h @@ -68,6 +68,8 @@ struct xen_memory_reservation { * IN: GPFN bases of extents to populate with memory * OUT: GMFN bases of extents that were allocated * (NB. This command also updates the mach_to_phys translation table) + * XENMEM_claim_pages: + * IN: must be zero */ XEN_GUEST_HANDLE(xen_pfn_t) extent_start; @@ -421,6 +423,43 @@ struct xen_mem_sharing_op { typedef struct xen_mem_sharing_op xen_mem_sharing_op_t; DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t); +/* + * Attempt to stake a claim for a domain on a quantity of pages + * of system RAM, but _not_ assign specific pageframes. Only + * arithmetic is performed so the hypercall is very fast and need + * not be preemptible, thus sidestepping time-of-check-time-of-use + * races for memory allocation. Returns 0 if the hypervisor page + * allocator has atomically and successfully claimed the requested + * number of pages, else non-zero. + * + * Any domain may have only one active claim. When sufficient memory + * has been allocated to resolve the claim, the claim silently expires. + * Claiming zero pages effectively resets any outstanding claim and + * is always successful. + * + * Note that a valid claim may be staked even after memory has been + * allocated for a domain. In this case, the claim is not incremental, + * i.e. if the domain''s tot_pages is 3, and a claim is staked for 10, + * only 7 additional pages are claimed. + * + * Caller must be privileged or the hypercall fails. + */ +#define XENMEM_claim_pages 24 +/* + * XENMEM_claim_pages flags: + * free_only: claim is successful only if sufficient free pages + * are available. If not set and tmem is enabled, hypervisor + * may also consider tmem "freeable" pages to satisfy the claim. + */ +#define _XENMEM_CLAIMF_free_only 0 +#define XENMEM_CLAIMF_free_only (1U<<_XENMEM_CLAIMF_free_only) +/* + * Get the number of pages currently claimed (but not yet "possessed") + * across all domains. The caller must be privileged but otherwise + * the call never fails. + */ +#define XENMEM_get_unclaimed_pages 25 + #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */ #endif /* __XEN_PUBLIC_MEMORY_H__ */ diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h index 64a0cc1..86351d0 100644 --- a/xen/include/xen/mm.h +++ b/xen/include/xen/mm.h @@ -48,6 +48,12 @@ void free_xenheap_pages(void *v, unsigned int order); #define alloc_xenheap_page() (alloc_xenheap_pages(0,0)) #define free_xenheap_page(v) (free_xenheap_pages(v,0)) +/* Claim handling */ +unsigned long domain_adjust_tot_pages(struct domain *d, long pages); +int domain_set_unclaimed_pages( + struct domain *d, unsigned long pages, unsigned long flags); +long get_total_unclaimed_pages(void); + /* Domain suballocator. These functions are *not* interrupt-safe.*/ void init_domheap_pages(paddr_t ps, paddr_t pe); struct page_info *alloc_domheap_pages( diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 6c55039..480ef39 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -242,6 +242,7 @@ struct domain struct page_list_head page_list; /* linked list */ struct page_list_head xenpage_list; /* linked list (size xenheap_pages) */ unsigned int tot_pages; /* number of pages currently possesed */ + unsigned int unclaimed_pages; /* pages claimed but not possessed */ unsigned int max_pages; /* maximum value for tot_pages */ atomic_t shr_pages; /* number of shared pages */ atomic_t paged_pages; /* number of paged-out pages */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Nov-28 16:19 UTC
Re: [PATCH v8 1/2] hypervisor: XENMEM_claim_pages (subop of existing) hypercall
>>> On 28.11.12 at 16:50, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -3841,7 +3841,7 @@ int donate_page( > { > if ( d->tot_pages >= d->max_pages ) > goto fail; > - d->tot_pages++; > + domain_adjust_tot_pages(d, 1);I would generally expect there to always be paired increments and decrements, yet I fail to spot this one''s counterpart. There is one a few lines down in steal_page().> --- a/xen/common/grant_table.c > +++ b/xen/common/grant_table.c > @@ -1656,7 +1656,7 @@ gnttab_transfer( > } > > /* Okay, add the page to ''e''. */ > - if ( unlikely(e->tot_pages++ == 0) ) > + if ( unlikely(domain_adjust_tot_pages(e, 1) == 1) )This one''s counterpart ought to also be the one in steal_page(), but did you really check? As that isn''t the first one you missed, what did you do to find them all (apparently you didn''t simply grep for "tot_pages")?> +unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > +{ > + long dom_before, dom_after, dom_claimed, sys_before, sys_after; > + > + ASSERT(spin_is_locked(&d->page_alloc_lock)); > + /* > + * can test d->claimed_pages race-free because it can only change > + * if d->page_alloc_lock and heap_lock are both held, see also > + * domain_set_unclaimed_pages below > + */ > + if ( !d->unclaimed_pages ) > + return d->tot_pages += pages; > + spin_lock(&heap_lock); > + d->tot_pages += pages;Why don''t you do this first thing, right after the comment above? Would remove redundancy. Perhaps then even invert the if() condition and have just a single return point. Plus, if you went the route Keir suggested and split out the conversion to domain_adjust_tot_pages() without anything else, that would be what could probably go in uncontroversially, making the remaining amount of changes quite a bit smaller (and touching a more limited set of files). Jan> + /* adjust domain unclaimed_pages; may not go negative */ > + dom_before = d->unclaimed_pages; > + dom_after = dom_before - pages; > + if ( (dom_before > 0) && (dom_after < 0) ) > + dom_claimed = 0; > + else > + dom_claimed = dom_after; > + d->unclaimed_pages = dom_claimed; > + /* flag accounting bug if system unclaimed_pages would go negative */ > + sys_before = total_unclaimed_pages; > + sys_after = sys_before - (dom_before - dom_claimed); > + BUG_ON((sys_before > 0) && (sys_after < 0)); > + total_unclaimed_pages = sys_after; > + spin_unlock(&heap_lock); > + return d->tot_pages; > +}
Dan Magenheimer
2012-Nov-28 18:05 UTC
Re: [PATCH v8 1/2] hypervisor: XENMEM_claim_pages (subop of existing) hypercall
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: Re: [PATCH v8 1/2] hypervisor: XENMEM_claim_pages (subop of existing) hypercall > > >>> On 28.11.12 at 16:50, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > --- a/xen/arch/x86/mm.c > > +++ b/xen/arch/x86/mm.c > > @@ -3841,7 +3841,7 @@ int donate_page( > > { > > if ( d->tot_pages >= d->max_pages ) > > goto fail; > > - d->tot_pages++; > > + domain_adjust_tot_pages(d, 1); > > I would generally expect there to always be paired increments > and decrements, yet I fail to spot this one''s counterpart. There > is one a few lines down in steal_page(). > > > --- a/xen/common/grant_table.c > > +++ b/xen/common/grant_table.c > > @@ -1656,7 +1656,7 @@ gnttab_transfer( > > } > > > > /* Okay, add the page to ''e''. */ > > - if ( unlikely(e->tot_pages++ == 0) ) > > + if ( unlikely(domain_adjust_tot_pages(e, 1) == 1) ) > > This one''s counterpart ought to also be the one in steal_page(), > but did you really check? As that isn''t the first one you missed, > what did you do to find them all (apparently you didn''t simply > grep for "tot_pages")?Urk. I did grep (-r) many times. I''m embarrassed that this one slipped past me among the dozens of unchanged read-not-modify uses of d->tot_pages. I''m glad for your additional set of eyes. I have now carefully audited to ensure the steal_page one is the last one. Sorry!> > +unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > > +{ > > + long dom_before, dom_after, dom_claimed, sys_before, sys_after; > > + > > + ASSERT(spin_is_locked(&d->page_alloc_lock)); > > + /* > > + * can test d->claimed_pages race-free because it can only change > > + * if d->page_alloc_lock and heap_lock are both held, see also > > + * domain_set_unclaimed_pages below > > + */ > > + if ( !d->unclaimed_pages ) > > + return d->tot_pages += pages; > > + spin_lock(&heap_lock); > > + d->tot_pages += pages; > > Why don''t you do this first thing, right after the comment above? > Would remove redundancy.Hmmm... doesn''t that create a window where d->tot_pages and d->unclaimed_pages are inconsistent? Hmmm... yes, but this matters only if code where their consistency is required is not also protected by both locks. AFAICT such code doesn''t exist so... will do.> Perhaps then even invert the if() > condition and have just a single return point.Rather than invert the if(), unless you prefer the extra level of indentation, I''d prefer: + d->tot_pages += pages; + if ( !d->unclaimed_pages ) + goto out; <snip> +out: + return d->tot_pages; Would that be acceptable?> Plus, if you went the route Keir suggested and split out the > conversion to domain_adjust_tot_pages() without anything > else, that would be what could probably go in > uncontroversially, making the remaining amount of changes > quite a bit smaller (and touching a more limited set of files).OK, will do. Thanks again! Dan
Jan Beulich
2012-Nov-29 08:07 UTC
Re: [PATCH v8 1/2] hypervisor: XENMEM_claim_pages (subop of existing) hypercall
>>> On 28.11.12 at 19:05, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> Perhaps then even invert the if() >> condition and have just a single return point. > > Rather than invert the if(), unless you prefer the extra > level of indentation, I''d prefer: > > + d->tot_pages += pages; > + if ( !d->unclaimed_pages ) > + goto out; > <snip> > +out: > + return d->tot_pages; > > Would that be acceptable?Acceptable as per the coding guidelines - yes. But I personally dislike goto-s (and would hence prefer the extra level of indentation, considering that this is a strait code block with no deep nesting of inner scopes). Jan
Keir Fraser
2012-Nov-29 09:29 UTC
Re: [PATCH v8 1/2] hypervisor: XENMEM_claim_pages (subop of existing) hypercall
On 29/11/2012 08:07, "Jan Beulich" <JBeulich@suse.com> wrote:>>>> On 28.11.12 at 19:05, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >>> From: Jan Beulich [mailto:JBeulich@suse.com] >>> Perhaps then even invert the if() >>> condition and have just a single return point. >> >> Rather than invert the if(), unless you prefer the extra >> level of indentation, I''d prefer: >> >> + d->tot_pages += pages; >> + if ( !d->unclaimed_pages ) >> + goto out; >> <snip> >> +out: >> + return d->tot_pages; >> >> Would that be acceptable? > > Acceptable as per the coding guidelines - yes. But I personally > dislike goto-s (and would hence prefer the extra level of > indentation, considering that this is a strait code block with no > deep nesting of inner scopes).Personally I think I prefer the bailing goto, as I like straightline code with no indentation where possible. But it is personal preference for a function this short. Either is perfectly acceptable. -- Keir> Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
David Vrabel
2012-Nov-29 11:25 UTC
Re: [PATCH v8 1/2] hypervisor: XENMEM_claim_pages (subop of existing) hypercall
On 28/11/12 15:50, Dan Magenheimer wrote:> This is patch 1of2 of an eighth cut of the patch of the proposed > XENMEM_claim_pages hypercall/subop, taking into account review > feedback from Jan and Keir and IanC and Matthew Daley, plus some > fixes found via runtime debugging (using printk and privcmd only). >[...]> > Proposed: > - call claim for mem=N amount of memory > - if claim succeeds: > call populate_physmap repeatedly to achieve mem=N memory (failsafe) > else > report -ENOMEM up the stack > - claim is held until mem=N is achieved or the domain dies or > the toolstack changes it to 0 > - memory is held until domain dies or the toolstack decreases itThere is no mechanism for per-NUMA node claim. Isn''t this needed? More fundamentally, doesn''t this approach result in a worse user experience? It''s guaranteeing that a new VM can be started but at the expense of existing VMs on that node. When making a VM placement decision, the toolstack needs to consider the future memory requirements of the new and existing VMs on the host and not just the current (or more correctly, the recently) memory. It seems more useful to me to have the toolstack (for example) to track historical memory usage of a VM to allow it to make better predictions about memory usage. With a better prediction, the number of failed VM creates due to memory shortage will be minimized. Then, combined with reducing the cost of a VM create by optimizing the allocator, the cost of occasionally failing a create will be minimal. For example, Sally starts her CAD application at 9am, tripling her desktop VM instances memory usage. If at 0858, the toolstack claimed a most of the remaining memory for a new VM, then Sally''s VM is going to grind to a halt as it swaps to death. If the toolstack could predict that that desktop instances memory usage was about to spike (because it had historical data showing his), it could have selected a different host and Sally''s VM would perform as expected. David
Dan Magenheimer
2012-Dec-03 21:28 UTC
Re: [PATCH v8 1/2] hypervisor: XENMEM_claim_pages (subop of existing) hypercall
> From: David Vrabel [mailto:david.vrabel@citrix.com]Hi David -- Thanks for your reply!> On 28/11/12 15:50, Dan Magenheimer wrote: > > This is patch 1of2 of an eighth cut of the patch of the proposed > > XENMEM_claim_pages hypercall/subop, taking into account review > > feedback from Jan and Keir and IanC and Matthew Daley, plus some > > fixes found via runtime debugging (using printk and privcmd only). > > > [...] > > > > Proposed: > > - call claim for mem=N amount of memory > > - if claim succeeds: > > call populate_physmap repeatedly to achieve mem=N memory (failsafe) > > else > > report -ENOMEM up the stack > > - claim is held until mem=N is achieved or the domain dies or > > the toolstack changes it to 0 > > - memory is held until domain dies or the toolstack decreases it > > There is no mechanism for per-NUMA node claim. Isn''t this needed?It would be a useful extension but is not a necessary one; IIUC a domain creation succeeds even if optimal NUMA positioning is not available. As is, the proposed XENMEM_claim_pages patch does exactly the same. If there is a domain creation option that forces domain creation to fail if there is _not_ an optimal NUMA positioning available, XENMEM_claim_pages has flag fields to pass the same requirement so an extension should be easy to add.> More fundamentally, doesn''t this approach result in a worse user > experience? It''s guaranteeing that a new VM can be started but at the > expense of existing VMs on that node.Well, we are talking about a race. Somebody has to win. Traditionally, software races are decided by first-come-first-serve. That''s exactly how the proposed XENMEM_claim_pages works. If you have a chance, please read the document I just posted (Proposed XENMEM_claim_pages hypercall: Analysis of problems and alternate solutions).> When making a VM placement decision, the toolstack needs to consider the > future memory requirements of the new and existing VMs on the host and > not just the current (or more correctly, the recently) memory. > > It seems more useful to me to have the toolstack (for example) to track > historical memory usage of a VM to allow it to make better predictions > about memory usage. With a better prediction, the number of failed VM > creates due to memory shortage will be minimized. Then, combined with > reducing the cost of a VM create by optimizing the allocator, the cost > of occasionally failing a create will be minimal. > > For example, Sally starts her CAD application at 9am, tripling her > desktop VM instances memory usage. If at 0858, the toolstack claimed a > most of the remaining memory for a new VM, then Sally''s VM is going to > grind to a halt as it swaps to death. > > If the toolstack could predict that that desktop instances memory usage > was about to spike (because it had historical data showing his), it > could have selected a different host and Sally''s VM would perform as > expected.You are drifting the thread a bit here, but... The last 4+ years of my life have been built on the fundamental assumption that nobody, not even one guest kernel itself, can adequately predict when memory usage is going to spike. Accurate inference from an external entity across potentially dozens of VMs is IMHO.... well... um... unlikely. I could be wrong but I believe, even in academia, there is no realistic research solution proposed for this. (If I''m wrong, please send a pointer.) If one accepts this assumption as true, one must instead plan to be able to adapt very dynamically when spikes occur. That''s what tmem does to solve Sally''s problem, though admittedly tmem doesn''t work for proprietary guest kernels. +1 for open source. ;-) Thanks, Dan P.S. If you''d like to learn more about tmem, please let me know, as it is now available in Fedora and Ubuntu guests as well as Oracle Linux (and, of course, Xen itself).