Andres Lagar-Cavilla
2012-Feb-16 03:57 UTC
[PATCH 0 of 4] Handling of (some) low memory conditions
After some experiments with the sharing code under low memory conditions, we post the following series: - Bugfix sharing unshare when we run out of memory. - Sort out the situations in which we can go to sleep on a wait queue if unshare fails (and by extension the semantics of unshare error handling) - Prevent a certain Ocaml-based toolstack from crashing domains doing sharing or paging. - Add a VIRQ that the hypervisor can emit when reaching a low memory threshold. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: Adin Scannell <adin@scannell.ca> xen/common/domctl.c | 8 +++++- xen/arch/x86/mm/mem_event.c | 5 ++- xen/include/asm-x86/mem_event.h | 30 ++++++++++++++++++--- xen/arch/x86/hvm/hvm.c | 20 +++++++++++++- xen/arch/x86/mm.c | 8 +++-- xen/arch/x86/mm/mem_sharing.c | 52 ++++++++++++++++++++++++--------------- xen/arch/x86/mm/p2m.c | 18 ++++++++++++- xen/common/grant_table.c | 11 ++++--- xen/common/memory.c | 1 + xen/include/asm-x86/mem_sharing.h | 15 +++++++++++ xen/common/page_alloc.c | 10 +++++++ xen/include/public/xen.h | 1 + 12 files changed, 140 insertions(+), 39 deletions(-)
Andres Lagar-Cavilla
2012-Feb-16 03:57 UTC
[PATCH 1 of 4] Prevent low values of max_pages for domains doing sharing or paging
xen/common/domctl.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) Apparently, setting d->max_pages to something lower than d->tot_pages is used as a mechanism for controling a domain''s footprint. It will result in all new page allocations failing. This is a really bad idea with paging or sharing, as regular guest memory accesses may need to be satisfied by allocating new memory (either to page in or to unshare). Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 62b1fe67b8d1 -r 11fd4e0a1e1a xen/common/domctl.c --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -813,8 +813,14 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc * NB. We removed a check that new_max >= current tot_pages; this means * that the domain will now be allowed to "ratchet" down to new_max. In * the meantime, while tot > max, all new allocations are disallowed. + * + * Except that this is not a great idea for domains doing sharing or + * paging, as they need to perform new allocations on the fly. */ - d->max_pages = new_max; + if ( (new_max > d->max_pages) || + !((d->mem_event->paging.ring_page != NULL) || + d->arch.hvm_domain.mem_sharing_enabled) ) + d->max_pages = new_max; ret = 0; spin_unlock(&d->page_alloc_lock);
Andres Lagar-Cavilla
2012-Feb-16 03:57 UTC
[PATCH 2 of 4] x86/mm: Allow to not sleep on mem event ring
xen/arch/x86/mm/mem_event.c | 5 +++-- xen/include/asm-x86/mem_event.h | 30 +++++++++++++++++++++++++----- 2 files changed, 28 insertions(+), 7 deletions(-) Under extreme congestion conditions, generating a mem event may put the vcpu to sleep on a wait queue if the ring is full. This is generally desirable, although fairly convoluted to work with, since sleeping on a wait queue requires a non-atomic context (i.e. no locks held). Introduce an allow_sleep flag to make this optional. The API remains such that all current callers set allow_sleep to true and thus will sleep if necessary. The end-use is for cases in which loss of guest mem events is tolerable. One such consumer to be added later is the unsharing code under ENOMEM conditions. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: Adin Scannell <adin@scannell.ca> diff -r 11fd4e0a1e1a -r 09746decbd28 xen/arch/x86/mm/mem_event.c --- a/xen/arch/x86/mm/mem_event.c +++ b/xen/arch/x86/mm/mem_event.c @@ -441,9 +441,10 @@ bool_t mem_event_check_ring(struct mem_e * 0: a spot has been reserved * */ -int mem_event_claim_slot(struct domain *d, struct mem_event_domain *med) +int __mem_event_claim_slot(struct domain *d, struct mem_event_domain *med, + bool_t allow_sleep) { - if ( current->domain == d ) + if ( (current->domain == d) && allow_sleep ) return mem_event_wait_slot(med); else return mem_event_grab_slot(med, 1); diff -r 11fd4e0a1e1a -r 09746decbd28 xen/include/asm-x86/mem_event.h --- a/xen/include/asm-x86/mem_event.h +++ b/xen/include/asm-x86/mem_event.h @@ -28,11 +28,31 @@ bool_t mem_event_check_ring(struct mem_event_domain *med); /* Returns 0 on success, -ENOSYS if there is no ring, -EBUSY if there is no - * available space. For success or -EBUSY, the vCPU may be left blocked - * temporarily to ensure that the ring does not lose future events. In - * general, you must follow a claim_slot() call with either put_request() or - * cancel_slot(), both of which are guaranteed to succeed. */ -int mem_event_claim_slot(struct domain *d, struct mem_event_domain *med); + * available space and the caller is a foreign domain. If the guest itself + * is the caller, -EBUSY is avoided by sleeping on a wait queue to ensure + * that the ring does not lose future events. + * + * However, the allow_sleep flag can be set to false in cases in which it is ok + * to lose future events, and thus -EBUSY can be returned to guest vcpus + * (handle with care!). + * + * In general, you must follow a claim_slot() call with either put_request() or + * cancel_slot(), both of which are guaranteed to + * succeed. + */ +int __mem_event_claim_slot(struct domain *d, struct mem_event_domain *med, + bool_t allow_sleep); +static inline int mem_event_claim_slot(struct domain *d, + struct mem_event_domain *med) +{ + return __mem_event_claim_slot(d, med, 1); +} + +static inline int mem_event_claim_slot_nosleep(struct domain *d, + struct mem_event_domain *med) +{ + return __mem_event_claim_slot(d, med, 0); +} void mem_event_cancel_slot(struct domain *d, struct mem_event_domain *med);
Andres Lagar-Cavilla
2012-Feb-16 03:57 UTC
[PATCH 3 of 4] Memory sharing: better handling of ENOMEM while unsharing
xen/arch/x86/hvm/hvm.c | 20 +++++++++++++- xen/arch/x86/mm.c | 8 +++-- xen/arch/x86/mm/mem_sharing.c | 52 ++++++++++++++++++++++++--------------- xen/arch/x86/mm/p2m.c | 18 ++++++++++++- xen/common/grant_table.c | 11 ++++--- xen/common/memory.c | 1 + xen/include/asm-x86/mem_sharing.h | 15 +++++++++++ 7 files changed, 94 insertions(+), 31 deletions(-) If unsharing fails with ENOMEM, we were: - leaving the list of gfns backed by the shared page in an inconsistent state - cycling forever on the hap page fault handler. - Attempting to produce a mem event (which could sleep on a wait queue) while holding locks. - Not checking, for all callers, that unshare could have indeed failed. Fix bugs above, and sanitize callers to place a ring event in an unlocked context, or without requiring to go to sleep on a wait queue. A note on the rationale for unshare error handling: 1. Unshare can only fail with ENOMEM. Any other error conditions BUG_ON()''s 2. We notify a potential dom0 helper through a mem_event ring. But we allow the notification to not go to sleep. If the event ring is full of ENOMEM warnings, then the helper will already have been kicked enough. 3. We cannot "just" go to sleep until the unshare is resolved, because we might be buried deep into locks (e.g. something -> copy_to_user -> __hvm_copy) 4. So, we make sure we: 4.1. return an error 4.2. do not corrupt shared memory 4.3. do not corrupt guest memory 4.4. let the guest deal with the error if propagation will reach that far Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 09746decbd28 -r 407b5ac709aa xen/arch/x86/hvm/hvm.c --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -1196,7 +1196,7 @@ int hvm_hap_nested_page_fault(unsigned l mfn_t mfn; struct vcpu *v = current; struct p2m_domain *p2m; - int rc, fall_through = 0, paged = 0; + int rc, fall_through = 0, paged = 0, sharing_enomem = 0; mem_event_request_t *req_ptr = NULL; /* On Nested Virtualization, walk the guest page table. @@ -1305,7 +1305,8 @@ int hvm_hap_nested_page_fault(unsigned l if ( access_w && (p2mt == p2m_ram_shared) ) { ASSERT(!p2m_is_nestedp2m(p2m)); - mem_sharing_unshare_page(p2m->domain, gfn, 0); + sharing_enomem = + (mem_sharing_unshare_page(p2m->domain, gfn, 0) < 0); rc = 1; goto out_put_gfn; } @@ -1345,8 +1346,23 @@ int hvm_hap_nested_page_fault(unsigned l out_put_gfn: put_gfn(p2m->domain, gfn); + /* All of these are delayed until we exit, since we might + * sleep on event ring wait queues, and we must not hold + * locks in such circumstance */ if ( paged ) p2m_mem_paging_populate(v->domain, gfn); + if ( sharing_enomem ) + { + int rv; + if ( (rv = mem_sharing_notify_enomem(v->domain, gfn, 1)) < 0 ) + { + gdprintk(XENLOG_ERR, "Domain %hu attempt to unshare " + "gfn %lx, ENOMEM and no helper (rc %d)\n", + v->domain->domain_id, gfn, rv); + /* Crash the domain */ + rc = 0; + } + } if ( req_ptr ) { mem_access_send_req(v->domain, req_ptr); diff -r 09746decbd28 -r 407b5ac709aa xen/arch/x86/mm.c --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -3589,12 +3589,14 @@ int do_mmu_update( /* Unshare the page for RW foreign mappings */ if ( l1e_get_flags(l1e) & _PAGE_RW ) { - rc = mem_sharing_unshare_page(pg_owner, - l1e_get_pfn(l1e), - 0); + unsigned long gfn = l1e_get_pfn(l1e); + rc = mem_sharing_unshare_page(pg_owner, gfn, 0); if ( rc ) { put_gfn(pg_owner, l1egfn); + /* Notify helper, don''t care about errors, will not + * sleep on wq, since we''re a foreign domain. */ + (void)mem_sharing_notify_enomem(pg_owner, gfn, 0); break; } } diff -r 09746decbd28 -r 407b5ac709aa xen/arch/x86/mm/mem_sharing.c --- a/xen/arch/x86/mm/mem_sharing.c +++ b/xen/arch/x86/mm/mem_sharing.c @@ -343,35 +343,30 @@ int mem_sharing_audit(void) #endif -static void mem_sharing_notify_helper(struct domain *d, unsigned long gfn) +int mem_sharing_notify_enomem(struct domain *d, unsigned long gfn, + bool_t allow_sleep) { struct vcpu *v = current; + int rc; mem_event_request_t req = { .type = MEM_EVENT_TYPE_SHARED }; - if ( v->domain != d ) + if ( (rc = __mem_event_claim_slot(d, + &d->mem_event->share, allow_sleep)) < 0 ) + return rc; + + if ( v->domain == d ) { - /* XXX This path needs some attention. For now, just fail foreign - * XXX requests to unshare if there''s no memory. This replaces - * XXX old code that BUG()ed here; the callers now BUG() - * XXX elewhere. */ - gdprintk(XENLOG_ERR, - "Failed alloc on unshare path for foreign (%d) lookup\n", - d->domain_id); - return; + req.flags = MEM_EVENT_FLAG_VCPU_PAUSED; + vcpu_pause_nosync(v); } - if (mem_event_claim_slot(d, &d->mem_event->share) < 0) - { - return; - } - - req.flags = MEM_EVENT_FLAG_VCPU_PAUSED; - vcpu_pause_nosync(v); - req.gfn = gfn; req.p2mt = p2m_ram_shared; req.vcpu_id = v->vcpu_id; + mem_event_put_request(d, &d->mem_event->share, &req); + + return 0; } unsigned int mem_sharing_get_nr_saved_mfns(void) @@ -903,6 +898,20 @@ err_out: return ret; } + +/* A note on the rationale for unshare error handling: + * 1. Unshare can only fail with ENOMEM. Any other error conditions BUG_ON()''s + * 2. We notify a potential dom0 helper through a mem_event ring. But we + * allow the notification to not go to sleep. If the event ring is full + * of ENOMEM warnings, then it''s on the ball. + * 3. We cannot go to sleep until the unshare is resolved, because we might + * be buried deep into locks (e.g. something -> copy_to_user -> __hvm_copy) + * 4. So, we make sure we: + * 4.1. return an error + * 4.2. do not corrupt shared memory + * 4.3. do not corrupt guest memory + * 4.4. let the guest deal with it if the error propagation will reach it + */ int mem_sharing_unshare_page(struct domain *d, unsigned long gfn, uint16_t flags) @@ -945,7 +954,6 @@ gfn_found: /* Do the accounting first. If anything fails below, we have bigger * bigger fish to fry. First, remove the gfn from the list. */ last_gfn = list_has_one_entry(&page->sharing->gfns); - mem_sharing_gfn_destroy(d, gfn_info); if ( last_gfn ) { /* Clean up shared state */ @@ -959,6 +967,7 @@ gfn_found: * (possibly freeing the page), and exit early */ if ( flags & MEM_SHARING_DESTROY_GFN ) { + mem_sharing_gfn_destroy(d, gfn_info); put_page_and_type(page); mem_sharing_page_unlock(page); if ( last_gfn && @@ -971,6 +980,7 @@ gfn_found: if ( last_gfn ) { + mem_sharing_gfn_destroy(d, gfn_info); /* Making a page private atomically unlocks it */ BUG_ON(page_make_private(d, page) != 0); goto private_page_found; @@ -982,7 +992,8 @@ gfn_found: { mem_sharing_page_unlock(old_page); put_gfn(d, gfn); - mem_sharing_notify_helper(d, gfn); + /* Caller is responsible for placing an event + * in the ring */ return -ENOMEM; } @@ -993,6 +1004,7 @@ gfn_found: unmap_domain_page(t); BUG_ON(set_shared_p2m_entry(d, gfn, page_to_mfn(page)) == 0); + mem_sharing_gfn_destroy(d, gfn_info); mem_sharing_page_unlock(old_page); put_page_and_type(old_page); diff -r 09746decbd28 -r 407b5ac709aa xen/arch/x86/mm/p2m.c --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -170,7 +170,10 @@ mfn_t __get_gfn_type_access(struct p2m_d if ( q == p2m_unshare && p2m_is_shared(*t) ) { ASSERT(!p2m_is_nestedp2m(p2m)); - mem_sharing_unshare_page(p2m->domain, gfn, 0); + /* Try to unshare. If we fail, communicate ENOMEM without + * sleeping. */ + if ( mem_sharing_unshare_page(p2m->domain, gfn, 0) < 0 ) + (void)mem_sharing_notify_enomem(p2m->domain, gfn, 0); mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order); } #endif @@ -371,6 +374,7 @@ void p2m_teardown(struct p2m_domain *p2m if ( mfn_valid(mfn) && (t == p2m_ram_shared) ) { ASSERT(!p2m_is_nestedp2m(p2m)); + /* Does not fail with ENOMEM given the DESTROY flag */ BUG_ON(mem_sharing_unshare_page(d, gfn, MEM_SHARING_DESTROY_GFN)); } } @@ -510,6 +514,18 @@ guest_physmap_add_entry(struct domain *d if ( rc ) { p2m_unlock(p2m); + /* NOTE: Should a guest domain bring this upon itself, + * there is not a whole lot we can do. We are buried + * deep in locks from most code paths by now. So, fail + * the call and don''t try to sleep on a wait queue + * while placing the mem event. + * + * However, all current (changeset 3432abcf9380) code + * paths avoid this unsavoury situation. For now. + * + * Foreign domains are okay to place an event as they + * won''t go to sleep. */ + (void)mem_sharing_notify_enomem(p2m->domain, gfn + i, 0); return rc; } omfn = p2m->get_entry(p2m, gfn + i, &ot, &a, p2m_query, NULL); diff -r 09746decbd28 -r 407b5ac709aa xen/common/grant_table.c --- a/xen/common/grant_table.c +++ b/xen/common/grant_table.c @@ -112,8 +112,7 @@ static unsigned inline int max_nr_maptra p2m_type_t __p2mt; \ unsigned long __x; \ __x = mfn_x(get_gfn_unshare((_d), (_gfn), &__p2mt)); \ - BUG_ON(p2m_is_shared(__p2mt)); /* XXX fixme */ \ - if ( !p2m_is_valid(__p2mt) ) \ + if ( p2m_is_shared(__p2mt) || !p2m_is_valid(__p2mt) ) \ __x = INVALID_MFN; \ __x; }) #else @@ -155,9 +154,11 @@ static int __get_paged_frame(unsigned lo else { mfn = get_gfn_unshare(rd, gfn, &p2mt); - BUG_ON(p2m_is_shared(p2mt)); - /* XXX Here, and above in gfn_to_mfn_private, need to handle - * XXX failure to unshare. */ + if ( p2m_is_shared(p2mt) ) + { + put_gfn(rd, gfn); + return GNTST_eagain; + } } if ( p2m_is_valid(p2mt) ) { diff -r 09746decbd28 -r 407b5ac709aa xen/common/memory.c --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -200,6 +200,7 @@ int guest_remove_page(struct domain *d, if ( mem_sharing_unshare_page(d, gmfn, 0) ) { put_gfn(d, gmfn); + (void)mem_sharing_notify_enomem(d, gmfn, 0); return 0; } /* Maybe the mfn changed */ diff -r 09746decbd28 -r 407b5ac709aa xen/include/asm-x86/mem_sharing.h --- a/xen/include/asm-x86/mem_sharing.h +++ b/xen/include/asm-x86/mem_sharing.h @@ -53,9 +53,24 @@ int mem_sharing_nominate_page(struct dom int expected_refcnt, shr_handle_t *phandle); #define MEM_SHARING_DESTROY_GFN (1<<1) +/* Only fails with -ENOMEM */ int mem_sharing_unshare_page(struct domain *d, unsigned long gfn, uint16_t flags); +/* If called by a foreign domain, possible errors are + * -EBUSY -> ring full + * -ENOSYS -> no ring to begin with + * and the foreign mapper is responsible for retrying. + * + * If called by the guest vcpu itself and allow_sleep is set, may + * sleep on a wait queue, so the caller is responsible for not + * holding locks on entry. It may only fail with ENOSYS + * + * If called by the guest vcpu itself and allow_sleep is not set, + * then it''s the same as a foreign domain. + */ +int mem_sharing_notify_enomem(struct domain *d, unsigned long gfn, + bool_t allow_sleep); int mem_sharing_sharing_resume(struct domain *d); int mem_sharing_memop(struct domain *d, xen_mem_sharing_op_t *mec);
Andres Lagar-Cavilla
2012-Feb-16 03:57 UTC
[PATCH 4 of 4] Global virq for low memory situations
xen/common/page_alloc.c | 10 ++++++++++ xen/include/public/xen.h | 1 + 2 files changed, 11 insertions(+), 0 deletions(-) When a low memory threshold on the Xen heap is reached, we fire a global dom0 virq. If someone''s listening, they can free up some more memory. The low threshold is configurable via the command line token ''low_mem_virq_limit", and defaults to 64MiB. We define a new virq VIRQ_ENOMEM. Potential listeners include squeezed, xenballoond, or anything else that can be fired through xencommons. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 407b5ac709aa -r 65c15f3c9611 xen/common/page_alloc.c --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -35,6 +35,7 @@ #include <xen/perfc.h> #include <xen/numa.h> #include <xen/nodemask.h> +#include <xen/event.h> #include <xen/tmem.h> #include <xen/tmem_xen.h> #include <public/sysctl.h> @@ -300,6 +301,12 @@ static unsigned long init_node_heap(int return needed; } +/* Default to 64 MiB */ +#define DEFAULT_LOW_MEM_VIRQ_MIB 64 +static unsigned long long __read_mostly opt_low_mem_virq = + (DEFAULT_LOW_MEM_VIRQ_MIB << 20); +size_param("low_mem_virq_limit", opt_low_mem_virq); + /* Allocate 2^@order contiguous pages. */ static struct page_info *alloc_heap_pages( unsigned int zone_lo, unsigned int zone_hi, @@ -420,6 +427,9 @@ static struct page_info *alloc_heap_page total_avail_pages -= request; ASSERT(total_avail_pages >= 0); + if ( (total_avail_pages << PAGE_SHIFT) <= opt_low_mem_virq ) + send_global_virq(VIRQ_ENOMEM); + if ( d != NULL ) d->last_alloc_node = node; diff -r 407b5ac709aa -r 65c15f3c9611 xen/include/public/xen.h --- a/xen/include/public/xen.h +++ b/xen/include/public/xen.h @@ -157,6 +157,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_pfn_t); #define VIRQ_PCPU_STATE 9 /* G. (DOM0) PCPU state changed */ #define VIRQ_MEM_EVENT 10 /* G. (DOM0) A memory event has occured */ #define VIRQ_XC_RESERVED 11 /* G. Reserved for XenClient */ +#define VIRQ_ENOMEM 12 /* G. (DOM0) Dangerously low on heap memory */ /* Architecture-specific VIRQ definitions. */ #define VIRQ_ARCH_0 16
Jan Beulich
2012-Feb-16 09:16 UTC
Re: [PATCH 1 of 4] Prevent low values of max_pages for domains doing sharing or paging
>>> On 16.02.12 at 04:57, Andres Lagar-Cavilla <andres@lagarcavilla.org> wrote: > xen/common/domctl.c | 8 +++++++- > 1 files changed, 7 insertions(+), 1 deletions(-) > > > Apparently, setting d->max_pages to something lower than d->tot_pages is > used as a mechanism for controling a domain''s footprint. It will result > in all new page allocations failing. > > This is a really bad idea with paging or sharing, as regular guest memory > accesses may need to be satisfied by allocating new memory (either to > page in or to unshare). > > Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> > > diff -r 62b1fe67b8d1 -r 11fd4e0a1e1a xen/common/domctl.c > --- a/xen/common/domctl.c > +++ b/xen/common/domctl.c > @@ -813,8 +813,14 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc > * NB. We removed a check that new_max >= current tot_pages; this means > * that the domain will now be allowed to "ratchet" down to new_max. In > * the meantime, while tot > max, all new allocations are disallowed. > + * > + * Except that this is not a great idea for domains doing sharing or > + * paging, as they need to perform new allocations on the fly. > */ > - d->max_pages = new_max; > + if ( (new_max > d->max_pages) || > + !((d->mem_event->paging.ring_page != NULL) || > + d->arch.hvm_domain.mem_sharing_enabled) ) > + d->max_pages = new_max;I don''t think this is appropriate here, as it will allow not only paging/sharing allocations to succeed, but also such that aren''t intended to. I think that *if* an override to the limit enforcement is needed, it ought to be implemented by passing a special flag to the page allocator to have it make an exception. But that''s questionable in the first place - if a domain is told to shrink, it ought to follow that directive (and the paging/sharing code then probably should obtain a page from inside the guest; I realize that might be problematic when sharing is enabled but paging isn''t). Jan> ret = 0; > spin_unlock(&d->page_alloc_lock); > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel
Jan Beulich
2012-Feb-16 09:31 UTC
Re: [PATCH 0 of 4] Handling of (some) low memory conditions
>>> On 16.02.12 at 04:57, Andres Lagar-Cavilla <andres@lagarcavilla.org> wrote: > - Add a VIRQ that the hypervisor can emit when reaching a low memory > threshold.In this patch, which didn''t make it to my inbox yet, you will want to change this + if ( (total_avail_pages << PAGE_SHIFT) <= opt_low_mem_virq ) to if ( total_avail_pages <= PFN_DOWN(opt_low_mem_virq) ) to avoid the case (on 32-bit hypervisors) where total_avail_pages, being just ''long'', would get significant bits shifted out. I''m further wondering whether the default value shouldn''t be set dynamically based on available memory and/or taking into account an eventual dom0_mem= option. Finally, rate limiting will be needed for sending the vIRQ - currently, once below the threshold you''re sending one on each and every allocation (which may become harmful if the listener can''t really do anything about the situation). Jan
Tim Deegan
2012-Feb-16 10:20 UTC
Re: [PATCH 1 of 4] Prevent low values of max_pages for domains doing sharing or paging
At 22:57 -0500 on 15 Feb (1329346624), Andres Lagar-Cavilla wrote:> xen/common/domctl.c | 8 +++++++- > 1 files changed, 7 insertions(+), 1 deletions(-) > > > Apparently, setting d->max_pages to something lower than d->tot_pages is > used as a mechanism for controling a domain''s footprint. It will result > in all new page allocations failing.Yep.> This is a really bad idea with paging or sharing, as regular guest memory > accesses may need to be satisfied by allocating new memory (either to > page in or to unshare).Nack. If a domain ends up with a max_pages so low that it can''t page in, that''s a tools bug. This patch doesn''t fix it, because the toolstack could set new max == current tot (er, +1) and then you have the same problem if you page in twice. (And also it silently ignores the update rather than reporting an error.) Tim.> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> > > diff -r 62b1fe67b8d1 -r 11fd4e0a1e1a xen/common/domctl.c > --- a/xen/common/domctl.c > +++ b/xen/common/domctl.c > @@ -813,8 +813,14 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc > * NB. We removed a check that new_max >= current tot_pages; this means > * that the domain will now be allowed to "ratchet" down to new_max. In > * the meantime, while tot > max, all new allocations are disallowed. > + * > + * Except that this is not a great idea for domains doing sharing or > + * paging, as they need to perform new allocations on the fly. > */ > - d->max_pages = new_max; > + if ( (new_max > d->max_pages) || > + !((d->mem_event->paging.ring_page != NULL) || > + d->arch.hvm_domain.mem_sharing_enabled) ) > + d->max_pages = new_max; > ret = 0; > spin_unlock(&d->page_alloc_lock); >
Andres Lagar-Cavilla
2012-Feb-16 14:40 UTC
Re: [PATCH 0 of 4] Handling of (some) low memory conditions
>>>> On 16.02.12 at 04:57, Andres Lagar-Cavilla <andres@lagarcavilla.org> >>>> wrote: >> - Add a VIRQ that the hypervisor can emit when reaching a low memory >> threshold. > > In this patch, which didn''t make it to my inbox yet, you will want to > change this > > + if ( (total_avail_pages << PAGE_SHIFT) <= opt_low_mem_virq ) > > to > > if ( total_avail_pages <= PFN_DOWN(opt_low_mem_virq) ) > > to avoid the case (on 32-bit hypervisors) where total_avail_pages, > being just ''long'', would get significant bits shifted out. > > I''m further wondering whether the default value shouldn''t be set > dynamically based on available memory and/or taking into account > an eventual dom0_mem= option.I can cap or get rid of the threshold (and the virq) if it doesn''t make sense with respect to total memory. I''m not sure about integrating dom0_mem, since dom0''s footprint is also a quantity manipulated by the receiver of the virq.> > Finally, rate limiting will be needed for sending the vIRQ - currently, > once below the threshold you''re sending one on each and every > allocation (which may become harmful if the listener can''t really do > anything about the situation).All three are sensible. I''ll look into printk rate-limiting for inspiration. Thanks, Andres> > Jan > >
Andres Lagar-Cavilla
2012-Feb-16 14:45 UTC
Re: [PATCH 1 of 4] Prevent low values of max_pages for domains doing sharing or paging
> At 22:57 -0500 on 15 Feb (1329346624), Andres Lagar-Cavilla wrote: >> xen/common/domctl.c | 8 +++++++- >> 1 files changed, 7 insertions(+), 1 deletions(-) >> >> >> Apparently, setting d->max_pages to something lower than d->tot_pages is >> used as a mechanism for controling a domain''s footprint. It will result >> in all new page allocations failing. > > Yep. > >> This is a really bad idea with paging or sharing, as regular guest >> memory >> accesses may need to be satisfied by allocating new memory (either to >> page in or to unshare). > > Nack. If a domain ends up with a max_pages so low that it can''t page > in, that''s a tools bug. This patch doesn''t fix it, because the > toolstack could set new max == current tot (er, +1) and then you have > the same problem if you page in twice. (And also it silently ignores > the update rather than reporting an error.)Fair enough (also referring to Jan''s comments). We would be building policy into the hypervisor. But I''ve seen squeezed set criminally low max_pages value (i.e. 256). Granted, this is squeezed''s problem, but shouldn''t some sanity checking be wired into the hypervisor? Why should we even allow max_pages < tot_pages? Also please note that paging can somewhat deal with this because the pager gets synchronous kicks in a "clean" context -- the pager can swap out some stuff with ease. Sharing is seriously disadvantaged here -- the enomem ring is not mandatory, and failed allocations can happen in runtime paths rather than on the pager''s request. Andres> > Tim. > >> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> >> >> diff -r 62b1fe67b8d1 -r 11fd4e0a1e1a xen/common/domctl.c >> --- a/xen/common/domctl.c >> +++ b/xen/common/domctl.c >> @@ -813,8 +813,14 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc >> * NB. We removed a check that new_max >= current tot_pages; >> this means >> * that the domain will now be allowed to "ratchet" down to >> new_max. In >> * the meantime, while tot > max, all new allocations are >> disallowed. >> + * >> + * Except that this is not a great idea for domains doing >> sharing or >> + * paging, as they need to perform new allocations on the fly. >> */ >> - d->max_pages = new_max; >> + if ( (new_max > d->max_pages) || >> + !((d->mem_event->paging.ring_page != NULL) || >> + d->arch.hvm_domain.mem_sharing_enabled) ) >> + d->max_pages = new_max; >> ret = 0; >> spin_unlock(&d->page_alloc_lock); >> >
Tim Deegan
2012-Feb-16 14:58 UTC
Re: [PATCH 1 of 4] Prevent low values of max_pages for domains doing sharing or paging
At 06:45 -0800 on 16 Feb (1329374751), Andres Lagar-Cavilla wrote:> Fair enough (also referring to Jan''s comments). We would be building > policy into the hypervisor. > > But I''ve seen squeezed set criminally low max_pages value (i.e. 256). > Granted, this is squeezed''s problem, but shouldn''t some sanity checking be > wired into the hypervisor?Some operating systems do just fine in 640K. :) But seriously, what lower limit would we use? Stupidly low max_pages for some uses would be just fine for others.> Why should we even allow max_pages < tot_pages?The reasoning is: - the tools want a hard guarantee that a rogue balloon driver can''t mess up their calculations of how much free RAM there is. - when a VM is ballooning down we don''t want to have the tools spinning watching actual max_pages and adjusting tot_pages down as it changes. It''s not a particularly nice interface, though, and I''d be happy to see it revert to the old one (where new max had to be <= current tot). But that will need a fix for Xapi/squeezed to handle the ballooning-down case some other way. Cc-ing xen-API. Any XCP folks got an opinion about this? Would it be easy to make squeezed not need this behaviour? Cheers, Tim.
Jan Beulich
2012-Feb-16 15:22 UTC
Re: [PATCH 0 of 4] Handling of (some) low memory conditions
>>> On 16.02.12 at 15:40, "Andres Lagar-Cavilla" <andres@lagarcavilla.org> wrote: >>>>> On 16.02.12 at 04:57, Andres Lagar-Cavilla <andres@lagarcavilla.org> >>>>> wrote: >>> - Add a VIRQ that the hypervisor can emit when reaching a low memory >>> threshold. >> >> In this patch, which didn''t make it to my inbox yet, you will want to >> change this >> >> + if ( (total_avail_pages << PAGE_SHIFT) <= opt_low_mem_virq ) >> >> to >> >> if ( total_avail_pages <= PFN_DOWN(opt_low_mem_virq) ) >> >> to avoid the case (on 32-bit hypervisors) where total_avail_pages, >> being just ''long'', would get significant bits shifted out. >> >> I''m further wondering whether the default value shouldn''t be set >> dynamically based on available memory and/or taking into account >> an eventual dom0_mem= option. > > I can cap or get rid of the threshold (and the virq) if it doesn''t make > sense with respect to total memory. I''m not sure about integrating > dom0_mem, since dom0''s footprint is also a quantity manipulated by the > receiver of the virq.No, dom0_mem= is only specifying the starting value, and the case that would be of possibly interest is that of having a negative amount specified. Jan
Jan Beulich
2012-Feb-16 15:32 UTC
Re: [PATCH 1 of 4] Prevent low values of max_pages for domains doing sharing or paging
>>> On 16.02.12 at 15:45, "Andres Lagar-Cavilla" <andres@lagarcavilla.org> wrote: > But I''ve seen squeezed set criminally low max_pages value (i.e. 256). > Granted, this is squeezed''s problem, but shouldn''t some sanity checking be > wired into the hypervisor? Why should we even allow max_pages < tot_pages?The only lower boundary that the hypervisor could (perhaps should) enforce is that of not being able to guarantee guest forward progress: On x86, up to 2 text pages plus up to 4 data pages, times the number of page table levels (i.e. 24 pages on 64-bit). Jan
Andres Lagar-Cavilla
2012-Feb-16 15:34 UTC
Re: [PATCH 0 of 4] Handling of (some) low memory conditions
>>>> On 16.02.12 at 15:40, "Andres Lagar-Cavilla" <andres@lagarcavilla.org> >>>> wrote: >>>>>> On 16.02.12 at 04:57, Andres Lagar-Cavilla <andres@lagarcavilla.org> >>>>>> wrote: >>>> - Add a VIRQ that the hypervisor can emit when reaching a low memory >>>> threshold. >>> >>> In this patch, which didn''t make it to my inbox yet, you will want to >>> change this >>> >>> + if ( (total_avail_pages << PAGE_SHIFT) <= opt_low_mem_virq ) >>> >>> to >>> >>> if ( total_avail_pages <= PFN_DOWN(opt_low_mem_virq) ) >>> >>> to avoid the case (on 32-bit hypervisors) where total_avail_pages, >>> being just ''long'', would get significant bits shifted out. >>> >>> I''m further wondering whether the default value shouldn''t be set >>> dynamically based on available memory and/or taking into account >>> an eventual dom0_mem= option. >> >> I can cap or get rid of the threshold (and the virq) if it doesn''t make >> sense with respect to total memory. I''m not sure about integrating >> dom0_mem, since dom0''s footprint is also a quantity manipulated by the >> receiver of the virq. > > No, dom0_mem= is only specifying the starting value, and the case > that would be of possibly interest is that of having a negative amount > specified.Sorry, not following entirely. The negative quantity you refer to would be dom0_mem or the low mem virq threshold? Additional checks pertaining only to dom0_mem, not in relation to this threshold, would most likely go on a separate patch. And both quantities are parsed using strtoull, so you get at most really large numbers. I certainly need to add additional checks for unsuitable low_mem_virq thresholds to this patch. Thanks, Andres> > Jan > >
Tim Deegan
2012-Feb-16 16:08 UTC
Re: [PATCH 1 of 4] Prevent low values of max_pages for domains doing sharing or paging
At 15:32 +0000 on 16 Feb (1329406348), Jan Beulich wrote:> >>> On 16.02.12 at 15:45, "Andres Lagar-Cavilla" <andres@lagarcavilla.org> wrote: > > But I''ve seen squeezed set criminally low max_pages value (i.e. 256). > > Granted, this is squeezed''s problem, but shouldn''t some sanity checking be > > wired into the hypervisor? Why should we even allow max_pages < tot_pages? > > The only lower boundary that the hypervisor could (perhaps should) > enforce is that of not being able to guarantee guest forward progress: > On x86, up to 2 text pages plus up to 4 data pages, times the number > of page table levels (i.e. 24 pages on 64-bit).OT: ISTR the actual number of memory accesses needed to complete one x86 instruction is much crazier than that (e.g. a memory-to-memory copy that faults on the second page of the destination, pushing most of a stack frame before hitting the SS limit and double-faulting via a task gate, &c). We did try to count it exactly for the shadow pagetables once but gave up and wildly overestimated instead, (because perf near the limit would have been too awful anyway). Tim.
Tim Deegan
2012-Feb-16 16:11 UTC
Re: [PATCH 2 of 4] x86/mm: Allow to not sleep on mem event ring
At 22:57 -0500 on 15 Feb (1329346625), Andres Lagar-Cavilla wrote:> xen/arch/x86/mm/mem_event.c | 5 +++-- > xen/include/asm-x86/mem_event.h | 30 +++++++++++++++++++++++++----- > 2 files changed, 28 insertions(+), 7 deletions(-) > > > Under extreme congestion conditions, generating a mem event may put the vcpu to > sleep on a wait queue if the ring is full. This is generally desirable, although > fairly convoluted to work with, since sleeping on a wait queue requires a > non-atomic context (i.e. no locks held). > > Introduce an allow_sleep flag to make this optional. The API remains such that > all current callers set allow_sleep to true and thus will sleep if necessary. > > The end-use is for cases in which loss of guest mem events is tolerable. One > such consumer to be added later is the unsharing code under ENOMEM conditions. > > Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> > Signed-off-by: Adin Scannell <adin@scannell.ca>Hmm. This interface is getting a bit twisty now, but assuming the patches that use it are OK, then Ack. Tim.
Tim Deegan
2012-Feb-16 16:19 UTC
Re: [PATCH 3 of 4] Memory sharing: better handling of ENOMEM while unsharing
At 22:57 -0500 on 15 Feb (1329346626), Andres Lagar-Cavilla wrote:> xen/arch/x86/hvm/hvm.c | 20 +++++++++++++- > xen/arch/x86/mm.c | 8 +++-- > xen/arch/x86/mm/mem_sharing.c | 52 ++++++++++++++++++++++++--------------- > xen/arch/x86/mm/p2m.c | 18 ++++++++++++- > xen/common/grant_table.c | 11 ++++--- > xen/common/memory.c | 1 + > xen/include/asm-x86/mem_sharing.h | 15 +++++++++++ > 7 files changed, 94 insertions(+), 31 deletions(-) > > > If unsharing fails with ENOMEM, we were: > - leaving the list of gfns backed by the shared page in an inconsistent state > - cycling forever on the hap page fault handler. > - Attempting to produce a mem event (which could sleep on a wait queue) > while holding locks. > - Not checking, for all callers, that unshare could have indeed failed. > > Fix bugs above, and sanitize callers to place a ring event in an unlocked > context, or without requiring to go to sleep on a wait queue. > > A note on the rationale for unshare error handling: > 1. Unshare can only fail with ENOMEM. Any other error conditions BUG_ON()''sPlease enforce this in the code -- either the function should just return success/failure or (better, I think) the caller should ASSERT that it doesn''t see any other error codes.> 2. We notify a potential dom0 helper through a mem_event ring. But we > allow the notification to not go to sleep. If the event ring is full > of ENOMEM warnings, then the helper will already have been kicked enough.Could we just keep a count (or flag) to remind us that an ENOMEM is in the pipe and avoid having an exception to the waiting rules?> 3. We cannot "just" go to sleep until the unshare is resolved, because we > might be buried deep into locks (e.g. something -> copy_to_user -> > __hvm_copy) > 4. So, we make sure we: > 4.1. return an error > 4.2. do not corrupt shared memory > 4.3. do not corrupt guest memory > 4.4. let the guest deal with the error if propagation will reach that farYep, the other cleanups look good to me. Cheers, Tim.
Jan Beulich
2012-Feb-16 16:26 UTC
Re: [PATCH 0 of 4] Handling of (some) low memory conditions
>>> On 16.02.12 at 16:34, "Andres Lagar-Cavilla" <andres@lagarcavilla.org> wrote: >>>>> On 16.02.12 at 15:40, "Andres Lagar-Cavilla" <andres@lagarcavilla.org> >>>>> wrote: >>>>>>> On 16.02.12 at 04:57, Andres Lagar-Cavilla <andres@lagarcavilla.org> >>>>>>> wrote: >>>>> - Add a VIRQ that the hypervisor can emit when reaching a low memory >>>>> threshold. >>>> >>>> In this patch, which didn''t make it to my inbox yet, you will want to >>>> change this >>>> >>>> + if ( (total_avail_pages << PAGE_SHIFT) <= opt_low_mem_virq ) >>>> >>>> to >>>> >>>> if ( total_avail_pages <= PFN_DOWN(opt_low_mem_virq) ) >>>> >>>> to avoid the case (on 32-bit hypervisors) where total_avail_pages, >>>> being just ''long'', would get significant bits shifted out. >>>> >>>> I''m further wondering whether the default value shouldn''t be set >>>> dynamically based on available memory and/or taking into account >>>> an eventual dom0_mem= option. >>> >>> I can cap or get rid of the threshold (and the virq) if it doesn''t make >>> sense with respect to total memory. I''m not sure about integrating >>> dom0_mem, since dom0''s footprint is also a quantity manipulated by the >>> receiver of the virq. >> >> No, dom0_mem= is only specifying the starting value, and the case >> that would be of possibly interest is that of having a negative amount >> specified. > > Sorry, not following entirely. The negative quantity you refer to would be > dom0_mem or the low mem virq threshold? Additional checks pertaining only > to dom0_mem, not in relation to this threshold, would most likely go on a > separate patch.Consider someone useing "dom0_mem=-64M" - that''d make your code raise the vIRQ immediately, even though the admin decided that leaving 64M inside Xen is okay at boot time at least. Jan
Jan Beulich
2012-Feb-16 16:44 UTC
Re: [PATCH 1 of 4] Prevent low values of max_pages for domains doing sharing or paging
>>> On 16.02.12 at 17:08, Tim Deegan <tim@xen.org> wrote: > At 15:32 +0000 on 16 Feb (1329406348), Jan Beulich wrote: >> >>> On 16.02.12 at 15:45, "Andres Lagar-Cavilla" <andres@lagarcavilla.org> wrote: >> > But I''ve seen squeezed set criminally low max_pages value (i.e. 256). >> > Granted, this is squeezed''s problem, but shouldn''t some sanity checking be >> > wired into the hypervisor? Why should we even allow max_pages < tot_pages? >> >> The only lower boundary that the hypervisor could (perhaps should) >> enforce is that of not being able to guarantee guest forward progress: >> On x86, up to 2 text pages plus up to 4 data pages, times the number >> of page table levels (i.e. 24 pages on 64-bit). > > OT: ISTR the actual number of memory accesses needed to complete one x86 > instruction is much crazier than that (e.g. a memory-to-memory copy that > faults on the second page of the destination, pushing most of a stack > frame before hitting the SS limit and double-faulting via a task gate, > &c). We did try to count it exactly for the shadow pagetables once but > gave up and wildly overestimated instead, (because perf near the limit > would have been too awful anyway).Hmm, indeed: When a page fault occurs, you don''t need the text or data pages the original instruction referenced anymore. Instead, you now need up to 2 IDT pages, up to 4 GDT pages, and up to 2 stack pages. So yes, the number is higher (but not that much). An exception going through a task gate is of course much uglier, but luckily that doesn''t need to be considered at least for x86-64. Jan
Andres Lagar-Cavilla
2012-Feb-17 16:57 UTC
Re: [PATCH 2 of 4] x86/mm: Allow to not sleep on mem event ring
> At 22:57 -0500 on 15 Feb (1329346625), Andres Lagar-Cavilla wrote: >> xen/arch/x86/mm/mem_event.c | 5 +++-- >> xen/include/asm-x86/mem_event.h | 30 +++++++++++++++++++++++++----- >> 2 files changed, 28 insertions(+), 7 deletions(-) >> >> >> Under extreme congestion conditions, generating a mem event may put the >> vcpu to >> sleep on a wait queue if the ring is full. This is generally desirable, >> although >> fairly convoluted to work with, since sleeping on a wait queue requires >> a >> non-atomic context (i.e. no locks held). >> >> Introduce an allow_sleep flag to make this optional. The API remains >> such that >> all current callers set allow_sleep to true and thus will sleep if >> necessary. >> >> The end-use is for cases in which loss of guest mem events is tolerable. >> One >> such consumer to be added later is the unsharing code under ENOMEM >> conditions. >> >> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> >> Signed-off-by: Adin Scannell <adin@scannell.ca> > > Hmm. This interface is getting a bit twisty now, but assuming the > patches that use it are OK, then Ack.Another option is to call a ring "best effort", and then the event producer doesn''t have to specify whether sleep is allowable. The event producer should know, though, whether the ring is best effort. I think that makes up for a cleaner interface. Andres> > Tim. > >
Andres Lagar-Cavilla
2012-Feb-17 17:01 UTC
Re: [PATCH 3 of 4] Memory sharing: better handling of ENOMEM while unsharing
> At 22:57 -0500 on 15 Feb (1329346626), Andres Lagar-Cavilla wrote: >> xen/arch/x86/hvm/hvm.c | 20 +++++++++++++- >> xen/arch/x86/mm.c | 8 +++-- >> xen/arch/x86/mm/mem_sharing.c | 52 >> ++++++++++++++++++++++++--------------- >> xen/arch/x86/mm/p2m.c | 18 ++++++++++++- >> xen/common/grant_table.c | 11 ++++--- >> xen/common/memory.c | 1 + >> xen/include/asm-x86/mem_sharing.h | 15 +++++++++++ >> 7 files changed, 94 insertions(+), 31 deletions(-) >> >> >> If unsharing fails with ENOMEM, we were: >> - leaving the list of gfns backed by the shared page in an inconsistent >> state >> - cycling forever on the hap page fault handler. >> - Attempting to produce a mem event (which could sleep on a wait queue) >> while holding locks. >> - Not checking, for all callers, that unshare could have indeed failed. >> >> Fix bugs above, and sanitize callers to place a ring event in an >> unlocked >> context, or without requiring to go to sleep on a wait queue. >> >> A note on the rationale for unshare error handling: >> 1. Unshare can only fail with ENOMEM. Any other error conditions >> BUG_ON()''s > > Please enforce this in the code -- either the function should just > return success/failure or (better, I think) the caller should ASSERT > that it doesn''t see any other error codes.I''ll turn unshare into __unshare and make unshare a wrapper that asserts the return value semantics of __unshare.> >> 2. We notify a potential dom0 helper through a mem_event ring. But we >> allow the notification to not go to sleep. If the event ring is full >> of ENOMEM warnings, then the helper will already have been kicked >> enough. > > Could we just keep a count (or flag) to remind us that an ENOMEM is in > the pipe and avoid having an exception to the waiting rules?If we follow the "best effort ring" idea I just proposed for the previous patch, it would make this one cleaner as well. I don''t think there''s a point in remembering enomem''s are on the pipe. The helper will be woken up on enomem 1, it won''t need over 64 notifications to do something about it. Andres> >> 3. We cannot "just" go to sleep until the unshare is resolved, because >> we >> might be buried deep into locks (e.g. something -> copy_to_user -> >> __hvm_copy) >> 4. So, we make sure we: >> 4.1. return an error >> 4.2. do not corrupt shared memory >> 4.3. do not corrupt guest memory >> 4.4. let the guest deal with the error if propagation will reach >> that far > > Yep, the other cleanups look good to me. > > Cheers, > > Tim. > >