Andres Lagar-Cavilla
2011-Nov-29 21:55 UTC
[PATCH 0 of 5] Memory event interface improvements
In this patch series we improve the management of congestion in the memory events ring. We ensure no guest events are lost, even in the face of unbounded flooding from foreign maps, or balloon. Also, we enable resumption of mem events via an event channel kick from user-space to Xen. This is more light-weight and scalable than the current domctl interface, and allows for batching as well. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: Adin Scannell <adin@scannell.ca> xen/common/memory.c | 29 +++++- xen/arch/x86/hvm/hvm.c | 20 ++- xen/arch/x86/mm/mem_event.c | 205 ++++++++++++++++++++++++++++++--------- xen/arch/x86/mm/mem_sharing.c | 27 +++- xen/arch/x86/mm/p2m.c | 104 ++++++++++--------- xen/common/memory.c | 7 +- xen/include/asm-x86/mem_event.h | 16 ++- xen/include/asm-x86/p2m.h | 6 +- xen/include/xen/mm.h | 2 + xen/include/xen/sched.h | 5 +- xen/arch/ia64/vmx/vmx_init.c | 2 +- xen/arch/x86/hvm/hvm.c | 7 +- xen/arch/x86/mm/mem_event.c | 3 +- xen/common/event_channel.c | 75 +++++++++++--- xen/include/xen/event.h | 5 +- xen/include/xen/sched.h | 2 +- xen/arch/x86/mm/mem_access.c | 3 +- xen/arch/x86/mm/p2m.c | 3 +- xen/include/asm-x86/p2m.h | 2 +- xen/arch/x86/mm/mem_event.c | 26 +++- 20 files changed, 389 insertions(+), 160 deletions(-)
Andres Lagar-Cavilla
2011-Nov-29 21:55 UTC
[PATCH 1 of 5] Allow decrease_reservation to be preempted if remove_page returns negative
xen/common/memory.c | 29 ++++++++++++++++++++++++++++- 1 files changed, 28 insertions(+), 1 deletions(-) This will allow for handling of full paging rings in a more graceful manner. ATM, remove_page does not return negative values, so this code is not yet exercised. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 15da109b0c7d -r 8d67f4d5bafa xen/common/memory.c --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -226,6 +226,8 @@ static void decrease_reservation(struct for ( i = a->nr_done; i < a->nr_extents; i++ ) { + int one_remove_succeeded = 0; + if ( hypercall_preempt_check() ) { a->preempted = 1; @@ -255,8 +257,33 @@ static void decrease_reservation(struct continue; for ( j = 0; j < (1 << a->extent_order); j++ ) - if ( !guest_remove_page(a->domain, gmfn + j) ) + { + int rv = guest_remove_page(a->domain, gmfn + j); + /* negative rv is a result of a mem event ring full + * in the presence of paging. We preempt the hypercall */ + if ( rv < 0 ) + { + /* Indicate we''re done with this extent */ + if ((j+1) == (1 << a->extent_order)) + i++; + a->preempted = 1; + raise_softirq(SCHEDULE_SOFTIRQ); goto out; + } + /* rv of zero means we tried to remove a gfn with no backing + * mfn. It can be a bad argument, or, a continuation in the midst + * of an extent. Heuristically determine second case. */ + if ( !rv ) + { + /* Has to be first extent */ + if ( i != a->nr_done ) + goto out; + /* No previous remove in this extent must have succeeded */ + if ( one_remove_succeeded ) + goto out; + } else + one_remove_succeeded = 1; + } } out:
Andres Lagar-Cavilla
2011-Nov-29 21:55 UTC
[PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
xen/arch/x86/hvm/hvm.c | 20 ++- xen/arch/x86/mm/mem_event.c | 205 ++++++++++++++++++++++++++++++--------- xen/arch/x86/mm/mem_sharing.c | 27 +++- xen/arch/x86/mm/p2m.c | 104 ++++++++++--------- xen/common/memory.c | 7 +- xen/include/asm-x86/mem_event.h | 16 ++- xen/include/asm-x86/p2m.h | 6 +- xen/include/xen/mm.h | 2 + xen/include/xen/sched.h | 5 +- 9 files changed, 268 insertions(+), 124 deletions(-) The memevent code currently has a mechanism for reserving space in the ring before putting an event, but each caller must individually ensure that the vCPUs are correctly paused if no space is available. This fixes that issue by reversing the semantics: we ensure that enough space is always left for one event per vCPU in the ring. If, after putting the current request, this constraint will be violated by the current vCPU when putting putting another request in the ring, we pause the vCPU. For mem events caused by foreign vcpus, we simply drop the event if the space constraint will be violated. Foreign vcpus are expected to retry mapping operations (and thus the associated paging populate mem events will be re issued). Finally, guests can cause an unbound number of paging drop events via the balloon -> decrease_reservation hypercall. Thus, notify the hypercall there is no more space in the ring so a continuation is scheduled. With all these mechanisms, no guest events are lost and there is no need for wait-queues. Our testing includes 1. ballooning down 512 MiBs; 2. a mem event mode in which every page access in a four-vCPU guest results in an event, with no vCPU pausing, and the four vCPUs touching all RAM. No guest events were lost in either case, and qemu-dm had no mapping problems. Additionally, we ensure that no events are lost by Xen as a consumer if multiple responses land on the ring for a single domctl. While the current domctl-based notifications to Xen disallow batching, this is required for later patches. Finally, mark_and_pause_vcpus was misleading, beacause it wasn''t strictly pausing the vcpu''s, rather sending them to sleep. Change to actual vcpu_pause, which protects wakeups via an atomic count. This is useful when an event pauses a vcpu and the vcpu gets mark and paused due to ring congestion. Signed-off-by: Adin Scannell <adin@scannell.ca> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 8d67f4d5bafa -r 43dc614d543c xen/arch/x86/hvm/hvm.c --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -4108,7 +4108,6 @@ static int hvm_memory_event_traps(long p struct vcpu* v = current; struct domain *d = v->domain; mem_event_request_t req; - int rc; if ( !(p & HVMPME_MODE_MASK) ) return 0; @@ -4116,10 +4115,6 @@ static int hvm_memory_event_traps(long p if ( (p & HVMPME_onchangeonly) && (value == old) ) return 1; - rc = mem_event_check_ring(d, &d->mem_access); - if ( rc ) - return rc; - memset(&req, 0, sizeof(req)); req.type = MEM_EVENT_TYPE_ACCESS; req.reason = reason; @@ -4139,7 +4134,20 @@ static int hvm_memory_event_traps(long p req.gla_valid = 1; } - mem_event_put_request(d, &d->mem_access, &req); + if ( mem_event_put_request(d, &d->mem_access, &req) == -ENOSYS ) + { + /* rc == -ENOSYS + * If there was no ring to send the event, then simply unpause the + * vcpu (if we had previously paused it). It will retry the + * instruction (with the exception "handled") and go on */ + if ( req.flags & MEM_EVENT_FLAG_VCPU_PAUSED ) + vcpu_unpause(v); + /* rc == -EBUSY + * If the ring is full, the vcpu has been marked and paused, + * and the exception has been communicated to the consumer. + * Once there is room in the ring, the vcpu will be woken up + * and will retry. */ + } return 1; } diff -r 8d67f4d5bafa -r 43dc614d543c xen/arch/x86/mm/mem_event.c --- a/xen/arch/x86/mm/mem_event.c +++ b/xen/arch/x86/mm/mem_event.c @@ -91,6 +91,9 @@ static int mem_event_enable(struct domai put_gfn(dom_mem_event, ring_gfn); put_gfn(dom_mem_event, shared_gfn); + /* Set the number of currently blocked vCPUs to 0. */ + med->blocked = 0; + /* Allocate event channel */ rc = alloc_unbound_xen_event_channel(d->vcpu[0], current->domain->domain_id); @@ -108,7 +111,7 @@ static int mem_event_enable(struct domai mem_event_ring_lock_init(med); /* Wake any VCPUs paused for memory events */ - mem_event_unpause_vcpus(d); + mem_event_unpause_vcpus(d, med); return 0; @@ -122,8 +125,57 @@ static int mem_event_enable(struct domai return rc; } -static int mem_event_disable(struct mem_event_domain *med) +static void __mem_event_unpause_vcpus(struct domain *d, + struct mem_event_domain *med, int free) { + struct vcpu *v; + int i, j, k; + int online = d->max_vcpus; + + /* + * We ensure that we only have vCPUs online if there are enough free slots + * for their memory events to be processed. This will ensure that no + * memory events are lost (due to the fact that certain types of events + * cannot be replayed, we need to ensure that there is space in the ring + * for when they are hit). + * See large comment above in mem_event_put_request(). + */ + for_each_vcpu ( d, v ) + if ( test_bit(_VPF_mem_event, &v->pause_flags) ) + online--; + + /* We remember which vcpu last woke up to avoid scanning always linearly + * from zero and starving higher-numbered vcpus under high load */ + if ( d->vcpu ) + { + for (i = med->last_vcpu_wake_up + 1, j = 0; j < d->max_vcpus; i++, j++) + { + k = i % d->max_vcpus; + v = d->vcpu[k]; + if ( !v ) continue; + + if ( !(med->blocked) || online >= free ) + break; + if ( test_and_clear_bit(_VPF_mem_event, &v->pause_flags) ) + { + vcpu_unpause(v); + online++; + med->blocked--; + med->last_vcpu_wake_up = k; + } + } + } +} + +static int mem_event_disable(struct domain *d, struct mem_event_domain *med) +{ + if ( !med ) + return 0; + + /* Wake up everybody, regardless */ + med->last_vcpu_wake_up = 0; + __mem_event_unpause_vcpus(d, med, d->max_vcpus); + unmap_domain_page(med->ring_page); med->ring_page = NULL; @@ -133,31 +185,95 @@ static int mem_event_disable(struct mem_ return 0; } -void mem_event_put_request(struct domain *d, struct mem_event_domain *med, mem_event_request_t *req) +static inline int mem_event_ring_free(struct domain *d, struct mem_event_domain *med) { + int free_requests; + + free_requests = RING_FREE_REQUESTS(&med->front_ring); + if ( unlikely(free_requests < d->max_vcpus) ) + { + /* This may happen during normal operation (hopefully not often). */ + gdprintk(XENLOG_INFO, "mem_event request slots for domain %d: %d\n", + d->domain_id, free_requests); + } + + return free_requests; +} + +/* Return values + * zero: success + * -ENOSYS: no ring + * -EAGAIN: ring is full and the event has not been transmitted. + * Only foreign vcpu''s get EAGAIN + * -EBUSY: guest vcpu has been paused due to ring congestion + */ +int mem_event_put_request(struct domain *d, struct mem_event_domain *med, mem_event_request_t *req) +{ + int ret = 0; + int foreign = (d != current->domain); mem_event_front_ring_t *front_ring; RING_IDX req_prod; + if ( mem_event_check_ring(d, med) ) + return -ENOSYS; + mem_event_ring_lock(med); - front_ring = &med->front_ring; - req_prod = front_ring->req_prod_pvt; + if ( foreign && + (mem_event_ring_free(d, med) <= (d->max_vcpus - med->blocked)) ) + { + /* This is an event caused by a foreign mapper. Putting the event + * in the ring could preclude a guest vcpu from putting an event. + * The foreign mapper has to retry. */ + gdprintk(XENLOG_DEBUG, "Foreign-caused (%u) event will not be put" + " in ring with %d slots %d sleepers", current->domain->domain_id, + mem_event_ring_free(d, med), med->blocked); + mem_event_ring_unlock(med); + return -EAGAIN; + } - /* Copy request */ - memcpy(RING_GET_REQUEST(front_ring, req_prod), req, sizeof(*req)); - req_prod++; + if ( mem_event_ring_free(d, med) == 0 ) + { + /* This should *never* happen */ + gdprintk(XENLOG_WARNING, "possible lost mem_event for domain %d\n", + d->domain_id); + ret = -EBUSY; + } + else + { + front_ring = &med->front_ring; + req_prod = front_ring->req_prod_pvt; - /* Update ring */ - med->req_producers--; - front_ring->req_prod_pvt = req_prod; - RING_PUSH_REQUESTS(front_ring); + /* Copy request */ + memcpy(RING_GET_REQUEST(front_ring, req_prod), req, sizeof(*req)); + req_prod++; + + /* Update ring */ + front_ring->req_prod_pvt = req_prod; + RING_PUSH_REQUESTS(front_ring); + } + + /* + * We ensure that each vcpu can put at least *one* event -- because some + * events are not repeatable, such as dropping a page. This will ensure no + * vCPU is left with an event that they must place on the ring, but cannot. + * They will be paused after the event is placed. + * See large comment below in mem_event_unpause_vcpus(). + */ + if ( !foreign && mem_event_ring_free(d, med) < d->max_vcpus ) + { + mem_event_mark_and_pause(current, med); + ret = -EBUSY; + } mem_event_ring_unlock(med); notify_via_xen_event_channel(d, med->xen_port); + + return ret; } -void mem_event_get_response(struct mem_event_domain *med, mem_event_response_t *rsp) +int mem_event_get_response(struct domain *d, struct mem_event_domain *med, mem_event_response_t *rsp) { mem_event_front_ring_t *front_ring; RING_IDX rsp_cons; @@ -167,6 +283,12 @@ void mem_event_get_response(struct mem_e front_ring = &med->front_ring; rsp_cons = front_ring->rsp_cons; + if ( !RING_HAS_UNCONSUMED_RESPONSES(front_ring) ) + { + mem_event_ring_unlock(med); + return 0; + } + /* Copy response */ memcpy(rsp, RING_GET_RESPONSE(front_ring, rsp_cons), sizeof(*rsp)); rsp_cons++; @@ -176,54 +298,35 @@ void mem_event_get_response(struct mem_e front_ring->sring->rsp_event = rsp_cons + 1; mem_event_ring_unlock(med); + + return 1; } -void mem_event_unpause_vcpus(struct domain *d) +void mem_event_unpause_vcpus(struct domain *d, struct mem_event_domain *med) { - struct vcpu *v; + mem_event_ring_lock(med); - for_each_vcpu ( d, v ) - if ( test_and_clear_bit(_VPF_mem_event, &v->pause_flags) ) - vcpu_wake(v); + if ( med->blocked ) + __mem_event_unpause_vcpus(d, med, mem_event_ring_free(d, med)); + + mem_event_ring_unlock(med); } -void mem_event_mark_and_pause(struct vcpu *v) +void mem_event_mark_and_pause(struct vcpu *v, struct mem_event_domain *med) { - set_bit(_VPF_mem_event, &v->pause_flags); - vcpu_sleep_nosync(v); -} - -void mem_event_put_req_producers(struct mem_event_domain *med) -{ - mem_event_ring_lock(med); - med->req_producers--; - mem_event_ring_unlock(med); + if ( !test_and_set_bit(_VPF_mem_event, &v->pause_flags) ) + { + vcpu_pause_nosync(v); + med->blocked++; + } } int mem_event_check_ring(struct domain *d, struct mem_event_domain *med) { - struct vcpu *curr = current; - int free_requests; - int ring_full = 1; + if ( !med->ring_page ) + return -ENOSYS; - if ( !med->ring_page ) - return -1; - - mem_event_ring_lock(med); - - free_requests = RING_FREE_REQUESTS(&med->front_ring); - if ( med->req_producers < free_requests ) - { - med->req_producers++; - ring_full = 0; - } - - if ( ring_full && (curr->domain == d) ) - mem_event_mark_and_pause(curr); - - mem_event_ring_unlock(med); - - return ring_full; + return 0; } int mem_event_domctl(struct domain *d, xen_domctl_mem_event_op_t *mec, @@ -294,7 +397,7 @@ int mem_event_domctl(struct domain *d, x case XEN_DOMCTL_MEM_EVENT_OP_PAGING_DISABLE: { if ( med->ring_page ) - rc = mem_event_disable(med); + rc = mem_event_disable(d, med); } break; @@ -333,7 +436,7 @@ int mem_event_domctl(struct domain *d, x case XEN_DOMCTL_MEM_EVENT_OP_ACCESS_DISABLE: { if ( med->ring_page ) - rc = mem_event_disable(&d->mem_access); + rc = mem_event_disable(d, &d->mem_access); } break; diff -r 8d67f4d5bafa -r 43dc614d543c xen/arch/x86/mm/mem_sharing.c --- a/xen/arch/x86/mm/mem_sharing.c +++ b/xen/arch/x86/mm/mem_sharing.c @@ -281,12 +281,19 @@ static struct page_info* mem_sharing_all vcpu_pause_nosync(v); req.flags |= MEM_EVENT_FLAG_VCPU_PAUSED; - if(mem_event_check_ring(d, &d->mem_share)) return page; - req.gfn = gfn; req.p2mt = p2m_ram_shared; req.vcpu_id = v->vcpu_id; - mem_event_put_request(d, &d->mem_share, &req); + + /* If there is no ring, and we''re out of memory, then + * there is no way out. */ + if ( mem_event_put_request(d, &d->mem_share, &req) == -ENOSYS ) + { + gdprintk(XENLOG_ERR, + "Failed alloc on unshare path for %hu and no ring " + "to upcall\n", d->domain_id); + domain_crash(d); + } return page; } @@ -300,12 +307,16 @@ int mem_sharing_sharing_resume(struct do { mem_event_response_t rsp; - /* Get request off the ring */ - mem_event_get_response(&d->mem_share, &rsp); + /* Get all requests off the ring */ + while ( mem_event_get_response(d, &d->mem_share, &rsp) ) + { + /* Unpause domain/vcpu */ + if ( rsp.flags & MEM_EVENT_FLAG_VCPU_PAUSED ) + vcpu_unpause(d->vcpu[rsp.vcpu_id]); + } - /* Unpause domain/vcpu */ - if( rsp.flags & MEM_EVENT_FLAG_VCPU_PAUSED ) - vcpu_unpause(d->vcpu[rsp.vcpu_id]); + /* Unpause any domains that were paused because the ring was full */ + mem_event_unpause_vcpus(d, &d->mem_paging); return 0; } diff -r 8d67f4d5bafa -r 43dc614d543c xen/arch/x86/mm/p2m.c --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -871,13 +871,15 @@ int p2m_mem_paging_evict(struct domain * * p2m_mem_paging_drop_page() will notify the pager that a paged-out gfn was * released by the guest. The pager is supposed to drop its reference of the * gfn. + * Returns zero on success, EAGAIN for foreign mappers and EBUSY for guest + * vcpus if the ring is congested. */ -void p2m_mem_paging_drop_page(struct domain *d, unsigned long gfn) +int p2m_mem_paging_drop_page(struct domain *d, unsigned long gfn) { struct vcpu *v = current; mem_event_request_t req; - /* Check that there''s space on the ring for this request */ + /* Check that there''s a paging listener who cares about this */ if ( mem_event_check_ring(d, &d->mem_paging) == 0) { /* Send release notification to pager */ @@ -886,8 +888,11 @@ void p2m_mem_paging_drop_page(struct dom req.gfn = gfn; req.vcpu_id = v->vcpu_id; - mem_event_put_request(d, &d->mem_paging, &req); + /* rc cannot be ENOSYS, as we checked before */ + return mem_event_put_request(d, &d->mem_paging, &req); } + + return 0; } /** @@ -920,9 +925,14 @@ void p2m_mem_paging_populate(struct doma mfn_t mfn; struct p2m_domain *p2m = p2m_get_hostp2m(d); - /* Check that there''s space on the ring for this request */ + /* We''re paging. There should be a ring */ if ( mem_event_check_ring(d, &d->mem_paging) ) + { + gdprintk(XENLOG_ERR, "Domain %hu paging gfn %lx yet no ring " + "in place\n", d->domain_id, gfn); + domain_crash(d); return; + } memset(&req, 0, sizeof(req)); req.type = MEM_EVENT_TYPE_PAGING; @@ -951,7 +961,6 @@ void p2m_mem_paging_populate(struct doma else if ( p2mt != p2m_ram_paging_out && p2mt != p2m_ram_paged ) { /* gfn is already on its way back and vcpu is not paused */ - mem_event_put_req_producers(&d->mem_paging); return; } @@ -960,7 +969,10 @@ void p2m_mem_paging_populate(struct doma req.p2mt = p2mt; req.vcpu_id = v->vcpu_id; - mem_event_put_request(d, &d->mem_paging, &req); + (void)mem_event_put_request(d, &d->mem_paging, &req); + /* If the ring is full, foreign mappers will retry, and guest + * vcpus remain paused. Guest vcpu will wake up and retry once + * the consumer has made some space in the ring. */ } /** @@ -1084,33 +1096,34 @@ void p2m_mem_paging_resume(struct domain p2m_access_t a; mfn_t mfn; - /* Pull the response off the ring */ - mem_event_get_response(&d->mem_paging, &rsp); + /* Pull all responses off the ring */ + while( mem_event_get_response(d, &d->mem_paging, &rsp) ) + { + /* Fix p2m entry if the page was not dropped */ + if ( !(rsp.flags & MEM_EVENT_FLAG_DROP_PAGE) ) + { + p2m_lock(p2m); + mfn = p2m->get_entry(p2m, rsp.gfn, &p2mt, &a, p2m_query, NULL); + /* Allow only pages which were prepared properly, or pages which + * were nominated but not evicted */ + if ( mfn_valid(mfn) && + (p2mt == p2m_ram_paging_in || p2mt == p2m_ram_paging_in_start) ) + { + set_p2m_entry(p2m, rsp.gfn, mfn, PAGE_ORDER_4K, + paging_mode_log_dirty(d) ? p2m_ram_logdirty : + p2m_ram_rw, a); + set_gpfn_from_mfn(mfn_x(mfn), rsp.gfn); + } + p2m_unlock(p2m); + } - /* Fix p2m entry if the page was not dropped */ - if ( !(rsp.flags & MEM_EVENT_FLAG_DROP_PAGE) ) - { - p2m_lock(p2m); - mfn = p2m->get_entry(p2m, rsp.gfn, &p2mt, &a, p2m_query, NULL); - /* Allow only pages which were prepared properly, or pages which - * were nominated but not evicted */ - if ( mfn_valid(mfn) && - (p2mt == p2m_ram_paging_in || p2mt == p2m_ram_paging_in_start) ) - { - set_p2m_entry(p2m, rsp.gfn, mfn, PAGE_ORDER_4K, - paging_mode_log_dirty(d) ? p2m_ram_logdirty : p2m_ram_rw, - a); - set_gpfn_from_mfn(mfn_x(mfn), rsp.gfn); - } - p2m_unlock(p2m); + /* Unpause domain */ + if ( rsp.flags & MEM_EVENT_FLAG_VCPU_PAUSED ) + vcpu_unpause(d->vcpu[rsp.vcpu_id]); } - /* Unpause domain */ - if ( rsp.flags & MEM_EVENT_FLAG_VCPU_PAUSED ) - vcpu_unpause(d->vcpu[rsp.vcpu_id]); - /* Unpause any domains that were paused because the ring was full */ - mem_event_unpause_vcpus(d); + mem_event_unpause_vcpus(d, &d->mem_paging); } void p2m_mem_access_check(unsigned long gpa, bool_t gla_valid, unsigned long gla, @@ -1121,7 +1134,6 @@ void p2m_mem_access_check(unsigned long unsigned long gfn = gpa >> PAGE_SHIFT; struct domain *d = v->domain; struct p2m_domain* p2m = p2m_get_hostp2m(d); - int res; mfn_t mfn; p2m_type_t p2mt; p2m_access_t p2ma; @@ -1139,17 +1151,14 @@ void p2m_mem_access_check(unsigned long p2m_unlock(p2m); /* Otherwise, check if there is a memory event listener, and send the message along */ - res = mem_event_check_ring(d, &d->mem_access); - if ( res < 0 ) + if ( mem_event_check_ring(d, &d->mem_access) == -ENOSYS ) { /* No listener */ if ( p2m->access_required ) { - printk(XENLOG_INFO - "Memory access permissions failure, no mem_event listener: pausing VCPU %d, dom %d\n", - v->vcpu_id, d->domain_id); - - mem_event_mark_and_pause(v); + gdprintk(XENLOG_INFO, "Memory access permissions failure, no mem_event " + "listener VCPU %d, dom %d\n", v->vcpu_id, d->domain_id); + domain_crash(v->domain); } else { @@ -1161,8 +1170,6 @@ void p2m_mem_access_check(unsigned long return; } - else if ( res > 0 ) - return; /* No space in buffer; VCPU paused */ memset(&req, 0, sizeof(req)); req.type = MEM_EVENT_TYPE_ACCESS; @@ -1183,9 +1190,8 @@ void p2m_mem_access_check(unsigned long req.vcpu_id = v->vcpu_id; - mem_event_put_request(d, &d->mem_access, &req); - - /* VCPU paused, mem event request sent */ + (void)mem_event_put_request(d, &d->mem_access, &req); + /* VCPU paused */ } void p2m_mem_access_resume(struct p2m_domain *p2m) @@ -1193,15 +1199,17 @@ void p2m_mem_access_resume(struct p2m_do struct domain *d = p2m->domain; mem_event_response_t rsp; - mem_event_get_response(&d->mem_access, &rsp); - - /* Unpause domain */ - if ( rsp.flags & MEM_EVENT_FLAG_VCPU_PAUSED ) - vcpu_unpause(d->vcpu[rsp.vcpu_id]); + /* Pull all responses off the ring */ + while( mem_event_get_response(d, &d->mem_access, &rsp) ) + { + /* Unpause domain */ + if ( rsp.flags & MEM_EVENT_FLAG_VCPU_PAUSED ) + vcpu_unpause(d->vcpu[rsp.vcpu_id]); + } /* Unpause any domains that were paused because the ring was full or no listener * was available */ - mem_event_unpause_vcpus(d); + mem_event_unpause_vcpus(d, &d->mem_access); } diff -r 8d67f4d5bafa -r 43dc614d543c xen/common/memory.c --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -165,10 +165,13 @@ int guest_remove_page(struct domain *d, mfn = mfn_x(get_gfn(d, gmfn, &p2mt)); if ( unlikely(p2m_is_paging(p2mt)) ) { + int rc; guest_physmap_remove_page(d, gmfn, mfn, 0); - p2m_mem_paging_drop_page(d, gmfn); + rc = p2m_mem_paging_drop_page(d, gmfn); put_gfn(d, gmfn); - return 1; + /* drop_page returns zero on success, we return 1 on success. + * drop_page returns negative on error, never returns 1. */ + return rc ? rc : 1; } #else mfn = gmfn_to_mfn(d, gmfn); diff -r 8d67f4d5bafa -r 43dc614d543c xen/include/asm-x86/mem_event.h --- a/xen/include/asm-x86/mem_event.h +++ b/xen/include/asm-x86/mem_event.h @@ -25,12 +25,18 @@ #define __MEM_EVENT_H__ /* Pauses VCPU while marking pause flag for mem event */ -void mem_event_mark_and_pause(struct vcpu *v); +void mem_event_mark_and_pause(struct vcpu *v, struct mem_event_domain *med); int mem_event_check_ring(struct domain *d, struct mem_event_domain *med); -void mem_event_put_req_producers(struct mem_event_domain *med); -void mem_event_put_request(struct domain *d, struct mem_event_domain *med, mem_event_request_t *req); -void mem_event_get_response(struct mem_event_domain *med, mem_event_response_t *rsp); -void mem_event_unpause_vcpus(struct domain *d); +void mem_event_unpause_vcpus(struct domain *d, struct mem_event_domain *med); + +/* Returns 0 on success, -ENOSYS if there is no ring, -EBUSY if there is + * a ring or no space. FOr success or -EBUSY. the vCPU is left blocked to + * ensure that the ring does not lose events. In general, put_request should + * not fail, as it attempts to ensure that each vCPU can put at least one req. */ +int mem_event_put_request(struct domain *d, struct mem_event_domain *med, + mem_event_request_t *req); +int mem_event_get_response(struct domain *d, struct mem_event_domain *med, + mem_event_response_t *rsp); int mem_event_domctl(struct domain *d, xen_domctl_mem_event_op_t *mec, XEN_GUEST_HANDLE(void) u_domctl); diff -r 8d67f4d5bafa -r 43dc614d543c xen/include/asm-x86/p2m.h --- a/xen/include/asm-x86/p2m.h +++ b/xen/include/asm-x86/p2m.h @@ -475,7 +475,7 @@ int p2m_mem_paging_nominate(struct domai /* Evict a frame */ int p2m_mem_paging_evict(struct domain *d, unsigned long gfn); /* Tell xenpaging to drop a paged out frame */ -void p2m_mem_paging_drop_page(struct domain *d, unsigned long gfn); +int p2m_mem_paging_drop_page(struct domain *d, unsigned long gfn); /* Start populating a paged out frame */ void p2m_mem_paging_populate(struct domain *d, unsigned long gfn); /* Prepare the p2m for paging a frame in */ @@ -483,8 +483,8 @@ int p2m_mem_paging_prep(struct domain *d /* Resume normal operation (in case a domain was paused) */ void p2m_mem_paging_resume(struct domain *d); #else -static inline void p2m_mem_paging_drop_page(struct domain *d, unsigned long gfn) -{ } +static inline int p2m_mem_paging_drop_page(struct domain *d, unsigned long gfn) +{ return 0; } static inline void p2m_mem_paging_populate(struct domain *d, unsigned long gfn) { } #endif diff -r 8d67f4d5bafa -r 43dc614d543c xen/include/xen/mm.h --- a/xen/include/xen/mm.h +++ b/xen/include/xen/mm.h @@ -318,6 +318,8 @@ page_list_splice(struct page_list_head * void scrub_one_page(struct page_info *); +/* Returns 1 on success, 0 on error, negative if the ring + * for event propagation is full in the presence of paging */ int guest_remove_page(struct domain *d, unsigned long gmfn); #define RAM_TYPE_CONVENTIONAL 0x00000001 diff -r 8d67f4d5bafa -r 43dc614d543c xen/include/xen/sched.h --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -183,13 +183,16 @@ struct mem_event_domain { /* ring lock */ spinlock_t ring_lock; - unsigned int req_producers; /* shared page */ mem_event_shared_page_t *shared_page; /* shared ring page */ void *ring_page; /* front-end ring */ mem_event_front_ring_t front_ring; + /* the number of vCPUs blocked */ + unsigned int blocked; + /* The last vcpu woken up */ + unsigned int last_vcpu_wake_up; /* event channel port (vcpu0 only) */ int xen_port; };
Andres Lagar-Cavilla
2011-Nov-29 21:55 UTC
[PATCH 3 of 5] Create a generic callback mechanism for Xen-bound event channels
xen/arch/ia64/vmx/vmx_init.c | 2 +- xen/arch/x86/hvm/hvm.c | 7 ++- xen/arch/x86/mm/mem_event.c | 3 +- xen/common/event_channel.c | 75 ++++++++++++++++++++++++++++++++++--------- xen/include/xen/event.h | 5 ++- xen/include/xen/sched.h | 2 +- 6 files changed, 70 insertions(+), 24 deletions(-) For event channels for which Xen is the consumer, there currently is a single action. With this patch, we allow event channel creators to specify a generic callback (or no callback). Because the expectation is that there will be few callbacks, they are stored in a small table. Signed-off-by: Adin Scannell <adin@scannell.ca> Signed-off-by: Keir Fraser <keir@xen.org> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 43dc614d543c -r 96ba0cb6e255 xen/arch/ia64/vmx/vmx_init.c --- a/xen/arch/ia64/vmx/vmx_init.c +++ b/xen/arch/ia64/vmx/vmx_init.c @@ -377,7 +377,7 @@ vmx_vcpu_initialise(struct vcpu *v) { struct vmx_ioreq_page *iorp = &v->domain->arch.hvm_domain.ioreq; - int rc = alloc_unbound_xen_event_channel(v, 0); + int rc = alloc_unbound_xen_event_channel(v, 0, NULL); if (rc < 0) return rc; v->arch.arch_vmx.xen_port = rc; diff -r 43dc614d543c -r 96ba0cb6e255 xen/arch/x86/hvm/hvm.c --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -979,7 +979,7 @@ int hvm_vcpu_initialise(struct vcpu *v) goto fail3; /* Create ioreq event channel. */ - rc = alloc_unbound_xen_event_channel(v, 0); + rc = alloc_unbound_xen_event_channel(v, 0, NULL); if ( rc < 0 ) goto fail4; @@ -989,7 +989,7 @@ int hvm_vcpu_initialise(struct vcpu *v) if ( v->vcpu_id == 0 ) { /* Create bufioreq event channel. */ - rc = alloc_unbound_xen_event_channel(v, 0); + rc = alloc_unbound_xen_event_channel(v, 0, NULL); if ( rc < 0 ) goto fail2; v->domain->arch.hvm_domain.params[HVM_PARAM_BUFIOREQ_EVTCHN] = rc; @@ -3551,7 +3551,8 @@ long do_hvm_op(unsigned long op, XEN_GUE for_each_vcpu ( d, v ) { int old_port, new_port; - new_port = alloc_unbound_xen_event_channel(v, a.value); + new_port = alloc_unbound_xen_event_channel( + v, a.value, NULL); if ( new_port < 0 ) { rc = new_port; diff -r 43dc614d543c -r 96ba0cb6e255 xen/arch/x86/mm/mem_event.c --- a/xen/arch/x86/mm/mem_event.c +++ b/xen/arch/x86/mm/mem_event.c @@ -96,7 +96,8 @@ static int mem_event_enable(struct domai /* Allocate event channel */ rc = alloc_unbound_xen_event_channel(d->vcpu[0], - current->domain->domain_id); + current->domain->domain_id, + NULL); if ( rc < 0 ) goto err; diff -r 43dc614d543c -r 96ba0cb6e255 xen/common/event_channel.c --- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -57,6 +57,51 @@ goto out; \ } while ( 0 ) +#define consumer_is_xen(e) (!!(e)->xen_consumer) + +/* + * The function alloc_unbound_xen_event_channel() allows an arbitrary + * notifier function to be specified. However, very few unique functions + * are specified in practice, so to prevent bloating the evtchn structure + * with a pointer, we stash them dynamically in a small lookup array which + * can be indexed by a small integer. + */ +static xen_event_channel_notification_t xen_consumers[8]; + +/* Default notification action: wake up from wait_on_xen_event_channel(). */ +static void default_xen_notification_fn(struct vcpu *v, unsigned int port) +{ + /* Consumer needs notification only if blocked. */ + if ( test_and_clear_bit(_VPF_blocked_in_xen, &v->pause_flags) ) + vcpu_wake(v); +} + +/* + * Given a notification function, return the value to stash in + * the evtchn->xen_consumer field. + */ +static uint8_t get_xen_consumer(xen_event_channel_notification_t fn) +{ + unsigned int i; + + if ( fn == NULL ) + fn = default_xen_notification_fn; + + for ( i = 0; i < ARRAY_SIZE(xen_consumers); i++ ) + { + if ( xen_consumers[i] == NULL ) + xen_consumers[i] = fn; + if ( xen_consumers[i] == fn ) + break; + } + + BUG_ON(i >= ARRAY_SIZE(xen_consumers)); + return i+1; +} + +/* Get the notification function for a given Xen-bound event channel. */ +#define xen_notification_fn(e) (xen_consumers[(e)->xen_consumer-1]) + static int evtchn_set_pending(struct vcpu *v, int port); static int virq_is_global(int virq) @@ -397,7 +442,7 @@ static long __evtchn_close(struct domain chn1 = evtchn_from_port(d1, port1); /* Guest cannot close a Xen-attached event channel. */ - if ( unlikely(chn1->consumer_is_xen) ) + if ( unlikely(consumer_is_xen(chn1)) ) { rc = -EINVAL; goto out; @@ -537,7 +582,7 @@ int evtchn_send(struct domain *d, unsign lchn = evtchn_from_port(ld, lport); /* Guest cannot send via a Xen-attached event channel. */ - if ( unlikely(lchn->consumer_is_xen) ) + if ( unlikely(consumer_is_xen(lchn)) ) { spin_unlock(&ld->event_lock); return -EINVAL; @@ -554,13 +599,8 @@ int evtchn_send(struct domain *d, unsign rport = lchn->u.interdomain.remote_port; rchn = evtchn_from_port(rd, rport); rvcpu = rd->vcpu[rchn->notify_vcpu_id]; - if ( rchn->consumer_is_xen ) - { - /* Xen consumers need notification only if they are blocked. */ - if ( test_and_clear_bit(_VPF_blocked_in_xen, - &rvcpu->pause_flags) ) - vcpu_wake(rvcpu); - } + if ( consumer_is_xen(rchn) ) + (*xen_notification_fn(rchn))(rvcpu, rport); else { evtchn_set_pending(rvcpu, rport); @@ -787,7 +827,7 @@ long evtchn_bind_vcpu(unsigned int port, chn = evtchn_from_port(d, port); /* Guest cannot re-bind a Xen-attached event channel. */ - if ( unlikely(chn->consumer_is_xen) ) + if ( unlikely(consumer_is_xen(chn)) ) { rc = -EINVAL; goto out; @@ -998,7 +1038,8 @@ long do_event_channel_op(int cmd, XEN_GU int alloc_unbound_xen_event_channel( - struct vcpu *local_vcpu, domid_t remote_domid) + struct vcpu *local_vcpu, domid_t remote_domid, + xen_event_channel_notification_t notification_fn) { struct evtchn *chn; struct domain *d = local_vcpu->domain; @@ -1011,7 +1052,7 @@ int alloc_unbound_xen_event_channel( chn = evtchn_from_port(d, port); chn->state = ECS_UNBOUND; - chn->consumer_is_xen = 1; + chn->xen_consumer = get_xen_consumer(notification_fn); chn->notify_vcpu_id = local_vcpu->vcpu_id; chn->u.unbound.remote_domid = remote_domid; @@ -1038,8 +1079,8 @@ void free_xen_event_channel( BUG_ON(!port_is_valid(d, port)); chn = evtchn_from_port(d, port); - BUG_ON(!chn->consumer_is_xen); - chn->consumer_is_xen = 0; + BUG_ON(!consumer_is_xen(chn)); + chn->xen_consumer = 0; spin_unlock(&d->event_lock); @@ -1063,7 +1104,7 @@ void notify_via_xen_event_channel(struct ASSERT(port_is_valid(ld, lport)); lchn = evtchn_from_port(ld, lport); - ASSERT(lchn->consumer_is_xen); + ASSERT(consumer_is_xen(lchn)); if ( likely(lchn->state == ECS_INTERDOMAIN) ) { @@ -1106,7 +1147,7 @@ void evtchn_destroy(struct domain *d) /* Close all existing event channels. */ for ( i = 0; port_is_valid(d, i); i++ ) { - evtchn_from_port(d, i)->consumer_is_xen = 0; + evtchn_from_port(d, i)->xen_consumer = 0; (void)__evtchn_close(d, i); } @@ -1192,7 +1233,7 @@ static void domain_dump_evtchn_info(stru printk(" v=%d", chn->u.virq); break; } - printk(" x=%d\n", chn->consumer_is_xen); + printk(" x=%d\n", chn->xen_consumer); } spin_unlock(&d->event_lock); diff -r 43dc614d543c -r 96ba0cb6e255 xen/include/xen/event.h --- a/xen/include/xen/event.h +++ b/xen/include/xen/event.h @@ -51,8 +51,11 @@ int evtchn_unmask(unsigned int port); void evtchn_move_pirqs(struct vcpu *v); /* Allocate/free a Xen-attached event channel port. */ +typedef void (*xen_event_channel_notification_t)( + struct vcpu *v, unsigned int port); int alloc_unbound_xen_event_channel( - struct vcpu *local_vcpu, domid_t remote_domid); + struct vcpu *local_vcpu, domid_t remote_domid, + xen_event_channel_notification_t notification_fn); void free_xen_event_channel( struct vcpu *local_vcpu, int port); diff -r 43dc614d543c -r 96ba0cb6e255 xen/include/xen/sched.h --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -47,7 +47,7 @@ struct evtchn #define ECS_VIRQ 5 /* Channel is bound to a virtual IRQ line. */ #define ECS_IPI 6 /* Channel is bound to a virtual IPI line. */ u8 state; /* ECS_* */ - u8 consumer_is_xen; /* Consumed by Xen or by guest? */ + u8 xen_consumer; /* Consumer in Xen, if any? (0 = send to guest) */ u16 notify_vcpu_id; /* VCPU for local delivery notification */ union { struct {
Andres Lagar-Cavilla
2011-Nov-29 21:55 UTC
[PATCH 4 of 5] Make the prototype of p2m_mem_access_resume consistent
xen/arch/x86/mm/mem_access.c | 3 +-- xen/arch/x86/mm/p2m.c | 3 +-- xen/include/asm-x86/p2m.h | 2 +- 3 files changed, 3 insertions(+), 5 deletions(-) Signed-off-by: Adin Scannell <adin@scannell.ca> Signed-off-by: Keir Fraser <keir@xen.org> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 96ba0cb6e255 -r c2b5e65331ee xen/arch/x86/mm/mem_access.c --- a/xen/arch/x86/mm/mem_access.c +++ b/xen/arch/x86/mm/mem_access.c @@ -29,13 +29,12 @@ int mem_access_domctl(struct domain *d, XEN_GUEST_HANDLE(void) u_domctl) { int rc; - struct p2m_domain *p2m = p2m_get_hostp2m(d); switch( mec->op ) { case XEN_DOMCTL_MEM_EVENT_OP_ACCESS_RESUME: { - p2m_mem_access_resume(p2m); + p2m_mem_access_resume(d); rc = 0; } break; diff -r 96ba0cb6e255 -r c2b5e65331ee xen/arch/x86/mm/p2m.c --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -1194,9 +1194,8 @@ void p2m_mem_access_check(unsigned long /* VCPU paused */ } -void p2m_mem_access_resume(struct p2m_domain *p2m) +void p2m_mem_access_resume(struct domain *d) { - struct domain *d = p2m->domain; mem_event_response_t rsp; /* Pull all responses off the ring */ diff -r 96ba0cb6e255 -r c2b5e65331ee xen/include/asm-x86/p2m.h --- a/xen/include/asm-x86/p2m.h +++ b/xen/include/asm-x86/p2m.h @@ -495,7 +495,7 @@ static inline void p2m_mem_paging_popula void p2m_mem_access_check(unsigned long gpa, bool_t gla_valid, unsigned long gla, bool_t access_r, bool_t access_w, bool_t access_x); /* Resumes the running of the VCPU, restarting the last instruction */ -void p2m_mem_access_resume(struct p2m_domain *p2m); +void p2m_mem_access_resume(struct domain *d); /* Set access type for a region of pfns. * If start_pfn == -1ul, sets the default access type */
Andres Lagar-Cavilla
2011-Nov-29 21:55 UTC
[PATCH 5 of 5] Allow memevent responses to be signaled via the event channel
xen/arch/x86/mm/mem_event.c | 26 ++++++++++++++++++++------ 1 files changed, 20 insertions(+), 6 deletions(-) Don''t require a separate domctl to notify the memevent interface that an event has occured. This domctl can be taxing, particularly when you are scaling events and paging to many domains across a single system. Instead, we use the existing event channel to signal when we place something in the ring (as per normal ring operation). Signed-off-by: Adin Scannell <adin@scannell.ca> Signed-off-by: Keir Fraser <keir@xen.org> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r c2b5e65331ee -r 29701f5bdd84 xen/arch/x86/mm/mem_event.c --- a/xen/arch/x86/mm/mem_event.c +++ b/xen/arch/x86/mm/mem_event.c @@ -37,9 +37,11 @@ #define mem_event_ring_lock(_med) spin_lock(&(_med)->ring_lock) #define mem_event_ring_unlock(_med) spin_unlock(&(_med)->ring_lock) -static int mem_event_enable(struct domain *d, - xen_domctl_mem_event_op_t *mec, - struct mem_event_domain *med) +static int mem_event_enable( + struct domain *d, + xen_domctl_mem_event_op_t *mec, + struct mem_event_domain *med, + xen_event_channel_notification_t notification_fn) { int rc; struct domain *dom_mem_event = current->domain; @@ -97,7 +99,7 @@ static int mem_event_enable(struct domai /* Allocate event channel */ rc = alloc_unbound_xen_event_channel(d->vcpu[0], current->domain->domain_id, - NULL); + notification_fn); if ( rc < 0 ) goto err; @@ -330,6 +332,18 @@ int mem_event_check_ring(struct domain * return 0; } +/* Registered with Xen-bound event channel for incoming notifications. */ +static void mem_paging_notification(struct vcpu *v, unsigned int port) +{ + p2m_mem_paging_resume(v->domain); +} + +/* Registered with Xen-bound event channel for incoming notifications. */ +static void mem_access_notification(struct vcpu *v, unsigned int port) +{ + p2m_mem_access_resume(v->domain); +} + int mem_event_domctl(struct domain *d, xen_domctl_mem_event_op_t *mec, XEN_GUEST_HANDLE(void) u_domctl) { @@ -391,7 +405,7 @@ int mem_event_domctl(struct domain *d, x if ( p2m->pod.entry_count ) break; - rc = mem_event_enable(d, mec, med); + rc = mem_event_enable(d, mec, med, mem_paging_notification); } break; @@ -430,7 +444,7 @@ int mem_event_domctl(struct domain *d, x if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ) break; - rc = mem_event_enable(d, mec, med); + rc = mem_event_enable(d, mec, med, mem_access_notification); } break;
Olaf Hering
2011-Nov-30 12:58 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
On Tue, Nov 29, Andres Lagar-Cavilla wrote:> xen/arch/x86/hvm/hvm.c | 20 ++- > xen/arch/x86/mm/mem_event.c | 205 ++++++++++++++++++++++++++++++--------- > xen/arch/x86/mm/mem_sharing.c | 27 +++- > xen/arch/x86/mm/p2m.c | 104 ++++++++++--------- > xen/common/memory.c | 7 +- > xen/include/asm-x86/mem_event.h | 16 ++- > xen/include/asm-x86/p2m.h | 6 +- > xen/include/xen/mm.h | 2 + > xen/include/xen/sched.h | 5 +- > 9 files changed, 268 insertions(+), 124 deletions(-) > > > The memevent code currently has a mechanism for reserving space in the ring > before putting an event, but each caller must individually ensure that the > vCPUs are correctly paused if no space is available.I have an improved patch which uses wait queues in mem_event_put_request() and also the new wake_up_nr(). Using pause here and wait queues in get_gfn does not mix well AFAICS. My wait queue patch for get_gfn is not yet finished. I propose to use wait queues for both mem_event and get_gfn. Olaf
Andres Lagar-Cavilla
2011-Nov-30 15:11 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
> On Tue, Nov 29, Andres Lagar-Cavilla wrote: > >> xen/arch/x86/hvm/hvm.c | 20 ++- >> xen/arch/x86/mm/mem_event.c | 205 >> ++++++++++++++++++++++++++++++--------- >> xen/arch/x86/mm/mem_sharing.c | 27 +++- >> xen/arch/x86/mm/p2m.c | 104 ++++++++++--------- >> xen/common/memory.c | 7 +- >> xen/include/asm-x86/mem_event.h | 16 ++- >> xen/include/asm-x86/p2m.h | 6 +- >> xen/include/xen/mm.h | 2 + >> xen/include/xen/sched.h | 5 +- >> 9 files changed, 268 insertions(+), 124 deletions(-) >> >> >> The memevent code currently has a mechanism for reserving space in the >> ring >> before putting an event, but each caller must individually ensure that >> the >> vCPUs are correctly paused if no space is available. > > I have an improved patch which uses wait queues in > mem_event_put_request() and also the new wake_up_nr(). Using pause here > and wait queues in get_gfn does not mix well AFAICS. My wait queue patch > for get_gfn is not yet finished. > > I propose to use wait queues for both mem_event and get_gfn.Well, given the patch I submitted, my position ought to be clear :) I have four cents to add: - Our patch works. We''re blasting the ring with multi-vcpu events. Nothing is ever lost, no vcpu is left blocked and forgotten. - This isn''t a problem that necessitates wait-queues for solving. Just careful logic. - I am not sure what your concerns about the mix are. get_gfn* would call populate on a paged out gfn, and then go to sleep if it''s a guest vcpu. With our patch, the guest vcpu event is guaranteed to go in the ring. vcpu pausing will stack (and unwind) properly. - This patch does not preclude wait queues where they are absolutely needed. I think once wait queues are in, they''ll be a welcome breakthrough for hypervisor code that just can''t handle paged out pages gracefully. I look forward to that. Andres> > Olaf >
Keir Fraser
2011-Dec-01 10:19 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
On 01/12/2011 18:10, "Tim Deegan" <tim@xen.org> wrote:> At 16:55 -0500 on 29 Nov (1322585711), Andres Lagar-Cavilla wrote: >> The memevent code currently has a mechanism for reserving space in the ring >> before putting an event, but each caller must individually ensure that the >> vCPUs are correctly paused if no space is available. >> >> This fixes that issue by reversing the semantics: we ensure that enough space >> is always left for one event per vCPU in the ring. If, after putting the >> current request, this constraint will be violated by the current vCPU >> when putting putting another request in the ring, we pause the vCPU. > > What about operations that touch more than one page of guest memory? > (E.g., pagetable walks, emulated faults and task switches). Can''t they > still fill up the ring? > > IIRC there are still cases where we need wait-queues anyway (when we hit > a paged-out page after an non-idempotent action has already been > taken). Is the purpose of this change just to reduce the number of > wait-queue uses or do you think you can do without them entirely?It''s definitely not possible to do without entirely. I''m pretty sure Andres agrees at least that far. -- Keir> Cheers, > > Tim.
Olaf Hering
2011-Dec-01 14:51 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
On Tue, Nov 29, Andres Lagar-Cavilla wrote:> @@ -133,31 +185,95 @@ static int mem_event_disable(struct mem_ > return 0; > } > > -void mem_event_put_request(struct domain *d, struct mem_event_domain *med, mem_event_request_t *req) > +static inline int mem_event_ring_free(struct domain *d, struct mem_event_domain *med) > { > + int free_requests; > + > + free_requests = RING_FREE_REQUESTS(&med->front_ring); > + if ( unlikely(free_requests < d->max_vcpus) ) > + { > + /* This may happen during normal operation (hopefully not often). */ > + gdprintk(XENLOG_INFO, "mem_event request slots for domain %d: %d\n", > + d->domain_id, free_requests); > + } > + > + return free_requests; > +} > + > +/* Return values > + * zero: success > + * -ENOSYS: no ring > + * -EAGAIN: ring is full and the event has not been transmitted. > + * Only foreign vcpu''s get EAGAIN > + * -EBUSY: guest vcpu has been paused due to ring congestion > + */ > +int mem_event_put_request(struct domain *d, struct mem_event_domain *med, mem_event_request_t *req) > +{ > + int ret = 0; > + int foreign = (d != current->domain);> + /* > + * We ensure that each vcpu can put at least *one* event -- because some > + * events are not repeatable, such as dropping a page. This will ensure no > + * vCPU is left with an event that they must place on the ring, but cannot. > + * They will be paused after the event is placed. > + * See large comment below in mem_event_unpause_vcpus(). > + */ > + if ( !foreign && mem_event_ring_free(d, med) < d->max_vcpus ) > + { > + mem_event_mark_and_pause(current, med); > + ret = -EBUSY; > + } > > mem_event_ring_unlock(med); > > notify_via_xen_event_channel(d, med->xen_port); > + > + return ret;What will happen if the guest has more vcpus than r->nr_ents in the ring buffer? To me it looks like no event can be placed into the ring and -EBUSY is returned instead. Olaf
Andres Lagar-Cavilla
2011-Dec-01 15:25 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
> On Tue, Nov 29, Andres Lagar-Cavilla wrote: > >> @@ -133,31 +185,95 @@ static int mem_event_disable(struct mem_ >> return 0; >> } >> >> -void mem_event_put_request(struct domain *d, struct mem_event_domain >> *med, mem_event_request_t *req) >> +static inline int mem_event_ring_free(struct domain *d, struct >> mem_event_domain *med) >> { >> + int free_requests; >> + >> + free_requests = RING_FREE_REQUESTS(&med->front_ring); >> + if ( unlikely(free_requests < d->max_vcpus) ) >> + { >> + /* This may happen during normal operation (hopefully not >> often). */ >> + gdprintk(XENLOG_INFO, "mem_event request slots for domain %d: >> %d\n", >> + d->domain_id, free_requests); >> + } >> + >> + return free_requests; >> +} >> + >> +/* Return values >> + * zero: success >> + * -ENOSYS: no ring >> + * -EAGAIN: ring is full and the event has not been transmitted. >> + * Only foreign vcpu''s get EAGAIN >> + * -EBUSY: guest vcpu has been paused due to ring congestion >> + */ >> +int mem_event_put_request(struct domain *d, struct mem_event_domain >> *med, mem_event_request_t *req) >> +{ >> + int ret = 0; >> + int foreign = (d != current->domain); > >> + /* >> + * We ensure that each vcpu can put at least *one* event -- because >> some >> + * events are not repeatable, such as dropping a page. This will >> ensure no >> + * vCPU is left with an event that they must place on the ring, but >> cannot. >> + * They will be paused after the event is placed. >> + * See large comment below in mem_event_unpause_vcpus(). >> + */ >> + if ( !foreign && mem_event_ring_free(d, med) < d->max_vcpus ) >> + { >> + mem_event_mark_and_pause(current, med); >> + ret = -EBUSY; >> + } >> >> mem_event_ring_unlock(med); >> >> notify_via_xen_event_channel(d, med->xen_port); >> + >> + return ret; > > > What will happen if the guest has more vcpus than r->nr_ents in the ring > buffer? To me it looks like no event can be placed into the ring and > -EBUSY is returned instead.MAX_HVM_VCPUS sits at 128 right now. Haven''t compile checked, but that probably means we would need a two page ring. And then, when 1024-cpu hosts arrive and we grow MAX_HVM_VCPUS, we grow the ring size again. Or, we could limit the constraint to the number of online vcpus, which would get somewhat tricky for vcpu hot-plugging. I can fix that separately, once there is a decision on which way to go re ring management. Andres> > Olaf >
Olaf Hering
2011-Dec-01 15:36 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
On Wed, Nov 30, Andres Lagar-Cavilla wrote:> - This isn''t a problem that necessitates wait-queues for solving. Just > careful logic. > > - I am not sure what your concerns about the mix are. get_gfn* would call > populate on a paged out gfn, and then go to sleep if it''s a guest vcpu. > With our patch, the guest vcpu event is guaranteed to go in the ring. vcpu > pausing will stack (and unwind) properly.Today I sent my current version of wait queues for mem_event and paging. Unfortunately the earlier versions of mem_event had bugs which caused trouble also for the paging change. Please have a look wether my changes work for you. I see you have a change to mem_event_get_response() which pulls all requests instead of just one. Thats currently a noop for paging, but it will be used once paging can get rid of the domctls. I added p2m_mem_paging_wait() which calls p2m_mem_paging_populate(), then goes to sleep until p2m_mem_paging_get_entry() indicates the gfn is back. If I understand your change correctly a guest vcpu can always place a request. If p2m_mem_paging_populate() happens to fail to put a request, what is supposed to happen in p2m_mem_paging_wait()? Should it skip the wait_event() and return to its caller? With my implementation both p2m_mem_paging_wait() and p2m_mem_paging_populate() will stop exectution until either the gfn is back or if there is room in the buffer. There is no need for exit code handling. Olaf
Tim Deegan
2011-Dec-01 18:01 UTC
Re: [PATCH 4 of 5] Make the prototype of p2m_mem_access_resume consistent
At 16:55 -0500 on 29 Nov (1322585713), Andres Lagar-Cavilla wrote:> xen/arch/x86/mm/mem_access.c | 3 +-- > xen/arch/x86/mm/p2m.c | 3 +-- > xen/include/asm-x86/p2m.h | 2 +- > 3 files changed, 3 insertions(+), 5 deletions(-) > > > Signed-off-by: Adin Scannell <adin@scannell.ca> > Signed-off-by: Keir Fraser <keir@xen.org> > Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>I''ve applied this one obvious fixup in isolation; the rest of the series clearly needs some further discussion. Cheers, Tim.
Andres Lagar-Cavilla
2011-Dec-01 18:05 UTC
Re: [PATCH 4 of 5] Make the prototype of p2m_mem_access_resume consistent
> At 16:55 -0500 on 29 Nov (1322585713), Andres Lagar-Cavilla wrote: >> xen/arch/x86/mm/mem_access.c | 3 +-- >> xen/arch/x86/mm/p2m.c | 3 +-- >> xen/include/asm-x86/p2m.h | 2 +- >> 3 files changed, 3 insertions(+), 5 deletions(-) >> >> >> Signed-off-by: Adin Scannell <adin@scannell.ca> >> Signed-off-by: Keir Fraser <keir@xen.org> >> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> > > I''ve applied this one obvious fixup in isolation; the rest of the series > clearly needs some further discussion.Thanks Tim, I had to rebase the main patch in the series given the recent commit "mem_event: move mem_event_domain out of struct domain". The rebase is obvious and does not affect the gist of the patch, but I can resend that single patch, the whole series, etc. Andres> > Cheers, > > Tim. >
Tim Deegan
2011-Dec-01 18:10 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
At 16:55 -0500 on 29 Nov (1322585711), Andres Lagar-Cavilla wrote:> The memevent code currently has a mechanism for reserving space in the ring > before putting an event, but each caller must individually ensure that the > vCPUs are correctly paused if no space is available. > > This fixes that issue by reversing the semantics: we ensure that enough space > is always left for one event per vCPU in the ring. If, after putting the > current request, this constraint will be violated by the current vCPU > when putting putting another request in the ring, we pause the vCPU.What about operations that touch more than one page of guest memory? (E.g., pagetable walks, emulated faults and task switches). Can''t they still fill up the ring? IIRC there are still cases where we need wait-queues anyway (when we hit a paged-out page after an non-idempotent action has already been taken). Is the purpose of this change just to reduce the number of wait-queue uses or do you think you can do without them entirely? Cheers, Tim.
Andres Lagar-Cavilla
2011-Dec-01 18:23 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
> At 16:55 -0500 on 29 Nov (1322585711), Andres Lagar-Cavilla wrote: >> The memevent code currently has a mechanism for reserving space in the >> ring >> before putting an event, but each caller must individually ensure that >> the >> vCPUs are correctly paused if no space is available. >> >> This fixes that issue by reversing the semantics: we ensure that enough >> space >> is always left for one event per vCPU in the ring. If, after putting >> the >> current request, this constraint will be violated by the current vCPU >> when putting putting another request in the ring, we pause the vCPU. > > What about operations that touch more than one page of guest memory? > (E.g., pagetable walks, emulated faults and task switches). Can''t they > still fill up the ring?Those only generate events on paging, which would go to sleep on the first fault with a wait queue. There is one case in which the guest vcpu can generate unbound events, and that is balloon down -> decrease_reservation -> paging_drop events. I handle that with preemption of the decrease_reservation hypercall.> > IIRC there are still cases where we need wait-queues anyway (when we hit > a paged-out page after an non-idempotent action has already been > taken). Is the purpose of this change just to reduce the number of > wait-queue uses or do you think you can do without them entirely?Certainly, there''s no way around wait-queues, for, say hvm_copy with a paged out page. Andres> > Cheers, > > Tim. >
Andres Lagar-Cavilla
2011-Dec-01 18:27 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
> On 01/12/2011 18:10, "Tim Deegan" <tim@xen.org> wrote: > >> At 16:55 -0500 on 29 Nov (1322585711), Andres Lagar-Cavilla wrote: >>> The memevent code currently has a mechanism for reserving space in the >>> ring >>> before putting an event, but each caller must individually ensure that >>> the >>> vCPUs are correctly paused if no space is available. >>> >>> This fixes that issue by reversing the semantics: we ensure that enough >>> space >>> is always left for one event per vCPU in the ring. If, after putting >>> the >>> current request, this constraint will be violated by the current vCPU >>> when putting putting another request in the ring, we pause the vCPU. >> >> What about operations that touch more than one page of guest memory? >> (E.g., pagetable walks, emulated faults and task switches). Can''t they >> still fill up the ring? >> >> IIRC there are still cases where we need wait-queues anyway (when we hit >> a paged-out page after an non-idempotent action has already been >> taken). Is the purpose of this change just to reduce the number of >> wait-queue uses or do you think you can do without them entirely? > > It''s definitely not possible to do without entirely. I''m pretty sure > Andres > agrees at least that far.I do agree that far :) Andres> > -- Keir > >> Cheers, >> >> Tim. > > >
Olaf Hering
2011-Dec-05 11:23 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
On Thu, Dec 01, Andres Lagar-Cavilla wrote:> MAX_HVM_VCPUS sits at 128 right now. Haven''t compile checked, but that > probably means we would need a two page ring. And then, when 1024-cpu > hosts arrive and we grow MAX_HVM_VCPUS, we grow the ring size again.The ring has 64 entries.> Or, we could limit the constraint to the number of online vcpus, which > would get somewhat tricky for vcpu hot-plugging. > > I can fix that separately, once there is a decision on which way to go re > ring management.I just sent "[PATCH] mem_event: use wait queue when ring is full" to the list. This version ought to work, it takes the request from both target and foreign cpus into account and leaves at lease one slot for the target. Olaf
Andres Lagar-Cavilla
2011-Dec-05 15:27 UTC
Re: [PATCH 2 of 5] Improve ring management for memory events. Do not lose guest events
I just pushed all that I had on mem event. I split it into two series. One that adds features that are independent on how we choose to manage the ring. Hopefully we''ll concur these features are useful (such as kicking Xen to consume batched responses directly with an event channel, no domctl needed). And that they should go in the tree. I''ve been using them for a month already. The second series is a refreshed post of our version of ring management, sans wait queues. With both up to date version in hand, we should be able to reach a consensus on which to use. Thanks Andres> On Thu, Dec 01, Andres Lagar-Cavilla wrote: > >> MAX_HVM_VCPUS sits at 128 right now. Haven''t compile checked, but that >> probably means we would need a two page ring. And then, when 1024-cpu >> hosts arrive and we grow MAX_HVM_VCPUS, we grow the ring size again. > > The ring has 64 entries. > >> Or, we could limit the constraint to the number of online vcpus, which >> would get somewhat tricky for vcpu hot-plugging. >> >> I can fix that separately, once there is a decision on which way to go >> re >> ring management. > > > I just sent "[PATCH] mem_event: use wait queue when ring is full" to the > list. This version ought to work, it takes the request from both target > and foreign cpus into account and leaves at lease one slot for the > target. > > > Olaf >