This patch series aggregates a number of fixes: - Fix the paging load operation to correct the m2p entry, and promote the p2m entry to the final guest-accessible type - Fix locking around p2m_teardown - Fix read-only mapping of shared pages - Output to the console the per-domain count of shared pages - Eliminate a stale var from a p2m audit debugtrace printk - Correct accounting of paged out pages when a paging-out is interrupted by a guest access - Simplify (and in some cases eliminate) use of p2m get_gfn*_unlocked - Allow for recursive locking in the mm-locks.h deadlock detection scheme Patches 1, 4 and 6 involve paging, cc''ed Olaf Hering. Patches 2, 7 and 8 lay more groundwork in preparation of fully synchronized p2m access. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> xen/arch/x86/mm/p2m.c | 18 ++++++++---------- xen/arch/x86/mm/p2m.c | 4 ++-- xen/arch/x86/mm.c | 8 +++++++- xen/common/keyhandler.c | 6 ++++-- xen/arch/x86/mm/p2m.c | 4 ++-- xen/arch/x86/mm/p2m.c | 3 ++- xen/arch/x86/hvm/emulate.c | 35 ++++++++++++++++++++++++++++------- xen/arch/x86/hvm/hvm.c | 5 ++++- xen/arch/x86/hvm/stdvga.c | 12 ++++++++++-- xen/arch/x86/hvm/vmx/vmx.c | 2 +- xen/arch/x86/mm/mem_sharing.c | 2 +- xen/arch/x86/mm/mm-locks.h | 3 ++- 12 files changed, 71 insertions(+), 31 deletions(-)
xen/arch/x86/mm/p2m.c | 18 ++++++++---------- 1 files changed, 8 insertions(+), 10 deletions(-) When restoring a p2m entry in the paging_load path, we were not updating the m2p entry correctly. Also take advantage of this to act on an old suggestion: once done with the load, promote the p2m entry to the final guest accessible type. This simplifies logic. Tested to work with xenpaging. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r f09f62ae92b7 -r 143e4982c9bf xen/arch/x86/mm/p2m.c --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -975,7 +975,7 @@ void p2m_mem_paging_populate(struct doma int p2m_mem_paging_prep(struct domain *d, unsigned long gfn, uint64_t buffer) { struct page_info *page; - p2m_type_t p2mt, target_p2mt; + p2m_type_t p2mt; p2m_access_t a; mfn_t mfn; struct p2m_domain *p2m = p2m_get_hostp2m(d); @@ -1033,15 +1033,13 @@ int p2m_mem_paging_prep(struct domain *d } } - target_p2mt = (p2mt == p2m_ram_paging_in_start) ? - /* If we kicked the pager with a populate event, the pager will send - * a resume event back */ - p2m_ram_paging_in : - /* If this was called asynchronously by the pager, then we can - * transition directly to the final guest-accessible type */ - (paging_mode_log_dirty(d) ? p2m_ram_logdirty : p2m_ram_rw); - /* Fix p2m mapping */ - set_p2m_entry(p2m, gfn, mfn, PAGE_ORDER_4K, target_p2mt, a); + /* Make the page already guest-accessible. If the pager still has a + * pending resume operation, it will be idempotent p2m entry-wise, + * but will unpause the vcpu */ + set_p2m_entry(p2m, gfn, mfn, PAGE_ORDER_4K, + paging_mode_log_dirty(d) ? p2m_ram_logdirty : + p2m_ram_rw, a); + set_gpfn_from_mfn(mfn_x(mfn), gfn); atomic_dec(&d->paged_pages);
Andres Lagar-Cavilla
2012-Jan-26 03:53 UTC
[PATCH 2 of 8] x86/mm: Fix p2m teardown locking
xen/arch/x86/mm/p2m.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) Holding the p2m lock during a p2m teardown, while unsharing entries pointing to shared frames, causes a locking inversion and deadlock panic. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 143e4982c9bf -r 2e7c77585adb xen/arch/x86/mm/p2m.c --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -343,8 +343,6 @@ void p2m_teardown(struct p2m_domain *p2m if (p2m == NULL) return; - p2m_lock(p2m); - #ifdef __x86_64__ for ( gfn=0; gfn < p2m->max_mapped_pfn; gfn++ ) { @@ -358,6 +356,8 @@ void p2m_teardown(struct p2m_domain *p2m } #endif + p2m_lock(p2m); + p2m->phys_table = pagetable_null(); while ( (pg = page_list_remove_head(&p2m->pages)) )
Andres Lagar-Cavilla
2012-Jan-26 03:53 UTC
[PATCH 3 of 8] x86/mm: Allow foreign read-only mappings of shared pages
xen/arch/x86/mm.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) Because shared pages are owned by dom_cow, the ownership test while foreign mapping fails. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 2e7c77585adb -r c4b96d2cba06 xen/arch/x86/mm.c --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -878,7 +878,8 @@ get_page_from_l1e( return 1; } - if ( unlikely(real_pg_owner != pg_owner) ) + if ( unlikely( (real_pg_owner != pg_owner) && + (real_pg_owner != dom_cow) ) ) { /* * Let privileged domains transfer the right to map their target @@ -892,6 +893,11 @@ get_page_from_l1e( pg_owner = real_pg_owner; } + /* Extra paranoid check for shared memory. Writable mappings + * disallowed (unshare first!) */ + if ( (l1f & _PAGE_RW) && (real_pg_owner == dom_cow) ) + goto could_not_pin; + /* Foreign mappings into guests in shadow external mode don''t * contribute to writeable mapping refcounts. (This allows the * qemu-dm helper process in dom0 to map the domain''s memory without
Andres Lagar-Cavilla
2012-Jan-26 03:53 UTC
[PATCH 4 of 8] x86/mm: Output domain count of paged pages in console
xen/common/keyhandler.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r c4b96d2cba06 -r 39e2fc847c57 xen/common/keyhandler.c --- a/xen/common/keyhandler.c +++ b/xen/common/keyhandler.c @@ -250,8 +250,10 @@ static void dump_domains(unsigned char k printk(" refcnt=%d dying=%d pause_count=%d\n", atomic_read(&d->refcnt), d->is_dying, atomic_read(&d->pause_count)); - printk(" nr_pages=%d xenheap_pages=%d shared_pages=%u dirty_cpus=%s max_pages=%u\n", - d->tot_pages, d->xenheap_pages, atomic_read(&d->shr_pages), tmpstr, d->max_pages); + printk(" nr_pages=%d xenheap_pages=%d shared_pages=%u paged_pages=%u " + "dirty_cpus=%s max_pages=%u\n", d->tot_pages, d->xenheap_pages, + atomic_read(&d->shr_pages), atomic_read(&d->paged_pages), + tmpstr, d->max_pages); printk(" handle=%02x%02x%02x%02x-%02x%02x-%02x%02x-" "%02x%02x-%02x%02x%02x%02x%02x%02x vm_assist=%08lx\n", d->handle[ 0], d->handle[ 1], d->handle[ 2], d->handle[ 3],
Andres Lagar-Cavilla
2012-Jan-26 03:53 UTC
[PATCH 5 of 8] x86/mm: Remove stale variable from debugtrace printk in p2m audit
xen/arch/x86/mm/p2m.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 39e2fc847c57 -r c41436e555cd xen/arch/x86/mm/p2m.c --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -1534,8 +1534,8 @@ void audit_p2m(struct domain *d, } __put_gfn(p2m, gfn); - P2M_PRINTK("OK: mfn=%#lx, gfn=%#lx, p2mfn=%#lx, lp2mfn=%#lx\n", - mfn, gfn, mfn_x(p2mfn), lp2mfn); + P2M_PRINTK("OK: mfn=%#lx, gfn=%#lx, p2mfn=%#lx\n", + mfn, gfn, mfn_x(p2mfn)); } spin_unlock(&d->page_alloc_lock);
Andres Lagar-Cavilla
2012-Jan-26 03:53 UTC
[PATCH 6 of 8] x86/mm: Properly account for paged out pages
xen/arch/x86/mm/p2m.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) If we hit the page after nominate but before paging it out, don''t decrement the domain count of paged out pages. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r c41436e555cd -r d4336e35f0bc xen/arch/x86/mm/p2m.c --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -1041,7 +1041,8 @@ int p2m_mem_paging_prep(struct domain *d p2m_ram_rw, a); set_gpfn_from_mfn(mfn_x(mfn), gfn); - atomic_dec(&d->paged_pages); + if ( !page_extant ) + atomic_dec(&d->paged_pages); ret = 0;
Andres Lagar-Cavilla
2012-Jan-26 03:53 UTC
[PATCH 7 of 8] x86/mm: clean use of p2m unlocked queries
xen/arch/x86/hvm/emulate.c | 35 ++++++++++++++++++++++++++++------- xen/arch/x86/hvm/hvm.c | 5 ++++- xen/arch/x86/hvm/stdvga.c | 12 ++++++++++-- xen/arch/x86/hvm/vmx/vmx.c | 2 +- xen/arch/x86/mm/mem_sharing.c | 2 +- 5 files changed, 44 insertions(+), 12 deletions(-) Limit such queries only to p2m_query types. This is more compatible with the name and intended semantics: perform only a lookup, and explicitly in an unlocked way. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r d4336e35f0bc -r 2dc0ef05662e xen/arch/x86/hvm/emulate.c --- a/xen/arch/x86/hvm/emulate.c +++ b/xen/arch/x86/hvm/emulate.c @@ -660,7 +660,7 @@ static int hvmemul_rep_movs( { struct hvm_emulate_ctxt *hvmemul_ctxt container_of(ctxt, struct hvm_emulate_ctxt, ctxt); - unsigned long saddr, daddr, bytes; + unsigned long saddr, daddr, bytes, sgfn, dgfn; paddr_t sgpa, dgpa; uint32_t pfec = PFEC_page_present; p2m_type_t p2mt; @@ -693,17 +693,28 @@ static int hvmemul_rep_movs( if ( rc != X86EMUL_OKAY ) return rc; - /* Unlocked works here because we get_gfn for real in whatever - * we call later. */ - (void)get_gfn_unlocked(current->domain, sgpa >> PAGE_SHIFT, &p2mt); + /* XXX In a fine-grained p2m locking scenario, we need to sort this + * get_gfn''s, or else we might deadlock */ + sgfn = sgpa >> PAGE_SHIFT; + (void)get_gfn(current->domain, sgfn, &p2mt); if ( !p2m_is_ram(p2mt) && !p2m_is_grant(p2mt) ) - return hvmemul_do_mmio( + { + rc = hvmemul_do_mmio( sgpa, reps, bytes_per_rep, dgpa, IOREQ_READ, df, NULL); + put_gfn(current->domain, sgfn); + return rc; + } - (void)get_gfn_unlocked(current->domain, dgpa >> PAGE_SHIFT, &p2mt); + dgfn = dgpa >> PAGE_SHIFT; + (void)get_gfn(current->domain, dgfn, &p2mt); if ( !p2m_is_ram(p2mt) && !p2m_is_grant(p2mt) ) - return hvmemul_do_mmio( + { + rc = hvmemul_do_mmio( dgpa, reps, bytes_per_rep, sgpa, IOREQ_WRITE, df, NULL); + put_gfn(current->domain, sgfn); + put_gfn(current->domain, dgfn); + return rc; + } /* RAM-to-RAM copy: emulate as equivalent of memmove(dgpa, sgpa, bytes). */ bytes = *reps * bytes_per_rep; @@ -718,7 +729,11 @@ static int hvmemul_rep_movs( * can be emulated by a source-to-buffer-to-destination block copy. */ if ( ((dgpa + bytes_per_rep) > sgpa) && (dgpa < (sgpa + bytes)) ) + { + put_gfn(current->domain, sgfn); + put_gfn(current->domain, dgfn); return X86EMUL_UNHANDLEABLE; + } /* Adjust destination address for reverse copy. */ if ( df ) @@ -727,7 +742,11 @@ static int hvmemul_rep_movs( /* Allocate temporary buffer. Fall back to slow emulation if this fails. */ buf = xmalloc_bytes(bytes); if ( buf == NULL ) + { + put_gfn(current->domain, sgfn); + put_gfn(current->domain, dgfn); return X86EMUL_UNHANDLEABLE; + } /* * We do a modicum of checking here, just for paranoia''s sake and to @@ -738,6 +757,8 @@ static int hvmemul_rep_movs( rc = hvm_copy_to_guest_phys(dgpa, buf, bytes); xfree(buf); + put_gfn(current->domain, sgfn); + put_gfn(current->domain, dgfn); if ( rc == HVMCOPY_gfn_paged_out ) return X86EMUL_RETRY; diff -r d4336e35f0bc -r 2dc0ef05662e xen/arch/x86/hvm/hvm.c --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -3975,7 +3975,10 @@ long do_hvm_op(unsigned long op, XEN_GUE rc = -EINVAL; if ( is_hvm_domain(d) ) { - get_gfn_unshare_unlocked(d, a.pfn, &t); + /* Use get_gfn query as we are interested in the current + * type, not in allocating or unsharing. That''ll happen + * on access. */ + get_gfn_query_unlocked(d, a.pfn, &t); if ( p2m_is_mmio(t) ) a.mem_type = HVMMEM_mmio_dm; else if ( p2m_is_readonly(t) ) diff -r d4336e35f0bc -r 2dc0ef05662e xen/arch/x86/hvm/stdvga.c --- a/xen/arch/x86/hvm/stdvga.c +++ b/xen/arch/x86/hvm/stdvga.c @@ -482,15 +482,19 @@ static int mmio_move(struct hvm_hw_stdvg if ( hvm_copy_to_guest_phys(data, &tmp, p->size) ! HVMCOPY_okay ) { - (void)get_gfn_unlocked(d, data >> PAGE_SHIFT, &p2mt); + (void)get_gfn(d, data >> PAGE_SHIFT, &p2mt); /* * The only case we handle is vga_mem <-> vga_mem. * Anything else disables caching and leaves it to qemu-dm. */ if ( (p2mt != p2m_mmio_dm) || (data < VGA_MEM_BASE) || ((data + p->size) > (VGA_MEM_BASE + VGA_MEM_SIZE)) ) + { + put_gfn(d, data >> PAGE_SHIFT); return 0; + } stdvga_mem_write(data, tmp, p->size); + put_gfn(d, data >> PAGE_SHIFT); } data += sign * p->size; addr += sign * p->size; @@ -504,11 +508,15 @@ static int mmio_move(struct hvm_hw_stdvg if ( hvm_copy_from_guest_phys(&tmp, data, p->size) ! HVMCOPY_okay ) { - (void)get_gfn_unlocked(d, data >> PAGE_SHIFT, &p2mt); + (void)get_gfn(d, data >> PAGE_SHIFT, &p2mt); if ( (p2mt != p2m_mmio_dm) || (data < VGA_MEM_BASE) || ((data + p->size) > (VGA_MEM_BASE + VGA_MEM_SIZE)) ) + { + put_gfn(d, data >> PAGE_SHIFT); return 0; + } tmp = stdvga_mem_read(data, p->size); + put_gfn(d, data >> PAGE_SHIFT); } stdvga_mem_write(addr, tmp, p->size); data += sign * p->size; diff -r d4336e35f0bc -r 2dc0ef05662e xen/arch/x86/hvm/vmx/vmx.c --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -2112,7 +2112,7 @@ static void ept_handle_violation(unsigne return; /* Everything else is an error. */ - mfn = get_gfn_guest_unlocked(d, gfn, &p2mt); + mfn = get_gfn_query_unlocked(d, gfn, &p2mt); gdprintk(XENLOG_ERR, "EPT violation %#lx (%c%c%c/%c%c%c), " "gpa %#"PRIpaddr", mfn %#lx, type %i.\n", qualification, diff -r d4336e35f0bc -r 2dc0ef05662e xen/arch/x86/mm/mem_sharing.c --- a/xen/arch/x86/mm/mem_sharing.c +++ b/xen/arch/x86/mm/mem_sharing.c @@ -421,7 +421,7 @@ int mem_sharing_debug_gfn(struct domain p2m_type_t p2mt; mfn_t mfn; - mfn = get_gfn_unlocked(d, gfn, &p2mt); + mfn = get_gfn_query_unlocked(d, gfn, &p2mt); gdprintk(XENLOG_DEBUG, "Debug for domain=%d, gfn=%lx, ", d->domain_id,
Andres Lagar-Cavilla
2012-Jan-26 03:53 UTC
[PATCH 8 of 8] x86/mm: Avoid spurious deadlock panic trigger
xen/arch/x86/mm/mm-locks.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) In the mm layer, if we take lock A, then lock B, and the recursively lock A, the deadlock detector panics. This is not a deadlock risk because we already ''own'' the outer lock (A), so we will not contend for that resource. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 2dc0ef05662e -r 409f0f368ae3 xen/arch/x86/mm/mm-locks.h --- a/xen/arch/x86/mm/mm-locks.h +++ b/xen/arch/x86/mm/mm-locks.h @@ -61,7 +61,8 @@ do { \ static inline void _mm_lock(mm_lock_t *l, const char *func, int level, int rec) { - __check_lock_level(level); + if ( !((mm_locked_by_me(l)) && rec) ) + __check_lock_level(level); spin_lock_recursive(&l->lock); if ( l->lock.recurse_cnt == 1 ) {
On Wed, Jan 25, Andres Lagar-Cavilla wrote:> When restoring a p2m entry in the paging_load path, we were not updating the > m2p entry correctly.Thats a good one.> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>Acked-by: Olaf Hering <olaf@aepfle.de>
Olaf Hering
2012-Jan-26 09:47 UTC
Re: [PATCH 4 of 8] x86/mm: Output domain count of paged pages in console
On Wed, Jan 25, Andres Lagar-Cavilla wrote:> xen/common/keyhandler.c | 6 ++++-- > 1 files changed, 4 insertions(+), 2 deletions(-) > > > Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>Acked-by: Olaf Hering <olaf@aepfle.de>
Olaf Hering
2012-Jan-26 09:54 UTC
Re: [PATCH 6 of 8] x86/mm: Properly account for paged out pages
On Wed, Jan 25, Andres Lagar-Cavilla wrote:> xen/arch/x86/mm/p2m.c | 3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > > If we hit the page after nominate but before paging it out, don''t decrement the > domain count of paged out pages.The meaning of paged_pages is to track released pages. In p2m_mem_paging_evict() it gets incremented once the page is removed from the domain, and in p2m_mem_paging_prep() it should be decremented right after the alloc_domheap_page() call. So the patch should move the atomic_dec() call after successful page allocation. Olaf> diff -r c41436e555cd -r d4336e35f0bc xen/arch/x86/mm/p2m.c > --- a/xen/arch/x86/mm/p2m.c > +++ b/xen/arch/x86/mm/p2m.c > @@ -1041,7 +1041,8 @@ int p2m_mem_paging_prep(struct domain *d > p2m_ram_rw, a); > set_gpfn_from_mfn(mfn_x(mfn), gfn); > > - atomic_dec(&d->paged_pages); > + if ( !page_extant ) > + atomic_dec(&d->paged_pages); > > ret = 0; >
Andres Lagar-Cavilla
2012-Jan-26 10:47 UTC
Re: [PATCH 6 of 8] x86/mm: Properly account for paged out pages
> On Wed, Jan 25, Andres Lagar-Cavilla wrote: > >> xen/arch/x86/mm/p2m.c | 3 ++- >> 1 files changed, 2 insertions(+), 1 deletions(-) >> >> >> If we hit the page after nominate but before paging it out, don''t >> decrement the >> domain count of paged out pages. > > The meaning of paged_pages is to track released pages. > In p2m_mem_paging_evict() it gets incremented once the page is removed > from the domain, and in p2m_mem_paging_prep() it should be decremented > right after the alloc_domheap_page() call. > So the patch should move the atomic_dec() call after successful page > allocation. > > OlafHi Olaf, there is a put_page in case things go wrong, after the alloc_domheap_page call. So doing the decrement at alloc is a bit too soon. Thanks Andres> >> diff -r c41436e555cd -r d4336e35f0bc xen/arch/x86/mm/p2m.c >> --- a/xen/arch/x86/mm/p2m.c >> +++ b/xen/arch/x86/mm/p2m.c >> @@ -1041,7 +1041,8 @@ int p2m_mem_paging_prep(struct domain *d >> p2m_ram_rw, a); >> set_gpfn_from_mfn(mfn_x(mfn), gfn); >> >> - atomic_dec(&d->paged_pages); >> + if ( !page_extant ) >> + atomic_dec(&d->paged_pages); >> >> ret = 0; >> >
> On Wed, Jan 25, Andres Lagar-Cavilla wrote: > >> When restoring a p2m entry in the paging_load path, we were not updating >> the >> m2p entry correctly. > > Thats a good one. > >> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> > > Acked-by: Olaf Hering <olaf@aepfle.de> >Great! Now, afaict, the p2m_ram_paging_in state is not needed anymore. Can you provide feedback as to whether 1. remove p2m_ram_paging_in 2. rename p2m_ram_paging_in_start to p2m_ram_paging_in sounds like a good plan? Cheers! Andres
On Thu, Jan 26, Andres Lagar-Cavilla wrote:> Now, afaict, the p2m_ram_paging_in state is not needed anymore. Can you > provide feedback as to whether > 1. remove p2m_ram_paging_in > 2. rename p2m_ram_paging_in_start to p2m_ram_paging_in > > sounds like a good plan?In my opinion the common case is that evicted pages get populated, an request is sent. Later an response is expected to make room in the ring. If p2m_mem_paging_populate allocates a page for the guest, it can let the pager know that it did so (or failed to allocate one). If there is a page already, the pager can copy the gfn content into a buffer, put a pointer to it in the response and let p2m_mem_paging_resume() handle both the ring accounting (as it does now) and also the copy_from_user. If page allocation failed, the pager has to allocate one via p2m_mem_paging_prep() as it is done now, as an intermediate step. The buffer page handling in the pager is probably simple, it needs to maintain RING_SIZE() buffers. There cant be more than that in flight because thats the limit of requests as well. In other words, the pager does not need to wait for p2m_mem_paging_resume() to run and pull the buffer content. If the "populate - allocate - put_request - get_request - fill_buffer - put_response - resume get_response - copy_from_buffer - resume_vcpu" cycle works, it would reduce the overall amount of work to be done during paging, even if the hypercalls itself are not the bottleneck. It all depends on the possibility to allocate a page in the various contexts where p2m_mem_paging_populate is called. The resume part could be done via eventchannel and XEN_DOMCTL_MEM_EVENT_OP_PAGING_RESUME could be removed. Also the question is if freeing one p2mt is more important than reducing the number if hypercalls to execute at runtime. Olaf
Olaf Hering
2012-Jan-26 12:11 UTC
Re: [PATCH 6 of 8] x86/mm: Properly account for paged out pages
On Thu, Jan 26, Andres Lagar-Cavilla wrote:> there is a put_page in case things go wrong, after the alloc_domheap_page > call. So doing the decrement at alloc is a bit too soon.Thats probably true. But in the case of failure the guest will likely hang soon. Is the copy_from_user() (and p2m_mem_paging_prep itself) restartable? If not, the guest should probably be crashed. The way xenpaging_populate_page() is written right now, it will just exit, EFAULT isnt caught. Olaf
> On Thu, Jan 26, Andres Lagar-Cavilla wrote: > >> Now, afaict, the p2m_ram_paging_in state is not needed anymore. Can you >> provide feedback as to whether >> 1. remove p2m_ram_paging_in >> 2. rename p2m_ram_paging_in_start to p2m_ram_paging_in >> >> sounds like a good plan? > > In my opinion the common case is that evicted pages get populated, an > request is sent. Later an response is expected to make room in the ring. > > If p2m_mem_paging_populate allocates a page for the guest, it can let > the pager know that it did so (or failed to allocate one). > If there is a page already, the pager can copy the gfn content into a > buffer, put a pointer to it in the response and let > p2m_mem_paging_resume() handle both the ring accounting (as it does now) > and also the copy_from_user.So, this would bounce the page contents twice for the case when the page hasn''t yet been evicted?> If page allocation failed, the pager has to allocate one via > p2m_mem_paging_prep() as it is done now, as an intermediate step.The issue of failed allocations is more pervasive. It also affects mem sharing. And even PoD. What I''m trying to say is that even though your solution seems to work (as long as the pager does dom0 ballooning to free up some memory in between populate and prep!), we need a more generic mechanism. Something along the lines of an ENOMEM mem_event ring, and a matching dom0 daemon.> > The buffer page handling in the pager is probably simple, it needs to > maintain RING_SIZE() buffers. There cant be more than that in flight > because thats the limit of requests as well. In other words, the pager > does not need to wait for p2m_mem_paging_resume() to run and pull the > buffer content. > > > If the "populate - allocate - put_request - get_request - fill_buffer - > put_response - resume get_response - copy_from_buffer - resume_vcpu" > cycle works, it would reduce the overall amount of work to be done > during paging, even if the hypercalls itself are not the bottleneck. > It all depends on the possibility to allocate a page in the various > contexts where p2m_mem_paging_populate is called.The gist here is that the paging_load call would be removed? I like the general direction, but one excellent property of paging_resume is that it doesn''t fail. This is particularly important since we already do resumes via ring responses and event channel kicks (see below). So, if resume needs to propagate failures back to the pager, things get icky. (paging_load is restartable, see other email)> > The resume part could be done via eventchannel and > XEN_DOMCTL_MEM_EVENT_OP_PAGING_RESUME could be removed.This is already the case. I''m not eager to remove the domctl, but resuming via event channel kick only is in place.> > Also the question is if freeing one p2mt is more important than reducing > the number if hypercalls to execute at runtime.Agreed. However, eliminating code complexity is also useful, and these two ram_paging_in states cause everyone headaches. Thanks Andres> > Olaf >
Andres Lagar-Cavilla
2012-Jan-26 12:26 UTC
Re: [PATCH 6 of 8] x86/mm: Properly account for paged out pages
> On Thu, Jan 26, Andres Lagar-Cavilla wrote: > >> there is a put_page in case things go wrong, after the >> alloc_domheap_page >> call. So doing the decrement at alloc is a bit too soon. > > Thats probably true. > > But in the case of failure the guest will likely hang soon. Is the > copy_from_user() (and p2m_mem_paging_prep itself) restartable? If not, > the guest should probably be crashed. The way xenpaging_populate_page() > is written right now, it will just exit, EFAULT isnt caught.p2m_mem_paging_prep is restartable, afaict. It cleanly undoes its deeds. xenpaging not handling EFAULT ... that''s a different issue/patch altogether. Andres> > Olaf >
On Thu, Jan 26, Andres Lagar-Cavilla wrote:> > On Thu, Jan 26, Andres Lagar-Cavilla wrote: > > > >> Now, afaict, the p2m_ram_paging_in state is not needed anymore. Can you > >> provide feedback as to whether > >> 1. remove p2m_ram_paging_in > >> 2. rename p2m_ram_paging_in_start to p2m_ram_paging_in > >> > >> sounds like a good plan? > > > > In my opinion the common case is that evicted pages get populated, an > > request is sent. Later an response is expected to make room in the ring. > > > > If p2m_mem_paging_populate allocates a page for the guest, it can let > > the pager know that it did so (or failed to allocate one). > > If there is a page already, the pager can copy the gfn content into a > > buffer, put a pointer to it in the response and let > > p2m_mem_paging_resume() handle both the ring accounting (as it does now) > > and also the copy_from_user. > > So, this would bounce the page contents twice for the case when the page > hasn''t yet been evicted?If it was not evicted, then nothing has to be done. In theory the page could be resumed right away. But you wanted the notification, so a kind-of dummy request has to be sent. The pager itself will figure out quickly that the page was not evicted yet, so all it can do is to send a dummy response. I''m not sure what you mean with bouncing it twice. The pager itself has likely written the page, so some IO happend. The hypervisor will find a mfn for the gfn, right now it does nothing but doing p2mt changes.> > If page allocation failed, the pager has to allocate one via > > p2m_mem_paging_prep() as it is done now, as an intermediate step. > > The issue of failed allocations is more pervasive. It also affects mem > sharing. And even PoD. What I''m trying to say is that even though your > solution seems to work (as long as the pager does dom0 ballooning to free > up some memory in between populate and prep!), we need a more generic > mechanism. Something along the lines of an ENOMEM mem_event ring, and a > matching dom0 daemon.I''m not sure how all that relates to putting an alloc_domheap_page() into p2m_mem_paging_populate() and let the pager know about the results.> > > > The buffer page handling in the pager is probably simple, it needs to > > maintain RING_SIZE() buffers. There cant be more than that in flight > > because thats the limit of requests as well. In other words, the pager > > does not need to wait for p2m_mem_paging_resume() to run and pull the > > buffer content. > > > > > > If the "populate - allocate - put_request - get_request - fill_buffer - > > put_response - resume get_response - copy_from_buffer - resume_vcpu" > > cycle works, it would reduce the overall amount of work to be done > > during paging, even if the hypercalls itself are not the bottleneck. > > It all depends on the possibility to allocate a page in the various > > contexts where p2m_mem_paging_populate is called. > > The gist here is that the paging_load call would be removed?No, it is required if alloc_domheap_page() in p2m_mem_paging_populate() fails. Then the pager has to take the "slow path" and try until it gets a page, like it does now.> I like the general direction, but one excellent property of paging_resume > is that it doesn''t fail. This is particularly important since we already > do resumes via ring responses and event channel kicks (see below). So, if > resume needs to propagate failures back to the pager, things get icky. > > (paging_load is restartable, see other email)Once the request is sent, all what can happen is that the copy_from_user fails. And in that case the guest is in an undefined state. Perhaps such copy_from_user handling can be fixed to be reliable. In that case of course my idea wont work.> > The resume part could be done via eventchannel and > > XEN_DOMCTL_MEM_EVENT_OP_PAGING_RESUME could be removed. > > This is already the case. I''m not eager to remove the domctl, but resuming > via event channel kick only is in place.It looks to me like resume is currently called twice in xenpaging, once per ioctl and once per event channel.> > Also the question is if freeing one p2mt is more important than reducing > > the number if hypercalls to execute at runtime. > > Agreed. However, eliminating code complexity is also useful, and these two > ram_paging_in states cause everyone headaches.Well, its not so alien to me after starring at and tracing it long enough. Olaf
Olaf Hering
2012-Jan-26 13:08 UTC
Re: [PATCH 6 of 8] x86/mm: Properly account for paged out pages
On Thu, Jan 26, Andres Lagar-Cavilla wrote:> xenpaging not handling EFAULT ... that''s a different issue/patch altogether.I''m not sure wether EFAULT is something an app can or should catch. I will have a look. Other than that, the patch is ok. Olaf
At 22:53 -0500 on 25 Jan (1327532004), Andres Lagar-Cavilla wrote:> This patch series aggregates a number of fixes: > - Fix the paging load operation to correct the m2p entry, and > promote the p2m entry to the final guest-accessible type > - Fix locking around p2m_teardown > - Fix read-only mapping of shared pages > - Output to the console the per-domain count of shared pages > - Eliminate a stale var from a p2m audit debugtrace printk > - Correct accounting of paged out pages when a paging-out is > interrupted by a guest access > - Simplify (and in some cases eliminate) use of p2m get_gfn*_unlocked > - Allow for recursive locking in the mm-locks.h deadlock detection > schemeAll applied; thanks. Tim.