Andres Lagar-Cavilla
2012-Apr-12 14:16 UTC
[PATCH 0 of 3] RFC: x86 memory sharing performance improvements
This is an RFC series. I haven''t fully tested it yet, but I want the concept to be known as I intend this to be merged prior to the closing of the 4.2 window. The sharing subsystem does not scale elegantly with high degrees of page sharing. The culprit is a reverse map that each shared frame maintains, resolving to all domain pages pointing to the shared frame. Because the rmap is implemented with a O(n) search linked-list, CoW unsharing can result in prolonged search times. The place where this becomes most obvious is during domain destruction, during which all shared p2m entries need to be unshared. Destroying a domain with a lot of sharing could result in minutes of hypervisor freeze-up! Solutions proposed: - Make the p2m clean up of shared entries part of the preemptible, synchronous, domain_kill domctl (as opposed to executing monolithically in the finalize destruction RCU callback) - When a shared frame exceeds an arbitrary ref count, mutate the rmap from a linked list to a hash table. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> xen/arch/x86/domain.c | 16 +++- xen/arch/x86/mm/mem_sharing.c | 45 ++++++++++ xen/arch/x86/mm/p2m.c | 4 + xen/include/asm-arm/mm.h | 4 + xen/include/asm-x86/domain.h | 1 + xen/include/asm-x86/mem_sharing.h | 10 ++ xen/include/asm-x86/p2m.h | 4 + xen/arch/x86/mm/mem_sharing.c | 142 +++++++++++++++++++++++-------- xen/arch/x86/mm/mem_sharing.c | 170 +++++++++++++++++++++++++++++++++++-- xen/include/asm-x86/mem_sharing.h | 13 ++- 10 files changed, 354 insertions(+), 55 deletions(-)
Andres Lagar-Cavilla
2012-Apr-12 14:16 UTC
[PATCH 1 of 3] x86/mm/sharing: Clean ups for relinquishing shared pages on destroy
xen/arch/x86/domain.c | 16 ++++++++++++- xen/arch/x86/mm/mem_sharing.c | 45 +++++++++++++++++++++++++++++++++++++++ xen/arch/x86/mm/p2m.c | 4 +++ xen/include/asm-arm/mm.h | 4 +++ xen/include/asm-x86/domain.h | 1 + xen/include/asm-x86/mem_sharing.h | 10 ++++++++ xen/include/asm-x86/p2m.h | 4 +++ 7 files changed, 82 insertions(+), 2 deletions(-) When a domain is destroyed, its pages are freed in relinquish_resources in a preemptible mode, in the context of a synchronous domctl. P2m entries pointing to shared pages are, however, released during p2m cleanup in an RCU callback, and in non-preemptible mode. This is an O(n) operation for a very large n, which may include actually freeing shared pages for which the domain is the last holder. To improve responsiveness, move this operation to the preemtible portion of domain destruction, during the synchronous domain_kill hypercall. And remove the bulk of the work from the RCU callback. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r c74fe64aafeb -r 9cbdf6b230dc xen/arch/x86/domain.c --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -2132,10 +2132,22 @@ int domain_relinquish_resources(struct d } } - d->arch.relmem = RELMEM_xen; + d->arch.relmem = RELMEM_shared; /* fallthrough */ - /* Relinquish every page of memory. */ + case RELMEM_shared: + + if ( is_hvm_domain(d) ) + { + /* If the domain has shared pages, relinquish them allowing + * for preemption. */ + ret = relinquish_shared_pages(d); + if ( ret ) + return ret; + } + + d->arch.relmem = RELMEM_xen; + /* Fallthrough. Relinquish every page of memory. */ case RELMEM_xen: ret = relinquish_memory(d, &d->xenpage_list, ~0UL); if ( ret ) diff -r c74fe64aafeb -r 9cbdf6b230dc xen/arch/x86/mm/mem_sharing.c --- a/xen/arch/x86/mm/mem_sharing.c +++ b/xen/arch/x86/mm/mem_sharing.c @@ -33,6 +33,7 @@ #include <asm/mem_event.h> #include <asm/atomic.h> #include <xen/rcupdate.h> +#include <asm/event.h> #include "mm-locks.h" @@ -1034,6 +1035,50 @@ private_page_found: return 0; } +int relinquish_shared_pages(struct domain *d) +{ + int rc = 0; + struct p2m_domain *p2m = p2m_get_hostp2m(d); + unsigned long gfn, count = 0; + + if ( p2m == NULL ) + return 0; + + p2m_lock(p2m); + for (gfn = p2m->next_shared_gfn_to_relinquish; + gfn < p2m->max_mapped_pfn; gfn++ ) + { + p2m_access_t a; + p2m_type_t t; + mfn_t mfn; + if ( atomic_read(&d->shr_pages) == 0 ) + break; + mfn = p2m->get_entry(p2m, gfn, &t, &a, 0, NULL); + if ( mfn_valid(mfn) && (t == p2m_ram_shared) ) + { + /* Does not fail with ENOMEM given the DESTROY flag */ + BUG_ON(__mem_sharing_unshare_page(d, gfn, + MEM_SHARING_DESTROY_GFN)); + /* Clear out the p2m entry so no one else may try to + * unshare */ + p2m->set_entry(p2m, gfn, _mfn(0), PAGE_ORDER_4K, + p2m_invalid, p2m_access_rwx); + count++; + } + + /* Preempt every 2MiB. Arbitrary */ + if ( (count == 512) && hypercall_preempt_check() ) + { + p2m->next_shared_gfn_to_relinquish = gfn + 1; + rc = -EAGAIN; + break; + } + } + + p2m_unlock(p2m); + return rc; +} + int mem_sharing_memop(struct domain *d, xen_mem_sharing_op_t *mec) { int rc = 0; diff -r c74fe64aafeb -r 9cbdf6b230dc xen/arch/x86/mm/p2m.c --- a/xen/arch/x86/mm/p2m.c +++ b/xen/arch/x86/mm/p2m.c @@ -371,9 +371,13 @@ void p2m_teardown(struct p2m_domain *p2m p2m_lock(p2m); #ifdef __x86_64__ + /* Try to unshare any remaining shared p2m entries. Safeguard + * Since relinquish_shared_pages should have done the work. */ for ( gfn=0; gfn < p2m->max_mapped_pfn; gfn++ ) { p2m_access_t a; + if ( atomic_read(&d->shr_pages) == 0 ) + break; mfn = p2m->get_entry(p2m, gfn, &t, &a, 0, NULL); if ( mfn_valid(mfn) && (t == p2m_ram_shared) ) { diff -r c74fe64aafeb -r 9cbdf6b230dc xen/include/asm-arm/mm.h --- a/xen/include/asm-arm/mm.h +++ b/xen/include/asm-arm/mm.h @@ -251,6 +251,10 @@ int get_page(struct page_info *page, st static inline void put_gfn(struct domain *d, unsigned long gfn) {} static inline void mem_event_cleanup(struct domain *d) {} +static inline int relinquish_shared_pages(struct domain *d) +{ + return 0; +} #define INVALID_MFN (~0UL) diff -r c74fe64aafeb -r 9cbdf6b230dc xen/include/asm-x86/domain.h --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -296,6 +296,7 @@ struct arch_domain /* Continuable domain_relinquish_resources(). */ enum { RELMEM_not_started, + RELMEM_shared, RELMEM_xen, RELMEM_l4, RELMEM_l3, diff -r c74fe64aafeb -r 9cbdf6b230dc xen/include/asm-x86/mem_sharing.h --- a/xen/include/asm-x86/mem_sharing.h +++ b/xen/include/asm-x86/mem_sharing.h @@ -89,9 +89,19 @@ int mem_sharing_domctl(struct domain *d, int mem_sharing_audit(void); void mem_sharing_init(void); +/* Scans the p2m and relinquishes any shared pages, destroying + * those for which this domain holds the final reference. + * Preemptible. + */ +int relinquish_shared_pages(struct domain *d); + #else #define mem_sharing_init() do { } while (0) +static inline int relinquish_shared_pages(struct domain *d) +{ + return 0; +} #endif /* __x86_64__ */ diff -r c74fe64aafeb -r 9cbdf6b230dc xen/include/asm-x86/p2m.h --- a/xen/include/asm-x86/p2m.h +++ b/xen/include/asm-x86/p2m.h @@ -260,6 +260,10 @@ struct p2m_domain { /* Highest guest frame that''s ever been mapped in the p2m */ unsigned long max_mapped_pfn; + /* When releasing shared gfn''s in a preemptible manner, recall where + * to resume the search */ + unsigned long next_shared_gfn_to_relinquish; + /* Populate-on-demand variables * All variables are protected with the pod lock. We cannot rely on * the p2m lock if it''s turned into a fine-grained lock.
Andres Lagar-Cavilla
2012-Apr-12 14:16 UTC
[PATCH 2 of 3] x86/mem_sharing: modularize reverse map for shared frames
xen/arch/x86/mm/mem_sharing.c | 142 ++++++++++++++++++++++++++++++----------- 1 files changed, 103 insertions(+), 39 deletions(-) Each shared frame maintains a reverse map of <domain, gfn> tuples, so we know which tuples this shared frame is backing. This is useful for auditing and sanity-checking, and necessary to update all relevant p2m entries when sharing. The reverse map is maintained as a doubly linked list, but the interface is open-coded throughout the mem_sharing.c subsystem. Bury it inside a level of abstraction, so it can later support different (more scalable) rmap implementations. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 9cbdf6b230dc -r 1730bff8fccf xen/arch/x86/mm/mem_sharing.c --- a/xen/arch/x86/mm/mem_sharing.c +++ b/xen/arch/x86/mm/mem_sharing.c @@ -69,7 +69,8 @@ static inline void audit_add_list(struct spin_unlock(&shr_audit_lock); } -static inline void audit_del_list(struct page_info *page) +/* Removes from the audit list and cleans up the page sharing metadata. */ +static inline void page_sharing_dispose(struct page_info *page) { spin_lock(&shr_audit_lock); list_del_rcu(&page->sharing->entry); @@ -86,7 +87,7 @@ int mem_sharing_audit(void) } #define audit_add_list(p) ((void)0) -static inline void audit_del_list(struct page_info *page) +static inline void page_sharing_dispose(struct page_info *page) { xfree(page->sharing); } @@ -143,6 +144,7 @@ static inline shr_handle_t get_next_hand static atomic_t nr_saved_mfns = ATOMIC_INIT(0); static atomic_t nr_shared_mfns = ATOMIC_INIT(0); +/** Reverse map **/ typedef struct gfn_info { unsigned long gfn; @@ -150,15 +152,77 @@ typedef struct gfn_info struct list_head list; } gfn_info_t; -/* Returns true if list has only one entry. O(1) complexity. */ -static inline int list_has_one_entry(struct list_head *head) +static inline void +rmap_init(struct page_info *page) { + INIT_LIST_HEAD(&page->sharing->gfns); +} + +static inline void +rmap_del(gfn_info_t *gfn_info, struct page_info *page) +{ + list_del(&gfn_info->list); +} + +static inline void +rmap_add(gfn_info_t *gfn_info, struct page_info *page) +{ + INIT_LIST_HEAD(&gfn_info->list); + list_add(&gfn_info->list, &page->sharing->gfns); +} + +static inline gfn_info_t * +rmap_retrieve(uint16_t domain_id, unsigned long gfn, + struct page_info *page) +{ + gfn_info_t *gfn_info; + struct list_head *le; + list_for_each(le, &page->sharing->gfns) + { + gfn_info = list_entry(le, gfn_info_t, list); + if ( (gfn_info->gfn == gfn) && (gfn_info->domain == domain_id) ) + return gfn_info; + } + return NULL; +} + +/* Returns true if the rmap has only one entry. O(1) complexity. */ +static inline int rmap_has_one_entry(struct page_info *page) +{ + struct list_head *head = &page->sharing->gfns; return (head->next != head) && (head->next->next == head); } -static inline gfn_info_t *gfn_get_info(struct list_head *list) +/* Returns true if the rmap has any entries. O(1) complexity. */ +static inline int rmap_has_entries(struct page_info *page) { - return list_entry(list->next, gfn_info_t, list); + return (!list_empty(&page->sharing->gfns)); +} + +/* The iterator hides the details of how the rmap is implemented. This + * involves splitting the list_for_each_safe macro into two steps. */ +struct rmap_iterator { + struct list_head *curr; + struct list_head *next; +}; + +static inline void +rmap_seed_iterator(struct page_info *page, struct rmap_iterator *ri) +{ + ri->curr = &page->sharing->gfns; + ri->next = ri->curr->next; +} + +static inline gfn_info_t * +rmap_iterate(struct page_info *page, struct rmap_iterator *ri) +{ + if ( ri->next == &page->sharing->gfns) + return NULL; + + ri->curr = ri->next; + ri->next = ri->curr->next; + + return list_entry(ri->curr, gfn_info_t, list); } static inline gfn_info_t *mem_sharing_gfn_alloc(struct page_info *page, @@ -172,8 +236,8 @@ static inline gfn_info_t *mem_sharing_gf gfn_info->gfn = gfn; gfn_info->domain = d->domain_id; - INIT_LIST_HEAD(&gfn_info->list); - list_add(&gfn_info->list, &page->sharing->gfns); + + rmap_add(gfn_info, page); /* Increment our number of shared pges. */ atomic_inc(&d->shr_pages); @@ -181,14 +245,15 @@ static inline gfn_info_t *mem_sharing_gf return gfn_info; } -static inline void mem_sharing_gfn_destroy(struct domain *d, +static inline void mem_sharing_gfn_destroy(struct page_info *page, + struct domain *d, gfn_info_t *gfn_info) { /* Decrement the number of pages. */ atomic_dec(&d->shr_pages); /* Free the gfn_info structure. */ - list_del(&gfn_info->list); + rmap_del(gfn_info, page); xfree(gfn_info); } @@ -230,8 +295,9 @@ int mem_sharing_audit(void) struct page_sharing_info *pg_shared_info; unsigned long nr_gfns = 0; struct page_info *pg; - struct list_head *le; mfn_t mfn; + gfn_info_t *g; + struct rmap_iterator ri; pg_shared_info = list_entry(ae, struct page_sharing_info, entry); pg = pg_shared_info->pg; @@ -272,7 +338,7 @@ int mem_sharing_audit(void) } /* Check we have a list */ - if ( (!pg->sharing) || (list_empty(&pg->sharing->gfns)) ) + if ( (!pg->sharing) || !rmap_has_entries(pg) ) { MEM_SHARING_DEBUG("mfn %lx shared, but empty gfn list!\n", mfn_x(mfn)); @@ -284,14 +350,13 @@ int mem_sharing_audit(void) count_found++; /* Check if all GFNs map to the MFN, and the p2m types */ - list_for_each(le, &pg->sharing->gfns) + rmap_seed_iterator(pg, &ri); + while ( (g = rmap_iterate(pg, &ri)) != NULL ) { struct domain *d; p2m_type_t t; mfn_t o_mfn; - gfn_info_t *g; - g = list_entry(le, gfn_info_t, list); d = get_domain_by_id(g->domain); if ( d == NULL ) { @@ -677,7 +742,7 @@ int mem_sharing_nominate_page(struct dom goto out; } page->sharing->pg = page; - INIT_LIST_HEAD(&page->sharing->gfns); + rmap_init(page); /* Create the handle */ page->sharing->handle = get_next_handle(); @@ -698,7 +763,7 @@ int mem_sharing_nominate_page(struct dom * it a few lines above. * The mfn needs to revert back to rw type. This should never fail, * since no-one knew that the mfn was temporarily sharable */ - mem_sharing_gfn_destroy(d, gfn_info); + mem_sharing_gfn_destroy(page, d, gfn_info); xfree(page->sharing); page->sharing = NULL; /* NOTE: We haven''t yet added this to the audit list. */ @@ -726,13 +791,13 @@ int mem_sharing_share_pages(struct domai struct domain *cd, unsigned long cgfn, shr_handle_t ch) { struct page_info *spage, *cpage, *firstpg, *secondpg; - struct list_head *le, *te; gfn_info_t *gfn; struct domain *d; int ret = -EINVAL; mfn_t smfn, cmfn; p2m_type_t smfn_type, cmfn_type; struct two_gfns tg; + struct rmap_iterator ri; get_two_gfns(sd, sgfn, &smfn_type, NULL, &smfn, cd, cgfn, &cmfn_type, NULL, &cmfn, @@ -799,15 +864,15 @@ int mem_sharing_share_pages(struct domai } /* Merge the lists together */ - list_for_each_safe(le, te, &cpage->sharing->gfns) + rmap_seed_iterator(cpage, &ri); + while ( (gfn = rmap_iterate(cpage, &ri)) != NULL) { - gfn = list_entry(le, gfn_info_t, list); /* Get the source page and type, this should never fail: * we are under shr lock, and got a successful lookup */ BUG_ON(!get_page_and_type(spage, dom_cow, PGT_shared_page)); /* Move the gfn_info from client list to source list */ - list_del(&gfn->list); - list_add(&gfn->list, &spage->sharing->gfns); + rmap_del(gfn, cpage); + rmap_add(gfn, spage); put_page_and_type(cpage); d = get_domain_by_id(gfn->domain); BUG_ON(!d); @@ -817,7 +882,7 @@ int mem_sharing_share_pages(struct domai ASSERT(list_empty(&cpage->sharing->gfns)); /* Clear the rest of the shared state */ - audit_del_list(cpage); + page_sharing_dispose(cpage); cpage->sharing = NULL; mem_sharing_page_unlock(secondpg); @@ -887,7 +952,7 @@ int mem_sharing_add_to_physmap(struct do if ( !ret ) { ret = -ENOENT; - mem_sharing_gfn_destroy(cd, gfn_info); + mem_sharing_gfn_destroy(spage, cd, gfn_info); put_page_and_type(spage); } else { ret = 0; @@ -929,7 +994,6 @@ int __mem_sharing_unshare_page(struct do void *s, *t; int last_gfn; gfn_info_t *gfn_info = NULL; - struct list_head *le; mfn = get_gfn(d, gfn, &p2mt); @@ -947,34 +1011,35 @@ int __mem_sharing_unshare_page(struct do BUG(); } - list_for_each(le, &page->sharing->gfns) + gfn_info = rmap_retrieve(d->domain_id, gfn, page); + if ( unlikely(gfn_info == NULL) ) { - gfn_info = list_entry(le, gfn_info_t, list); - if ( (gfn_info->gfn == gfn) && (gfn_info->domain == d->domain_id) ) - goto gfn_found; + gdprintk(XENLOG_ERR, "Could not find gfn_info for shared gfn: " + "%lx\n", gfn); + BUG(); } - gdprintk(XENLOG_ERR, "Could not find gfn_info for shared gfn: " - "%lx\n", gfn); - BUG(); -gfn_found: /* Do the accounting first. If anything fails below, we have bigger * bigger fish to fry. First, remove the gfn from the list. */ - last_gfn = list_has_one_entry(&page->sharing->gfns); + last_gfn = rmap_has_one_entry(page); if ( last_gfn ) { - /* Clean up shared state */ - audit_del_list(page); + /* Clean up shared state. Get rid of the <domid, gfn> tuple + * before destroying the rmap. */ + mem_sharing_gfn_destroy(page, d, gfn_info); + page_sharing_dispose(page); page->sharing = NULL; atomic_dec(&nr_shared_mfns); } else atomic_dec(&nr_saved_mfns); + /* If the GFN is getting destroyed drop the references to MFN * (possibly freeing the page), and exit early */ if ( flags & MEM_SHARING_DESTROY_GFN ) { - mem_sharing_gfn_destroy(d, gfn_info); + if ( !last_gfn ) + mem_sharing_gfn_destroy(page, d, gfn_info); put_page_and_type(page); mem_sharing_page_unlock(page); if ( last_gfn && @@ -987,7 +1052,6 @@ gfn_found: if ( last_gfn ) { - mem_sharing_gfn_destroy(d, gfn_info); /* Making a page private atomically unlocks it */ BUG_ON(page_make_private(d, page) != 0); goto private_page_found; @@ -1011,7 +1075,7 @@ gfn_found: unmap_domain_page(t); BUG_ON(set_shared_p2m_entry(d, gfn, page_to_mfn(page)) == 0); - mem_sharing_gfn_destroy(d, gfn_info); + mem_sharing_gfn_destroy(old_page, d, gfn_info); mem_sharing_page_unlock(old_page); put_page_and_type(old_page);
Andres Lagar-Cavilla
2012-Apr-12 14:16 UTC
[PATCH 3 of 3] x86/mem_sharing: For shared pages with many references, use a hash table instead of a list
xen/arch/x86/mm/mem_sharing.c | 170 +++++++++++++++++++++++++++++++++++-- xen/include/asm-x86/mem_sharing.h | 13 ++- 2 files changed, 169 insertions(+), 14 deletions(-) For shared frames that have many references, the doubly-linked list used to store the rmap results in costly scans during unshare operations. To alleviate the overhead, replace the linked list by a hash table. However, hash tables are space-intensive, so only use them for pages that have "many" (arbitrary threshold) references. Unsharing is heaviliy exercised during domain destroy. In experimental testing, for a domain that points over 100 thousand pages to the same shared frame, domain destruction dropped from over 7 minutes(!) to less than two seconds. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> diff -r 1730bff8fccf -r 7b606c043208 xen/arch/x86/mm/mem_sharing.c --- a/xen/arch/x86/mm/mem_sharing.c +++ b/xen/arch/x86/mm/mem_sharing.c @@ -49,6 +49,17 @@ DEFINE_PER_CPU(pg_lock_data_t, __pld); #define MEM_SHARING_DEBUG(_f, _a...) \ debugtrace_printk("mem_sharing_debug: %s(): " _f, __func__, ##_a) +/* Reverse map defines */ +#define RMAP_HASHTAB_ORDER 0 +#define RMAP_HASHTAB_SIZE \ + ((PAGE_SIZE << RMAP_HASHTAB_ORDER) / sizeof(struct list_head)) +#define RMAP_USES_HASHTAB(page) \ + ((page)->sharing->hash_table.flag == NULL) +#define RMAP_HEAVY_SHARED_PAGE RMAP_HASHTAB_SIZE +/* A bit of hysteresis. We don''t want to be mutating between list and hash + * table constantly. */ +#define RMAP_LIGHT_SHARED_PAGE (RMAP_HEAVY_SHARED_PAGE >> 2) + #if MEM_SHARING_AUDIT static struct list_head shr_audit_list; @@ -72,6 +83,11 @@ static inline void audit_add_list(struct /* Removes from the audit list and cleans up the page sharing metadata. */ static inline void page_sharing_dispose(struct page_info *page) { + /* Unlikely given our thresholds, but we should be careful. */ + if ( unlikely(RMAP_USES_HASHTAB(page)) ) + free_xenheap_pages(page->sharing->hash_table.bucket, + RMAP_HASHTAB_ORDER); + spin_lock(&shr_audit_lock); list_del_rcu(&page->sharing->entry); spin_unlock(&shr_audit_lock); @@ -89,6 +105,10 @@ int mem_sharing_audit(void) #define audit_add_list(p) ((void)0) static inline void page_sharing_dispose(struct page_info *page) { + /* Unlikely given our thresholds, but we should be careful. */ + if ( unlikely(RMAP_USES_HASHTAB(page)) ) + free_xenheap_pages(page->sharing->hash_table.bucket, + RMAP_HASHTAB_ORDER); xfree(page->sharing); } @@ -145,6 +165,11 @@ static atomic_t nr_saved_mfns = ATOMIC static atomic_t nr_shared_mfns = ATOMIC_INIT(0); /** Reverse map **/ +/* Every shared frame keeps a reverse map (rmap) of <domain, gfn> tuples that + * this shared frame backs. For pages with a low degree of sharing, a O(n) + * search linked list is good enough. For pages with higher degree of sharing, + * we use a hash table instead. */ + typedef struct gfn_info { unsigned long gfn; @@ -155,20 +180,109 @@ typedef struct gfn_info static inline void rmap_init(struct page_info *page) { + /* We always start off as a doubly linked list. */ INIT_LIST_HEAD(&page->sharing->gfns); } +/* Exceedingly simple "hash function" */ +#define HASH(domain, gfn) \ + (((gfn) + (domain)) % RMAP_HASHTAB_SIZE) + +/* Conversions. Tuned by the thresholds. Should only happen twice + * (once each) during the lifetime of a shared page */ +static inline int +rmap_list_to_hash_table(struct page_info *page) +{ + unsigned int i; + struct list_head *pos, *tmp, *b + alloc_xenheap_pages(RMAP_HASHTAB_ORDER, 0); + + if ( b == NULL ) + return -ENOMEM; + + for (i = 0; i < RMAP_HASHTAB_SIZE; i++) + INIT_LIST_HEAD(b + i); + + list_for_each_safe(pos, tmp, &page->sharing->gfns) + { + gfn_info_t *gfn_info = list_entry(pos, gfn_info_t, list); + struct list_head *bucket = b + HASH(gfn_info->domain, gfn_info->gfn); + list_del(pos); + list_add(pos, bucket); + } + + page->sharing->hash_table.bucket = b; + page->sharing->hash_table.flag = NULL; + + return 0; +} + static inline void -rmap_del(gfn_info_t *gfn_info, struct page_info *page) +rmap_hash_table_to_list(struct page_info *page) { + unsigned int i; + struct list_head *bucket = page->sharing->hash_table.bucket; + + INIT_LIST_HEAD(&page->sharing->gfns); + + for (i = 0; i < RMAP_HASHTAB_SIZE; i++) + { + struct list_head *pos, *tmp, *head = bucket + i; + list_for_each_safe(pos, tmp, head) + { + list_del(pos); + list_add(pos, &page->sharing->gfns); + } + } + + free_xenheap_pages(bucket, RMAP_HASHTAB_ORDER); +} + +/* Generic accessors to the rmap */ +static inline unsigned long +rmap_count(struct page_info *pg) +{ + unsigned long count; + unsigned long t = read_atomic(&pg->u.inuse.type_info); + count = t & PGT_count_mask; + if ( t & PGT_locked ) + count--; + return count; +} + +/* The page type count is always decreased after removing from the rmap. + * Use a convert flag to avoid mutating the rmap if in the middle of an + * iterator, or if the page will be soon destroyed anyways. */ +static inline void +rmap_del(gfn_info_t *gfn_info, struct page_info *page, int convert) +{ + if ( RMAP_USES_HASHTAB(page) && convert && + (rmap_count(page) <= RMAP_LIGHT_SHARED_PAGE) ) + rmap_hash_table_to_list(page); + + /* Regardless of rmap type, same removal operation */ list_del(&gfn_info->list); } +/* The page type count is always increased before adding to the rmap. */ static inline void rmap_add(gfn_info_t *gfn_info, struct page_info *page) { + struct list_head *head; + + if ( !RMAP_USES_HASHTAB(page) && + (rmap_count(page) >= RMAP_HEAVY_SHARED_PAGE) ) + /* The conversion may fail with ENOMEM. We''ll be less efficient, + * but no reason to panic. */ + (void)rmap_list_to_hash_table(page); + + head = (RMAP_USES_HASHTAB(page)) ? + page->sharing->hash_table.bucket + + HASH(gfn_info->domain, gfn_info->gfn) : + &page->sharing->gfns; + INIT_LIST_HEAD(&gfn_info->list); - list_add(&gfn_info->list, &page->sharing->gfns); + list_add(&gfn_info->list, head); } static inline gfn_info_t * @@ -176,27 +290,33 @@ rmap_retrieve(uint16_t domain_id, unsign struct page_info *page) { gfn_info_t *gfn_info; - struct list_head *le; - list_for_each(le, &page->sharing->gfns) + struct list_head *le, *head; + + head = (RMAP_USES_HASHTAB(page)) ? + page->sharing->hash_table.bucket + HASH(domain_id, gfn) : + &page->sharing->gfns; + + list_for_each(le, head) { gfn_info = list_entry(le, gfn_info_t, list); if ( (gfn_info->gfn == gfn) && (gfn_info->domain == domain_id) ) return gfn_info; } + + /* Nothing was found */ return NULL; } /* Returns true if the rmap has only one entry. O(1) complexity. */ static inline int rmap_has_one_entry(struct page_info *page) { - struct list_head *head = &page->sharing->gfns; - return (head->next != head) && (head->next->next == head); + return (rmap_count(page) == 1); } /* Returns true if the rmap has any entries. O(1) complexity. */ static inline int rmap_has_entries(struct page_info *page) { - return (!list_empty(&page->sharing->gfns)); + return (rmap_count(page) != 0); } /* The iterator hides the details of how the rmap is implemented. This @@ -204,20 +324,43 @@ static inline int rmap_has_entries(struc struct rmap_iterator { struct list_head *curr; struct list_head *next; + unsigned int bucket; }; static inline void rmap_seed_iterator(struct page_info *page, struct rmap_iterator *ri) { - ri->curr = &page->sharing->gfns; + ri->curr = (RMAP_USES_HASHTAB(page)) ? + page->sharing->hash_table.bucket : + &page->sharing->gfns; ri->next = ri->curr->next; + ri->bucket = 0; } static inline gfn_info_t * rmap_iterate(struct page_info *page, struct rmap_iterator *ri) { - if ( ri->next == &page->sharing->gfns) - return NULL; + struct list_head *head = (RMAP_USES_HASHTAB(page)) ? + page->sharing->hash_table.bucket + ri->bucket : + &page->sharing->gfns; + +retry: + if ( ri->next == head) + { + if ( RMAP_USES_HASHTAB(page) ) + { + ri->bucket++; + if ( ri->bucket >= RMAP_HASHTAB_SIZE ) + /* No more hash table buckets */ + return NULL; + head = page->sharing->hash_table.bucket + ri->bucket; + ri->curr = head; + ri->next = ri->curr->next; + goto retry; + } else + /* List exhausted */ + return NULL; + } ri->curr = ri->next; ri->next = ri->curr->next; @@ -253,7 +396,7 @@ static inline void mem_sharing_gfn_destr atomic_dec(&d->shr_pages); /* Free the gfn_info structure. */ - rmap_del(gfn_info, page); + rmap_del(gfn_info, page, 1); xfree(gfn_info); } @@ -870,8 +1013,9 @@ int mem_sharing_share_pages(struct domai /* Get the source page and type, this should never fail: * we are under shr lock, and got a successful lookup */ BUG_ON(!get_page_and_type(spage, dom_cow, PGT_shared_page)); - /* Move the gfn_info from client list to source list */ - rmap_del(gfn, cpage); + /* Move the gfn_info from client list to source list. + * Don''t change the type of rmap for the client page. */ + rmap_del(gfn, cpage, 0); rmap_add(gfn, spage); put_page_and_type(cpage); d = get_domain_by_id(gfn->domain); diff -r 1730bff8fccf -r 7b606c043208 xen/include/asm-x86/mem_sharing.h --- a/xen/include/asm-x86/mem_sharing.h +++ b/xen/include/asm-x86/mem_sharing.h @@ -30,6 +30,13 @@ typedef uint64_t shr_handle_t; +typedef struct rmap_hashtab { + struct list_head *bucket; + /* Overlaps with prev pointer of list_head in union below. + * Unlike the prev pointer, this can be NULL. */ + void *flag; +} rmap_hashtab_t; + struct page_sharing_info { struct page_info *pg; /* Back pointer to the page. */ @@ -38,7 +45,11 @@ struct page_sharing_info struct list_head entry; /* List of all shared pages (entry). */ struct rcu_head rcu_head; /* List of all shared pages (entry). */ #endif - struct list_head gfns; /* List of domains and gfns for this page (head). */ + /* Reverse map of <domain,gfn> tuples for this shared frame. */ + union { + struct list_head gfns; + rmap_hashtab_t hash_table; + }; }; #ifdef __x86_64__
Andres Lagar-Cavilla
2012-Apr-13 16:11 UTC
Re: [PATCH 0 of 3] RFC: x86 memory sharing performance improvements
Please scratch the RFC tag. I''m done testing this and now I''m asking for inclusion in the tree. Thanks, Andres ----------------------------------------------------------------------- This is an RFC series. I haven''t fully tested it yet, but I want the concept to be known as I intend this to be merged prior to the closing of the 4.2 window. The sharing subsystem does not scale elegantly with high degrees of page sharing. The culprit is a reverse map that each shared frame maintains, resolving to all domain pages pointing to the shared frame. Because the rmap is implemented with a O(n) search linked-list, CoW unsharing can result in prolonged search times. The place where this becomes most obvious is during domain destruction, during which all shared p2m entries need to be unshared. Destroying a domain with a lot of sharing could result in minutes of hypervisor freeze-up! Solutions proposed: - Make the p2m clean up of shared entries part of the preemptible, synchronous, domain_kill domctl (as opposed to executing monolithically in the finalize destruction RCU callback) - When a shared frame exceeds an arbitrary ref count, mutate the rmap from a linked list to a hash table. Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> xen/arch/x86/domain.c | 16 +++- xen/arch/x86/mm/mem_sharing.c | 45 ++++++++++ xen/arch/x86/mm/p2m.c | 4 + xen/include/asm-arm/mm.h | 4 + xen/include/asm-x86/domain.h | 1 + xen/include/asm-x86/mem_sharing.h | 10 ++ xen/include/asm-x86/p2m.h | 4 + xen/arch/x86/mm/mem_sharing.c | 142 +++++++++++++++++++++++-------- xen/arch/x86/mm/mem_sharing.c | 170 +++++++++++++++++++++++++++++++++++-- xen/include/asm-x86/mem_sharing.h | 13 ++- 10 files changed, 354 insertions(+), 55 deletions(-)
Tim Deegan
2012-Apr-18 12:42 UTC
Re: [PATCH 1 of 3] x86/mm/sharing: Clean ups for relinquishing shared pages on destroy
At 10:16 -0400 on 12 Apr (1334225772), Andres Lagar-Cavilla wrote:> xen/arch/x86/domain.c | 16 ++++++++++++- > xen/arch/x86/mm/mem_sharing.c | 45 +++++++++++++++++++++++++++++++++++++++ > xen/arch/x86/mm/p2m.c | 4 +++ > xen/include/asm-arm/mm.h | 4 +++ > xen/include/asm-x86/domain.h | 1 + > xen/include/asm-x86/mem_sharing.h | 10 ++++++++ > xen/include/asm-x86/p2m.h | 4 +++ > 7 files changed, 82 insertions(+), 2 deletions(-) > > > When a domain is destroyed, its pages are freed in relinquish_resources in a > preemptible mode, in the context of a synchronous domctl. > > P2m entries pointing to shared pages are, however, released during p2m cleanup > in an RCU callback, and in non-preemptible mode. > > This is an O(n) operation for a very large n, which may include actually > freeing shared pages for which the domain is the last holder. > > To improve responsiveness, move this operation to the preemtible portion of > domain destruction, during the synchronous domain_kill hypercall. And remove > the bulk of the work from the RCU callback. > > Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>I''ve applied this as a bug-fix. The other two seem more like new development and I''m less happy about taking them before 4.2 Cheers, Tim.
Andres Lagar-Cavilla
2012-Apr-18 13:06 UTC
Re: [PATCH 1 of 3] x86/mm/sharing: Clean ups for relinquishing shared pages on destroy
> At 10:16 -0400 on 12 Apr (1334225772), Andres Lagar-Cavilla wrote: >> xen/arch/x86/domain.c | 16 ++++++++++++- >> xen/arch/x86/mm/mem_sharing.c | 45 >> +++++++++++++++++++++++++++++++++++++++ >> xen/arch/x86/mm/p2m.c | 4 +++ >> xen/include/asm-arm/mm.h | 4 +++ >> xen/include/asm-x86/domain.h | 1 + >> xen/include/asm-x86/mem_sharing.h | 10 ++++++++ >> xen/include/asm-x86/p2m.h | 4 +++ >> 7 files changed, 82 insertions(+), 2 deletions(-) >> >> >> When a domain is destroyed, its pages are freed in relinquish_resources >> in a >> preemptible mode, in the context of a synchronous domctl. >> >> P2m entries pointing to shared pages are, however, released during p2m >> cleanup >> in an RCU callback, and in non-preemptible mode. >> >> This is an O(n) operation for a very large n, which may include actually >> freeing shared pages for which the domain is the last holder. >> >> To improve responsiveness, move this operation to the preemtible portion >> of >> domain destruction, during the synchronous domain_kill hypercall. And >> remove >> the bulk of the work from the RCU callback. >> >> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> > > I''ve applied this as a bug-fix. The other two seem more like new > development and I''m less happy about taking them before 4.2Thanks for applying this one. Can I ask you to reconsider on the other two? Right now I can freeze up the hypervisor for several minutes (without these two patches). Soft lockups start popping up in dom0. I''ve had two instances in which the block IO driver bails out and the dom0 file system becomes read-only, forcing host reboot. It''s a horrible performance regression this is fixing. Thanks, Andres> > Cheers, > > Tim. >
Tim Deegan
2012-Apr-18 14:05 UTC
Re: [PATCH 2 of 3] x86/mem_sharing: modularize reverse map for shared frames
At 10:16 -0400 on 12 Apr (1334225773), Andres Lagar-Cavilla wrote:> /* Do the accounting first. If anything fails below, we have bigger > * bigger fish to fry. First, remove the gfn from the list. */ > - last_gfn = list_has_one_entry(&page->sharing->gfns); > + last_gfn = rmap_has_one_entry(page); > if ( last_gfn ) > { > - /* Clean up shared state */ > - audit_del_list(page); > + /* Clean up shared state. Get rid of the <domid, gfn> tuple > + * before destroying the rmap. */ > + mem_sharing_gfn_destroy(page, d, gfn_info);Moving this mem_sharing_gfn_destroy() around seems like it''s unrelated to the rest of this patch, which is basically code motion. If so, can you put it in a patch of its own? Otherwise, pending patch 3/3 being OK too, Acked-by: Tim Deegan <tim@xen.org>> + page_sharing_dispose(page); > page->sharing = NULL; > atomic_dec(&nr_shared_mfns); > } > else > atomic_dec(&nr_saved_mfns); > + > /* If the GFN is getting destroyed drop the references to MFN > * (possibly freeing the page), and exit early */ > if ( flags & MEM_SHARING_DESTROY_GFN ) > { > - mem_sharing_gfn_destroy(d, gfn_info); > + if ( !last_gfn ) > + mem_sharing_gfn_destroy(page, d, gfn_info); > put_page_and_type(page); > mem_sharing_page_unlock(page); > if ( last_gfn && > @@ -987,7 +1052,6 @@ gfn_found: > > if ( last_gfn ) > { > - mem_sharing_gfn_destroy(d, gfn_info); > /* Making a page private atomically unlocks it */ > BUG_ON(page_make_private(d, page) != 0); > goto private_page_found; > @@ -1011,7 +1075,7 @@ gfn_found: > unmap_domain_page(t); > > BUG_ON(set_shared_p2m_entry(d, gfn, page_to_mfn(page)) == 0); > - mem_sharing_gfn_destroy(d, gfn_info); > + mem_sharing_gfn_destroy(old_page, d, gfn_info); > mem_sharing_page_unlock(old_page); > put_page_and_type(old_page); > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Andres Lagar-Cavilla
2012-Apr-18 14:19 UTC
Re: [PATCH 2 of 3] x86/mem_sharing: modularize reverse map for shared frames
> At 10:16 -0400 on 12 Apr (1334225773), Andres Lagar-Cavilla wrote: >> /* Do the accounting first. If anything fails below, we have bigger >> * bigger fish to fry. First, remove the gfn from the list. */ >> - last_gfn = list_has_one_entry(&page->sharing->gfns); >> + last_gfn = rmap_has_one_entry(page); >> if ( last_gfn ) >> { >> - /* Clean up shared state */ >> - audit_del_list(page); >> + /* Clean up shared state. Get rid of the <domid, gfn> tuple >> + * before destroying the rmap. */ >> + mem_sharing_gfn_destroy(page, d, gfn_info); > > Moving this mem_sharing_gfn_destroy() around seems like it''s unrelated > to the rest of this patch, which is basically code motion. If so, can > you put it in a patch of its own?Sure. I''ll split it up but wait for feedback on 3.3 before re-submit. Andres> > Otherwise, pending patch 3/3 being OK too, > > Acked-by: Tim Deegan <tim@xen.org> > >> + page_sharing_dispose(page); >> page->sharing = NULL; >> atomic_dec(&nr_shared_mfns); >> } >> else >> atomic_dec(&nr_saved_mfns); >> + >> /* If the GFN is getting destroyed drop the references to MFN >> * (possibly freeing the page), and exit early */ >> if ( flags & MEM_SHARING_DESTROY_GFN ) >> { >> - mem_sharing_gfn_destroy(d, gfn_info); >> + if ( !last_gfn ) >> + mem_sharing_gfn_destroy(page, d, gfn_info); >> put_page_and_type(page); >> mem_sharing_page_unlock(page); >> if ( last_gfn && >> @@ -987,7 +1052,6 @@ gfn_found: >> >> if ( last_gfn ) >> { >> - mem_sharing_gfn_destroy(d, gfn_info); >> /* Making a page private atomically unlocks it */ >> BUG_ON(page_make_private(d, page) != 0); >> goto private_page_found; >> @@ -1011,7 +1075,7 @@ gfn_found: >> unmap_domain_page(t); >> >> BUG_ON(set_shared_p2m_entry(d, gfn, page_to_mfn(page)) == 0); >> - mem_sharing_gfn_destroy(d, gfn_info); >> + mem_sharing_gfn_destroy(old_page, d, gfn_info); >> mem_sharing_page_unlock(old_page); >> put_page_and_type(old_page); >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel >
Tim Deegan
2012-Apr-18 15:35 UTC
Re: [PATCH 3 of 3] x86/mem_sharing: For shared pages with many references, use a hash table instead of a list
At 10:16 -0400 on 12 Apr (1334225774), Andres Lagar-Cavilla wrote:> xen/arch/x86/mm/mem_sharing.c | 170 +++++++++++++++++++++++++++++++++++-- > xen/include/asm-x86/mem_sharing.h | 13 ++- > 2 files changed, 169 insertions(+), 14 deletions(-) > > > For shared frames that have many references, the doubly-linked list used to > store the rmap results in costly scans during unshare operations. To alleviate > the overhead, replace the linked list by a hash table. However, hash tables are > space-intensive, so only use them for pages that have "many" (arbitrary > threshold) references. > > Unsharing is heaviliy exercised during domain destroy. In experimental testing, > for a domain that points over 100 thousand pages to the same shared frame, > domain destruction dropped from over 7 minutes(!) to less than two seconds.If you''re adding a new datastructure, would it be better to use a tree, since the keys are easily sorted? Xen already has include/xen/rbtree.h. It would have a higher per-entry overhead but you wouldn''t need to keep the list datastructure as well for the light-sharing case. Tim.
Andres Lagar-Cavilla
2012-Apr-18 16:18 UTC
Re: [PATCH 3 of 3] x86/mem_sharing: For shared pages with many references, use a hash table instead of a list
> At 10:16 -0400 on 12 Apr (1334225774), Andres Lagar-Cavilla wrote: >> xen/arch/x86/mm/mem_sharing.c | 170 >> +++++++++++++++++++++++++++++++++++-- >> xen/include/asm-x86/mem_sharing.h | 13 ++- >> 2 files changed, 169 insertions(+), 14 deletions(-) >> >> >> For shared frames that have many references, the doubly-linked list used >> to >> store the rmap results in costly scans during unshare operations. To >> alleviate >> the overhead, replace the linked list by a hash table. However, hash >> tables are >> space-intensive, so only use them for pages that have "many" (arbitrary >> threshold) references. >> >> Unsharing is heaviliy exercised during domain destroy. In experimental >> testing, >> for a domain that points over 100 thousand pages to the same shared >> frame, >> domain destruction dropped from over 7 minutes(!) to less than two >> seconds. > > If you''re adding a new datastructure, would it be better to use a tree, > since the keys are easily sorted? Xen already has include/xen/rbtree.h. > It would have a higher per-entry overhead but you wouldn''t need to keep > the list datastructure as well for the light-sharing case.My main concern is space. A regular case we deal with is four hundred thousand shared frames, most with roughly a hundred <domid,gfn>s they back, some with over a hundred thousand, a few with less than ten thousand. That''s a lot of heap memory for rb trees and nodes! I find O(n) on less than 256 to be easily tolerable, as well as spending an extra page only when we''re saving thousands. Nevertheless I''ll look into rb tree. Whose only consumer is tmem, if I''m not mistaken. It doesn''t seem like pool objects grow to contain so many pages, but I could be wrong. Andres> > Tim. > >
Andres Lagar-Cavilla
2012-Apr-24 19:33 UTC
Re: [PATCH 3 of 3] x86/mem_sharing: For shared pages with many references, use a hash table instead of a list
On Apr 18, 2012, at 12:18 PM, Andres Lagar-Cavilla wrote:>> At 10:16 -0400 on 12 Apr (1334225774), Andres Lagar-Cavilla wrote: >>> xen/arch/x86/mm/mem_sharing.c | 170 >>> +++++++++++++++++++++++++++++++++++-- >>> xen/include/asm-x86/mem_sharing.h | 13 ++- >>> 2 files changed, 169 insertions(+), 14 deletions(-) >>> >>> >>> For shared frames that have many references, the doubly-linked list used >>> to >>> store the rmap results in costly scans during unshare operations. To >>> alleviate >>> the overhead, replace the linked list by a hash table. However, hash >>> tables are >>> space-intensive, so only use them for pages that have "many" (arbitrary >>> threshold) references. >>> >>> Unsharing is heaviliy exercised during domain destroy. In experimental >>> testing, >>> for a domain that points over 100 thousand pages to the same shared >>> frame, >>> domain destruction dropped from over 7 minutes(!) to less than two >>> seconds. >> >> If you''re adding a new datastructure, would it be better to use a tree, >> since the keys are easily sorted? Xen already has include/xen/rbtree.h. >> It would have a higher per-entry overhead but you wouldn''t need to keep >> the list datastructure as well for the light-sharing case. > > My main concern is space. A regular case we deal with is four hundred > thousand shared frames, most with roughly a hundred <domid,gfn>s they > back, some with over a hundred thousand, a few with less than ten > thousand. That''s a lot of heap memory for rb trees and nodes! I find O(n) > on less than 256 to be easily tolerable, as well as spending an extra page > only when we''re saving thousands.I''ve looked into rb and my initial opinion stands. I''m confident I''m getting a better space/search-big-O tradeoff with my hash table instead of an rb tree. I have not, however, conducted a thorough profiling, given the time constraints for 4.2. That is certainly doable after that window closes. I will resubmit the series taking into account your comment about splitting the initial patch. Thanks! Andres> > > Nevertheless I''ll look into rb tree. Whose only consumer is tmem, if I''m > not mistaken. It doesn''t seem like pool objects grow to contain so many > pages, but I could be wrong. > > Andres >> >> Tim. >> >> > >