This series enables Xen to support up to 16Tb. 01: x86: introduce virt_to_xen_l1e() 02: x86: extend frame table virtual space 03: x86: re-introduce map_domain_page() et al 04: x86: properly use map_domain_page() when building Dom0 05: x86: consolidate initialization of PV guest L4 page tables 06: x86: properly use map_domain_page() during domain creation/destruction 07: x86: properly use map_domain_page() during page table manipulation 08: x86: properly use map_domain_page() in nested HVM code 09: x86: properly use map_domain_page() in miscellaneous places 10: tmem: partial adjustments for x86 16Tb support 11: x86: support up to 16Tb As I don''t have a 16Tb system around, I used the following debugging patch to simulate the most critical aspect the changes above would have on a system with this much memory: Not all of the 1:1 mapping being accessible when in PV guest context. To do so, a command line option to pull the split point down is being added. The patch is being provided in the raw form I used it, but has pieces properly formatted and not marked "//temp" which I would think might be worth considering to add. The other pieces are likely less worthwhile, but if others think differently I could certainly also put them into "normal" shape. 12: x86: debugging code for testing 16Tb support on smaller memory systems Signed-off-by: Jan Beulich <jbeulich@suse.com>
... to allow frames for up to 16Tb. At the same time, add the super page frame table coordinates to the comment describing the address space layout. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -146,8 +146,7 @@ unsigned long max_page; unsigned long total_pages; unsigned long __read_mostly pdx_group_valid[BITS_TO_LONGS( - (FRAMETABLE_SIZE / sizeof(*frame_table) + PDX_GROUP_COUNT - 1) - / PDX_GROUP_COUNT)] = { [0] = 1 }; + (FRAMETABLE_NR + PDX_GROUP_COUNT - 1) / PDX_GROUP_COUNT)] = { [0] = 1 }; bool_t __read_mostly machine_to_phys_mapping_valid = 0; @@ -218,7 +217,7 @@ static void __init init_spagetable(void) BUILD_BUG_ON(XEN_VIRT_END > SPAGETABLE_VIRT_START); init_frametable_chunk(spage_table, - mem_hotplug ? (void *)SPAGETABLE_VIRT_END + mem_hotplug ? spage_table + SPAGETABLE_NR : pdx_to_spage(max_pdx - 1) + 1); } --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -378,8 +378,8 @@ static void __init setup_max_pdx(void) if ( max_pdx > (DIRECTMAP_SIZE >> PAGE_SHIFT) ) max_pdx = DIRECTMAP_SIZE >> PAGE_SHIFT; - if ( max_pdx > FRAMETABLE_SIZE / sizeof(*frame_table) ) - max_pdx = FRAMETABLE_SIZE / sizeof(*frame_table); + if ( max_pdx > FRAMETABLE_NR ) + max_pdx = FRAMETABLE_NR; max_page = pdx_to_pfn(max_pdx - 1) + 1; } --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -958,7 +958,7 @@ static int extend_frame_table(struct mem nidx = cidx = pfn_to_pdx(spfn)/PDX_GROUP_COUNT; ASSERT( pfn_to_pdx(epfn) <= (DIRECTMAP_SIZE >> PAGE_SHIFT) && - (pfn_to_pdx(epfn) <= FRAMETABLE_SIZE / sizeof(struct page_info)) ); + pfn_to_pdx(epfn) <= FRAMETABLE_NR ); if ( test_bit(cidx, pdx_group_valid) ) cidx = find_next_zero_bit(pdx_group_valid, eidx, cidx); @@ -1406,7 +1406,7 @@ int mem_hotadd_check(unsigned long spfn, if ( (spfn >= epfn) ) return 0; - if (pfn_to_pdx(epfn) > (FRAMETABLE_SIZE / sizeof(*frame_table))) + if (pfn_to_pdx(epfn) > FRAMETABLE_NR) return 0; if ( (spfn | epfn) & ((1UL << PAGETABLE_ORDER) - 1) ) --- a/xen/include/asm-x86/config.h +++ b/xen/include/asm-x86/config.h @@ -152,9 +152,11 @@ extern unsigned char boot_edid_info[128] * High read-only compatibility machine-to-phys translation table. * 0xffff82c480000000 - 0xffff82c4bfffffff [1GB, 2^30 bytes, PML4:261] * Xen text, static data, bss. - * 0xffff82c4c0000000 - 0xffff82f5ffffffff [197GB, PML4:261] + * 0xffff82c4c0000000 - 0xffff82dffbffffff [109GB - 64MB, PML4:261] * Reserved for future use. - * 0xffff82f600000000 - 0xffff82ffffffffff [40GB, 2^38 bytes, PML4:261] + * 0xffff82dffc000000 - 0xffff82dfffffffff [64MB, 2^26 bytes, PML4:261] + * Super-page information array. + * 0xffff82e000000000 - 0xffff82ffffffffff [128GB, 2^37 bytes, PML4:261] * Page-frame information array. * 0xffff830000000000 - 0xffff87ffffffffff [5TB, 5*2^40 bytes, PML4:262-271] * 1:1 direct mapping of all physical memory. @@ -218,15 +220,17 @@ extern unsigned char boot_edid_info[128] /* Slot 261: xen text, static data and bss (1GB). */ #define XEN_VIRT_START (HIRO_COMPAT_MPT_VIRT_END) #define XEN_VIRT_END (XEN_VIRT_START + GB(1)) -/* Slot 261: superpage information array (20MB). */ +/* Slot 261: superpage information array (64MB). */ #define SPAGETABLE_VIRT_END FRAMETABLE_VIRT_START -#define SPAGETABLE_SIZE ((DIRECTMAP_SIZE >> SUPERPAGE_SHIFT) * \ - sizeof(struct spage_info)) -#define SPAGETABLE_VIRT_START (SPAGETABLE_VIRT_END - SPAGETABLE_SIZE) -/* Slot 261: page-frame information array (40GB). */ +#define SPAGETABLE_NR (((FRAMETABLE_NR - 1) >> (SUPERPAGE_SHIFT - \ + PAGE_SHIFT)) + 1) +#define SPAGETABLE_SIZE (SPAGETABLE_NR * sizeof(struct spage_info)) +#define SPAGETABLE_VIRT_START ((SPAGETABLE_VIRT_END - SPAGETABLE_SIZE) & \ + (-1UL << SUPERPAGE_SHIFT)) +/* Slot 261: page-frame information array (128GB). */ #define FRAMETABLE_VIRT_END DIRECTMAP_VIRT_START -#define FRAMETABLE_SIZE ((DIRECTMAP_SIZE >> PAGE_SHIFT) * \ - sizeof(struct page_info)) +#define FRAMETABLE_SIZE GB(128) +#define FRAMETABLE_NR (FRAMETABLE_SIZE / sizeof(*frame_table)) #define FRAMETABLE_VIRT_START (FRAMETABLE_VIRT_END - FRAMETABLE_SIZE) /* Slot 262-271: A direct 1:1 mapping of all of physical memory. */ #define DIRECTMAP_VIRT_START (PML4_ADDR(262)) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
This is being done mostly in the form previously used on x86-32, utilizing the second L3 page table slot within the per-domain mapping area for those mappings. It remains to be determined whether that concept is really suitable, or whether instead re-implementing at least the non-global variant from scratch would be better. Also add the helpers {clear,copy}_domain_page() as well as initial uses of them. One question is whether, to exercise the non-trivial code paths, we shouldn''t make the trivial shortcuts conditional upon NDEBUG being defined. See the debugging patch at the end of the series. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/Makefile +++ b/xen/arch/x86/Makefile @@ -19,6 +19,7 @@ obj-bin-y += dmi_scan.init.o obj-y += domctl.o obj-y += domain.o obj-bin-y += domain_build.init.o +obj-y += domain_page.o obj-y += e820.o obj-y += extable.o obj-y += flushtlb.o --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -397,10 +397,14 @@ int vcpu_initialise(struct vcpu *v) return -ENOMEM; clear_page(page_to_virt(pg)); perdomain_pt_page(d, idx) = pg; - d->arch.mm_perdomain_l2[l2_table_offset(PERDOMAIN_VIRT_START)+idx] + d->arch.mm_perdomain_l2[0][l2_table_offset(PERDOMAIN_VIRT_START)+idx] = l2e_from_page(pg, __PAGE_HYPERVISOR); } + rc = mapcache_vcpu_init(v); + if ( rc ) + return rc; + paging_vcpu_init(v); v->arch.perdomain_ptes = perdomain_ptes(d, v); @@ -526,8 +530,8 @@ int arch_domain_create(struct domain *d, pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d))); if ( pg == NULL ) goto fail; - d->arch.mm_perdomain_l2 = page_to_virt(pg); - clear_page(d->arch.mm_perdomain_l2); + d->arch.mm_perdomain_l2[0] = page_to_virt(pg); + clear_page(d->arch.mm_perdomain_l2[0]); pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d))); if ( pg == NULL ) @@ -535,8 +539,10 @@ int arch_domain_create(struct domain *d, d->arch.mm_perdomain_l3 = page_to_virt(pg); clear_page(d->arch.mm_perdomain_l3); d->arch.mm_perdomain_l3[l3_table_offset(PERDOMAIN_VIRT_START)] - l3e_from_page(virt_to_page(d->arch.mm_perdomain_l2), - __PAGE_HYPERVISOR); + l3e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l2[0]), + __PAGE_HYPERVISOR); + + mapcache_domain_init(d); HYPERVISOR_COMPAT_VIRT_START(d) is_hvm_domain(d) ? ~0u : __HYPERVISOR_COMPAT_VIRT_START; @@ -609,8 +615,9 @@ int arch_domain_create(struct domain *d, free_xenheap_page(d->shared_info); if ( paging_initialised ) paging_final_teardown(d); - if ( d->arch.mm_perdomain_l2 ) - free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2)); + mapcache_domain_exit(d); + if ( d->arch.mm_perdomain_l2[0] ) + free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0])); if ( d->arch.mm_perdomain_l3 ) free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3)); if ( d->arch.mm_perdomain_pt_pages ) @@ -633,13 +640,15 @@ void arch_domain_destroy(struct domain * paging_final_teardown(d); + mapcache_domain_exit(d); + for ( i = 0; i < PDPT_L2_ENTRIES; ++i ) { if ( perdomain_pt_page(d, i) ) free_domheap_page(perdomain_pt_page(d, i)); } free_domheap_page(virt_to_page(d->arch.mm_perdomain_pt_pages)); - free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2)); + free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0])); free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3)); free_xenheap_page(d->shared_info); --- /dev/null +++ b/xen/arch/x86/domain_page.c @@ -0,0 +1,471 @@ +/****************************************************************************** + * domain_page.h + * + * Allow temporary mapping of domain pages. + * + * Copyright (c) 2003-2006, Keir Fraser <keir@xensource.com> + */ + +#include <xen/domain_page.h> +#include <xen/mm.h> +#include <xen/perfc.h> +#include <xen/pfn.h> +#include <xen/sched.h> +#include <asm/current.h> +#include <asm/flushtlb.h> +#include <asm/hardirq.h> + +static inline struct vcpu *mapcache_current_vcpu(void) +{ + /* In the common case we use the mapcache of the running VCPU. */ + struct vcpu *v = current; + + /* + * When current isn''t properly set up yet, this is equivalent to + * running in an idle vCPU (callers must check for NULL). + */ + if ( v == (struct vcpu *)0xfffff000 ) + return NULL; + + /* + * If guest_table is NULL, and we are running a paravirtualised guest, + * then it means we are running on the idle domain''s page table and must + * therefore use its mapcache. + */ + if ( unlikely(pagetable_is_null(v->arch.guest_table)) && !is_hvm_vcpu(v) ) + { + /* If we really are idling, perform lazy context switch now. */ + if ( (v = idle_vcpu[smp_processor_id()]) == current ) + sync_local_execstate(); + /* We must now be running on the idle page table. */ + ASSERT(read_cr3() == __pa(idle_pg_table)); + } + + return v; +} + +#define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER) +#define MAPCACHE_L2_ENTRIES (mapcache_l2_entry(MAPCACHE_ENTRIES - 1) + 1) +#define DCACHE_L1ENT(dc, idx) \ + ((dc)->l1tab[(idx) >> PAGETABLE_ORDER] \ + [(idx) & ((1 << PAGETABLE_ORDER) - 1)]) + +void *map_domain_page(unsigned long mfn) +{ + unsigned long flags; + unsigned int idx, i; + struct vcpu *v; + struct mapcache_domain *dcache; + struct mapcache_vcpu *vcache; + struct vcpu_maphash_entry *hashent; + + if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) + return mfn_to_virt(mfn); + + v = mapcache_current_vcpu(); + if ( !v || is_hvm_vcpu(v) ) + return mfn_to_virt(mfn); + + dcache = &v->domain->arch.pv_domain.mapcache; + vcache = &v->arch.pv_vcpu.mapcache; + if ( !dcache->l1tab ) + return mfn_to_virt(mfn); + + perfc_incr(map_domain_page_count); + + local_irq_save(flags); + + hashent = &vcache->hash[MAPHASH_HASHFN(mfn)]; + if ( hashent->mfn == mfn ) + { + idx = hashent->idx; + ASSERT(idx < dcache->entries); + hashent->refcnt++; + ASSERT(hashent->refcnt); + ASSERT(l1e_get_pfn(DCACHE_L1ENT(dcache, idx)) == mfn); + goto out; + } + + spin_lock(&dcache->lock); + + /* Has some other CPU caused a wrap? We must flush if so. */ + if ( unlikely(dcache->epoch != vcache->shadow_epoch) ) + { + vcache->shadow_epoch = dcache->epoch; + if ( NEED_FLUSH(this_cpu(tlbflush_time), dcache->tlbflush_timestamp) ) + { + perfc_incr(domain_page_tlb_flush); + flush_tlb_local(); + } + } + + idx = find_next_zero_bit(dcache->inuse, dcache->entries, dcache->cursor); + if ( unlikely(idx >= dcache->entries) ) + { + unsigned long accum = 0; + + /* /First/, clean the garbage map and update the inuse list. */ + for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++ ) + { + dcache->inuse[i] &= ~xchg(&dcache->garbage[i], 0); + accum |= ~dcache->inuse[i]; + } + + if ( accum ) + idx = find_first_zero_bit(dcache->inuse, dcache->entries); + else + { + /* Replace a hash entry instead. */ + i = MAPHASH_HASHFN(mfn); + do { + hashent = &vcache->hash[i]; + if ( hashent->idx != MAPHASHENT_NOTINUSE && !hashent->refcnt ) + { + idx = hashent->idx; + ASSERT(l1e_get_pfn(DCACHE_L1ENT(dcache, idx)) =+ hashent->mfn); + l1e_write(&DCACHE_L1ENT(dcache, idx), l1e_empty()); + hashent->idx = MAPHASHENT_NOTINUSE; + hashent->mfn = ~0UL; + break; + } + if ( ++i == MAPHASH_ENTRIES ) + i = 0; + } while ( i != MAPHASH_HASHFN(mfn) ); + } + BUG_ON(idx >= dcache->entries); + + /* /Second/, flush TLBs. */ + perfc_incr(domain_page_tlb_flush); + flush_tlb_local(); + vcache->shadow_epoch = ++dcache->epoch; + dcache->tlbflush_timestamp = tlbflush_current_time(); + } + + set_bit(idx, dcache->inuse); + dcache->cursor = idx + 1; + + spin_unlock(&dcache->lock); + + l1e_write(&DCACHE_L1ENT(dcache, idx), + l1e_from_pfn(mfn, __PAGE_HYPERVISOR)); + + out: + local_irq_restore(flags); + return (void *)MAPCACHE_VIRT_START + pfn_to_paddr(idx); +} + +void unmap_domain_page(const void *ptr) +{ + unsigned int idx; + struct vcpu *v; + struct mapcache_domain *dcache; + unsigned long va = (unsigned long)ptr, mfn, flags; + struct vcpu_maphash_entry *hashent; + + if ( va >= DIRECTMAP_VIRT_START ) + return; + + ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END); + + v = mapcache_current_vcpu(); + ASSERT(v && !is_hvm_vcpu(v)); + + dcache = &v->domain->arch.pv_domain.mapcache; + ASSERT(dcache->l1tab); + + idx = PFN_DOWN(va - MAPCACHE_VIRT_START); + mfn = l1e_get_pfn(DCACHE_L1ENT(dcache, idx)); + hashent = &v->arch.pv_vcpu.mapcache.hash[MAPHASH_HASHFN(mfn)]; + + local_irq_save(flags); + + if ( hashent->idx == idx ) + { + ASSERT(hashent->mfn == mfn); + ASSERT(hashent->refcnt); + hashent->refcnt--; + } + else if ( !hashent->refcnt ) + { + if ( hashent->idx != MAPHASHENT_NOTINUSE ) + { + /* /First/, zap the PTE. */ + ASSERT(l1e_get_pfn(DCACHE_L1ENT(dcache, hashent->idx)) =+ hashent->mfn); + l1e_write(&DCACHE_L1ENT(dcache, hashent->idx), l1e_empty()); + /* /Second/, mark as garbage. */ + set_bit(hashent->idx, dcache->garbage); + } + + /* Add newly-freed mapping to the maphash. */ + hashent->mfn = mfn; + hashent->idx = idx; + } + else + { + /* /First/, zap the PTE. */ + l1e_write(&DCACHE_L1ENT(dcache, idx), l1e_empty()); + /* /Second/, mark as garbage. */ + set_bit(idx, dcache->garbage); + } + + local_irq_restore(flags); +} + +void clear_domain_page(unsigned long mfn) +{ + void *ptr = map_domain_page(mfn); + + clear_page(ptr); + unmap_domain_page(ptr); +} + +void copy_domain_page(unsigned long dmfn, unsigned long smfn) +{ + const void *src = map_domain_page(smfn); + void *dst = map_domain_page(dmfn); + + copy_page(dst, src); + unmap_domain_page(dst); + unmap_domain_page(src); +} + +int mapcache_domain_init(struct domain *d) +{ + struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache; + unsigned int i, bitmap_pages, memf = MEMF_node(domain_to_node(d)); + unsigned long *end; + + if ( is_hvm_domain(d) || is_idle_domain(d) ) + return 0; + + if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) + return 0; + + dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1); + d->arch.mm_perdomain_l2[MAPCACHE_SLOT] = alloc_xenheap_pages(0, memf); + if ( !dcache->l1tab || !d->arch.mm_perdomain_l2[MAPCACHE_SLOT] ) + return -ENOMEM; + + clear_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]); + d->arch.mm_perdomain_l3[l3_table_offset(MAPCACHE_VIRT_START)] + l3e_from_paddr(__pa(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]), + __PAGE_HYPERVISOR); + + BUILD_BUG_ON(MAPCACHE_VIRT_END + 3 + + 2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long)) > + MAPCACHE_VIRT_START + (PERDOMAIN_SLOT_MBYTES << 20)); + bitmap_pages = PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long)); + dcache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE; + dcache->garbage = dcache->inuse + + (bitmap_pages + 1) * PAGE_SIZE / sizeof(long); + end = dcache->garbage + bitmap_pages * PAGE_SIZE / sizeof(long); + + for ( i = l2_table_offset((unsigned long)dcache->inuse); + i <= l2_table_offset((unsigned long)(end - 1)); ++i ) + { + ASSERT(i <= MAPCACHE_L2_ENTRIES); + dcache->l1tab[i] = alloc_xenheap_pages(0, memf); + if ( !dcache->l1tab[i] ) + return -ENOMEM; + clear_page(dcache->l1tab[i]); + d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] + l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR); + } + + spin_lock_init(&dcache->lock); + + return 0; +} + +void mapcache_domain_exit(struct domain *d) +{ + struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache; + + if ( is_hvm_domain(d) ) + return; + + if ( dcache->l1tab ) + { + unsigned long i; + + for ( i = (unsigned long)dcache->inuse; ; i += PAGE_SIZE ) + { + l1_pgentry_t *pl1e; + + if ( l2_table_offset(i) > MAPCACHE_L2_ENTRIES || + !dcache->l1tab[l2_table_offset(i)] ) + break; + + pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)]; + if ( l1e_get_flags(*pl1e) ) + free_domheap_page(l1e_get_page(*pl1e)); + } + + for ( i = 0; i < MAPCACHE_L2_ENTRIES + 1; ++i ) + free_xenheap_page(dcache->l1tab[i]); + + xfree(dcache->l1tab); + } + free_xenheap_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]); +} + +int mapcache_vcpu_init(struct vcpu *v) +{ + struct domain *d = v->domain; + struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache; + unsigned long i; + unsigned int memf = MEMF_node(vcpu_to_node(v)); + + if ( is_hvm_vcpu(v) || !dcache->l1tab ) + return 0; + + while ( dcache->entries < d->max_vcpus * MAPCACHE_VCPU_ENTRIES ) + { + unsigned int ents = dcache->entries + MAPCACHE_VCPU_ENTRIES; + l1_pgentry_t *pl1e; + + /* Populate page tables. */ + if ( !dcache->l1tab[i = mapcache_l2_entry(ents - 1)] ) + { + dcache->l1tab[i] = alloc_xenheap_pages(0, memf); + if ( !dcache->l1tab[i] ) + return -ENOMEM; + clear_page(dcache->l1tab[i]); + d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] + l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR); + } + + /* Populate bit maps. */ + i = (unsigned long)(dcache->inuse + BITS_TO_LONGS(ents)); + pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)]; + if ( !l1e_get_flags(*pl1e) ) + { + struct page_info *pg = alloc_domheap_page(NULL, memf); + + if ( !pg ) + return -ENOMEM; + clear_domain_page(page_to_mfn(pg)); + *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR); + + i = (unsigned long)(dcache->garbage + BITS_TO_LONGS(ents)); + pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)]; + ASSERT(!l1e_get_flags(*pl1e)); + + pg = alloc_domheap_page(NULL, memf); + if ( !pg ) + return -ENOMEM; + clear_domain_page(page_to_mfn(pg)); + *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR); + } + + dcache->entries = ents; + } + + /* Mark all maphash entries as not in use. */ + BUILD_BUG_ON(MAPHASHENT_NOTINUSE < MAPCACHE_ENTRIES); + for ( i = 0; i < MAPHASH_ENTRIES; i++ ) + { + struct vcpu_maphash_entry *hashent = &v->arch.pv_vcpu.mapcache.hash[i]; + + hashent->mfn = ~0UL; /* never valid to map */ + hashent->idx = MAPHASHENT_NOTINUSE; + } + + return 0; +} + +#define GLOBALMAP_BITS (GLOBALMAP_GBYTES << (30 - PAGE_SHIFT)) +static unsigned long inuse[BITS_TO_LONGS(GLOBALMAP_BITS)]; +static unsigned long garbage[BITS_TO_LONGS(GLOBALMAP_BITS)]; +static unsigned int inuse_cursor; +static DEFINE_SPINLOCK(globalmap_lock); + +void *map_domain_page_global(unsigned long mfn) +{ + l1_pgentry_t *pl1e; + unsigned int idx, i; + unsigned long va; + + ASSERT(!in_irq() && local_irq_is_enabled()); + + if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) + return mfn_to_virt(mfn); + + spin_lock(&globalmap_lock); + + idx = find_next_zero_bit(inuse, GLOBALMAP_BITS, inuse_cursor); + va = GLOBALMAP_VIRT_START + pfn_to_paddr(idx); + if ( unlikely(va >= GLOBALMAP_VIRT_END) ) + { + /* /First/, clean the garbage map and update the inuse list. */ + for ( i = 0; i < ARRAY_SIZE(garbage); i++ ) + inuse[i] &= ~xchg(&garbage[i], 0); + + /* /Second/, flush all TLBs to get rid of stale garbage mappings. */ + flush_tlb_all(); + + idx = find_first_zero_bit(inuse, GLOBALMAP_BITS); + va = GLOBALMAP_VIRT_START + pfn_to_paddr(idx); + if ( unlikely(va >= GLOBALMAP_VIRT_END) ) + { + spin_unlock(&globalmap_lock); + return NULL; + } + } + + set_bit(idx, inuse); + inuse_cursor = idx + 1; + + spin_unlock(&globalmap_lock); + + pl1e = virt_to_xen_l1e(va); + if ( !pl1e ) + return NULL; + l1e_write(pl1e, l1e_from_pfn(mfn, __PAGE_HYPERVISOR)); + + return (void *)va; +} + +void unmap_domain_page_global(const void *ptr) +{ + unsigned long va = (unsigned long)ptr; + l1_pgentry_t *pl1e; + + if ( va >= DIRECTMAP_VIRT_START ) + return; + + ASSERT(va >= GLOBALMAP_VIRT_START && va < GLOBALMAP_VIRT_END); + + /* /First/, we zap the PTE. */ + pl1e = virt_to_xen_l1e(va); + BUG_ON(!pl1e); + l1e_write(pl1e, l1e_empty()); + + /* /Second/, we add to the garbage map. */ + set_bit(PFN_DOWN(va - GLOBALMAP_VIRT_START), garbage); +} + +/* Translate a map-domain-page''d address to the underlying MFN */ +unsigned long domain_page_map_to_mfn(const void *ptr) +{ + unsigned long va = (unsigned long)ptr; + const l1_pgentry_t *pl1e; + + if ( va >= DIRECTMAP_VIRT_START ) + return virt_to_mfn(ptr); + + if ( va >= GLOBALMAP_VIRT_START && va < GLOBALMAP_VIRT_END ) + { + pl1e = virt_to_xen_l1e(va); + BUG_ON(!pl1e); + } + else + { + ASSERT(va >= MAPCACHE_VIRT_START && va < MAPCACHE_VIRT_END); + pl1e = &__linear_l1_table[l1_linear_offset(va)]; + } + + return l1e_get_pfn(*pl1e); +} --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -2661,9 +2661,6 @@ static inline int vcpumask_to_pcpumask( } } -#define fixmap_domain_page(mfn) mfn_to_virt(mfn) -#define fixunmap_domain_page(ptr) ((void)(ptr)) - long do_mmuext_op( XEN_GUEST_HANDLE_PARAM(mmuext_op_t) uops, unsigned int count, @@ -2983,7 +2980,6 @@ long do_mmuext_op( case MMUEXT_CLEAR_PAGE: { struct page_info *page; - unsigned char *ptr; page = get_page_from_gfn(d, op.arg1.mfn, NULL, P2M_ALLOC); if ( !page || !get_page_type(page, PGT_writable_page) ) @@ -2998,9 +2994,7 @@ long do_mmuext_op( /* A page is dirtied when it''s being cleared. */ paging_mark_dirty(d, page_to_mfn(page)); - ptr = fixmap_domain_page(page_to_mfn(page)); - clear_page(ptr); - fixunmap_domain_page(ptr); + clear_domain_page(page_to_mfn(page)); put_page_and_type(page); break; @@ -3008,8 +3002,6 @@ long do_mmuext_op( case MMUEXT_COPY_PAGE: { - const unsigned char *src; - unsigned char *dst; struct page_info *src_page, *dst_page; src_page = get_page_from_gfn(d, op.arg2.src_mfn, NULL, P2M_ALLOC); @@ -3034,11 +3026,7 @@ long do_mmuext_op( /* A page is dirtied when it''s being copied to. */ paging_mark_dirty(d, page_to_mfn(dst_page)); - src = __map_domain_page(src_page); - dst = fixmap_domain_page(page_to_mfn(dst_page)); - copy_page(dst, src); - fixunmap_domain_page(dst); - unmap_domain_page(src); + copy_domain_page(page_to_mfn(dst_page), page_to_mfn(src_page)); put_page_and_type(dst_page); put_page(src_page); --- a/xen/include/asm-x86/config.h +++ b/xen/include/asm-x86/config.h @@ -27,6 +27,7 @@ #define CONFIG_DISCONTIGMEM 1 #define CONFIG_NUMA_EMU 1 #define CONFIG_PAGEALLOC_MAX_ORDER (2 * PAGETABLE_ORDER) +#define CONFIG_DOMAIN_PAGE 1 /* Intel P4 currently has largest cache line (L2 line size is 128 bytes). */ #define CONFIG_X86_L1_CACHE_SHIFT 7 @@ -147,12 +148,14 @@ extern unsigned char boot_edid_info[128] * 0xffff82c000000000 - 0xffff82c3ffffffff [16GB, 2^34 bytes, PML4:261] * vmap()/ioremap()/fixmap area. * 0xffff82c400000000 - 0xffff82c43fffffff [1GB, 2^30 bytes, PML4:261] - * Compatibility machine-to-phys translation table. + * Global domain page map area. * 0xffff82c440000000 - 0xffff82c47fffffff [1GB, 2^30 bytes, PML4:261] - * High read-only compatibility machine-to-phys translation table. + * Compatibility machine-to-phys translation table. * 0xffff82c480000000 - 0xffff82c4bfffffff [1GB, 2^30 bytes, PML4:261] + * High read-only compatibility machine-to-phys translation table. + * 0xffff82c4c0000000 - 0xffff82c4ffffffff [1GB, 2^30 bytes, PML4:261] * Xen text, static data, bss. - * 0xffff82c4c0000000 - 0xffff82dffbffffff [109GB - 64MB, PML4:261] + * 0xffff82c500000000 - 0xffff82dffbffffff [108GB - 64MB, PML4:261] * Reserved for future use. * 0xffff82dffc000000 - 0xffff82dfffffffff [64MB, 2^26 bytes, PML4:261] * Super-page information array. @@ -201,18 +204,24 @@ extern unsigned char boot_edid_info[128] /* Slot 259: linear page table (shadow table). */ #define SH_LINEAR_PT_VIRT_START (PML4_ADDR(259)) #define SH_LINEAR_PT_VIRT_END (SH_LINEAR_PT_VIRT_START + PML4_ENTRY_BYTES) -/* Slot 260: per-domain mappings. */ +/* Slot 260: per-domain mappings (including map cache). */ #define PERDOMAIN_VIRT_START (PML4_ADDR(260)) -#define PERDOMAIN_VIRT_END (PERDOMAIN_VIRT_START + (PERDOMAIN_MBYTES<<20)) -#define PERDOMAIN_MBYTES (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER)) +#define PERDOMAIN_SLOT_MBYTES (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER)) +#define PERDOMAIN_SLOTS 2 +#define PERDOMAIN_VIRT_SLOT(s) (PERDOMAIN_VIRT_START + (s) * \ + (PERDOMAIN_SLOT_MBYTES << 20)) /* Slot 261: machine-to-phys conversion table (256GB). */ #define RDWR_MPT_VIRT_START (PML4_ADDR(261)) #define RDWR_MPT_VIRT_END (RDWR_MPT_VIRT_START + MPT_VIRT_SIZE) /* Slot 261: vmap()/ioremap()/fixmap area (16GB). */ #define VMAP_VIRT_START RDWR_MPT_VIRT_END #define VMAP_VIRT_END (VMAP_VIRT_START + GB(16)) +/* Slot 261: global domain page map area (1GB). */ +#define GLOBALMAP_GBYTES 1 +#define GLOBALMAP_VIRT_START VMAP_VIRT_END +#define GLOBALMAP_VIRT_END (GLOBALMAP_VIRT_START + (GLOBALMAP_GBYTES<<30)) /* Slot 261: compatibility machine-to-phys conversion table (1GB). */ -#define RDWR_COMPAT_MPT_VIRT_START VMAP_VIRT_END +#define RDWR_COMPAT_MPT_VIRT_START GLOBALMAP_VIRT_END #define RDWR_COMPAT_MPT_VIRT_END (RDWR_COMPAT_MPT_VIRT_START + GB(1)) /* Slot 261: high read-only compat machine-to-phys conversion table (1GB). */ #define HIRO_COMPAT_MPT_VIRT_START RDWR_COMPAT_MPT_VIRT_END @@ -279,9 +288,9 @@ extern unsigned long xen_phys_start; /* GDT/LDT shadow mapping area. The first per-domain-mapping sub-area. */ #define GDT_LDT_VCPU_SHIFT 5 #define GDT_LDT_VCPU_VA_SHIFT (GDT_LDT_VCPU_SHIFT + PAGE_SHIFT) -#define GDT_LDT_MBYTES PERDOMAIN_MBYTES +#define GDT_LDT_MBYTES PERDOMAIN_SLOT_MBYTES #define MAX_VIRT_CPUS (GDT_LDT_MBYTES << (20-GDT_LDT_VCPU_VA_SHIFT)) -#define GDT_LDT_VIRT_START PERDOMAIN_VIRT_START +#define GDT_LDT_VIRT_START PERDOMAIN_VIRT_SLOT(0) #define GDT_LDT_VIRT_END (GDT_LDT_VIRT_START + (GDT_LDT_MBYTES << 20)) /* The address of a particular VCPU''s GDT or LDT. */ @@ -290,8 +299,16 @@ extern unsigned long xen_phys_start; #define LDT_VIRT_START(v) \ (GDT_VIRT_START(v) + (64*1024)) +/* map_domain_page() map cache. The last per-domain-mapping sub-area. */ +#define MAPCACHE_VCPU_ENTRIES (CONFIG_PAGING_LEVELS * CONFIG_PAGING_LEVELS) +#define MAPCACHE_ENTRIES (MAX_VIRT_CPUS * MAPCACHE_VCPU_ENTRIES) +#define MAPCACHE_SLOT (PERDOMAIN_SLOTS - 1) +#define MAPCACHE_VIRT_START PERDOMAIN_VIRT_SLOT(MAPCACHE_SLOT) +#define MAPCACHE_VIRT_END (MAPCACHE_VIRT_START + \ + MAPCACHE_ENTRIES * PAGE_SIZE) + #define PDPT_L1_ENTRIES \ - ((PERDOMAIN_VIRT_END - PERDOMAIN_VIRT_START) >> PAGE_SHIFT) + ((PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS - 1) - PERDOMAIN_VIRT_START) >> PAGE_SHIFT) #define PDPT_L2_ENTRIES \ ((PDPT_L1_ENTRIES + (1 << PAGETABLE_ORDER) - 1) >> PAGETABLE_ORDER) --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -39,7 +39,7 @@ struct trap_bounce { #define MAPHASH_ENTRIES 8 #define MAPHASH_HASHFN(pfn) ((pfn) & (MAPHASH_ENTRIES-1)) -#define MAPHASHENT_NOTINUSE ((u16)~0U) +#define MAPHASHENT_NOTINUSE ((u32)~0U) struct mapcache_vcpu { /* Shadow of mapcache_domain.epoch. */ unsigned int shadow_epoch; @@ -47,16 +47,15 @@ struct mapcache_vcpu { /* Lock-free per-VCPU hash of recently-used mappings. */ struct vcpu_maphash_entry { unsigned long mfn; - uint16_t idx; - uint16_t refcnt; + uint32_t idx; + uint32_t refcnt; } hash[MAPHASH_ENTRIES]; }; -#define MAPCACHE_ORDER 10 -#define MAPCACHE_ENTRIES (1 << MAPCACHE_ORDER) struct mapcache_domain { /* The PTEs that provide the mappings, and a cursor into the array. */ - l1_pgentry_t *l1tab; + l1_pgentry_t **l1tab; + unsigned int entries; unsigned int cursor; /* Protects map_domain_page(). */ @@ -67,12 +66,13 @@ struct mapcache_domain { u32 tlbflush_timestamp; /* Which mappings are in use, and which are garbage to reap next epoch? */ - unsigned long inuse[BITS_TO_LONGS(MAPCACHE_ENTRIES)]; - unsigned long garbage[BITS_TO_LONGS(MAPCACHE_ENTRIES)]; + unsigned long *inuse; + unsigned long *garbage; }; -void mapcache_domain_init(struct domain *); -void mapcache_vcpu_init(struct vcpu *); +int mapcache_domain_init(struct domain *); +void mapcache_domain_exit(struct domain *); +int mapcache_vcpu_init(struct vcpu *); /* x86/64: toggle guest between kernel and user modes. */ void toggle_guest_mode(struct vcpu *); @@ -229,6 +229,9 @@ struct pv_domain * unmask the event channel */ bool_t auto_unmask; + /* map_domain_page() mapping cache. */ + struct mapcache_domain mapcache; + /* Pseudophysical e820 map (XENMEM_memory_map). */ spinlock_t e820_lock; struct e820entry *e820; @@ -238,7 +241,7 @@ struct pv_domain struct arch_domain { struct page_info **mm_perdomain_pt_pages; - l2_pgentry_t *mm_perdomain_l2; + l2_pgentry_t *mm_perdomain_l2[PERDOMAIN_SLOTS]; l3_pgentry_t *mm_perdomain_l3; unsigned int hv_compat_vstart; @@ -324,6 +327,9 @@ struct arch_domain struct pv_vcpu { + /* map_domain_page() mapping cache. */ + struct mapcache_vcpu mapcache; + struct trap_info *trap_ctxt; unsigned long gdt_frames[FIRST_RESERVED_GDT_PAGE]; --- a/xen/include/xen/domain_page.h +++ b/xen/include/xen/domain_page.h @@ -25,11 +25,16 @@ void *map_domain_page(unsigned long mfn) */ void unmap_domain_page(const void *va); +/* + * Clear a given page frame, or copy between two of them. + */ +void clear_domain_page(unsigned long mfn); +void copy_domain_page(unsigned long dmfn, unsigned long smfn); /* * Given a VA from map_domain_page(), return its underlying MFN. */ -unsigned long domain_page_map_to_mfn(void *va); +unsigned long domain_page_map_to_mfn(const void *va); /* * Similar to the above calls, except the mapping is accessible in all @@ -107,6 +112,9 @@ domain_mmap_cache_destroy(struct domain_ #define map_domain_page(mfn) mfn_to_virt(mfn) #define __map_domain_page(pg) page_to_virt(pg) #define unmap_domain_page(va) ((void)(va)) +#define clear_domain_page(mfn) clear_page(mfn_to_virt(mfn)) +#define copy_domain_page(dmfn, smfn) copy_page(mfn_to_virt(dmfn), \ + mfn_to_virt(smfn)) #define domain_page_map_to_mfn(va) virt_to_mfn((unsigned long)(va)) #define map_domain_page_global(mfn) mfn_to_virt(mfn) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Jan-22 10:51 UTC
[PATCH 04/11] x86: properly use map_domain_page() when building Dom0
This requires a minor hack to allow the correct page tables to be used while running on Dom0''s page tables (as they can''t be determined from "current" at that time). Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/domain_build.c +++ b/xen/arch/x86/domain_build.c @@ -621,8 +621,10 @@ int __init construct_dom0( maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l3_page_table; l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE; } - copy_page(l4tab, idle_pg_table); - l4tab[0] = l4e_empty(); /* zap trampoline mapping */ + clear_page(l4tab); + for ( i = l4_table_offset(HYPERVISOR_VIRT_START); + i < l4_table_offset(HYPERVISOR_VIRT_END); ++i ) + l4tab[i] = idle_pg_table[i]; l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] l4e_from_paddr(__pa(l4start), __PAGE_HYPERVISOR); l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] @@ -766,6 +768,7 @@ int __init construct_dom0( /* We run on dom0''s page tables for the final part of the build process. */ write_ptbase(v); + mapcache_override_current(v); /* Copy the OS image and free temporary buffer. */ elf.dest = (void*)vkern_start; @@ -782,6 +785,7 @@ int __init construct_dom0( if ( (parms.virt_hypercall < v_start) || (parms.virt_hypercall >= v_end) ) { + mapcache_override_current(NULL); write_ptbase(current); printk("Invalid HYPERCALL_PAGE field in ELF notes.\n"); return -1; @@ -811,6 +815,10 @@ int __init construct_dom0( elf_64bit(&elf) ? 64 : 32, parms.pae ? "p" : ""); count = d->tot_pages; + l4start = map_domain_page(pagetable_get_pfn(v->arch.guest_table)); + l3tab = NULL; + l2tab = NULL; + l1tab = NULL; /* Set up the phys->machine table if not part of the initial mapping. */ if ( parms.p2m_base != UNSET_ADDR ) { @@ -825,6 +833,21 @@ int __init construct_dom0( >> PAGE_SHIFT) + 3 > nr_pages ) panic("Dom0 allocation too small for initial P->M table.\n"); + if ( l1tab ) + { + unmap_domain_page(l1tab); + l1tab = NULL; + } + if ( l2tab ) + { + unmap_domain_page(l2tab); + l2tab = NULL; + } + if ( l3tab ) + { + unmap_domain_page(l3tab); + l3tab = NULL; + } l4tab = l4start + l4_table_offset(va); if ( !l4e_get_intpte(*l4tab) ) { @@ -835,10 +858,11 @@ int __init construct_dom0( page->count_info = PGC_allocated | 2; page->u.inuse.type_info PGT_l3_page_table | PGT_validated | 1; - clear_page(page_to_virt(page)); + l3tab = __map_domain_page(page); + clear_page(l3tab); *l4tab = l4e_from_page(page, L4_PROT); - } - l3tab = page_to_virt(l4e_get_page(*l4tab)); + } else + l3tab = map_domain_page(l4e_get_pfn(*l4tab)); l3tab += l3_table_offset(va); if ( !l3e_get_intpte(*l3tab) ) { @@ -857,17 +881,16 @@ int __init construct_dom0( } if ( (page = alloc_domheap_page(d, 0)) == NULL ) break; - else - { - /* No mapping, PGC_allocated + page-table page. */ - page->count_info = PGC_allocated | 2; - page->u.inuse.type_info - PGT_l2_page_table | PGT_validated | 1; - clear_page(page_to_virt(page)); - *l3tab = l3e_from_page(page, L3_PROT); - } + /* No mapping, PGC_allocated + page-table page. */ + page->count_info = PGC_allocated | 2; + page->u.inuse.type_info + PGT_l2_page_table | PGT_validated | 1; + l2tab = __map_domain_page(page); + clear_page(l2tab); + *l3tab = l3e_from_page(page, L3_PROT); } - l2tab = page_to_virt(l3e_get_page(*l3tab)); + else + l2tab = map_domain_page(l3e_get_pfn(*l3tab)); l2tab += l2_table_offset(va); if ( !l2e_get_intpte(*l2tab) ) { @@ -887,17 +910,16 @@ int __init construct_dom0( } if ( (page = alloc_domheap_page(d, 0)) == NULL ) break; - else - { - /* No mapping, PGC_allocated + page-table page. */ - page->count_info = PGC_allocated | 2; - page->u.inuse.type_info - PGT_l1_page_table | PGT_validated | 1; - clear_page(page_to_virt(page)); - *l2tab = l2e_from_page(page, L2_PROT); - } + /* No mapping, PGC_allocated + page-table page. */ + page->count_info = PGC_allocated | 2; + page->u.inuse.type_info + PGT_l1_page_table | PGT_validated | 1; + l1tab = __map_domain_page(page); + clear_page(l1tab); + *l2tab = l2e_from_page(page, L2_PROT); } - l1tab = page_to_virt(l2e_get_page(*l2tab)); + else + l1tab = map_domain_page(l2e_get_pfn(*l2tab)); l1tab += l1_table_offset(va); BUG_ON(l1e_get_intpte(*l1tab)); page = alloc_domheap_page(d, 0); @@ -911,6 +933,14 @@ int __init construct_dom0( panic("Not enough RAM for DOM0 P->M table.\n"); } + if ( l1tab ) + unmap_domain_page(l1tab); + if ( l2tab ) + unmap_domain_page(l2tab); + if ( l3tab ) + unmap_domain_page(l3tab); + unmap_domain_page(l4start); + /* Write the phys->machine and machine->phys table entries. */ for ( pfn = 0; pfn < count; pfn++ ) { @@ -1000,6 +1030,7 @@ int __init construct_dom0( xlat_start_info(si, XLAT_start_info_console_dom0); /* Return to idle domain''s page tables. */ + mapcache_override_current(NULL); write_ptbase(current); update_domain_wallclock_time(d); --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -15,10 +15,12 @@ #include <asm/flushtlb.h> #include <asm/hardirq.h> +static struct vcpu *__read_mostly override; + static inline struct vcpu *mapcache_current_vcpu(void) { /* In the common case we use the mapcache of the running VCPU. */ - struct vcpu *v = current; + struct vcpu *v = override ?: current; /* * When current isn''t properly set up yet, this is equivalent to @@ -44,6 +46,11 @@ static inline struct vcpu *mapcache_curr return v; } +void __init mapcache_override_current(struct vcpu *v) +{ + override = v; +} + #define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER) #define MAPCACHE_L2_ENTRIES (mapcache_l2_entry(MAPCACHE_ENTRIES - 1) + 1) #define DCACHE_L1ENT(dc, idx) \ --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -73,6 +73,7 @@ struct mapcache_domain { int mapcache_domain_init(struct domain *); void mapcache_domain_exit(struct domain *); int mapcache_vcpu_init(struct vcpu *); +void mapcache_override_current(struct vcpu *); /* x86/64: toggle guest between kernel and user modes. */ void toggle_guest_mode(struct vcpu *); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Jan-22 10:52 UTC
[PATCH 05/11] x86: consolidate initialization of PV guest L4 page tables
So far this has been repeated in 3 places, requiring to remember to update all of them if a change is being made. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -290,13 +290,8 @@ static int setup_compat_l4(struct vcpu * pg->u.inuse.type_info = PGT_l4_page_table|PGT_validated|1; l4tab = page_to_virt(pg); - copy_page(l4tab, idle_pg_table); - l4tab[0] = l4e_empty(); - l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] - l4e_from_page(pg, __PAGE_HYPERVISOR); - l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] - l4e_from_paddr(__pa(v->domain->arch.mm_perdomain_l3), - __PAGE_HYPERVISOR); + clear_page(l4tab); + init_guest_l4_table(l4tab, v->domain); v->arch.guest_table = pagetable_from_page(pg); v->arch.guest_table_user = v->arch.guest_table; --- a/xen/arch/x86/domain_build.c +++ b/xen/arch/x86/domain_build.c @@ -622,13 +622,7 @@ int __init construct_dom0( l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE; } clear_page(l4tab); - for ( i = l4_table_offset(HYPERVISOR_VIRT_START); - i < l4_table_offset(HYPERVISOR_VIRT_END); ++i ) - l4tab[i] = idle_pg_table[i]; - l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] - l4e_from_paddr(__pa(l4start), __PAGE_HYPERVISOR); - l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] - l4e_from_paddr(__pa(d->arch.mm_perdomain_l3), __PAGE_HYPERVISOR); + init_guest_l4_table(l4tab, d); v->arch.guest_table = pagetable_from_paddr(__pa(l4start)); if ( is_pv_32on64_domain(d) ) v->arch.guest_table_user = v->arch.guest_table; --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -1315,6 +1315,18 @@ static int alloc_l3_table(struct page_in return rc > 0 ? 0 : rc; } +void init_guest_l4_table(l4_pgentry_t l4tab[], const struct domain *d) +{ + /* Xen private mappings. */ + memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT], + &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT], + ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t)); + l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] + l4e_from_pfn(virt_to_mfn(l4tab), __PAGE_HYPERVISOR); + l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] + l4e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l3), __PAGE_HYPERVISOR); +} + static int alloc_l4_table(struct page_info *page, int preemptible) { struct domain *d = page_get_owner(page); @@ -1358,15 +1370,7 @@ static int alloc_l4_table(struct page_in adjust_guest_l4e(pl4e[i], d); } - /* Xen private mappings. */ - memcpy(&pl4e[ROOT_PAGETABLE_FIRST_XEN_SLOT], - &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT], - ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t)); - pl4e[l4_table_offset(LINEAR_PT_VIRT_START)] - l4e_from_pfn(pfn, __PAGE_HYPERVISOR); - pl4e[l4_table_offset(PERDOMAIN_VIRT_START)] - l4e_from_page(virt_to_page(d->arch.mm_perdomain_l3), - __PAGE_HYPERVISOR); + init_guest_l4_table(pl4e, d); return rc > 0 ? 0 : rc; } --- a/xen/include/asm-x86/mm.h +++ b/xen/include/asm-x86/mm.h @@ -316,6 +316,8 @@ static inline void *__page_to_virt(const int free_page_type(struct page_info *page, unsigned long type, int preemptible); +void init_guest_l4_table(l4_pgentry_t[], const struct domain *); + int is_iomem_page(unsigned long mfn); void clear_superpage_mark(struct page_info *page); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Jan-22 10:53 UTC
[PATCH 06/11] x86: properly use map_domain_page() during domain creation/destruction
This involves no longer storing virtual addresses of the per-domain mapping L2 and L3 page tables. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -289,9 +289,10 @@ static int setup_compat_l4(struct vcpu * /* This page needs to look like a pagetable so that it can be shadowed */ pg->u.inuse.type_info = PGT_l4_page_table|PGT_validated|1; - l4tab = page_to_virt(pg); + l4tab = __map_domain_page(pg); clear_page(l4tab); init_guest_l4_table(l4tab, v->domain); + unmap_domain_page(l4tab); v->arch.guest_table = pagetable_from_page(pg); v->arch.guest_table_user = v->arch.guest_table; @@ -383,17 +384,22 @@ int vcpu_initialise(struct vcpu *v) v->arch.flags = TF_kernel_mode; - idx = perdomain_pt_pgidx(v); - if ( !perdomain_pt_page(d, idx) ) + idx = perdomain_pt_idx(v); + if ( !d->arch.perdomain_pts[idx] ) { - struct page_info *pg; - pg = alloc_domheap_page(NULL, MEMF_node(vcpu_to_node(v))); - if ( !pg ) + void *pt; + l2_pgentry_t *l2tab; + + pt = alloc_xenheap_pages(0, MEMF_node(vcpu_to_node(v))); + if ( !pt ) return -ENOMEM; - clear_page(page_to_virt(pg)); - perdomain_pt_page(d, idx) = pg; - d->arch.mm_perdomain_l2[0][l2_table_offset(PERDOMAIN_VIRT_START)+idx] - = l2e_from_page(pg, __PAGE_HYPERVISOR); + clear_page(pt); + d->arch.perdomain_pts[idx] = pt; + + l2tab = __map_domain_page(d->arch.perdomain_l2_pg[0]); + l2tab[l2_table_offset(PERDOMAIN_VIRT_START) + idx] + = l2e_from_paddr(__pa(pt), __PAGE_HYPERVISOR); + unmap_domain_page(l2tab); } rc = mapcache_vcpu_init(v); @@ -484,6 +490,7 @@ void vcpu_destroy(struct vcpu *v) int arch_domain_create(struct domain *d, unsigned int domcr_flags) { struct page_info *pg; + l3_pgentry_t *l3tab; int i, paging_initialised = 0; int rc = -ENOMEM; @@ -514,28 +521,29 @@ int arch_domain_create(struct domain *d, d->domain_id); } - BUILD_BUG_ON(PDPT_L2_ENTRIES * sizeof(*d->arch.mm_perdomain_pt_pages) + BUILD_BUG_ON(PDPT_L2_ENTRIES * sizeof(*d->arch.perdomain_pts) != PAGE_SIZE); - pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d))); - if ( !pg ) + d->arch.perdomain_pts + alloc_xenheap_pages(0, MEMF_node(domain_to_node(d))); + if ( !d->arch.perdomain_pts ) goto fail; - d->arch.mm_perdomain_pt_pages = page_to_virt(pg); - clear_page(d->arch.mm_perdomain_pt_pages); + clear_page(d->arch.perdomain_pts); pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d))); if ( pg == NULL ) goto fail; - d->arch.mm_perdomain_l2[0] = page_to_virt(pg); - clear_page(d->arch.mm_perdomain_l2[0]); + d->arch.perdomain_l2_pg[0] = pg; + clear_domain_page(page_to_mfn(pg)); pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d))); if ( pg == NULL ) goto fail; - d->arch.mm_perdomain_l3 = page_to_virt(pg); - clear_page(d->arch.mm_perdomain_l3); - d->arch.mm_perdomain_l3[l3_table_offset(PERDOMAIN_VIRT_START)] - l3e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l2[0]), - __PAGE_HYPERVISOR); + d->arch.perdomain_l3_pg = pg; + l3tab = __map_domain_page(pg); + clear_page(l3tab); + l3tab[l3_table_offset(PERDOMAIN_VIRT_START)] + l3e_from_page(d->arch.perdomain_l2_pg[0], __PAGE_HYPERVISOR); + unmap_domain_page(l3tab); mapcache_domain_init(d); @@ -611,12 +619,12 @@ int arch_domain_create(struct domain *d, if ( paging_initialised ) paging_final_teardown(d); mapcache_domain_exit(d); - if ( d->arch.mm_perdomain_l2[0] ) - free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0])); - if ( d->arch.mm_perdomain_l3 ) - free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3)); - if ( d->arch.mm_perdomain_pt_pages ) - free_domheap_page(virt_to_page(d->arch.mm_perdomain_pt_pages)); + for ( i = 0; i < PERDOMAIN_SLOTS; ++i) + if ( d->arch.perdomain_l2_pg[i] ) + free_domheap_page(d->arch.perdomain_l2_pg[i]); + if ( d->arch.perdomain_l3_pg ) + free_domheap_page(d->arch.perdomain_l3_pg); + free_xenheap_page(d->arch.perdomain_pts); return rc; } @@ -638,13 +646,12 @@ void arch_domain_destroy(struct domain * mapcache_domain_exit(d); for ( i = 0; i < PDPT_L2_ENTRIES; ++i ) - { - if ( perdomain_pt_page(d, i) ) - free_domheap_page(perdomain_pt_page(d, i)); - } - free_domheap_page(virt_to_page(d->arch.mm_perdomain_pt_pages)); - free_domheap_page(virt_to_page(d->arch.mm_perdomain_l2[0])); - free_domheap_page(virt_to_page(d->arch.mm_perdomain_l3)); + free_xenheap_page(d->arch.perdomain_pts[i]); + free_xenheap_page(d->arch.perdomain_pts); + for ( i = 0; i < PERDOMAIN_SLOTS; ++i) + if ( d->arch.perdomain_l2_pg[i] ) + free_domheap_page(d->arch.perdomain_l2_pg[i]); + free_domheap_page(d->arch.perdomain_l3_pg); free_xenheap_page(d->shared_info); cleanup_domain_irq_mapping(d); @@ -810,9 +817,10 @@ int arch_set_info_guest( fail |= xen_pfn_to_cr3(pfn) != c.nat->ctrlreg[1]; } } else { - l4_pgentry_t *l4tab = __va(pfn_to_paddr(pfn)); + l4_pgentry_t *l4tab = map_domain_page(pfn); pfn = l4e_get_pfn(*l4tab); + unmap_domain_page(l4tab); fail = compat_pfn_to_cr3(pfn) != c.cmp->ctrlreg[3]; } @@ -951,9 +959,10 @@ int arch_set_info_guest( return -EINVAL; } - l4tab = __va(pagetable_get_paddr(v->arch.guest_table)); + l4tab = map_domain_page(pagetable_get_pfn(v->arch.guest_table)); *l4tab = l4e_from_pfn(page_to_mfn(cr3_page), _PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED); + unmap_domain_page(l4tab); } if ( v->vcpu_id == 0 ) @@ -1971,12 +1980,13 @@ static int relinquish_memory( static void vcpu_destroy_pagetables(struct vcpu *v) { struct domain *d = v->domain; - unsigned long pfn; + unsigned long pfn = pagetable_get_pfn(v->arch.guest_table); if ( is_pv_32on64_vcpu(v) ) { - pfn = l4e_get_pfn(*(l4_pgentry_t *) - __va(pagetable_get_paddr(v->arch.guest_table))); + l4_pgentry_t *l4tab = map_domain_page(pfn); + + pfn = l4e_get_pfn(*l4tab); if ( pfn != 0 ) { @@ -1986,15 +1996,12 @@ static void vcpu_destroy_pagetables(stru put_page_and_type(mfn_to_page(pfn)); } - l4e_write( - (l4_pgentry_t *)__va(pagetable_get_paddr(v->arch.guest_table)), - l4e_empty()); + l4e_write(l4tab, l4e_empty()); v->arch.cr3 = 0; return; } - pfn = pagetable_get_pfn(v->arch.guest_table); if ( pfn != 0 ) { if ( paging_mode_refcounts(d) ) --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -241,6 +241,8 @@ void copy_domain_page(unsigned long dmfn int mapcache_domain_init(struct domain *d) { struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache; + l3_pgentry_t *l3tab; + l2_pgentry_t *l2tab; unsigned int i, bitmap_pages, memf = MEMF_node(domain_to_node(d)); unsigned long *end; @@ -251,14 +253,18 @@ int mapcache_domain_init(struct domain * return 0; dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1); - d->arch.mm_perdomain_l2[MAPCACHE_SLOT] = alloc_xenheap_pages(0, memf); - if ( !dcache->l1tab || !d->arch.mm_perdomain_l2[MAPCACHE_SLOT] ) + d->arch.perdomain_l2_pg[MAPCACHE_SLOT] = alloc_domheap_page(NULL, memf); + if ( !dcache->l1tab || !d->arch.perdomain_l2_pg[MAPCACHE_SLOT] ) return -ENOMEM; - clear_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]); - d->arch.mm_perdomain_l3[l3_table_offset(MAPCACHE_VIRT_START)] - l3e_from_paddr(__pa(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]), - __PAGE_HYPERVISOR); + clear_domain_page(page_to_mfn(d->arch.perdomain_l2_pg[MAPCACHE_SLOT])); + l3tab = __map_domain_page(d->arch.perdomain_l3_pg); + l3tab[l3_table_offset(MAPCACHE_VIRT_START)] + l3e_from_page(d->arch.perdomain_l2_pg[MAPCACHE_SLOT], + __PAGE_HYPERVISOR); + unmap_domain_page(l3tab); + + l2tab = __map_domain_page(d->arch.perdomain_l2_pg[MAPCACHE_SLOT]); BUILD_BUG_ON(MAPCACHE_VIRT_END + 3 + 2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long)) > @@ -275,12 +281,16 @@ int mapcache_domain_init(struct domain * ASSERT(i <= MAPCACHE_L2_ENTRIES); dcache->l1tab[i] = alloc_xenheap_pages(0, memf); if ( !dcache->l1tab[i] ) + { + unmap_domain_page(l2tab); return -ENOMEM; + } clear_page(dcache->l1tab[i]); - d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] - l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR); + l2tab[i] = l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR); } + unmap_domain_page(l2tab); + spin_lock_init(&dcache->lock); return 0; @@ -315,19 +325,21 @@ void mapcache_domain_exit(struct domain xfree(dcache->l1tab); } - free_xenheap_page(d->arch.mm_perdomain_l2[MAPCACHE_SLOT]); } int mapcache_vcpu_init(struct vcpu *v) { struct domain *d = v->domain; struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache; + l2_pgentry_t *l2tab; unsigned long i; unsigned int memf = MEMF_node(vcpu_to_node(v)); if ( is_hvm_vcpu(v) || !dcache->l1tab ) return 0; + l2tab = __map_domain_page(d->arch.perdomain_l2_pg[MAPCACHE_SLOT]); + while ( dcache->entries < d->max_vcpus * MAPCACHE_VCPU_ENTRIES ) { unsigned int ents = dcache->entries + MAPCACHE_VCPU_ENTRIES; @@ -338,10 +350,13 @@ int mapcache_vcpu_init(struct vcpu *v) { dcache->l1tab[i] = alloc_xenheap_pages(0, memf); if ( !dcache->l1tab[i] ) + { + unmap_domain_page(l2tab); return -ENOMEM; + } clear_page(dcache->l1tab[i]); - d->arch.mm_perdomain_l2[MAPCACHE_SLOT][i] - l2e_from_paddr(__pa(dcache->l1tab[i]), __PAGE_HYPERVISOR); + l2tab[i] = l2e_from_paddr(__pa(dcache->l1tab[i]), + __PAGE_HYPERVISOR); } /* Populate bit maps. */ @@ -351,18 +366,22 @@ int mapcache_vcpu_init(struct vcpu *v) { struct page_info *pg = alloc_domheap_page(NULL, memf); + if ( pg ) + { + clear_domain_page(page_to_mfn(pg)); + *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR); + pg = alloc_domheap_page(NULL, memf); + } if ( !pg ) + { + unmap_domain_page(l2tab); return -ENOMEM; - clear_domain_page(page_to_mfn(pg)); - *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR); + } i = (unsigned long)(dcache->garbage + BITS_TO_LONGS(ents)); pl1e = &dcache->l1tab[l2_table_offset(i)][l1_table_offset(i)]; ASSERT(!l1e_get_flags(*pl1e)); - pg = alloc_domheap_page(NULL, memf); - if ( !pg ) - return -ENOMEM; clear_domain_page(page_to_mfn(pg)); *pl1e = l1e_from_page(pg, __PAGE_HYPERVISOR); } @@ -370,6 +389,8 @@ int mapcache_vcpu_init(struct vcpu *v) dcache->entries = ents; } + unmap_domain_page(l2tab); + /* Mark all maphash entries as not in use. */ BUILD_BUG_ON(MAPHASHENT_NOTINUSE < MAPCACHE_ENTRIES); for ( i = 0; i < MAPHASH_ENTRIES; i++ ) --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -1322,9 +1322,9 @@ void init_guest_l4_table(l4_pgentry_t l4 &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT], ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t)); l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] - l4e_from_pfn(virt_to_mfn(l4tab), __PAGE_HYPERVISOR); + l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR); l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] - l4e_from_pfn(virt_to_mfn(d->arch.mm_perdomain_l3), __PAGE_HYPERVISOR); + l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR); } static int alloc_l4_table(struct page_info *page, int preemptible) --- a/xen/arch/x86/mm/hap/hap.c +++ b/xen/arch/x86/mm/hap/hap.c @@ -369,7 +369,7 @@ static void hap_install_xen_entries_in_l /* Install the per-domain mappings for this domain */ l4e[l4_table_offset(PERDOMAIN_VIRT_START)] - l4e_from_pfn(mfn_x(page_to_mfn(virt_to_page(d->arch.mm_perdomain_l3))), + l4e_from_pfn(mfn_x(page_to_mfn(d->arch.perdomain_l3_pg)), __PAGE_HYPERVISOR); /* Install a linear mapping */ --- a/xen/arch/x86/mm/shadow/multi.c +++ b/xen/arch/x86/mm/shadow/multi.c @@ -1449,7 +1449,7 @@ void sh_install_xen_entries_in_l4(struct /* Install the per-domain mappings for this domain */ sl4e[shadow_l4_table_offset(PERDOMAIN_VIRT_START)] - shadow_l4e_from_mfn(page_to_mfn(virt_to_page(d->arch.mm_perdomain_l3)), + shadow_l4e_from_mfn(page_to_mfn(d->arch.perdomain_l3_pg), __PAGE_HYPERVISOR); /* Shadow linear mapping for 4-level shadows. N.B. for 3-level --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -823,9 +823,8 @@ void __init setup_idle_pagetable(void) { /* Install per-domain mappings for idle domain. */ l4e_write(&idle_pg_table[l4_table_offset(PERDOMAIN_VIRT_START)], - l4e_from_page( - virt_to_page(idle_vcpu[0]->domain->arch.mm_perdomain_l3), - __PAGE_HYPERVISOR)); + l4e_from_page(idle_vcpu[0]->domain->arch.perdomain_l3_pg, + __PAGE_HYPERVISOR)); } void __init zap_low_mappings(void) @@ -850,21 +849,18 @@ void *compat_arg_xlat_virt_base(void) int setup_compat_arg_xlat(struct vcpu *v) { unsigned int order = get_order_from_bytes(COMPAT_ARG_XLAT_SIZE); - struct page_info *pg; - pg = alloc_domheap_pages(NULL, order, 0); - if ( pg == NULL ) - return -ENOMEM; + v->arch.compat_arg_xlat = alloc_xenheap_pages(order, + MEMF_node(vcpu_to_node(v))); - v->arch.compat_arg_xlat = page_to_virt(pg); - return 0; + return v->arch.compat_arg_xlat ? 0 : -ENOMEM; } void free_compat_arg_xlat(struct vcpu *v) { unsigned int order = get_order_from_bytes(COMPAT_ARG_XLAT_SIZE); - if ( v->arch.compat_arg_xlat != NULL ) - free_domheap_pages(virt_to_page(v->arch.compat_arg_xlat), order); + + free_xenheap_pages(v->arch.compat_arg_xlat, order); v->arch.compat_arg_xlat = NULL; } --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -241,9 +241,9 @@ struct pv_domain struct arch_domain { - struct page_info **mm_perdomain_pt_pages; - l2_pgentry_t *mm_perdomain_l2[PERDOMAIN_SLOTS]; - l3_pgentry_t *mm_perdomain_l3; + void **perdomain_pts; + struct page_info *perdomain_l2_pg[PERDOMAIN_SLOTS]; + struct page_info *perdomain_l3_pg; unsigned int hv_compat_vstart; @@ -318,13 +318,11 @@ struct arch_domain #define has_arch_pdevs(d) (!list_empty(&(d)->arch.pdev_list)) #define has_arch_mmios(d) (!rangeset_is_empty((d)->iomem_caps)) -#define perdomain_pt_pgidx(v) \ +#define perdomain_pt_idx(v) \ ((v)->vcpu_id >> (PAGETABLE_ORDER - GDT_LDT_VCPU_SHIFT)) #define perdomain_ptes(d, v) \ - ((l1_pgentry_t *)page_to_virt((d)->arch.mm_perdomain_pt_pages \ - [perdomain_pt_pgidx(v)]) + (((v)->vcpu_id << GDT_LDT_VCPU_SHIFT) & \ - (L1_PAGETABLE_ENTRIES - 1))) -#define perdomain_pt_page(d, n) ((d)->arch.mm_perdomain_pt_pages[n]) + ((l1_pgentry_t *)(d)->arch.perdomain_pts[perdomain_pt_idx(v)] + \ + (((v)->vcpu_id << GDT_LDT_VCPU_SHIFT) & (L1_PAGETABLE_ENTRIES - 1))) struct pv_vcpu { _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Jan-22 10:55 UTC
[PATCH 07/11] x86: properly use map_domain_page() during page table manipulation
Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/debug.c +++ b/xen/arch/x86/debug.c @@ -98,8 +98,9 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma if ( pgd3val == 0 ) { - l4t = mfn_to_virt(mfn); + l4t = map_domain_page(mfn); l4e = l4t[l4_table_offset(vaddr)]; + unmap_domain_page(l4t); mfn = l4e_get_pfn(l4e); DBGP2("l4t:%p l4to:%lx l4e:%lx mfn:%lx\n", l4t, l4_table_offset(vaddr), l4e, mfn); @@ -109,20 +110,23 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma return INVALID_MFN; } - l3t = mfn_to_virt(mfn); + l3t = map_domain_page(mfn); l3e = l3t[l3_table_offset(vaddr)]; + unmap_domain_page(l3t); mfn = l3e_get_pfn(l3e); DBGP2("l3t:%p l3to:%lx l3e:%lx mfn:%lx\n", l3t, l3_table_offset(vaddr), l3e, mfn); - if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) ) + if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) || + (l3e_get_flags(l3e) & _PAGE_PSE) ) { DBGP1("l3 PAGE not present. vaddr:%lx cr3:%lx\n", vaddr, cr3); return INVALID_MFN; } } - l2t = mfn_to_virt(mfn); + l2t = map_domain_page(mfn); l2e = l2t[l2_table_offset(vaddr)]; + unmap_domain_page(l2t); mfn = l2e_get_pfn(l2e); DBGP2("l2t:%p l2to:%lx l2e:%lx mfn:%lx\n", l2t, l2_table_offset(vaddr), l2e, mfn); @@ -132,8 +136,9 @@ dbg_pv_va2mfn(dbgva_t vaddr, struct doma DBGP1("l2 PAGE not present. vaddr:%lx cr3:%lx\n", vaddr, cr3); return INVALID_MFN; } - l1t = mfn_to_virt(mfn); + l1t = map_domain_page(mfn); l1e = l1t[l1_table_offset(vaddr)]; + unmap_domain_page(l1t); mfn = l1e_get_pfn(l1e); DBGP2("l1t:%p l1to:%lx l1e:%lx mfn:%lx\n", l1t, l1_table_offset(vaddr), l1e, mfn); --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -1331,7 +1331,7 @@ static int alloc_l4_table(struct page_in { struct domain *d = page_get_owner(page); unsigned long pfn = page_to_mfn(page); - l4_pgentry_t *pl4e = page_to_virt(page); + l4_pgentry_t *pl4e = map_domain_page(pfn); unsigned int i; int rc = 0, partial = page->partial_pte; @@ -1365,12 +1365,16 @@ static int alloc_l4_table(struct page_in put_page_from_l4e(pl4e[i], pfn, 0, 0); } if ( rc < 0 ) + { + unmap_domain_page(pl4e); return rc; + } adjust_guest_l4e(pl4e[i], d); } init_guest_l4_table(pl4e, d); + unmap_domain_page(pl4e); return rc > 0 ? 0 : rc; } @@ -1464,7 +1468,7 @@ static int free_l4_table(struct page_inf { struct domain *d = page_get_owner(page); unsigned long pfn = page_to_mfn(page); - l4_pgentry_t *pl4e = page_to_virt(page); + l4_pgentry_t *pl4e = map_domain_page(pfn); int rc = 0, partial = page->partial_pte; unsigned int i = page->nr_validated_ptes - !partial; @@ -1487,6 +1491,9 @@ static int free_l4_table(struct page_inf page->partial_pte = 0; rc = -EAGAIN; } + + unmap_domain_page(pl4e); + return rc > 0 ? 0 : rc; } @@ -4983,15 +4990,23 @@ int mmio_ro_do_page_fault(struct vcpu *v return rc != X86EMUL_UNHANDLEABLE ? EXCRET_fault_fixed : 0; } -void free_xen_pagetable(void *v) +void *alloc_xen_pagetable(void) { - if ( system_state == SYS_STATE_early_boot ) - return; + if ( system_state != SYS_STATE_early_boot ) + { + void *ptr = alloc_xenheap_page(); - if ( is_xen_heap_page(virt_to_page(v)) ) + BUG_ON(!dom0 && !ptr); + return ptr; + } + + return mfn_to_virt(alloc_boot_pages(1, 1)); +} + +void free_xen_pagetable(void *v) +{ + if ( system_state != SYS_STATE_early_boot ) free_xenheap_page(v); - else - free_domheap_page(virt_to_page(v)); } /* Convert to from superpage-mapping flags for map_pages_to_xen(). */ --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -180,6 +180,11 @@ static void show_guest_stack(struct vcpu printk(" %p", _p(addr)); stack++; } + if ( mask == PAGE_SIZE ) + { + BUILD_BUG_ON(PAGE_SIZE == STACK_SIZE); + unmap_domain_page(stack); + } if ( i == 0 ) printk("Stack empty."); printk("\n"); --- a/xen/arch/x86/x86_64/compat/traps.c +++ b/xen/arch/x86/x86_64/compat/traps.c @@ -56,6 +56,11 @@ void compat_show_guest_stack(struct vcpu printk(" %08x", addr); stack++; } + if ( mask == PAGE_SIZE ) + { + BUILD_BUG_ON(PAGE_SIZE == STACK_SIZE); + unmap_domain_page(stack); + } if ( i == 0 ) printk("Stack empty."); printk("\n"); --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -65,22 +65,6 @@ int __mfn_valid(unsigned long mfn) pdx_group_valid)); } -void *alloc_xen_pagetable(void) -{ - unsigned long mfn; - - if ( system_state != SYS_STATE_early_boot ) - { - struct page_info *pg = alloc_domheap_page(NULL, 0); - - BUG_ON(!dom0 && !pg); - return pg ? page_to_virt(pg) : NULL; - } - - mfn = alloc_boot_pages(1, 1); - return mfn_to_virt(mfn); -} - l3_pgentry_t *virt_to_xen_l3e(unsigned long v) { l4_pgentry_t *pl4e; @@ -154,35 +138,45 @@ void *do_page_walk(struct vcpu *v, unsig if ( is_hvm_vcpu(v) ) return NULL; - l4t = mfn_to_virt(mfn); + l4t = map_domain_page(mfn); l4e = l4t[l4_table_offset(addr)]; - mfn = l4e_get_pfn(l4e); + unmap_domain_page(l4t); if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) ) return NULL; - l3t = mfn_to_virt(mfn); + l3t = map_l3t_from_l4e(l4e); l3e = l3t[l3_table_offset(addr)]; + unmap_domain_page(l3t); mfn = l3e_get_pfn(l3e); if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) || !mfn_valid(mfn) ) return NULL; if ( (l3e_get_flags(l3e) & _PAGE_PSE) ) - return mfn_to_virt(mfn) + (addr & ((1UL << L3_PAGETABLE_SHIFT) - 1)); + { + mfn += PFN_DOWN(addr & ((1UL << L3_PAGETABLE_SHIFT) - 1)); + goto ret; + } - l2t = mfn_to_virt(mfn); + l2t = map_domain_page(mfn); l2e = l2t[l2_table_offset(addr)]; + unmap_domain_page(l2t); mfn = l2e_get_pfn(l2e); if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) || !mfn_valid(mfn) ) return NULL; if ( (l2e_get_flags(l2e) & _PAGE_PSE) ) - return mfn_to_virt(mfn) + (addr & ((1UL << L2_PAGETABLE_SHIFT) - 1)); + { + mfn += PFN_DOWN(addr & ((1UL << L2_PAGETABLE_SHIFT) - 1)); + goto ret; + } - l1t = mfn_to_virt(mfn); + l1t = map_domain_page(mfn); l1e = l1t[l1_table_offset(addr)]; + unmap_domain_page(l1t); mfn = l1e_get_pfn(l1e); if ( !(l1e_get_flags(l1e) & _PAGE_PRESENT) || !mfn_valid(mfn) ) return NULL; - return mfn_to_virt(mfn) + (addr & ~PAGE_MASK); + ret: + return map_domain_page(mfn) + (addr & ~PAGE_MASK); } void __init pfn_pdx_hole_setup(unsigned long mask) @@ -519,10 +513,9 @@ static int setup_compat_m2p_table(struct static int setup_m2p_table(struct mem_hotadd_info *info) { unsigned long i, va, smap, emap; - unsigned int n, memflags; + unsigned int n; l2_pgentry_t *l2_ro_mpt = NULL; l3_pgentry_t *l3_ro_mpt = NULL; - struct page_info *l2_pg; int ret = 0; ASSERT(l4e_get_flags(idle_pg_table[l4_table_offset(RO_MPT_VIRT_START)]) @@ -560,7 +553,6 @@ static int setup_m2p_table(struct mem_ho } va = RO_MPT_VIRT_START + i * sizeof(*machine_to_phys_mapping); - memflags = MEMF_node(phys_to_nid(i << PAGE_SHIFT)); for ( n = 0; n < CNT; ++n) if ( mfn_valid(i + n * PDX_GROUP_COUNT) ) @@ -587,19 +579,18 @@ static int setup_m2p_table(struct mem_ho l2_table_offset(va); else { - l2_pg = alloc_domheap_page(NULL, memflags); - - if (!l2_pg) + l2_ro_mpt = alloc_xen_pagetable(); + if ( !l2_ro_mpt ) { ret = -ENOMEM; goto error; } - l2_ro_mpt = page_to_virt(l2_pg); clear_page(l2_ro_mpt); l3e_write(&l3_ro_mpt[l3_table_offset(va)], - l3e_from_page(l2_pg, __PAGE_HYPERVISOR | _PAGE_USER)); - l2_ro_mpt += l2_table_offset(va); + l3e_from_paddr(__pa(l2_ro_mpt), + __PAGE_HYPERVISOR | _PAGE_USER)); + l2_ro_mpt += l2_table_offset(va); } /* NB. Cannot be GLOBAL as shadow_mode_translate reuses this area. */ @@ -762,12 +753,12 @@ void __init paging_init(void) l4_table_offset(HIRO_COMPAT_MPT_VIRT_START)); l3_ro_mpt = l4e_to_l3e(idle_pg_table[l4_table_offset( HIRO_COMPAT_MPT_VIRT_START)]); - if ( (l2_pg = alloc_domheap_page(NULL, 0)) == NULL ) + if ( (l2_ro_mpt = alloc_xen_pagetable()) == NULL ) goto nomem; - compat_idle_pg_table_l2 = l2_ro_mpt = page_to_virt(l2_pg); + compat_idle_pg_table_l2 = l2_ro_mpt; clear_page(l2_ro_mpt); l3e_write(&l3_ro_mpt[l3_table_offset(HIRO_COMPAT_MPT_VIRT_START)], - l3e_from_page(l2_pg, __PAGE_HYPERVISOR)); + l3e_from_paddr(__pa(l2_ro_mpt), __PAGE_HYPERVISOR)); l2_ro_mpt += l2_table_offset(HIRO_COMPAT_MPT_VIRT_START); /* Allocate and map the compatibility mode machine-to-phys table. */ mpt_size = (mpt_size >> 1) + (1UL << (L2_PAGETABLE_SHIFT - 1)); --- a/xen/arch/x86/x86_64/traps.c +++ b/xen/arch/x86/x86_64/traps.c @@ -175,8 +175,9 @@ void show_page_walk(unsigned long addr) printk("Pagetable walk from %016lx:\n", addr); - l4t = mfn_to_virt(mfn); + l4t = map_domain_page(mfn); l4e = l4t[l4_table_offset(addr)]; + unmap_domain_page(l4t); mfn = l4e_get_pfn(l4e); pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ? get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY; @@ -186,8 +187,9 @@ void show_page_walk(unsigned long addr) !mfn_valid(mfn) ) return; - l3t = mfn_to_virt(mfn); + l3t = map_domain_page(mfn); l3e = l3t[l3_table_offset(addr)]; + unmap_domain_page(l3t); mfn = l3e_get_pfn(l3e); pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ? get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY; @@ -199,8 +201,9 @@ void show_page_walk(unsigned long addr) !mfn_valid(mfn) ) return; - l2t = mfn_to_virt(mfn); + l2t = map_domain_page(mfn); l2e = l2t[l2_table_offset(addr)]; + unmap_domain_page(l2t); mfn = l2e_get_pfn(l2e); pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ? get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY; @@ -212,8 +215,9 @@ void show_page_walk(unsigned long addr) !mfn_valid(mfn) ) return; - l1t = mfn_to_virt(mfn); + l1t = map_domain_page(mfn); l1e = l1t[l1_table_offset(addr)]; + unmap_domain_page(l1t); mfn = l1e_get_pfn(l1e); pfn = mfn_valid(mfn) && machine_to_phys_mapping_valid ? get_gpfn_from_mfn(mfn) : INVALID_M2P_ENTRY; --- a/xen/include/asm-x86/page.h +++ b/xen/include/asm-x86/page.h @@ -172,6 +172,10 @@ static inline l4_pgentry_t l4e_from_padd #define l3e_to_l2e(x) ((l2_pgentry_t *)__va(l3e_get_paddr(x))) #define l4e_to_l3e(x) ((l3_pgentry_t *)__va(l4e_get_paddr(x))) +#define map_l1t_from_l2e(x) ((l1_pgentry_t *)map_domain_page(l2e_get_pfn(x))) +#define map_l2t_from_l3e(x) ((l2_pgentry_t *)map_domain_page(l3e_get_pfn(x))) +#define map_l3t_from_l4e(x) ((l3_pgentry_t *)map_domain_page(l4e_get_pfn(x))) + /* Given a virtual address, get an entry offset into a page table. */ #define l1_table_offset(a) \ (((a) >> L1_PAGETABLE_SHIFT) & (L1_PAGETABLE_ENTRIES - 1)) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Jan-22 10:55 UTC
[PATCH 08/11] x86: properly use map_domain_page() in nested HVM code
This eliminates a couple of incorrect/inconsistent uses of map_domain_page() from VT-x code. Note that this does _not_ add error handling where none was present before, even though I think NULL returns from any of the mapping operations touched here need to properly be handled. I just don''t know this code well enough to figure out what the right action in each case would be. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -1966,7 +1966,8 @@ int hvm_virtual_to_linear_addr( /* On non-NULL return, we leave this function holding an additional * ref on the underlying mfn, if any */ -static void *__hvm_map_guest_frame(unsigned long gfn, bool_t writable) +static void *__hvm_map_guest_frame(unsigned long gfn, bool_t writable, + bool_t permanent) { void *map; p2m_type_t p2mt; @@ -1991,28 +1992,41 @@ static void *__hvm_map_guest_frame(unsig if ( writable ) paging_mark_dirty(d, page_to_mfn(page)); - map = __map_domain_page(page); + if ( !permanent ) + return __map_domain_page(page); + + map = __map_domain_page_global(page); + if ( !map ) + put_page(page); + return map; } -void *hvm_map_guest_frame_rw(unsigned long gfn) +void *hvm_map_guest_frame_rw(unsigned long gfn, bool_t permanent) { - return __hvm_map_guest_frame(gfn, 1); + return __hvm_map_guest_frame(gfn, 1, permanent); } -void *hvm_map_guest_frame_ro(unsigned long gfn) +void *hvm_map_guest_frame_ro(unsigned long gfn, bool_t permanent) { - return __hvm_map_guest_frame(gfn, 0); + return __hvm_map_guest_frame(gfn, 0, permanent); } -void hvm_unmap_guest_frame(void *p) +void hvm_unmap_guest_frame(void *p, bool_t permanent) { - if ( p ) - { - unsigned long mfn = domain_page_map_to_mfn(p); + unsigned long mfn; + + if ( !p ) + return; + + mfn = domain_page_map_to_mfn(p); + + if ( !permanent ) unmap_domain_page(p); - put_page(mfn_to_page(mfn)); - } + else + unmap_domain_page_global(p); + + put_page(mfn_to_page(mfn)); } static void *hvm_map_entry(unsigned long va) @@ -2038,7 +2052,7 @@ static void *hvm_map_entry(unsigned long if ( (pfec == PFEC_page_paged) || (pfec == PFEC_page_shared) ) goto fail; - v = hvm_map_guest_frame_rw(gfn); + v = hvm_map_guest_frame_rw(gfn, 0); if ( v == NULL ) goto fail; @@ -2051,7 +2065,7 @@ static void *hvm_map_entry(unsigned long static void hvm_unmap_entry(void *p) { - hvm_unmap_guest_frame(p); + hvm_unmap_guest_frame(p, 0); } static int hvm_load_segment_selector( --- a/xen/arch/x86/hvm/nestedhvm.c +++ b/xen/arch/x86/hvm/nestedhvm.c @@ -53,8 +53,7 @@ nestedhvm_vcpu_reset(struct vcpu *v) nv->nv_ioport80 = 0; nv->nv_ioportED = 0; - if (nv->nv_vvmcx) - hvm_unmap_guest_frame(nv->nv_vvmcx); + hvm_unmap_guest_frame(nv->nv_vvmcx, 1); nv->nv_vvmcx = NULL; nv->nv_vvmcxaddr = VMCX_EADDR; nv->nv_flushp2m = 0; --- a/xen/arch/x86/hvm/svm/nestedsvm.c +++ b/xen/arch/x86/hvm/svm/nestedsvm.c @@ -69,15 +69,14 @@ int nestedsvm_vmcb_map(struct vcpu *v, u struct nestedvcpu *nv = &vcpu_nestedhvm(v); if (nv->nv_vvmcx != NULL && nv->nv_vvmcxaddr != vmcbaddr) { - ASSERT(nv->nv_vvmcx != NULL); ASSERT(nv->nv_vvmcxaddr != VMCX_EADDR); - hvm_unmap_guest_frame(nv->nv_vvmcx); + hvm_unmap_guest_frame(nv->nv_vvmcx, 1); nv->nv_vvmcx = NULL; nv->nv_vvmcxaddr = VMCX_EADDR; } if (nv->nv_vvmcx == NULL) { - nv->nv_vvmcx = hvm_map_guest_frame_rw(vmcbaddr >> PAGE_SHIFT); + nv->nv_vvmcx = hvm_map_guest_frame_rw(vmcbaddr >> PAGE_SHIFT, 1); if (nv->nv_vvmcx == NULL) return 0; nv->nv_vvmcxaddr = vmcbaddr; @@ -141,6 +140,8 @@ void nsvm_vcpu_destroy(struct vcpu *v) get_order_from_bytes(MSRPM_SIZE)); svm->ns_merged_msrpm = NULL; } + hvm_unmap_guest_frame(nv->nv_vvmcx, 1); + nv->nv_vvmcx = NULL; if (nv->nv_n2vmcx) { free_vmcb(nv->nv_n2vmcx); nv->nv_n2vmcx = NULL; @@ -358,11 +359,11 @@ static int nsvm_vmrun_permissionmap(stru svm->ns_oiomap_pa = svm->ns_iomap_pa; svm->ns_iomap_pa = ns_vmcb->_iopm_base_pa; - ns_viomap = hvm_map_guest_frame_ro(svm->ns_iomap_pa >> PAGE_SHIFT); + ns_viomap = hvm_map_guest_frame_ro(svm->ns_iomap_pa >> PAGE_SHIFT, 0); ASSERT(ns_viomap != NULL); ioport_80 = test_bit(0x80, ns_viomap); ioport_ed = test_bit(0xed, ns_viomap); - hvm_unmap_guest_frame(ns_viomap); + hvm_unmap_guest_frame(ns_viomap, 0); svm->ns_iomap = nestedhvm_vcpu_iomap_get(ioport_80, ioport_ed); @@ -888,7 +889,7 @@ nsvm_vmcb_guest_intercepts_ioio(paddr_t break; } - io_bitmap = hvm_map_guest_frame_ro(gfn); + io_bitmap = hvm_map_guest_frame_ro(gfn, 0); if (io_bitmap == NULL) { gdprintk(XENLOG_ERR, "IOIO intercept: mapping of permission map failed\n"); @@ -896,7 +897,7 @@ nsvm_vmcb_guest_intercepts_ioio(paddr_t } enabled = test_bit(port, io_bitmap); - hvm_unmap_guest_frame(io_bitmap); + hvm_unmap_guest_frame(io_bitmap, 0); if (!enabled) return NESTEDHVM_VMEXIT_HOST; --- a/xen/arch/x86/hvm/vmx/vvmx.c +++ b/xen/arch/x86/hvm/vmx/vvmx.c @@ -569,18 +569,20 @@ void nvmx_update_exception_bitmap(struct static void nvmx_update_apic_access_address(struct vcpu *v) { struct nestedvcpu *nvcpu = &vcpu_nestedhvm(v); - u64 apic_gpfn, apic_mfn; u32 ctrl; - void *apic_va; ctrl = __n2_secondary_exec_control(v); if ( ctrl & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES ) { + p2m_type_t p2mt; + unsigned long apic_gpfn; + struct page_info *apic_pg; + apic_gpfn = __get_vvmcs(nvcpu->nv_vvmcx, APIC_ACCESS_ADDR) >> PAGE_SHIFT; - apic_va = hvm_map_guest_frame_ro(apic_gpfn); - apic_mfn = virt_to_mfn(apic_va); - __vmwrite(APIC_ACCESS_ADDR, (apic_mfn << PAGE_SHIFT)); - hvm_unmap_guest_frame(apic_va); + apic_pg = get_page_from_gfn(v->domain, apic_gpfn, &p2mt, P2M_ALLOC); + ASSERT(apic_pg && !p2m_is_paging(p2mt)); + __vmwrite(APIC_ACCESS_ADDR, page_to_maddr(apic_pg)); + put_page(apic_pg); } else __vmwrite(APIC_ACCESS_ADDR, 0); @@ -589,18 +591,20 @@ static void nvmx_update_apic_access_addr static void nvmx_update_virtual_apic_address(struct vcpu *v) { struct nestedvcpu *nvcpu = &vcpu_nestedhvm(v); - u64 vapic_gpfn, vapic_mfn; u32 ctrl; - void *vapic_va; ctrl = __n2_exec_control(v); if ( ctrl & CPU_BASED_TPR_SHADOW ) { + p2m_type_t p2mt; + unsigned long vapic_gpfn; + struct page_info *vapic_pg; + vapic_gpfn = __get_vvmcs(nvcpu->nv_vvmcx, VIRTUAL_APIC_PAGE_ADDR) >> PAGE_SHIFT; - vapic_va = hvm_map_guest_frame_ro(vapic_gpfn); - vapic_mfn = virt_to_mfn(vapic_va); - __vmwrite(VIRTUAL_APIC_PAGE_ADDR, (vapic_mfn << PAGE_SHIFT)); - hvm_unmap_guest_frame(vapic_va); + vapic_pg = get_page_from_gfn(v->domain, vapic_gpfn, &p2mt, P2M_ALLOC); + ASSERT(vapic_pg && !p2m_is_paging(p2mt)); + __vmwrite(VIRTUAL_APIC_PAGE_ADDR, page_to_maddr(vapic_pg)); + put_page(vapic_pg); } else __vmwrite(VIRTUAL_APIC_PAGE_ADDR, 0); @@ -641,9 +645,9 @@ static void __map_msr_bitmap(struct vcpu unsigned long gpa; if ( nvmx->msrbitmap ) - hvm_unmap_guest_frame (nvmx->msrbitmap); + hvm_unmap_guest_frame(nvmx->msrbitmap, 1); gpa = __get_vvmcs(vcpu_nestedhvm(v).nv_vvmcx, MSR_BITMAP); - nvmx->msrbitmap = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT); + nvmx->msrbitmap = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT, 1); } static void __map_io_bitmap(struct vcpu *v, u64 vmcs_reg) @@ -654,9 +658,9 @@ static void __map_io_bitmap(struct vcpu index = vmcs_reg == IO_BITMAP_A ? 0 : 1; if (nvmx->iobitmap[index]) - hvm_unmap_guest_frame (nvmx->iobitmap[index]); + hvm_unmap_guest_frame(nvmx->iobitmap[index], 1); gpa = __get_vvmcs(vcpu_nestedhvm(v).nv_vvmcx, vmcs_reg); - nvmx->iobitmap[index] = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT); + nvmx->iobitmap[index] = hvm_map_guest_frame_ro(gpa >> PAGE_SHIFT, 1); } static inline void map_io_bitmap_all(struct vcpu *v) @@ -673,17 +677,17 @@ static void nvmx_purge_vvmcs(struct vcpu __clear_current_vvmcs(v); if ( nvcpu->nv_vvmcxaddr != VMCX_EADDR ) - hvm_unmap_guest_frame(nvcpu->nv_vvmcx); + hvm_unmap_guest_frame(nvcpu->nv_vvmcx, 1); nvcpu->nv_vvmcx = NULL; nvcpu->nv_vvmcxaddr = VMCX_EADDR; for (i=0; i<2; i++) { if ( nvmx->iobitmap[i] ) { - hvm_unmap_guest_frame(nvmx->iobitmap[i]); + hvm_unmap_guest_frame(nvmx->iobitmap[i], 1); nvmx->iobitmap[i] = NULL; } } if ( nvmx->msrbitmap ) { - hvm_unmap_guest_frame(nvmx->msrbitmap); + hvm_unmap_guest_frame(nvmx->msrbitmap, 1); nvmx->msrbitmap = NULL; } } @@ -1289,7 +1293,7 @@ int nvmx_handle_vmptrld(struct cpu_user_ if ( nvcpu->nv_vvmcxaddr == VMCX_EADDR ) { - nvcpu->nv_vvmcx = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT); + nvcpu->nv_vvmcx = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT, 1); nvcpu->nv_vvmcxaddr = gpa; map_io_bitmap_all (v); __map_msr_bitmap(v); @@ -1350,10 +1354,10 @@ int nvmx_handle_vmclear(struct cpu_user_ else { /* Even if this VMCS isn''t the current one, we must clear it. */ - vvmcs = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT); + vvmcs = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT, 0); if ( vvmcs ) __set_vvmcs(vvmcs, NVMX_LAUNCH_STATE, 0); - hvm_unmap_guest_frame(vvmcs); + hvm_unmap_guest_frame(vvmcs, 0); } vmreturn(regs, VMSUCCEED); --- a/xen/include/asm-x86/hvm/hvm.h +++ b/xen/include/asm-x86/hvm/hvm.h @@ -423,9 +423,9 @@ int hvm_virtual_to_linear_addr( unsigned int addr_size, unsigned long *linear_addr); -void *hvm_map_guest_frame_rw(unsigned long gfn); -void *hvm_map_guest_frame_ro(unsigned long gfn); -void hvm_unmap_guest_frame(void *p); +void *hvm_map_guest_frame_rw(unsigned long gfn, bool_t permanent); +void *hvm_map_guest_frame_ro(unsigned long gfn, bool_t permanent); +void hvm_unmap_guest_frame(void *p, bool_t permanent); static inline void hvm_set_info_guest(struct vcpu *v) { _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Jan-22 10:56 UTC
[PATCH 09/11] x86: properly use map_domain_page() in miscellaneous places
Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -150,7 +150,7 @@ long arch_do_domctl( ret = -ENOMEM; break; } - arr = page_to_virt(page); + arr = __map_domain_page(page); for ( n = ret = 0; n < num; ) { @@ -220,7 +220,9 @@ long arch_do_domctl( n += k; } - free_domheap_page(virt_to_page(arr)); + page = mfn_to_page(domain_page_map_to_mfn(arr)); + unmap_domain_page(arr); + free_domheap_page(page); break; } @@ -1347,8 +1349,11 @@ void arch_get_info_guest(struct vcpu *v, } else { - l4_pgentry_t *l4e = __va(pagetable_get_paddr(v->arch.guest_table)); + const l4_pgentry_t *l4e + map_domain_page(pagetable_get_pfn(v->arch.guest_table)); + c.cmp->ctrlreg[3] = compat_pfn_to_cr3(l4e_get_pfn(*l4e)); + unmap_domain_page(l4e); /* Merge shadow DR7 bits into real DR7. */ c.cmp->debugreg[7] |= c.cmp->debugreg[5]; --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -2538,14 +2538,18 @@ int new_guest_cr3(unsigned long mfn) if ( is_pv_32on64_domain(d) ) { + unsigned long gt_mfn = pagetable_get_pfn(curr->arch.guest_table); + l4_pgentry_t *pl4e = map_domain_page(gt_mfn); + okay = paging_mode_refcounts(d) ? 0 /* Old code was broken, but what should it be? */ : mod_l4_entry( - __va(pagetable_get_paddr(curr->arch.guest_table)), + pl4e, l4e_from_pfn( mfn, (_PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED)), - pagetable_get_pfn(curr->arch.guest_table), 0, 0, curr) == 0; + gt_mfn, 0, 0, curr) == 0; + unmap_domain_page(pl4e); if ( unlikely(!okay) ) { MEM_LOG("Error while installing new compat baseptr %lx", mfn); --- a/xen/arch/x86/mm/shadow/common.c +++ b/xen/arch/x86/mm/shadow/common.c @@ -3543,6 +3543,9 @@ int shadow_track_dirty_vram(struct domai } else { + unsigned long map_mfn = INVALID_MFN; + void *map_sl1p = NULL; + /* Iterate over VRAM to track dirty bits. */ for ( i = 0; i < nr; i++ ) { mfn_t mfn = get_gfn_query_unlocked(d, begin_pfn + i, &t); @@ -3576,7 +3579,17 @@ int shadow_track_dirty_vram(struct domai { /* Hopefully the most common case: only one mapping, * whose dirty bit we can use. */ - l1_pgentry_t *sl1e = maddr_to_virt(sl1ma); + l1_pgentry_t *sl1e; + unsigned long sl1mfn = paddr_to_pfn(sl1ma); + + if ( sl1mfn != map_mfn ) + { + if ( map_sl1p ) + sh_unmap_domain_page(map_sl1p); + map_sl1p = sh_map_domain_page(_mfn(sl1mfn)); + map_mfn = sl1mfn; + } + sl1e = map_sl1p + (sl1ma & ~PAGE_MASK); if ( l1e_get_flags(*sl1e) & _PAGE_DIRTY ) { @@ -3603,6 +3616,9 @@ int shadow_track_dirty_vram(struct domai } } + if ( map_sl1p ) + sh_unmap_domain_page(map_sl1p); + rc = -EFAULT; if ( copy_to_guest(dirty_bitmap, dirty_vram->dirty_bitmap, dirty_size) == 0 ) { memset(dirty_vram->dirty_bitmap, 0, dirty_size); --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -2255,7 +2255,11 @@ static int emulate_privileged_op(struct } else { - mfn = l4e_get_pfn(*(l4_pgentry_t *)__va(pagetable_get_paddr(v->arch.guest_table))); + l4_pgentry_t *pl4e + map_domain_page(pagetable_get_pfn(v->arch.guest_table)); + + mfn = l4e_get_pfn(*pl4e); + unmap_domain_page(pl4e); *reg = compat_pfn_to_cr3(mfn_to_gmfn( v->domain, mfn)); } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Jan-22 10:57 UTC
[PATCH 10/11] tmem: partial adjustments for x86 16Tb support
Despite the changes below, tmem still has code assuming to be able to directly access all memory, or mapping arbitrary amounts of not directly accessible memory. I cannot see how to fix this without converting _all_ its domheap allocations to xenheap ones. And even then I wouldn''t be certain about there not being other cases where the "all memory is always mapped" assumption would be broken. Therefore, tmem gets disabled by the next patch for the time being if the full 1:1 mapping isn''t always visible. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/common/tmem_xen.c +++ b/xen/common/tmem_xen.c @@ -393,7 +393,8 @@ static void tmh_persistent_pool_page_put struct page_info *pi; ASSERT(IS_PAGE_ALIGNED(page_va)); - pi = virt_to_page(page_va); + pi = mfn_to_page(domain_page_map_to_mfn(page_va)); + unmap_domain_page(page_va); ASSERT(IS_VALID_PAGE(pi)); _tmh_free_page_thispool(pi); } @@ -441,39 +442,28 @@ static int cpu_callback( { case CPU_UP_PREPARE: { if ( per_cpu(dstmem, cpu) == NULL ) - { - struct page_info *p = alloc_domheap_pages(0, dstmem_order, 0); - per_cpu(dstmem, cpu) = p ? page_to_virt(p) : NULL; - } + per_cpu(dstmem, cpu) = alloc_xenheap_pages(dstmem_order, 0); if ( per_cpu(workmem, cpu) == NULL ) - { - struct page_info *p = alloc_domheap_pages(0, workmem_order, 0); - per_cpu(workmem, cpu) = p ? page_to_virt(p) : NULL; - } + per_cpu(workmem, cpu) = alloc_xenheap_pages(workmem_order, 0); if ( per_cpu(scratch_page, cpu) == NULL ) - { - struct page_info *p = alloc_domheap_page(NULL, 0); - per_cpu(scratch_page, cpu) = p ? page_to_virt(p) : NULL; - } + per_cpu(scratch_page, cpu) = alloc_xenheap_page(); break; } case CPU_DEAD: case CPU_UP_CANCELED: { if ( per_cpu(dstmem, cpu) != NULL ) { - struct page_info *p = virt_to_page(per_cpu(dstmem, cpu)); - free_domheap_pages(p, dstmem_order); + free_xenheap_pages(per_cpu(dstmem, cpu), dstmem_order); per_cpu(dstmem, cpu) = NULL; } if ( per_cpu(workmem, cpu) != NULL ) { - struct page_info *p = virt_to_page(per_cpu(workmem, cpu)); - free_domheap_pages(p, workmem_order); + free_xenheap_pages(per_cpu(workmem, cpu), workmem_order); per_cpu(workmem, cpu) = NULL; } if ( per_cpu(scratch_page, cpu) != NULL ) { - free_domheap_page(virt_to_page(per_cpu(scratch_page, cpu))); + free_xenheap_page(per_cpu(scratch_page, cpu)); per_cpu(scratch_page, cpu) = NULL; } break; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
This mainly involves adjusting the number of L4 entries needing copying between page tables (which is now different between PV and HVM/idle domains), and changing the cutoff point and method when more than the supported amount of memory is found in a system. Since TMEM doesn''t currently cope with the full 1:1 map not always being visible, it gets forcefully disabled in that case. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/efi/boot.c +++ b/xen/arch/x86/efi/boot.c @@ -1591,7 +1591,7 @@ void __init efi_init_memory(void) /* Insert Xen mappings. */ for ( i = l4_table_offset(HYPERVISOR_VIRT_START); - i < l4_table_offset(HYPERVISOR_VIRT_END); ++i ) + i < l4_table_offset(DIRECTMAP_VIRT_END); ++i ) efi_l4_pgtable[i] = idle_pg_table[i]; #endif } --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -1320,7 +1320,7 @@ void init_guest_l4_table(l4_pgentry_t l4 /* Xen private mappings. */ memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT], &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT], - ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t)); + ROOT_PAGETABLE_PV_XEN_SLOTS * sizeof(l4_pgentry_t)); l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR); l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -25,6 +25,7 @@ #include <xen/dmi.h> #include <xen/pfn.h> #include <xen/nodemask.h> +#include <xen/tmem_xen.h> /* for opt_tmem only */ #include <public/version.h> #include <compat/platform.h> #include <compat/xen.h> @@ -381,6 +382,11 @@ static void __init setup_max_pdx(void) if ( max_pdx > FRAMETABLE_NR ) max_pdx = FRAMETABLE_NR; +#ifdef PAGE_LIST_NULL + if ( max_pdx >= PAGE_LIST_NULL ) + max_pdx = PAGE_LIST_NULL - 1; +#endif + max_page = pdx_to_pfn(max_pdx - 1) + 1; } @@ -1031,9 +1037,23 @@ void __init __start_xen(unsigned long mb /* Create new mappings /before/ passing memory to the allocator. */ if ( map_e < e ) { - map_pages_to_xen((unsigned long)__va(map_e), map_e >> PAGE_SHIFT, - (e - map_e) >> PAGE_SHIFT, PAGE_HYPERVISOR); - init_boot_pages(map_e, e); + uint64_t limit = __pa(HYPERVISOR_VIRT_END - 1) + 1; + uint64_t end = min(e, limit); + + if ( map_e < end ) + { + map_pages_to_xen((unsigned long)__va(map_e), PFN_DOWN(map_e), + PFN_DOWN(end - map_e), PAGE_HYPERVISOR); + init_boot_pages(map_e, end); + map_e = end; + } + } + if ( map_e < e ) + { + /* This range must not be passed to the boot allocator and + * must also not be mapped with _PAGE_GLOBAL. */ + map_pages_to_xen((unsigned long)__va(map_e), PFN_DOWN(map_e), + PFN_DOWN(e - map_e), __PAGE_HYPERVISOR); } if ( s < map_s ) { @@ -1104,6 +1124,34 @@ void __init __start_xen(unsigned long mb end_boot_allocator(); system_state = SYS_STATE_boot; + if ( max_page - 1 > virt_to_mfn(HYPERVISOR_VIRT_END - 1) ) + { + unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1); + uint64_t mask = PAGE_SIZE - 1; + + xenheap_max_mfn(limit); + + /* Pass the remaining memory to the allocator. */ + for ( i = 0; i < boot_e820.nr_map; i++ ) + { + uint64_t s, e; + + s = (boot_e820.map[i].addr + mask) & ~mask; + e = (boot_e820.map[i].addr + boot_e820.map[i].size) & ~mask; + if ( PFN_DOWN(e) <= limit ) + continue; + if ( PFN_DOWN(s) <= limit ) + s = pfn_to_paddr(limit + 1); + init_domheap_pages(s, e); + } + + if ( opt_tmem ) + { + printk(XENLOG_WARNING "Forcing TMEM off\n"); + opt_tmem = 0; + } + } + vm_init(); vesa_init(); --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -1471,10 +1471,23 @@ int memory_add(unsigned long spfn, unsig return -EINVAL; } - ret = map_pages_to_xen((unsigned long)mfn_to_virt(spfn), spfn, - epfn - spfn, PAGE_HYPERVISOR); - if ( ret ) - return ret; + i = virt_to_mfn(HYPERVISOR_VIRT_END - 1) + 1; + if ( spfn < i ) + { + ret = map_pages_to_xen((unsigned long)mfn_to_virt(spfn), spfn, + min(epfn, i) - spfn, PAGE_HYPERVISOR); + if ( ret ) + return ret; + } + if ( i < epfn ) + { + if ( i < spfn ) + i = spfn; + ret = map_pages_to_xen((unsigned long)mfn_to_virt(i), i, + epfn - i, __PAGE_HYPERVISOR); + if ( ret ) + return ret; + } old_node_start = NODE_DATA(node)->node_start_pfn; old_node_span = NODE_DATA(node)->node_spanned_pages; --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -255,6 +255,9 @@ static unsigned long init_node_heap(int unsigned long needed = (sizeof(**_heap) + sizeof(**avail) * NR_ZONES + PAGE_SIZE - 1) >> PAGE_SHIFT; +#ifdef DIRECTMAP_VIRT_END + unsigned long eva = min(DIRECTMAP_VIRT_END, HYPERVISOR_VIRT_END); +#endif int i, j; if ( !first_node_initialised ) @@ -266,14 +269,14 @@ static unsigned long init_node_heap(int } #ifdef DIRECTMAP_VIRT_END else if ( *use_tail && nr >= needed && - (mfn + nr) <= (virt_to_mfn(DIRECTMAP_VIRT_END - 1) + 1) ) + (mfn + nr) <= (virt_to_mfn(eva - 1) + 1) ) { _heap[node] = mfn_to_virt(mfn + nr - needed); avail[node] = mfn_to_virt(mfn + nr - 1) + PAGE_SIZE - sizeof(**avail) * NR_ZONES; } else if ( nr >= needed && - (mfn + needed) <= (virt_to_mfn(DIRECTMAP_VIRT_END - 1) + 1) ) + (mfn + needed) <= (virt_to_mfn(eva - 1) + 1) ) { _heap[node] = mfn_to_virt(mfn); avail[node] = mfn_to_virt(mfn + needed - 1) + @@ -1205,6 +1208,13 @@ void free_xenheap_pages(void *v, unsigne #else +static unsigned int __read_mostly xenheap_bits; + +void __init xenheap_max_mfn(unsigned long mfn) +{ + xenheap_bits = fls(mfn) + PAGE_SHIFT - 1; +} + void init_xenheap_pages(paddr_t ps, paddr_t pe) { init_domheap_pages(ps, pe); @@ -1217,6 +1227,11 @@ void *alloc_xenheap_pages(unsigned int o ASSERT(!in_irq()); + if ( xenheap_bits && (memflags >> _MEMF_bits) > xenheap_bits ) + memflags &= ~MEMF_bits(~0); + if ( !(memflags >> _MEMF_bits) ) + memflags |= MEMF_bits(xenheap_bits); + pg = alloc_domheap_pages(NULL, order, memflags); if ( unlikely(pg == NULL) ) return NULL; --- a/xen/include/asm-x86/config.h +++ b/xen/include/asm-x86/config.h @@ -163,8 +163,12 @@ extern unsigned char boot_edid_info[128] * Page-frame information array. * 0xffff830000000000 - 0xffff87ffffffffff [5TB, 5*2^40 bytes, PML4:262-271] * 1:1 direct mapping of all physical memory. - * 0xffff880000000000 - 0xffffffffffffffff [120TB, PML4:272-511] - * Guest-defined use. + * 0xffff880000000000 - 0xffffffffffffffff [120TB, PML4:272-511] + * PV: Guest-defined use. + * 0xffff880000000000 - 0xffffff7fffffffff [119.5TB, PML4:272-510] + * HVM/idle: continuation of 1:1 mapping + * 0xffffff8000000000 - 0xffffffffffffffff [512GB, 2^39 bytes PML4:511] + * HVM/idle: unused * * Compatibility guest area layout: * 0x0000000000000000 - 0x00000000f57fffff [3928MB, PML4:0] @@ -183,6 +187,8 @@ extern unsigned char boot_edid_info[128] #define ROOT_PAGETABLE_FIRST_XEN_SLOT 256 #define ROOT_PAGETABLE_LAST_XEN_SLOT 271 #define ROOT_PAGETABLE_XEN_SLOTS \ + (L4_PAGETABLE_ENTRIES - ROOT_PAGETABLE_FIRST_XEN_SLOT - 1) +#define ROOT_PAGETABLE_PV_XEN_SLOTS \ (ROOT_PAGETABLE_LAST_XEN_SLOT - ROOT_PAGETABLE_FIRST_XEN_SLOT + 1) /* Hypervisor reserves PML4 slots 256 to 271 inclusive. */ @@ -241,9 +247,9 @@ extern unsigned char boot_edid_info[128] #define FRAMETABLE_SIZE GB(128) #define FRAMETABLE_NR (FRAMETABLE_SIZE / sizeof(*frame_table)) #define FRAMETABLE_VIRT_START (FRAMETABLE_VIRT_END - FRAMETABLE_SIZE) -/* Slot 262-271: A direct 1:1 mapping of all of physical memory. */ +/* Slot 262-271/510: A direct 1:1 mapping of all of physical memory. */ #define DIRECTMAP_VIRT_START (PML4_ADDR(262)) -#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES*10) +#define DIRECTMAP_SIZE (PML4_ENTRY_BYTES * (511 - 262)) #define DIRECTMAP_VIRT_END (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE) #ifndef __ASSEMBLY__ --- a/xen/include/xen/mm.h +++ b/xen/include/xen/mm.h @@ -43,6 +43,7 @@ void end_boot_allocator(void); /* Xen suballocator. These functions are interrupt-safe. */ void init_xenheap_pages(paddr_t ps, paddr_t pe); +void xenheap_max_mfn(unsigned long mfn); void *alloc_xenheap_pages(unsigned int order, unsigned int memflags); void free_xenheap_pages(void *v, unsigned int order); #define alloc_xenheap_page() (alloc_xenheap_pages(0,0)) @@ -111,7 +112,7 @@ struct page_list_head /* These must only have instances in struct page_info. */ # define page_list_entry -#define PAGE_LIST_NULL (~0) +# define PAGE_LIST_NULL ((typeof(((struct page_info){}).list.next))~0) # if !defined(pdx_to_page) && !defined(page_to_pdx) # if defined(__page_to_mfn) || defined(__mfn_to_page) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Jan-22 10:58 UTC
[PATCH 12/11] x86: debugging code for testing 16Tb support on smaller memory systems
DO NOT APPLY AS IS. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -66,8 +66,10 @@ void *map_domain_page(unsigned long mfn) struct mapcache_vcpu *vcache; struct vcpu_maphash_entry *hashent; +#ifdef NDEBUG if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) return mfn_to_virt(mfn); +#endif v = mapcache_current_vcpu(); if ( !v || is_hvm_vcpu(v) ) @@ -139,6 +141,14 @@ void *map_domain_page(unsigned long mfn) if ( ++i == MAPHASH_ENTRIES ) i = 0; } while ( i != MAPHASH_HASHFN(mfn) ); +if(idx >= dcache->entries) {//temp + mapcache_domain_dump(v->domain); + for(i = 0; i < ARRAY_SIZE(vcache->hash); ++i) + if(hashent->idx != MAPHASHENT_NOTINUSE) { + hashent = &vcache->hash[i]; + printk("vc[%u]: ref=%u idx=%04x mfn=%08lx\n", i, hashent->refcnt, hashent->idx, hashent->mfn); + } +} } BUG_ON(idx >= dcache->entries); @@ -249,8 +259,10 @@ int mapcache_domain_init(struct domain * if ( is_hvm_domain(d) || is_idle_domain(d) ) return 0; +#ifdef NDEBUG if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) return 0; +#endif dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1); d->arch.perdomain_l2_pg[MAPCACHE_SLOT] = alloc_domheap_page(NULL, memf); @@ -418,8 +430,10 @@ void *map_domain_page_global(unsigned lo ASSERT(!in_irq() && local_irq_is_enabled()); +#ifdef NDEBUG if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) return mfn_to_virt(mfn); +#endif spin_lock(&globalmap_lock); @@ -497,3 +511,26 @@ unsigned long domain_page_map_to_mfn(con return l1e_get_pfn(*pl1e); } + +void mapcache_domain_dump(struct domain *d) {//temp + unsigned i, n = 0; + const struct mapcache_domain *dcache = &d->arch.pv_domain.mapcache; + const struct vcpu *v; + if(is_hvm_domain(d) || is_idle_domain(d)) + return; + for_each_vcpu(d, v) { + const struct mapcache_vcpu *vcache = &v->arch.pv_vcpu.mapcache; + for(i = 0; i < ARRAY_SIZE(vcache->hash); ++i) + n += (vcache->hash[i].idx != MAPHASHENT_NOTINUSE); + } + printk("Dom%d mc (#=%u v=%u) [%p]:\n", d->domain_id, n, d->max_vcpus, __builtin_return_address(0)); + for(i = 0; i < BITS_TO_LONGS(dcache->entries); ++i) + printk("dcu[%02x]: %016lx\n", i, dcache->inuse[i]); + for(i = 0; i < BITS_TO_LONGS(dcache->entries); ++i) + printk("dcg[%02x]: %016lx\n", i, dcache->garbage[i]); + for(i = 0; i < dcache->entries; ++i) { + l1_pgentry_t l1e = DCACHE_L1ENT(dcache, i); + if((test_bit(i, dcache->inuse) && !test_bit(i, dcache->garbage)) || (l1e_get_flags(l1e) & _PAGE_PRESENT)) + printk("dc[%04x]: %"PRIpte"\n", i, l1e_get_intpte(l1e)); + } +} --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -250,6 +250,14 @@ void __init init_frametable(void) init_spagetable(); } +#ifndef NDEBUG +static unsigned int __read_mostly root_pgt_pv_xen_slots + = ROOT_PAGETABLE_PV_XEN_SLOTS; +static l4_pgentry_t __read_mostly split_l4e; +#else +#define root_pgt_pv_xen_slots ROOT_PAGETABLE_PV_XEN_SLOTS +#endif + void __init arch_init_memory(void) { unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn; @@ -344,6 +352,41 @@ void __init arch_init_memory(void) efi_init_memory(); mem_sharing_init(); + +#ifndef NDEBUG + if ( split_gb ) + { + paddr_t split_pa = split_gb * GB(1); + unsigned long split_va = (unsigned long)__va(split_pa); + + if ( split_va < HYPERVISOR_VIRT_END && + split_va - 1 == (unsigned long)__va(split_pa - 1) ) + { + root_pgt_pv_xen_slots = l4_table_offset(split_va) - + ROOT_PAGETABLE_FIRST_XEN_SLOT; + ASSERT(root_pgt_pv_xen_slots < ROOT_PAGETABLE_PV_XEN_SLOTS); + if ( l4_table_offset(split_va) == l4_table_offset(split_va - 1) ) + { + l3_pgentry_t *l3tab = alloc_xen_pagetable(); + + if ( l3tab ) + { + const l3_pgentry_t *l3idle + l4e_to_l3e(idle_pg_table[l4_table_offset(split_va)]); + + for ( i = 0; i < l3_table_offset(split_va); ++i ) + l3tab[i] = l3idle[i]; + for ( ; i <= L3_PAGETABLE_ENTRIES; ++i ) + l3tab[i] = l3e_empty(); + split_l4e = l4e_from_pfn(virt_to_mfn(l3tab), + __PAGE_HYPERVISOR); + } + else + ++root_pgt_pv_xen_slots; + } + } + } +#endif } int page_is_ram_type(unsigned long mfn, unsigned long mem_type) @@ -1320,7 +1363,12 @@ void init_guest_l4_table(l4_pgentry_t l4 /* Xen private mappings. */ memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT], &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT], - ROOT_PAGETABLE_PV_XEN_SLOTS * sizeof(l4_pgentry_t)); + root_pgt_pv_xen_slots * sizeof(l4_pgentry_t)); +#ifndef NDEBUG + if ( l4e_get_intpte(split_l4e) ) + l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT + root_pgt_pv_xen_slots] + split_l4e; +#endif l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR); l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu s8 __read_mostly xen_cpuidle = -1; boolean_param("cpuidle", xen_cpuidle); +#ifndef NDEBUG +unsigned int __initdata split_gb; +integer_param("split-gb", split_gb); +#endif + cpumask_t __read_mostly cpu_present_map; unsigned long __read_mostly xen_phys_start; @@ -789,6 +794,11 @@ void __init __start_xen(unsigned long mb modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end); bootstrap_map(NULL); +#ifndef split_gb /* Don''t allow split below 4Gb. */ + if ( split_gb < 4 ) + split_gb = 0; +#endif + for ( i = boot_e820.nr_map-1; i >= 0; i-- ) { uint64_t s, e, mask = (1UL << L2_PAGETABLE_SHIFT) - 1; @@ -917,6 +927,9 @@ void __init __start_xen(unsigned long mb /* Don''t overlap with other modules. */ end = consider_modules(s, e, size, mod, mbi->mods_count, j); + if ( split_gb && end > split_gb * GB(1) ) + continue; + if ( s < end && (headroom || ((end - size) >> PAGE_SHIFT) > mod[j].mod_start) ) @@ -958,6 +971,8 @@ void __init __start_xen(unsigned long mb kexec_reserve_area(&boot_e820); setup_max_pdx(); + if ( split_gb ) + xenheap_max_mfn(split_gb << (30 - PAGE_SHIFT)); /* * Walk every RAM region and map it in its entirety (on x86/64, at least) @@ -1129,7 +1144,8 @@ void __init __start_xen(unsigned long mb unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1); uint64_t mask = PAGE_SIZE - 1; - xenheap_max_mfn(limit); + if ( !split_gb ) + xenheap_max_mfn(limit); /* Pass the remaining memory to the allocator. */ for ( i = 0; i < boot_e820.nr_map; i++ ) --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -45,6 +45,7 @@ #include <asm/flushtlb.h> #ifdef CONFIG_X86 #include <asm/p2m.h> +#include <asm/setup.h> /* for split_gb only */ #else #define p2m_pod_offline_or_broken_hit(pg) 0 #define p2m_pod_offline_or_broken_replace(pg) BUG_ON(pg != NULL) @@ -203,6 +204,25 @@ unsigned long __init alloc_boot_pages( pg = (r->e - nr_pfns) & ~(pfn_align - 1); if ( pg < r->s ) continue; + +#if defined(CONFIG_X86) && !defined(NDEBUG) + /* + * Filtering pfn_align == 1 since the only allocations using a bigger + * alignment are the ones used for setting up the frame table chunks. + * Those allocations get remapped anyway, i.e. them not having 1:1 + * mappings always accessible is not a problem. + */ + if ( split_gb && pfn_align == 1 && + r->e > (split_gb << (30 - PAGE_SHIFT)) ) + { + pg = r->s; + if ( pg + nr_pfns > (split_gb << (30 - PAGE_SHIFT)) ) + continue; + r->s = pg + nr_pfns; + return pg; + } +#endif + _e = r->e; r->e = pg; bootmem_region_add(pg + nr_pfns, _e); --- a/xen/include/asm-x86/domain.h +++ b/xen/include/asm-x86/domain.h @@ -72,6 +72,7 @@ struct mapcache_domain { int mapcache_domain_init(struct domain *); void mapcache_domain_exit(struct domain *); +void mapcache_domain_dump(struct domain *);//temp int mapcache_vcpu_init(struct vcpu *); void mapcache_override_current(struct vcpu *); --- a/xen/include/asm-x86/setup.h +++ b/xen/include/asm-x86/setup.h @@ -43,4 +43,10 @@ void microcode_grab_module( extern uint8_t kbd_shift_flags; +#ifdef NDEBUG +# define split_gb 0 +#else +extern unsigned int split_gb; +#endif + #endif _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: [Xen-devel] [PATCH 11/11] x86: support up to 16Tb > > Since TMEM doesn''t currently cope with the full 1:1 map not always > being visible, it gets forcefully disabled in that case. > > Signed-off-by: Jan Beulich <jbeulich@suse.com>I agree this is the correct short-term (and maybe mid-term) answer. Anyone who can afford to fill their machine with more than 5TiB of RAM is likely not very interested in memory overcommit technologies :-) at least for the next year or three. Cloud providers may be an exception but I''d imagine most of those are buying small- to mid-range machines to optimize cost/performance, rather than behemoths that expand to 5TiB+ which are highly performant but often not cost-effective. Longer term, zcache in Linux (which is a tmem-based technology) successfully uses kmap/kunmap to run on 32-bit Linux OS''s so I''d imagine a similar technique could be used in Xen? In any case, thanks Jan for remembering to handle tmem. One nit below... Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>> + if ( opt_tmem ) > + { > + printk(XENLOG_WARNING "Forcing TMEM off\n"); > + opt_tmem = 0; > + } > + }Maybe a bit more descriptive? I.e. "TMEM physical RAM limit exceeded, disabling TMEM".
>>> On 22.01.13 at 16:20, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> Subject: [Xen-devel] [PATCH 11/11] x86: support up to 16Tb >> >> Since TMEM doesn''t currently cope with the full 1:1 map not always >> being visible, it gets forcefully disabled in that case. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > I agree this is the correct short-term (and maybe mid-term) > answer. Anyone who can afford to fill their machine with > more than 5TiB of RAM is likely not very interested in > memory overcommit technologies :-) at least for the next > year or three. Cloud providers may be an exception but > I''d imagine most of those are buying small- to mid-range > machines to optimize cost/performance, rather than > behemoths that expand to 5TiB+ which are highly performant > but often not cost-effective. > > Longer term, zcache in Linux (which is a tmem-based technology) > successfully uses kmap/kunmap to run on 32-bit Linux OS''s > so I''d imagine a similar technique could be used in Xen? > > In any case, thanks Jan for remembering to handle tmem. > > One nit below... > > Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>Hmm, an ack on this patch is sort of unexpected from you; I would have hoped you would ack patch 10...>> + if ( opt_tmem ) >> + { >> + printk(XENLOG_WARNING "Forcing TMEM off\n"); >> + opt_tmem = 0; >> + } >> + } > > Maybe a bit more descriptive? I.e. "TMEM physical RAM limit > exceeded, disabling TMEM".Fine with me, patch updated. Jan
Dan Magenheimer
2013-Jan-22 17:55 UTC
Re: [PATCH 10/11] tmem: partial adjustments for x86 16Tb support
> From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Tuesday, January 22, 2013 8:32 AM > To: xen-devel; Dan Magenheimer > Cc: Konrad Wilk > Subject: RE: [Xen-devel] [PATCH 11/11] x86: support up to 16Tb > > > Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com> > > Hmm, an ack on this patch is sort of unexpected from you; I > would have hoped you would ack patch 10...Heh. I was intrigued by the new domain_page_map_to_mfn() and wanted to look deeper before acking patch 10. So...> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: [PATCH 10/11] tmem: partial adjustments for x86 16Tb support > > Despite the changes below, tmem still has code assuming to be able to > directly access all memory, or mapping arbitrary amounts of not > directly accessible memory. I cannot see how to fix this without > converting _all_ its domheap allocations to xenheap ones. And even then > I wouldn''t be certain about there not being other cases where the "all > memory is always mapped" assumption would be broken. Therefore, tmem > gets disabled by the next patch for the time being if the full 1:1 > mapping isn''t always visible. > > Signed-off-by: Jan Beulich <jbeulich@suse.com>IIUC, all the metadata will need to be allocated from the xenheap and all "wholepage" accesses will need some kind of wrapper. This will get messier with compression/deduplication, but I''m thinking it will still be doable... sometime in the future if/when users want/need memory overcommit on huge RAM systems. In any case... Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>
On 22/01/2013 10:45, "Jan Beulich" <JBeulich@suse.com> wrote:> This series enables Xen to support up to 16Tb. > > 01: x86: introduce virt_to_xen_l1e() > 02: x86: extend frame table virtual space > 03: x86: re-introduce map_domain_page() et al > 04: x86: properly use map_domain_page() when building Dom0 > 05: x86: consolidate initialization of PV guest L4 page tables > 06: x86: properly use map_domain_page() during domain creation/destruction > 07: x86: properly use map_domain_page() during page table manipulation > 08: x86: properly use map_domain_page() in nested HVM code > 09: x86: properly use map_domain_page() in miscellaneous places > 10: tmem: partial adjustments for x86 16Tb support > 11: x86: support up to 16TbI will take a look at these tomorrow. -- Keir> As I don''t have a 16Tb system around, I used the following > debugging patch to simulate the most critical aspect the changes > above would have on a system with this much memory: Not all of > the 1:1 mapping being accessible when in PV guest context. To do > so, a command line option to pull the split point down is being > added. The patch is being provided in the raw form I used it, but > has pieces properly formatted and not marked "//temp" which I > would think might be worth considering to add. The other pieces > are likely less worthwhile, but if others think differently I could > certainly also put them into "normal" shape. > > 12: x86: debugging code for testing 16Tb support on smaller memory systems > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 22/01/2013 10:45, "Jan Beulich" <JBeulich@suse.com> wrote:> This series enables Xen to support up to 16Tb. > > 01: x86: introduce virt_to_xen_l1e() > 02: x86: extend frame table virtual space > 03: x86: re-introduce map_domain_page() et al > 04: x86: properly use map_domain_page() when building Dom0 > 05: x86: consolidate initialization of PV guest L4 page tables > 06: x86: properly use map_domain_page() during domain creation/destruction > 07: x86: properly use map_domain_page() during page table manipulation > 08: x86: properly use map_domain_page() in nested HVM code > 09: x86: properly use map_domain_page() in miscellaneous places > 10: tmem: partial adjustments for x86 16Tb support > 11: x86: support up to 16TbAcked-by: Keir Fraser <keir@xen.org> There''s an ''ifdef PAGE_LIST_NULL'' in patch 11 in x86/setup.c. Is that really needed?> As I don''t have a 16Tb system around, I used the following > debugging patch to simulate the most critical aspect the changes > above would have on a system with this much memory: Not all of > the 1:1 mapping being accessible when in PV guest context. To do > so, a command line option to pull the split point down is being > added. The patch is being provided in the raw form I used it, but > has pieces properly formatted and not marked "//temp" which I > would think might be worth considering to add. The other pieces > are likely less worthwhile, but if others think differently I could > certainly also put them into "normal" shape. > > 12: x86: debugging code for testing 16Tb support on smaller memory systemsMake split-gb a size_param, and rename to something more meaningful like highmem_start. -- Keir> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
>>> On 23.01.13 at 10:33, Keir Fraser <keir@xen.org> wrote: > There''s an ''ifdef PAGE_LIST_NULL'' in patch 11 in x86/setup.c. Is that really > needed?That''s there so that once we go beyond 16Tb the code won''t need to change. In particular, because of the implied growth of struct page_info, I''m envisioning such support to become optional (to be enabled at build time). Jan
On 23/01/2013 09:56, "Jan Beulich" <JBeulich@suse.com> wrote:>>>> On 23.01.13 at 10:33, Keir Fraser <keir@xen.org> wrote: >> There''s an ''ifdef PAGE_LIST_NULL'' in patch 11 in x86/setup.c. Is that really >> needed? > > That''s there so that once we go beyond 16Tb the code won''t need > to change. In particular, because of the implied growth of struct > page_info, I''m envisioning such support to become optional (to be > enabled at build time).Defer the ifdef until it''s needed, then when it''s added its in the sensible place (ie. where PAGE_LIST_NULL really does become build-time optional). -- Keir> Jan >
Jan Beulich
2013-Jan-23 14:26 UTC
[PATCH v2] x86: debugging code for testing 16Tb support on smaller memory systems
Signed-off-by: Jan Beulich <jbeulich@suse.com> --- v2: Removed unwanted bits and switched to byte-granular "highmem-start" option. --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -546,6 +546,12 @@ Paging (HAP). ### hvm\_port80 > `= <boolean>` +### highmem-start +> `= <size>` + +Specify the memory boundary past which memory will be treated as highmem (x86 +debug hypervisor only). + ### idle\_latency\_factor > `= <integer>` --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -66,8 +66,10 @@ void *map_domain_page(unsigned long mfn) struct mapcache_vcpu *vcache; struct vcpu_maphash_entry *hashent; +#ifdef NDEBUG if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) return mfn_to_virt(mfn); +#endif v = mapcache_current_vcpu(); if ( !v || is_hvm_vcpu(v) ) @@ -249,8 +251,10 @@ int mapcache_domain_init(struct domain * if ( is_hvm_domain(d) || is_idle_domain(d) ) return 0; +#ifdef NDEBUG if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) return 0; +#endif dcache->l1tab = xzalloc_array(l1_pgentry_t *, MAPCACHE_L2_ENTRIES + 1); d->arch.perdomain_l2_pg[MAPCACHE_SLOT] = alloc_domheap_page(NULL, memf); @@ -418,8 +422,10 @@ void *map_domain_page_global(unsigned lo ASSERT(!in_irq() && local_irq_is_enabled()); +#ifdef NDEBUG if ( mfn <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) return mfn_to_virt(mfn); +#endif spin_lock(&globalmap_lock); --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -250,6 +250,14 @@ void __init init_frametable(void) init_spagetable(); } +#ifndef NDEBUG +static unsigned int __read_mostly root_pgt_pv_xen_slots + = ROOT_PAGETABLE_PV_XEN_SLOTS; +static l4_pgentry_t __read_mostly split_l4e; +#else +#define root_pgt_pv_xen_slots ROOT_PAGETABLE_PV_XEN_SLOTS +#endif + void __init arch_init_memory(void) { unsigned long i, pfn, rstart_pfn, rend_pfn, iostart_pfn, ioend_pfn; @@ -344,6 +352,40 @@ void __init arch_init_memory(void) efi_init_memory(); mem_sharing_init(); + +#ifndef NDEBUG + if ( highmem_start ) + { + unsigned long split_va = (unsigned long)__va(highmem_start); + + if ( split_va < HYPERVISOR_VIRT_END && + split_va - 1 == (unsigned long)__va(highmem_start - 1) ) + { + root_pgt_pv_xen_slots = l4_table_offset(split_va) - + ROOT_PAGETABLE_FIRST_XEN_SLOT; + ASSERT(root_pgt_pv_xen_slots < ROOT_PAGETABLE_PV_XEN_SLOTS); + if ( l4_table_offset(split_va) == l4_table_offset(split_va - 1) ) + { + l3_pgentry_t *l3tab = alloc_xen_pagetable(); + + if ( l3tab ) + { + const l3_pgentry_t *l3idle + l4e_to_l3e(idle_pg_table[l4_table_offset(split_va)]); + + for ( i = 0; i < l3_table_offset(split_va); ++i ) + l3tab[i] = l3idle[i]; + for ( ; i <= L3_PAGETABLE_ENTRIES; ++i ) + l3tab[i] = l3e_empty(); + split_l4e = l4e_from_pfn(virt_to_mfn(l3tab), + __PAGE_HYPERVISOR); + } + else + ++root_pgt_pv_xen_slots; + } + } + } +#endif } int page_is_ram_type(unsigned long mfn, unsigned long mem_type) @@ -1320,7 +1362,12 @@ void init_guest_l4_table(l4_pgentry_t l4 /* Xen private mappings. */ memcpy(&l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT], &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT], - ROOT_PAGETABLE_PV_XEN_SLOTS * sizeof(l4_pgentry_t)); + root_pgt_pv_xen_slots * sizeof(l4_pgentry_t)); +#ifndef NDEBUG + if ( l4e_get_intpte(split_l4e) ) + l4tab[ROOT_PAGETABLE_FIRST_XEN_SLOT + root_pgt_pv_xen_slots] + split_l4e; +#endif l4tab[l4_table_offset(LINEAR_PT_VIRT_START)] l4e_from_pfn(domain_page_map_to_mfn(l4tab), __PAGE_HYPERVISOR); l4tab[l4_table_offset(PERDOMAIN_VIRT_START)] --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu s8 __read_mostly xen_cpuidle = -1; boolean_param("cpuidle", xen_cpuidle); +#ifndef NDEBUG +unsigned long __initdata highmem_start; +size_param("highmem-start", highmem_start); +#endif + cpumask_t __read_mostly cpu_present_map; unsigned long __read_mostly xen_phys_start; @@ -787,6 +792,14 @@ void __init __start_xen(unsigned long mb modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end); bootstrap_map(NULL); +#ifndef highmem_start + /* Don''t allow split below 4Gb. */ + if ( highmem_start < GB(4) ) + highmem_start = 0; + else /* align to L3 entry boundary */ + highmem_start &= ~((1UL << L3_PAGETABLE_SHIFT) - 1); +#endif + for ( i = boot_e820.nr_map-1; i >= 0; i-- ) { uint64_t s, e, mask = (1UL << L2_PAGETABLE_SHIFT) - 1; @@ -915,6 +928,9 @@ void __init __start_xen(unsigned long mb /* Don''t overlap with other modules. */ end = consider_modules(s, e, size, mod, mbi->mods_count, j); + if ( highmem_start && end > highmem_start ) + continue; + if ( s < end && (headroom || ((end - size) >> PAGE_SHIFT) > mod[j].mod_start) ) @@ -956,6 +972,8 @@ void __init __start_xen(unsigned long mb kexec_reserve_area(&boot_e820); setup_max_pdx(); + if ( highmem_start ) + xenheap_max_mfn(PFN_DOWN(highmem_start)); /* * Walk every RAM region and map it in its entirety (on x86/64, at least) @@ -1127,7 +1145,8 @@ void __init __start_xen(unsigned long mb unsigned long limit = virt_to_mfn(HYPERVISOR_VIRT_END - 1); uint64_t mask = PAGE_SIZE - 1; - xenheap_max_mfn(limit); + if ( !highmem_start ) + xenheap_max_mfn(limit); /* Pass the remaining memory to the allocator. */ for ( i = 0; i < boot_e820.nr_map; i++ ) --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -45,6 +45,7 @@ #include <asm/flushtlb.h> #ifdef CONFIG_X86 #include <asm/p2m.h> +#include <asm/setup.h> /* for highmem_start only */ #else #define p2m_pod_offline_or_broken_hit(pg) 0 #define p2m_pod_offline_or_broken_replace(pg) BUG_ON(pg != NULL) @@ -203,6 +204,25 @@ unsigned long __init alloc_boot_pages( pg = (r->e - nr_pfns) & ~(pfn_align - 1); if ( pg < r->s ) continue; + +#if defined(CONFIG_X86) && !defined(NDEBUG) + /* + * Filtering pfn_align == 1 since the only allocations using a bigger + * alignment are the ones used for setting up the frame table chunks. + * Those allocations get remapped anyway, i.e. them not having 1:1 + * mappings always accessible is not a problem. + */ + if ( highmem_start && pfn_align == 1 && + r->e > PFN_DOWN(highmem_start) ) + { + pg = r->s; + if ( pg + nr_pfns > PFN_DOWN(highmem_start) ) + continue; + r->s = pg + nr_pfns; + return pg; + } +#endif + _e = r->e; r->e = pg; bootmem_region_add(pg + nr_pfns, _e); --- a/xen/include/asm-x86/setup.h +++ b/xen/include/asm-x86/setup.h @@ -43,4 +43,10 @@ void microcode_grab_module( extern uint8_t kbd_shift_flags; +#ifdef NDEBUG +# define highmem_start 0 +#else +extern unsigned long highmem_start; +#endif + #endif _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Keir Fraser
2013-Jan-23 15:18 UTC
Re: [PATCH v2] x86: debugging code for testing 16Tb support on smaller memory systems
On 23/01/2013 14:26, "Jan Beulich" <JBeulich@suse.com> wrote:> Signed-off-by: Jan Beulich <jbeulich@suse.com>Acked-by: Keir Fraser <keir@xen.org>
Tim Deegan
2013-Jan-24 11:36 UTC
Re: [PATCH v2] x86: debugging code for testing 16Tb support on smaller memory systems
At 14:26 +0000 on 23 Jan (1358951188), Jan Beulich wrote:> --- a/xen/arch/x86/setup.c > +++ b/xen/arch/x86/setup.c > @@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu > s8 __read_mostly xen_cpuidle = -1; > boolean_param("cpuidle", xen_cpuidle); > > +#ifndef NDEBUG > +unsigned long __initdata highmem_start; > +size_param("highmem-start", highmem_start); > +#endif > + > cpumask_t __read_mostly cpu_present_map; > > unsigned long __read_mostly xen_phys_start; > @@ -787,6 +792,14 @@ void __init __start_xen(unsigned long mb > modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end); > bootstrap_map(NULL); > > +#ifndef highmem_start > + /* Don''t allow split below 4Gb. */ > + if ( highmem_start < GB(4) ) > + highmem_start = 0; > + else /* align to L3 entry boundary */ > + highmem_start &= ~((1UL << L3_PAGETABLE_SHIFT) - 1); > +#endifDYM #ifndef NDEBUG ? I can see that checking for highmem_start being a macro is strictly correct but it seems more vulnerable to later changes, esp. since this:> --- a/xen/include/asm-x86/setup.h > +++ b/xen/include/asm-x86/setup.h > @@ -43,4 +43,10 @@ void microcode_grab_module( > > extern uint8_t kbd_shift_flags; > > +#ifdef NDEBUG > +# define highmem_start 0 > +#else > +extern unsigned long highmem_start; > +#endifhappens so far away. Tim.
Jan Beulich
2013-Jan-24 12:23 UTC
Re: [PATCH v2] x86: debugging code for testing 16Tb support on smaller memory systems
>>> On 24.01.13 at 12:36, Tim Deegan <tim@xen.org> wrote: > At 14:26 +0000 on 23 Jan (1358951188), Jan Beulich wrote: >> --- a/xen/arch/x86/setup.c >> +++ b/xen/arch/x86/setup.c >> @@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu >> s8 __read_mostly xen_cpuidle = -1; >> boolean_param("cpuidle", xen_cpuidle); >> >> +#ifndef NDEBUG >> +unsigned long __initdata highmem_start; >> +size_param("highmem-start", highmem_start); >> +#endif >> + >> cpumask_t __read_mostly cpu_present_map; >> >> unsigned long __read_mostly xen_phys_start; >> @@ -787,6 +792,14 @@ void __init __start_xen(unsigned long mb >> modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end); >> bootstrap_map(NULL); >> >> +#ifndef highmem_start >> + /* Don''t allow split below 4Gb. */ >> + if ( highmem_start < GB(4) ) >> + highmem_start = 0; >> + else /* align to L3 entry boundary */ >> + highmem_start &= ~((1UL << L3_PAGETABLE_SHIFT) - 1); >> +#endif > > DYM #ifndef NDEBUG ? I can see that checking for highmem_start being a > macro is strictly correctI intended it to be that way, because there could be other uses for having the symbol #define-d/real.> but it seems more vulnerable to later changes, > esp. since this: > >> --- a/xen/include/asm-x86/setup.h >> +++ b/xen/include/asm-x86/setup.h >> @@ -43,4 +43,10 @@ void microcode_grab_module( >> >> extern uint8_t kbd_shift_flags; >> >> +#ifdef NDEBUG >> +# define highmem_start 0 >> +#else >> +extern unsigned long highmem_start; >> +#endif > > happens so far away.I realize that, but these getting out of sync is no problem the way it is coded now. The distance of the two would really be more of a problem imo if the condition here got changed (which would then require to also change it up there). Jan
Tim Deegan
2013-Jan-24 12:36 UTC
Re: [PATCH v2] x86: debugging code for testing 16Tb support on smaller memory systems
At 12:23 +0000 on 24 Jan (1359030221), Jan Beulich wrote:> >>> On 24.01.13 at 12:36, Tim Deegan <tim@xen.org> wrote: > > At 14:26 +0000 on 23 Jan (1358951188), Jan Beulich wrote: > >> --- a/xen/arch/x86/setup.c > >> +++ b/xen/arch/x86/setup.c > >> @@ -82,6 +82,11 @@ boolean_param("noapic", skip_ioapic_setu > >> s8 __read_mostly xen_cpuidle = -1; > >> boolean_param("cpuidle", xen_cpuidle); > >> > >> +#ifndef NDEBUG > >> +unsigned long __initdata highmem_start; > >> +size_param("highmem-start", highmem_start); > >> +#endif > >> + > >> cpumask_t __read_mostly cpu_present_map; > >> > >> unsigned long __read_mostly xen_phys_start; > >> @@ -787,6 +792,14 @@ void __init __start_xen(unsigned long mb > >> modules_headroom = bzimage_headroom(bootstrap_map(mod), mod->mod_end); > >> bootstrap_map(NULL); > >> > >> +#ifndef highmem_start > >> + /* Don''t allow split below 4Gb. */ > >> + if ( highmem_start < GB(4) ) > >> + highmem_start = 0; > >> + else /* align to L3 entry boundary */ > >> + highmem_start &= ~((1UL << L3_PAGETABLE_SHIFT) - 1); > >> +#endif > > > > DYM #ifndef NDEBUG ? I can see that checking for highmem_start being a > > macro is strictly correct > > I intended it to be that way, because there could be other uses > for having the symbol #define-d/real.Yes - but if it ever ends up being a #define _and_ user-settable, these checks will silently disappear. Since there''s no indication in the places where you might make it a #define that doing so will remove these checks, I''d be inclined to leave it gated on NDEBUG so it''s fail in an obvious way. Or add a #define CONFIG_HIGHMEM_START (default to == !NDEBUG), and gate everything on that? Tim.