Konrad Rzeszutek Wilk
2012-Jul-31 14:43 UTC
[PATCH] Boot PV guests with more than 128GB (v2) for 3.7
Changelog: Since v1: [http://lists.xen.org/archives/html/xen-devel/2012-07/msg01561.html] - added more comments, and #ifdefs - squashed The L4 and L4, L3, and L2 recycle patches together - Added Acked-by''s The explanation of these patches is exactly what v1 had: The details of this problem are nicely explained in: [PATCH 4/6] xen/p2m: Add logic to revector a P2M tree to use __va [PATCH 5/6] xen/mmu: Copy and revector the P2M tree. [PATCH 6/6] xen/mmu: Remove from __ka space PMD entries for and the supporting patches are just nice optimizations. Pasting in what those patches mentioned: During bootup Xen supplies us with a P2M array. It sticks it right after the ramdisk, as can be seen with a 128GB PV guest: (certain parts removed for clarity): xc_dom_build_image: called xc_dom_alloc_segment: kernel : 0xffffffff81000000 -> 0xffffffff81e43000 (pfn 0x1000 + 0xe43 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000 xc_dom_alloc_segment: ramdisk : 0xffffffff81e43000 -> 0xffffffff925c7000 (pfn 0x1e43 + 0x10784 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000 xc_dom_alloc_segment: phys2mach : 0xffffffff925c7000 -> 0xffffffffa25c7000 (pfn 0x125c7 + 0x10000 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000 xc_dom_alloc_page : start info : 0xffffffffa25c7000 (pfn 0x225c7) xc_dom_alloc_page : xenstore : 0xffffffffa25c8000 (pfn 0x225c8) xc_dom_alloc_page : console : 0xffffffffa25c9000 (pfn 0x225c9) nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s) nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s) nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffbfffffff, 1 table(s) nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffa27fffff, 276 table(s) xc_dom_alloc_segment: page tables : 0xffffffffa25ca000 -> 0xffffffffa26e1000 (pfn 0x225ca + 0x117 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000 xc_dom_alloc_page : boot stack : 0xffffffffa26e1000 (pfn 0x226e1) xc_dom_build_image : virt_alloc_end : 0xffffffffa26e2000 xc_dom_build_image : virt_pgtab_end : 0xffffffffa2800000 So the physical memory and virtual (using __START_KERNEL_map addresses) layout looks as so: phys __ka /------------\ /-------------------\ | 0 | empty | 0xffffffff80000000| | .. | | .. | | 16MB | <= kernel starts | 0xffffffff81000000| | .. | | | | 30MB | <= kernel ends => | 0xffffffff81e43000| | .. | & ramdisk starts | .. | | 293MB | <= ramdisk ends=> | 0xffffffff925c7000| | .. | & P2M starts | .. | | .. | | .. | | 549MB | <= P2M ends => | 0xffffffffa25c7000| | .. | start_info | 0xffffffffa25c7000| | .. | xenstore | 0xffffffffa25c8000| | .. | cosole | 0xffffffffa25c9000| | 549MB | <= page tables => | 0xffffffffa25ca000| | .. | | | | 550MB | <= PGT end => | 0xffffffffa26e1000| | .. | boot stack | | \------------/ \-------------------/ As can be seen, the ramdisk, P2M and pagetables are taking a bit of __ka addresses space. Which is a problem since the MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits right in there! This results during bootup with the inability to load modules, with this error: ------------[ cut here ]------------ WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 vmap_page_range_noflush+0x2d9/0x370() Call Trace: [<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81071a45>] warn_slowpath_null+0x15/0x20 [<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370 [<ffffffff81130c4d>] map_vm_area+0x2d/0x50 [<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80 [<ffffffff810c6186>] ? load_module+0x66/0x19c0 [<ffffffff8105cadc>] module_alloc+0x5c/0x60 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80 [<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80 [<ffffffff810c70c3>] load_module+0xfa3/0x19c0 [<ffffffff812491f6>] ? security_file_permission+0x86/0x90 [<ffffffff810c7b3a>] sys_init_module+0x5a/0x220 [<ffffffff815ce339>] system_call_fastpath+0x16/0x1b ---[ end trace fd8f7704fdea0291 ]--- vmalloc: allocation failure, allocated 16384 of 20480 bytes modprobe: page allocation failure: order:0, mode:0xd2 Since the __va and __ka are 1:1 up to MODULES_VADDR and cleanup_highmap rids __ka of the ramdisk mapping, what we want to do is similar - get rid of the P2M in the __ka address space. There are two ways of fixing this: 1) All P2M lookups instead of using the __ka address would use the __va address. This means we can safely erase from __ka space the PMD pointers that point to the PFNs for P2M array and be OK. 2). Allocate a new array, copy the existing P2M into it, revector the P2M tree to use that, and return the old P2M to the memory allocate. This has the advantage that it sets the stage for using XEN_ELF_NOTE_INIT_P2M feature. That feature allows us to set the exact virtual address space we want for the P2M - and allows us to boot as initial domain on large machines. So we pick option 2). This patch only lays the groundwork in the P2M code. The patch that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree." -- xen/mmu: Copy and revector the P2M tree: The ''xen_revector_p2m_tree()'' function allocates a new P2M tree copies the contents of the old one in it, and returns the new one. At this stage, the __ka address space (which is what the old P2M tree was using) is partially disassembled. The cleanup_highmap has removed the PMD entries from 0-16MB and anything past _brk_end up to the max_pfn_mapped (which is the end of the ramdisk). We have revectored the P2M tree (and the one for save/restore as well) to use new shiny __va address to new MFNs. The xen_start_info has been taken care of already in ''xen_setup_kernel_pagetable()'' and xen_start_info->shared_info in ''xen_setup_shared_info()'', so we are free to roam and delete PMD entries - which is exactly what we are going to do. We rip out the __ka for the old P2M array. -- xen/mmu: Remove from __ka space PMD entries for At this stage, the __ka address space (which is what the old P2M tree was using) is partially disassembled. The cleanup_highmap has removed the PMD entries from 0-16MB and anything past _brk_end up to the max_pfn_mapped (which is the end of the ramdisk). The xen_remove_p2m_tree and code around has ripped out the __ka for the old P2M array. Here we continue on doing it to where the Xen page-tables were. It is safe to do it, as the page-tables are addressed using __va. For good measure we delete anything that is within MODULES_VADDR and up to the end of the PMD. At this point the __ka only contains PMD entries for the start of the kernel up to __brk.
Konrad Rzeszutek Wilk
2012-Jul-31 14:43 UTC
[PATCH 1/6] xen/mmu: use copy_page instead of memcpy.
After all, this is what it is there for. Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/mmu.c | 13 ++++++------- 1 files changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 6ba6100..7247e5a 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1754,14 +1754,14 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) * it will be also modified in the __ka space! (But if you just * modify the PMD table to point to other PTE''s or none, then you * are OK - which is what cleanup_highmap does) */ - memcpy(level2_ident_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD); + copy_page(level2_ident_pgt, l2); /* Graft it onto L4[511][511] */ - memcpy(level2_kernel_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD); + copy_page(level2_kernel_pgt, l2); /* Get [511][510] and graft that in level2_fixmap_pgt */ l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd); l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud); - memcpy(level2_fixmap_pgt, l2, sizeof(pmd_t) * PTRS_PER_PMD); + copy_page(level2_fixmap_pgt, l2); /* Note that we don''t do anything with level1_fixmap_pgt which * we don''t need. */ @@ -1821,8 +1821,7 @@ static void __init xen_write_cr3_init(unsigned long cr3) */ swapper_kernel_pmd extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE); - memcpy(swapper_kernel_pmd, initial_kernel_pmd, - sizeof(pmd_t) * PTRS_PER_PMD); + copy_page(swapper_kernel_pmd, initial_kernel_pmd); swapper_pg_dir[KERNEL_PGD_BOUNDARY] __pgd(__pa(swapper_kernel_pmd) | _PAGE_PRESENT); set_page_prot(swapper_kernel_pmd, PAGE_KERNEL_RO); @@ -1851,11 +1850,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) 512*1024); kernel_pmd = m2v(pgd[KERNEL_PGD_BOUNDARY].pgd); - memcpy(initial_kernel_pmd, kernel_pmd, sizeof(pmd_t) * PTRS_PER_PMD); + copy_page(initial_kernel_pmd, kernel_pmd); xen_map_identity_early(initial_kernel_pmd, max_pfn); - memcpy(initial_page_table, pgd, sizeof(pgd_t) * PTRS_PER_PGD); + copy_page(initial_page_table, pgd); initial_page_table[KERNEL_PGD_BOUNDARY] __pgd(__pa(initial_kernel_pmd) | _PAGE_PRESENT); -- 1.7.7.6
Konrad Rzeszutek Wilk
2012-Jul-31 14:43 UTC
[PATCH 2/6] xen/mmu: For 64-bit do not call xen_map_identity_early
B/c we do not need it. During the startup the Xen provides us with all the memory mapped that we need to function. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/mmu.c | 11 +++++------ 1 files changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 7247e5a..a59070b 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -84,6 +84,7 @@ */ DEFINE_SPINLOCK(xen_reservation_lock); +#ifdef CONFIG_X86_32 /* * Identity map, in addition to plain kernel map. This needs to be * large enough to allocate page table pages to allocate the rest. @@ -91,7 +92,7 @@ DEFINE_SPINLOCK(xen_reservation_lock); */ #define LEVEL1_IDENT_ENTRIES (PTRS_PER_PTE * 4) static RESERVE_BRK_ARRAY(pte_t, level1_ident_pgt, LEVEL1_IDENT_ENTRIES); - +#endif #ifdef CONFIG_X86_64 /* l3 pud for userspace vsyscall mapping */ static pud_t level3_user_vsyscall[PTRS_PER_PUD] __page_aligned_bss; @@ -1628,7 +1629,7 @@ static void set_page_prot(void *addr, pgprot_t prot) if (HYPERVISOR_update_va_mapping((unsigned long)addr, pte, 0)) BUG(); } - +#ifdef CONFIG_X86_32 static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn) { unsigned pmdidx, pteidx; @@ -1679,7 +1680,7 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn) set_page_prot(pmd, PAGE_KERNEL_RO); } - +#endif void __init xen_setup_machphys_mapping(void) { struct xen_machphys_mapping mapping; @@ -1765,14 +1766,12 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) /* Note that we don''t do anything with level1_fixmap_pgt which * we don''t need. */ - /* Set up identity map */ - xen_map_identity_early(level2_ident_pgt, max_pfn); - /* Make pagetable pieces RO */ set_page_prot(init_level4_pgt, PAGE_KERNEL_RO); set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO); set_page_prot(level3_kernel_pgt, PAGE_KERNEL_RO); set_page_prot(level3_user_vsyscall, PAGE_KERNEL_RO); + set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO); set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO); set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO); -- 1.7.7.6
Konrad Rzeszutek Wilk
2012-Jul-31 14:43 UTC
[PATCH 3/6] xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
As we are not using them. We end up only using the L1 pagetables and grafting those to our page-tables. [v1: Per Stefano''s suggestion squashed two commits] [v2: Per Stefano''s suggestion simplified loop] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/mmu.c | 40 +++++++++++++++++++++++++++++++++------- 1 files changed, 33 insertions(+), 7 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index a59070b..de4b8fd 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1708,7 +1708,20 @@ static void convert_pfn_mfn(void *v) for (i = 0; i < PTRS_PER_PTE; i++) pte[i] = xen_make_pte(pte[i].pte); } - +static __init check_pt_base(unsigned long *pt_base, unsigned long *pt_end, + unsigned long addr) +{ + if (pt_base == PFN_DOWN(__pa(addr))) { + set_page_prot((void *)addr, PAGE_KERNEL); + clear_page((void *)addr); + *pt_base++; + } + if (pt_end == PFN_DOWN(__pa(addr))) { + set_page_prot((void *)addr, PAGE_KERNEL); + clear_page((void *)addr); + *pt_end--; + } +} /* * Set up the initial kernel pagetable. * @@ -1724,6 +1737,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) { pud_t *l3; pmd_t *l2; + unsigned long addr[3]; + unsigned long pt_base, pt_end; + unsigned i; /* max_pfn_mapped is the last pfn mapped in the initial memory * mappings. Considering that on Xen after the kernel mappings we @@ -1731,6 +1747,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) * set max_pfn_mapped to the last real pfn mapped. */ max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list)); + pt_base = PFN_DOWN(__pa(xen_start_info->pt_base)); + pt_end = PFN_DOWN(__pa(xen_start_info->pt_base + (xen_start_info->nr_pt_frames * PAGE_SIZE))); + /* Zap identity mapping */ init_level4_pgt[0] = __pgd(0); @@ -1749,6 +1768,9 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd); l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud); + addr[0] = (unsigned long)pgd; + addr[1] = (unsigned long)l3; + addr[2] = (unsigned long)l2; /* Graft it onto L4[272][0]. Note that we creating an aliasing problem: * Both L4[272][0] and L4[511][511] have entries that point to the same * L2 (PMD) tables. Meaning that if you modify it in __va space @@ -1782,20 +1804,24 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) /* Unpin Xen-provided one */ pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd))); - /* Switch over */ - pgd = init_level4_pgt; - /* * At this stage there can be no user pgd, and no page * structure to attach it to, so make sure we just set kernel * pgd. */ xen_mc_batch(); - __xen_write_cr3(true, __pa(pgd)); + __xen_write_cr3(true, __pa(init_level4_pgt)); xen_mc_issue(PARAVIRT_LAZY_CPU); - memblock_reserve(__pa(xen_start_info->pt_base), - xen_start_info->nr_pt_frames * PAGE_SIZE); + /* We can''t that easily rip out L3 and L2, as the Xen pagetables are + * set out this way: [L4], [L1], [L2], [L3], [L1], [L1] ... for + * the initial domain. For guests using the toolstack, they are in: + * [L4], [L3], [L2], [L1], [L1], order .. */ + for (i = 0; i < ARRAY_SIZE(addr); i++) + check_pt_base(&pt_base, &pt_end, addr[i]); + + /* Our (by three pages) smaller Xen pagetable that we are using */ + memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE); } #else /* !CONFIG_X86_64 */ static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD); -- 1.7.7.6
Konrad Rzeszutek Wilk
2012-Jul-31 14:43 UTC
[PATCH 4/6] xen/p2m: Add logic to revector a P2M tree to use __va leafs.
During bootup Xen supplies us with a P2M array. It sticks it right after the ramdisk, as can be seen with a 128GB PV guest: (certain parts removed for clarity): xc_dom_build_image: called xc_dom_alloc_segment: kernel : 0xffffffff81000000 -> 0xffffffff81e43000 (pfn 0x1000 + 0xe43 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000 xc_dom_alloc_segment: ramdisk : 0xffffffff81e43000 -> 0xffffffff925c7000 (pfn 0x1e43 + 0x10784 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000 xc_dom_alloc_segment: phys2mach : 0xffffffff925c7000 -> 0xffffffffa25c7000 (pfn 0x125c7 + 0x10000 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000 xc_dom_alloc_page : start info : 0xffffffffa25c7000 (pfn 0x225c7) xc_dom_alloc_page : xenstore : 0xffffffffa25c8000 (pfn 0x225c8) xc_dom_alloc_page : console : 0xffffffffa25c9000 (pfn 0x225c9) nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s) nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s) nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffbfffffff, 1 table(s) nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffa27fffff, 276 table(s) xc_dom_alloc_segment: page tables : 0xffffffffa25ca000 -> 0xffffffffa26e1000 (pfn 0x225ca + 0x117 pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000 xc_dom_alloc_page : boot stack : 0xffffffffa26e1000 (pfn 0x226e1) xc_dom_build_image : virt_alloc_end : 0xffffffffa26e2000 xc_dom_build_image : virt_pgtab_end : 0xffffffffa2800000 So the physical memory and virtual (using __START_KERNEL_map addresses) layout looks as so: phys __ka /------------\ /-------------------\ | 0 | empty | 0xffffffff80000000| | .. | | .. | | 16MB | <= kernel starts | 0xffffffff81000000| | .. | | | | 30MB | <= kernel ends => | 0xffffffff81e43000| | .. | & ramdisk starts | .. | | 293MB | <= ramdisk ends=> | 0xffffffff925c7000| | .. | & P2M starts | .. | | .. | | .. | | 549MB | <= P2M ends => | 0xffffffffa25c7000| | .. | start_info | 0xffffffffa25c7000| | .. | xenstore | 0xffffffffa25c8000| | .. | cosole | 0xffffffffa25c9000| | 549MB | <= page tables => | 0xffffffffa25ca000| | .. | | | | 550MB | <= PGT end => | 0xffffffffa26e1000| | .. | boot stack | | \------------/ \-------------------/ As can be seen, the ramdisk, P2M and pagetables are taking a bit of __ka addresses space. Which is a problem since the MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits right in there! This results during bootup with the inability to load modules, with this error: ------------[ cut here ]------------ WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 vmap_page_range_noflush+0x2d9/0x370() Call Trace: [<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81071a45>] warn_slowpath_null+0x15/0x20 [<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370 [<ffffffff81130c4d>] map_vm_area+0x2d/0x50 [<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80 [<ffffffff810c6186>] ? load_module+0x66/0x19c0 [<ffffffff8105cadc>] module_alloc+0x5c/0x60 [<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80 [<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80 [<ffffffff810c70c3>] load_module+0xfa3/0x19c0 [<ffffffff812491f6>] ? security_file_permission+0x86/0x90 [<ffffffff810c7b3a>] sys_init_module+0x5a/0x220 [<ffffffff815ce339>] system_call_fastpath+0x16/0x1b ---[ end trace fd8f7704fdea0291 ]--- vmalloc: allocation failure, allocated 16384 of 20480 bytes modprobe: page allocation failure: order:0, mode:0xd2 Since the __va and __ka are 1:1 up to MODULES_VADDR and cleanup_highmap rids __ka of the ramdisk mapping, what we want to do is similar - get rid of the P2M in the __ka address space. There are two ways of fixing this: 1) All P2M lookups instead of using the __ka address would use the __va address. This means we can safely erase from __ka space the PMD pointers that point to the PFNs for P2M array and be OK. 2). Allocate a new array, copy the existing P2M into it, revector the P2M tree to use that, and return the old P2M to the memory allocate. This has the advantage that it sets the stage for using XEN_ELF_NOTE_INIT_P2M feature. That feature allows us to set the exact virtual address space we want for the P2M - and allows us to boot as initial domain on large machines. So we pick option 2). This patch only lays the groundwork in the P2M code. The patch that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree." Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/p2m.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++ arch/x86/xen/xen-ops.h | 1 + 2 files changed, 71 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c index 6a2bfa4..bbfd085 100644 --- a/arch/x86/xen/p2m.c +++ b/arch/x86/xen/p2m.c @@ -394,7 +394,77 @@ void __init xen_build_dynamic_phys_to_machine(void) * Xen provided pagetable). Do it later in xen_reserve_internals. */ } +#ifdef CONFIG_X86_64 +#include <linux/bootmem.h> +unsigned long __init xen_revector_p2m_tree(void) +{ + unsigned long va_start; + unsigned long va_end; + unsigned long pfn; + unsigned long *mfn_list = NULL; + unsigned long size; + + va_start = xen_start_info->mfn_list; + /*We copy in increments of P2M_PER_PAGE * sizeof(unsigned long), + * so make sure it is rounded up to that */ + size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long)); + va_end = va_start + size; + + /* If we were revectored already, don''t do it again. */ + if (va_start <= __START_KERNEL_map && va_start >= __PAGE_OFFSET) + return 0; + + mfn_list = alloc_bootmem_align(size, PAGE_SIZE); + if (!mfn_list) { + pr_warn("Could not allocate space for a new P2M tree!\n"); + return xen_start_info->mfn_list; + } + /* Fill it out with INVALID_P2M_ENTRY value */ + memset(mfn_list, 0xFF, size); + + for (pfn = 0; pfn < ALIGN(MAX_DOMAIN_PAGES, P2M_PER_PAGE); pfn += P2M_PER_PAGE) { + unsigned topidx = p2m_top_index(pfn); + unsigned mididx; + unsigned long *mid_p; + + if (!p2m_top[topidx]) + continue; + + if (p2m_top[topidx] == p2m_mid_missing) + continue; + + mididx = p2m_mid_index(pfn); + mid_p = p2m_top[topidx][mididx]; + if (!mid_p) + continue; + if ((mid_p == p2m_missing) || (mid_p == p2m_identity)) + continue; + + if ((unsigned long)mid_p == INVALID_P2M_ENTRY) + continue; + + /* The old va. Rebase it on mfn_list */ + if (mid_p >= (unsigned long *)va_start && mid_p <= (unsigned long *)va_end) { + unsigned long *new; + + new = &mfn_list[pfn]; + + copy_page(new, mid_p); + p2m_top[topidx][mididx] = &mfn_list[pfn]; + p2m_top_mfn_p[topidx][mididx] = virt_to_mfn(&mfn_list[pfn]); + } + /* This should be the leafs allocated for identity from _brk. */ + } + return (unsigned long)mfn_list; + +} +#else +unsigned long __init xen_revector_p2m_tree(void) +{ + return 0; +} +#endif unsigned long get_phys_to_machine(unsigned long pfn) { unsigned topidx, mididx, idx; diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h index 2230f57..bb5a810 100644 --- a/arch/x86/xen/xen-ops.h +++ b/arch/x86/xen/xen-ops.h @@ -45,6 +45,7 @@ void xen_hvm_init_shared_info(void); void xen_unplug_emulated_devices(void); void __init xen_build_dynamic_phys_to_machine(void); +unsigned long __init xen_revector_p2m_tree(void); void xen_init_irq_ops(void); void xen_setup_timer(int cpu); -- 1.7.7.6
Konrad Rzeszutek Wilk
2012-Jul-31 14:43 UTC
[PATCH 5/6] xen/mmu: Copy and revector the P2M tree.
Please first read the description in "xen/p2m: Add logic to revector a P2M tree to use __va leafs" patch. The ''xen_revector_p2m_tree()'' function allocates a new P2M tree copies the contents of the old one in it, and returns the new one. At this stage, the __ka address space (which is what the old P2M tree was using) is partially disassembled. The cleanup_highmap has removed the PMD entries from 0-16MB and anything past _brk_end up to the max_pfn_mapped (which is the end of the ramdisk). We have revectored the P2M tree (and the one for save/restore as well) to use new shiny __va address to new MFNs. The xen_start_info has been taken care of already in ''xen_setup_kernel_pagetable()'' and xen_start_info->shared_info in ''xen_setup_shared_info()'', so we are free to roam and delete PMD entries - which is exactly what we are going to do. We rip out the __ka for the old P2M array. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/mmu.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 57 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index de4b8fd..9358b75 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1183,9 +1183,64 @@ static __init void xen_mapping_pagetable_reserve(u64 start, u64 end) static void xen_post_allocator_init(void); +#ifdef CONFIG_X86_64 +void __init xen_cleanhighmap(unsigned long vaddr, unsigned long vaddr_end) +{ + unsigned long kernel_end = roundup((unsigned long)_brk_end, PMD_SIZE) - 1; + pmd_t *pmd = level2_kernel_pgt + pmd_index(vaddr); + + /* NOTE: The loop is more greedy than the cleanup_highmap variant. + * We include the PMD passed in on _both_ boundaries. */ + for (; vaddr <= vaddr_end && (pmd < (level2_kernel_pgt + PAGE_SIZE)); + pmd++, vaddr += PMD_SIZE) { + if (pmd_none(*pmd)) + continue; + if (vaddr < (unsigned long) _text || vaddr > kernel_end) + set_pmd(pmd, __pmd(0)); + } + /* In case we did something silly, we should crash in this function + * instead of somewhere later and be confusing. */ + xen_mc_flush(); +} +#endif static void __init xen_pagetable_setup_done(pgd_t *base) { +#ifdef CONFIG_X86_64 + unsigned long size; + unsigned long addr; +#endif + xen_setup_shared_info(); +#ifdef CONFIG_X86_64 + if (!xen_feature(XENFEAT_auto_translated_physmap)) { + unsigned long new_mfn_list; + + size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long)); + + new_mfn_list = xen_revector_p2m_tree(); + + /* On 32-bit, we get zero so this never gets executed. */ + if (new_mfn_list && new_mfn_list != xen_start_info->mfn_list) { + /* using __ka address! */ + memset((void *)xen_start_info->mfn_list, 0, size); + + /* We should be in __ka space. */ + BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map); + addr = xen_start_info->mfn_list; + size = PAGE_ALIGN(xen_start_info->nr_pages * sizeof(unsigned long)); + /* We roundup to the PMD, which means that if anybody at this stage is + * using the __ka address of xen_start_info or xen_start_info->shared_info + * they are in going to crash. Fortunatly we have already revectored + * in xen_setup_kernel_pagetable and in xen_setup_shared_info. */ + size = roundup(size, PMD_SIZE); + xen_cleanhighmap(addr, addr + size); + + memblock_free(__pa(xen_start_info->mfn_list), size); + /* And revector! Bye bye old array */ + xen_start_info->mfn_list = new_mfn_list; + } + } +#endif xen_post_allocator_init(); } @@ -1822,6 +1877,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) /* Our (by three pages) smaller Xen pagetable that we are using */ memblock_reserve(PFN_PHYS(pt_base), (pt_end - pt_base) * PAGE_SIZE); + /* Revector the xen_start_info */ + xen_start_info = (struct start_info *)__va(__pa(xen_start_info)); } #else /* !CONFIG_X86_64 */ static RESERVE_BRK_ARRAY(pmd_t, initial_kernel_pmd, PTRS_PER_PMD); -- 1.7.7.6
Konrad Rzeszutek Wilk
2012-Jul-31 14:43 UTC
[PATCH 6/6] xen/mmu: Remove from __ka space PMD entries for pagetables.
Please first read the description in "xen/mmu: Copy and revector the P2M tree." At this stage, the __ka address space (which is what the old P2M tree was using) is partially disassembled. The cleanup_highmap has removed the PMD entries from 0-16MB and anything past _brk_end up to the max_pfn_mapped (which is the end of the ramdisk). The xen_remove_p2m_tree and code around has ripped out the __ka for the old P2M array. Here we continue on doing it to where the Xen page-tables were. It is safe to do it, as the page-tables are addressed using __va. For good measure we delete anything that is within MODULES_VADDR and up to the end of the PMD. At this point the __ka only contains PMD entries for the start of the kernel up to __brk. [v1: Per Stefano''s suggestion wrapped the MODULES_VADDR in debug] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/mmu.c | 19 +++++++++++++++++++ 1 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 9358b75..fa4d208 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1240,6 +1240,25 @@ static void __init xen_pagetable_setup_done(pgd_t *base) xen_start_info->mfn_list = new_mfn_list; } } + /* At this stage, cleanup_highmap has already cleaned __ka space + * from _brk_limit way up to the max_pfn_mapped (which is the end of + * the ramdisk). We continue on, erasing PMD entries that point to page + * tables - do note that they are accessible at this stage via __va. + * For good measure we also round up to the PMD - which means that if + * anybody is using __ka address to the initial boot-stack - and try + * to use it - they are going to crash. The xen_start_info has been + * taken care of already in xen_setup_kernel_pagetable. */ + addr = xen_start_info->pt_base; + size = roundup(xen_start_info->nr_pt_frames * PAGE_SIZE, PMD_SIZE); + + xen_cleanhighmap(addr, addr + size); + xen_start_info->pt_base = (unsigned long)__va(__pa(xen_start_info->pt_base)); +#ifdef DEBUG + /* This is superflous and shouldn''t be neccessary, but you know what + * lets do it. The MODULES_VADDR -> MODULES_END should be clear of + * anything at this stage. */ + xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1); +#endif #endif xen_post_allocator_init(); } -- 1.7.7.6
Konrad Rzeszutek Wilk
2012-Aug-01 15:50 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
On Tue, Jul 31, 2012 at 10:43:18AM -0400, Konrad Rzeszutek Wilk wrote:> Changelog: > Since v1: [http://lists.xen.org/archives/html/xen-devel/2012-07/msg01561.html] > - added more comments, and #ifdefs > - squashed The L4 and L4, L3, and L2 recycle patches together > - Added Acked-by''s > > The explanation of these patches is exactly what v1 had: > > The details of this problem are nicely explained in: > > [PATCH 4/6] xen/p2m: Add logic to revector a P2M tree to use __va > [PATCH 5/6] xen/mmu: Copy and revector the P2M tree. > [PATCH 6/6] xen/mmu: Remove from __ka space PMD entries for > > and the supporting patches are just nice optimizations. Pasting in > what those patches mentioned:With these patches I''ve gotten it to boot up to 384GB. Around that area something weird happens - mainly the pagetables that the toolstack allocated seems to have missing data. I hadn''t looked in details, but this is what domain builder tells me: xc_dom_alloc_segment: ramdisk : 0xffffffff82278000 -> 0xffffffff930b4000 (pfn 0x2278 + 0x10e3c pages) xc_dom_malloc : 1621 kB xc_dom_pfn_to_ptr: domU mapping: pfn 0x2278+0x10e3c at 0x7fb0853a2000 xc_dom_do_gunzip: unzip ok, 0x4ba831c -> 0x10e3be10 xc_dom_alloc_segment: phys2mach : 0xffffffff930b4000 -> 0xffffffffc30b4000 (pfn 0x130b4 + 0x30000 pages) xc_dom_malloc : 4608 kB xc_dom_pfn_to_ptr: domU mapping: pfn 0x130b4+0x30000 at 0x7fb0553a2000 xc_dom_alloc_page : start info : 0xffffffffc30b4000 (pfn 0x430b4) xc_dom_alloc_page : xenstore : 0xffffffffc30b5000 (pfn 0x430b5) xc_dom_alloc_page : console : 0xffffffffc30b6000 (pfn 0x430b6) nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s) nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s) nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffffffffff, 2 table(s) nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffc33fffff, 538 table(s) xc_dom_alloc_segment: page tables : 0xffffffffc30b7000 -> 0xffffffffc32d5000 (pfn 0x430b7 + 0x21e pages) xc_dom_pfn_to_ptr: domU mapping: pfn 0x430b7+0x21e at 0x7fb055184000 xc_dom_alloc_page : boot stack : 0xffffffffc32d5000 (pfn 0x432d5) xc_dom_build_image : virt_alloc_end : 0xffffffffc32d6000 xc_dom_build_image : virt_pgtab_end : 0xffffffffc3400000 Note it is is 0xffffffffc30b4000 - so already past the level2_kernel_pgt (L3[510] and in level2_fixmap_pgt territory (L3[511]). Hypervisor tells me: (XEN) Pagetable walk from ffffffffc32d5ff8: (XEN) L4[0x1ff] = 000000b9804d9067 00000000000430b8 (XEN) L3[0x1ff] = 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S (XEN) Domain 13 (vcpu#0) crashed on cpu#121: (XEN) ----[ Xen-4.1.2-OVM x86_64 debug=n Not tainted ]---- (XEN) CPU: 121 (XEN) RIP: e033:[<ffffffff818a4200>] (XEN) RFLAGS: 0000000000010202 EM: 1 CONTEXT: pv guest (XEN) rax: 0000000000000000 rbx: 0000000000000000 rcx: 0000000000000000 (XEN) rdx: 0000000000000000 rsi: ffffffffc30b4000 rdi: 0000000000000000 (XEN) rbp: 0000000000000000 rsp: ffffffffc32d6000 r8: 0000000000000000 (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 (XEN) r12: 0000000000000000 r13: 0000000000000000 r14: 0000000000000000 (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 000000b9804da000 cr2: ffffffffc32d5ff8 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 (XEN) Guest stack trace from rsp=ffffffffc32d6000: (XEN) Fault while accessing guest memory. And that EIP translates to ffffffff818a4200 T startup_xen which does: ENTRY(startup_xen) cld ffffffff818a4200: fc cld #ifdef CONFIG_X86_32 mov %esi,xen_start_info mov $init_thread_union+THREAD_SIZE,%esp #else mov %rsi,xen_start_info ffffffff818a4201: 48 89 34 25 48 92 94 mov %rsi,0xffffffff81949248 ffffffff818a4208: 81 At that stage we are still operating using the Xen provided pagetable - which look to have the L4[511][511] empty! Which sounds to me like a Xen tool-stack problem? Jan, have you seen something similar to this?
Jan Beulich
2012-Aug-02 09:05 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
>>> On 01.08.12 at 17:50, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > With these patches I''ve gotten it to boot up to 384GB. Around that area > something weird happens - mainly the pagetables that the toolstack allocated > seems to have missing data. I hadn''t looked in details, but this is what > domain builder tells me: > > > xc_dom_alloc_segment: ramdisk : 0xffffffff82278000 -> > 0xffffffff930b4000 (pfn 0x2278 + 0x10e3c pages) > xc_dom_malloc : 1621 kB > xc_dom_pfn_to_ptr: domU mapping: pfn 0x2278+0x10e3c at 0x7fb0853a2000 > xc_dom_do_gunzip: unzip ok, 0x4ba831c -> 0x10e3be10 > xc_dom_alloc_segment: phys2mach : 0xffffffff930b4000 -> > 0xffffffffc30b4000 (pfn 0x130b4 + 0x30000 pages) > xc_dom_malloc : 4608 kB > xc_dom_pfn_to_ptr: domU mapping: pfn 0x130b4+0x30000 at 0x7fb0553a2000 > xc_dom_alloc_page : start info : 0xffffffffc30b4000 (pfn 0x430b4) > xc_dom_alloc_page : xenstore : 0xffffffffc30b5000 (pfn 0x430b5) > xc_dom_alloc_page : console : 0xffffffffc30b6000 (pfn 0x430b6) > nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> > 0xffffffffffffffff, 1 table(s) > nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> > 0xffffffffffffffff, 1 table(s) > nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> > 0xffffffffffffffff, 2 table(s) > nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> > 0xffffffffc33fffff, 538 table(s) > xc_dom_alloc_segment: page tables : 0xffffffffc30b7000 -> > 0xffffffffc32d5000 (pfn 0x430b7 + 0x21e pages) > xc_dom_pfn_to_ptr: domU mapping: pfn 0x430b7+0x21e at 0x7fb055184000 > xc_dom_alloc_page : boot stack : 0xffffffffc32d5000 (pfn 0x432d5) > xc_dom_build_image : virt_alloc_end : 0xffffffffc32d6000 > xc_dom_build_image : virt_pgtab_end : 0xffffffffc3400000 > > Note it is is 0xffffffffc30b4000 - so already past the level2_kernel_pgt > (L3[510] > and in level2_fixmap_pgt territory (L3[511]). > > At that stage we are still operating using the Xen provided pagetable - which > look to have the L4[511][511] empty! Which sounds to me like a Xen tool-stack > problem? Jan, have you seen something similar to this?No we haven''t, but I also don''t think anyone tried to create as big a DomU. I was, however, under the impression that DomU-s this big had been created at Oracle before. Or was that only up to 256Gb perhaps? In any case, setup_pgtables_x86_64() indeed looks flawed to me: While the clearing of l1tab looks right, l[23]tab get cleared (and hence a new table allocated) too early. l2tab should really get cleared only when l1tab gets cleared _and_ the L2 clearing condition is true. Similarly for l3tab then, and of course - even though it would unlikely ever matter - setup_pgtables_x86_32_pae() is broken in the same way. Afaict this got broken with the domain build re-write between 3.0.4 and 3.1 (the old code looks alright). Jan
Konrad Rzeszutek Wilk
2012-Aug-02 14:17 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
On Thu, Aug 02, 2012 at 10:05:27AM +0100, Jan Beulich wrote:> >>> On 01.08.12 at 17:50, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > With these patches I''ve gotten it to boot up to 384GB. Around that area > > something weird happens - mainly the pagetables that the toolstack allocated > > seems to have missing data. I hadn''t looked in details, but this is what > > domain builder tells me: > > > > > > xc_dom_alloc_segment: ramdisk : 0xffffffff82278000 -> > > 0xffffffff930b4000 (pfn 0x2278 + 0x10e3c pages) > > xc_dom_malloc : 1621 kB > > xc_dom_pfn_to_ptr: domU mapping: pfn 0x2278+0x10e3c at 0x7fb0853a2000 > > xc_dom_do_gunzip: unzip ok, 0x4ba831c -> 0x10e3be10 > > xc_dom_alloc_segment: phys2mach : 0xffffffff930b4000 -> > > 0xffffffffc30b4000 (pfn 0x130b4 + 0x30000 pages) > > xc_dom_malloc : 4608 kB > > xc_dom_pfn_to_ptr: domU mapping: pfn 0x130b4+0x30000 at 0x7fb0553a2000 > > xc_dom_alloc_page : start info : 0xffffffffc30b4000 (pfn 0x430b4) > > xc_dom_alloc_page : xenstore : 0xffffffffc30b5000 (pfn 0x430b5) > > xc_dom_alloc_page : console : 0xffffffffc30b6000 (pfn 0x430b6) > > nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> > > 0xffffffffffffffff, 1 table(s) > > nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> > > 0xffffffffffffffff, 1 table(s) > > nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> > > 0xffffffffffffffff, 2 table(s) > > nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> > > 0xffffffffc33fffff, 538 table(s) > > xc_dom_alloc_segment: page tables : 0xffffffffc30b7000 -> > > 0xffffffffc32d5000 (pfn 0x430b7 + 0x21e pages) > > xc_dom_pfn_to_ptr: domU mapping: pfn 0x430b7+0x21e at 0x7fb055184000 > > xc_dom_alloc_page : boot stack : 0xffffffffc32d5000 (pfn 0x432d5) > > xc_dom_build_image : virt_alloc_end : 0xffffffffc32d6000 > > xc_dom_build_image : virt_pgtab_end : 0xffffffffc3400000 > > > > Note it is is 0xffffffffc30b4000 - so already past the level2_kernel_pgt > > (L3[510] > > and in level2_fixmap_pgt territory (L3[511]). > > > > At that stage we are still operating using the Xen provided pagetable - which > > look to have the L4[511][511] empty! Which sounds to me like a Xen tool-stack > > problem? Jan, have you seen something similar to this? > > No we haven''t, but I also don''t think anyone tried to create as > big a DomU. I was, however, under the impression that DomU-s > this big had been created at Oracle before. Or was that only up > to 256Gb perhaps?Mukesh do you recall? Was it with OVM2.2.2 which was 3.4 based? It might be that we did not have the 1TB hardware at that time yet. Or perhaps I am missing some bug-fix from the old product..> > In any case, setup_pgtables_x86_64() indeed looks flawed > to me: While the clearing of l1tab looks right, l[23]tab get > cleared (and hence a new table allocated) too early. l2tab > should really get cleared only when l1tab gets cleared _and_ > the L2 clearing condition is true. Similarly for l3tab then, and > of course - even though it would unlikely ever matter - > setup_pgtables_x86_32_pae() is broken in the same way. > > Afaict this got broken with the domain build re-write between > 3.0.4 and 3.1 (the old code looks alright).Oh wow. Long time ago. Thanks for the pointer - will look at this once I am through with some of the current bug log.> > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Mukesh Rathor
2012-Aug-02 23:04 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
On Thu, 2 Aug 2012 10:17:10 -0400 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Thu, Aug 02, 2012 at 10:05:27AM +0100, Jan Beulich wrote: > > >>> On 01.08.12 at 17:50, Konrad Rzeszutek Wilk > > >>> <konrad.wilk@oracle.com> wrote: > > > With these patches I''ve gotten it to boot up to 384GB. Around > > > that area something weird happens - mainly the pagetables that > > > the toolstack allocated seems to have missing data. I hadn''t > > > looked in details, but this is what domain builder tells me: > > > > > > > > > xc_dom_alloc_segment: ramdisk : 0xffffffff82278000 -> > > > 0xffffffff930b4000 (pfn 0x2278 + 0x10e3c pages) > > > xc_dom_malloc : 1621 kB > > > xc_dom_pfn_to_ptr: domU mapping: pfn 0x2278+0x10e3c at > > > 0x7fb0853a2000 xc_dom_do_gunzip: unzip ok, 0x4ba831c -> 0x10e3be10 > > > xc_dom_alloc_segment: phys2mach : 0xffffffff930b4000 -> > > > 0xffffffffc30b4000 (pfn 0x130b4 + 0x30000 pages) > > > xc_dom_malloc : 4608 kB > > > xc_dom_pfn_to_ptr: domU mapping: pfn 0x130b4+0x30000 at > > > 0x7fb0553a2000 xc_dom_alloc_page : start info : > > > 0xffffffffc30b4000 (pfn 0x430b4) xc_dom_alloc_page : > > > xenstore : 0xffffffffc30b5000 (pfn 0x430b5) > > > xc_dom_alloc_page : console : 0xffffffffc30b6000 (pfn > > > 0x430b6) nr_page_tables: 0x0000ffffffffffff/48: > > > 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s) > > > nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> > > > 0xffffffffffffffff, 1 table(s) nr_page_tables: > > > 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffffffffff, > > > 2 table(s) nr_page_tables: 0x00000000001fffff/21: > > > 0xffffffff80000000 -> 0xffffffffc33fffff, 538 table(s) > > > xc_dom_alloc_segment: page tables : 0xffffffffc30b7000 -> > > > 0xffffffffc32d5000 (pfn 0x430b7 + 0x21e pages) > > > xc_dom_pfn_to_ptr: domU mapping: pfn 0x430b7+0x21e at > > > 0x7fb055184000 xc_dom_alloc_page : boot stack : > > > 0xffffffffc32d5000 (pfn 0x432d5) xc_dom_build_image : > > > virt_alloc_end : 0xffffffffc32d6000 xc_dom_build_image : > > > virt_pgtab_end : 0xffffffffc3400000 > > > > > > Note it is is 0xffffffffc30b4000 - so already past the > > > level2_kernel_pgt (L3[510] > > > and in level2_fixmap_pgt territory (L3[511]). > > > > > > At that stage we are still operating using the Xen provided > > > pagetable - which look to have the L4[511][511] empty! Which > > > sounds to me like a Xen tool-stack problem? Jan, have you seen > > > something similar to this? > > > > No we haven''t, but I also don''t think anyone tried to create as > > big a DomU. I was, however, under the impression that DomU-s > > this big had been created at Oracle before. Or was that only up > > to 256Gb perhaps? > > Mukesh do you recall? Was it with OVM2.2.2 which was 3.4 based? > It might be that we did not have the 1TB hardware at that time yet.Yes, in ovm2.x, I debugged/booted upto 500GB domU. So something got broken after it looks like. I can debug later if it becomes hot. thanks, Mukesh
Konrad Rzeszutek Wilk
2012-Aug-03 13:30 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
> > > > Note it is is 0xffffffffc30b4000 - so already past the > > > > level2_kernel_pgt (L3[510] > > > > and in level2_fixmap_pgt territory (L3[511]). > > > > > > > > At that stage we are still operating using the Xen provided > > > > pagetable - which look to have the L4[511][511] empty! Which > > > > sounds to me like a Xen tool-stack problem? Jan, have you seen > > > > something similar to this? > > > > > > No we haven''t, but I also don''t think anyone tried to create as > > > big a DomU. I was, however, under the impression that DomU-s > > > this big had been created at Oracle before. Or was that only up > > > to 256Gb perhaps? > > > > Mukesh do you recall? Was it with OVM2.2.2 which was 3.4 based? > > It might be that we did not have the 1TB hardware at that time yet. > > Yes, in ovm2.x, I debugged/booted upto 500GB domU. So something > got broken after it looks like. I can debug later if it becomes hot.I got the kernel part fixed but its the toolstack that got bugs in it. If you recall - where there any patches in the toolstack for this or did you just concentrate on the kernel? Thanks!
Jan Beulich
2012-Aug-03 13:54 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
>>> On 03.08.12 at 15:30, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote: >> > > > Note it is is 0xffffffffc30b4000 - so already past the >> > > > level2_kernel_pgt (L3[510] >> > > > and in level2_fixmap_pgt territory (L3[511]). >> > > > >> > > > At that stage we are still operating using the Xen provided >> > > > pagetable - which look to have the L4[511][511] empty! Which >> > > > sounds to me like a Xen tool-stack problem? Jan, have you seen >> > > > something similar to this? >> > > >> > > No we haven''t, but I also don''t think anyone tried to create as >> > > big a DomU. I was, however, under the impression that DomU-s >> > > this big had been created at Oracle before. Or was that only up >> > > to 256Gb perhaps? >> > >> > Mukesh do you recall? Was it with OVM2.2.2 which was 3.4 based? >> > It might be that we did not have the 1TB hardware at that time yet. >> >> Yes, in ovm2.x, I debugged/booted upto 500GB domU. So something >> got broken after it looks like. I can debug later if it becomes hot. > > I got the kernel part fixed but its the toolstack that got bugs in it.So did you try the suggested fix? Or are you waiting for me to put this in patch form? Jan> If you recall - where there any patches in the toolstack for this or > did you just concentrate on the kernel? > Thanks!
Mukesh Rathor
2012-Aug-03 18:37 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
On Fri, 3 Aug 2012 09:30:01 -0400 Konrad Rzeszutek Wilk <konrad@darnok.org> wrote:> > > > > Note it is is 0xffffffffc30b4000 - so already past the > > > > > level2_kernel_pgt (L3[510] > > > > > and in level2_fixmap_pgt territory (L3[511]). > > > > > > > > > > At that stage we are still operating using the Xen provided > > > > > pagetable - which look to have the L4[511][511] empty! Which > > > > > sounds to me like a Xen tool-stack problem? Jan, have you seen > > > > > something similar to this? > > > > > > > > No we haven''t, but I also don''t think anyone tried to create as > > > > big a DomU. I was, however, under the impression that DomU-s > > > > this big had been created at Oracle before. Or was that only up > > > > to 256Gb perhaps? > > > > > > Mukesh do you recall? Was it with OVM2.2.2 which was 3.4 based? > > > It might be that we did not have the 1TB hardware at that time > > > yet. > > > > Yes, in ovm2.x, I debugged/booted upto 500GB domU. So something > > got broken after it looks like. I can debug later if it becomes > > hot. > > I got the kernel part fixed but its the toolstack that got bugs in it. > If you recall - where there any patches in the toolstack for this or > did you just concentrate on the kernel?Ah, I remember, it was issue in tool stack, xm, so I punted it for tools expert. They were busy, so we hoped xl would fix it.
Jan Beulich
2012-Aug-13 07:54 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
>>> On 03.08.12 at 16:46, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote: > Didn''t get to it yet. Sorry for top posting. If you have a patch ready I > can test it on Monday - travelling now.So here''s what I was thinking of (compile tested only). Jan --- a/tools/libxc/xc_dom_x86.c +++ b/tools/libxc/xc_dom_x86.c @@ -241,7 +241,7 @@ static int setup_pgtables_x86_32_pae(str l3_pgentry_64_t *l3tab; l2_pgentry_64_t *l2tab = NULL; l1_pgentry_64_t *l1tab = NULL; - unsigned long l3off, l2off, l1off; + unsigned long l3off, l2off = 0, l1off; xen_vaddr_t addr; xen_pfn_t pgpfn; xen_pfn_t l3mfn = xc_dom_p2m_guest(dom, l3pfn); @@ -283,8 +283,6 @@ static int setup_pgtables_x86_32_pae(str l2off = l2_table_offset_pae(addr); l2tab[l2off] pfn_to_paddr(xc_dom_p2m_guest(dom, l1pfn)) | L2_PROT; - if ( l2off == (L2_PAGETABLE_ENTRIES_PAE - 1) ) - l2tab = NULL; l1pfn++; } @@ -296,8 +294,13 @@ static int setup_pgtables_x86_32_pae(str if ( (addr >= dom->pgtables_seg.vstart) && (addr < dom->pgtables_seg.vend) ) l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ + if ( l1off == (L1_PAGETABLE_ENTRIES_PAE - 1) ) + { l1tab = NULL; + if ( l2off == (L2_PAGETABLE_ENTRIES_PAE - 1) ) + l2tab = NULL; + } } if ( dom->virt_pgtab_end <= 0xc0000000 ) @@ -340,7 +343,7 @@ static int setup_pgtables_x86_64(struct l3_pgentry_64_t *l3tab = NULL; l2_pgentry_64_t *l2tab = NULL; l1_pgentry_64_t *l1tab = NULL; - uint64_t l4off, l3off, l2off, l1off; + uint64_t l4off, l3off = 0, l2off = 0, l1off; uint64_t addr; xen_pfn_t pgpfn; @@ -364,8 +367,6 @@ static int setup_pgtables_x86_64(struct l3off = l3_table_offset_x86_64(addr); l3tab[l3off] pfn_to_paddr(xc_dom_p2m_guest(dom, l2pfn)) | L3_PROT; - if ( l3off == (L3_PAGETABLE_ENTRIES_X86_64 - 1) ) - l3tab = NULL; l2pfn++; } @@ -376,8 +377,6 @@ static int setup_pgtables_x86_64(struct l2off = l2_table_offset_x86_64(addr); l2tab[l2off] pfn_to_paddr(xc_dom_p2m_guest(dom, l1pfn)) | L2_PROT; - if ( l2off == (L2_PAGETABLE_ENTRIES_X86_64 - 1) ) - l2tab = NULL; l1pfn++; } @@ -389,8 +388,17 @@ static int setup_pgtables_x86_64(struct if ( (addr >= dom->pgtables_seg.vstart) && (addr < dom->pgtables_seg.vend) ) l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ + if ( l1off == (L1_PAGETABLE_ENTRIES_X86_64 - 1) ) + { l1tab = NULL; + if ( l2off == (L2_PAGETABLE_ENTRIES_X86_64 - 1) ) + { + l2tab = NULL; + if ( l3off == (L3_PAGETABLE_ENTRIES_X86_64 - 1) ) + l3tab = NULL; + } + } } return 0; }
Jan Beulich
2012-Sep-03 06:33 UTC
Ping: Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
>>> On 13.08.12 at 09:54, "Jan Beulich" <JBeulich@suse.com> wrote: >>>> On 03.08.12 at 16:46, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote: >> Didn''t get to it yet. Sorry for top posting. If you have a patch ready I >> can test it on Monday - travelling now. > > So here''s what I was thinking of (compile tested only).Obviously, if this works, I''d like to see this included in 4.2 (and 4.1-testing). Jan> --- a/tools/libxc/xc_dom_x86.c > +++ b/tools/libxc/xc_dom_x86.c > @@ -241,7 +241,7 @@ static int setup_pgtables_x86_32_pae(str > l3_pgentry_64_t *l3tab; > l2_pgentry_64_t *l2tab = NULL; > l1_pgentry_64_t *l1tab = NULL; > - unsigned long l3off, l2off, l1off; > + unsigned long l3off, l2off = 0, l1off; > xen_vaddr_t addr; > xen_pfn_t pgpfn; > xen_pfn_t l3mfn = xc_dom_p2m_guest(dom, l3pfn); > @@ -283,8 +283,6 @@ static int setup_pgtables_x86_32_pae(str > l2off = l2_table_offset_pae(addr); > l2tab[l2off] > pfn_to_paddr(xc_dom_p2m_guest(dom, l1pfn)) | L2_PROT; > - if ( l2off == (L2_PAGETABLE_ENTRIES_PAE - 1) ) > - l2tab = NULL; > l1pfn++; > } > > @@ -296,8 +294,13 @@ static int setup_pgtables_x86_32_pae(str > if ( (addr >= dom->pgtables_seg.vstart) && > (addr < dom->pgtables_seg.vend) ) > l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ > + > if ( l1off == (L1_PAGETABLE_ENTRIES_PAE - 1) ) > + { > l1tab = NULL; > + if ( l2off == (L2_PAGETABLE_ENTRIES_PAE - 1) ) > + l2tab = NULL; > + } > } > > if ( dom->virt_pgtab_end <= 0xc0000000 ) > @@ -340,7 +343,7 @@ static int setup_pgtables_x86_64(struct > l3_pgentry_64_t *l3tab = NULL; > l2_pgentry_64_t *l2tab = NULL; > l1_pgentry_64_t *l1tab = NULL; > - uint64_t l4off, l3off, l2off, l1off; > + uint64_t l4off, l3off = 0, l2off = 0, l1off; > uint64_t addr; > xen_pfn_t pgpfn; > > @@ -364,8 +367,6 @@ static int setup_pgtables_x86_64(struct > l3off = l3_table_offset_x86_64(addr); > l3tab[l3off] > pfn_to_paddr(xc_dom_p2m_guest(dom, l2pfn)) | L3_PROT; > - if ( l3off == (L3_PAGETABLE_ENTRIES_X86_64 - 1) ) > - l3tab = NULL; > l2pfn++; > } > > @@ -376,8 +377,6 @@ static int setup_pgtables_x86_64(struct > l2off = l2_table_offset_x86_64(addr); > l2tab[l2off] > pfn_to_paddr(xc_dom_p2m_guest(dom, l1pfn)) | L2_PROT; > - if ( l2off == (L2_PAGETABLE_ENTRIES_X86_64 - 1) ) > - l2tab = NULL; > l1pfn++; > } > > @@ -389,8 +388,17 @@ static int setup_pgtables_x86_64(struct > if ( (addr >= dom->pgtables_seg.vstart) && > (addr < dom->pgtables_seg.vend) ) > l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ > + > if ( l1off == (L1_PAGETABLE_ENTRIES_X86_64 - 1) ) > + { > l1tab = NULL; > + if ( l2off == (L2_PAGETABLE_ENTRIES_X86_64 - 1) ) > + { > + l2tab = NULL; > + if ( l3off == (L3_PAGETABLE_ENTRIES_X86_64 - 1) ) > + l3tab = NULL; > + } > + } > } > return 0; > } > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2012-Sep-06 21:03 UTC
Re: Ping: Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
On Mon, Sep 03, 2012 at 07:33:24AM +0100, Jan Beulich wrote:> >>> On 13.08.12 at 09:54, "Jan Beulich" <JBeulich@suse.com> wrote: > >>>> On 03.08.12 at 16:46, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote: > >> Didn''t get to it yet. Sorry for top posting. If you have a patch ready I > >> can test it on Monday - travelling now. > > > > So here''s what I was thinking of (compile tested only). > > Obviously, if this works, I''d like to see this included in 4.2 (and > 4.1-testing).No luck. I still get: (XEN) Pagetable walk from ffff8800443da070: (XEN) L4[0x110] = 0000009342f95067 0000000000001a0c (XEN) L3[0x001] = 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S (XEN) Domain 61 (vcpu#0) crashed on cpu#97: (XEN) ----[ Xen-4.1.2-OVM x86_64 debug=n Not tainted ]---- (XEN) CPU: 97 (XEN) RIP: e033:[<ffffffff81abf971>] (XEN) RFLAGS: 0000000000000246 EM: 1 CONTEXT: pv guest (XEN) rax: ffff8800443da000 rbx: 0000000000000000 rcx: 0000000000000001 (XEN) rdx: ffffffff81f76000 rsi: 0000000000000000 rdi: 0000000000000006 (XEN) rbp: ffffffff81a01ff8 rsp: ffffffff81a01f70 r8: 0000000000000000 (XEN) r9: 00000000443e0000 r10: 0000000000225000 r11: 0000008b00fc2067 (XEN) r12: 0000000000000000 r13: 0000000000000000 r14: 0000000000000000 (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 0000009342f96000 cr2: ffff8800443da070 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 (XEN) Guest stack trace from rsp=ffffffff81a01f70: (XEN) 0000000000000001 0000008b00fc2067 0000000000000000 ffffffff81abf971 (XEN) 000000010000e030 0000000000010046 ffffffff81a01fb8 000000000000e02b (XEN) ffffffff81abf918 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 80b822011f898975 000206e537200800 0000000000000001 (XEN) 0000000000000000 0000000000000000 0f00000060c0c748 ccccccccccccc305 (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc
Jan Beulich
2012-Sep-07 09:01 UTC
Re: Ping: Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
>>> On 06.09.12 at 23:03, Konrad Rzeszutek Wilk <konrad@kernel.org> wrote: > On Mon, Sep 03, 2012 at 07:33:24AM +0100, Jan Beulich wrote: >> >>> On 13.08.12 at 09:54, "Jan Beulich" <JBeulich@suse.com> wrote: >> >>>> On 03.08.12 at 16:46, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote: >> >> Didn''t get to it yet. Sorry for top posting. If you have a patch ready I >> >> can test it on Monday - travelling now. >> > >> > So here''s what I was thinking of (compile tested only). >> >> Obviously, if this works, I''d like to see this included in 4.2 (and >> 4.1-testing). > > No luck. I still get: > > (XEN) Pagetable walk from ffff8800443da070: > (XEN) L4[0x110] = 0000009342f95067 0000000000001a0c > (XEN) L3[0x001] = 0000000000000000 ffffffffffffffffAnd I can''t see why. I wasn''t able to track down the original stack trace you saw on the archives - was that identical to this one (i.e. nothing changed at all)? If so (please forgive that I''m asking, I just know that I happen to fall into this trap once in a while myself), did you indeed build and install the patched tools? In that case, adding some logging to the code in question is presumably the only alternative, short of anyone else seeing anything further wrong with that code. Jan
Konrad Rzeszutek Wilk
2012-Sep-07 13:39 UTC
Re: Ping: Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
On Fri, Sep 07, 2012 at 10:01:36AM +0100, Jan Beulich wrote:> >>> On 06.09.12 at 23:03, Konrad Rzeszutek Wilk <konrad@kernel.org> wrote: > > On Mon, Sep 03, 2012 at 07:33:24AM +0100, Jan Beulich wrote: > >> >>> On 13.08.12 at 09:54, "Jan Beulich" <JBeulich@suse.com> wrote: > >> >>>> On 03.08.12 at 16:46, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote: > >> >> Didn''t get to it yet. Sorry for top posting. If you have a patch ready I > >> >> can test it on Monday - travelling now. > >> > > >> > So here''s what I was thinking of (compile tested only). > >> > >> Obviously, if this works, I''d like to see this included in 4.2 (and > >> 4.1-testing). > > > > No luck. I still get: > > > > (XEN) Pagetable walk from ffff8800443da070: > > (XEN) L4[0x110] = 0000009342f95067 0000000000001a0c > > (XEN) L3[0x001] = 0000000000000000 ffffffffffffffff > > And I can''t see why. I wasn''t able to track down the original > stack trace you saw on the archives - was that identical to > this one (i.e. nothing changed at all)? If so (please forgiveIt does look identical.> that I''m asking, I just know that I happen to fall into this trap > once in a while myself), did you indeed build and install theI know. I did double check - as I couldn''t install wholesale the new RPM (owner of the box needed the old version of it), instead I did this bit of hack: xend stop cd /konrad rpmcpio xen-*konrad* | cpio -id tar -czvf /xen.orig.tgz /usr/lib64/*xen* rm -Rf /usr/lib64/*xen* mv /usr/lib/python2.4/site-packages/xen /usr/lib/python2.4/site-packages/xen.old ln -s /konrad/usr/lib/python2.4/site-packages/xen /usr/lib/python2.4/site-packages/xen export PATH=/konrad/usr/bin:/konrad/usr/sbin:$PATH export LD_LIBRARY_PATH=/konrad/usr/lib64 xend start xm create /konrad/test.xm Which _should_ have taken care of all in the toolstack.> patched tools? In that case, adding some logging to the code > in question is presumably the only alternative, short of > anyone else seeing anything further wrong with that code.That was my next thought too.. also that would verify that my hac^H^H^Hinstallation worked properly.> > Jan
Jan Beulich
2012-Sep-07 14:09 UTC
Re: Ping: Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
>>> On 07.09.12 at 15:39, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Fri, Sep 07, 2012 at 10:01:36AM +0100, Jan Beulich wrote: >> that I''m asking, I just know that I happen to fall into this trap >> once in a while myself), did you indeed build and install the > > I know. I did double check - as I couldn''t install wholesale the > new RPM (owner of the box needed the old version of it), instead > I did this bit of hack: > > xend stop > cd /konrad > rpmcpio xen-*konrad* | cpio -id > tar -czvf /xen.orig.tgz /usr/lib64/*xen* > rm -Rf /usr/lib64/*xen*So here you removed the old libraries. But where did you drop in the new ones? Did you just forget to list this here?> mv /usr/lib/python2.4/site-packages/xen /usr/lib/python2.4/site-packages/xen.old > ln -s /konrad/usr/lib/python2.4/site-packages/xen /usr/lib/python2.4/site-packages/xen > export PATH=/konrad/usr/bin:/konrad/usr/sbin:$PATH > export LD_LIBRARY_PATH=/konrad/usr/lib64 > > xend start > xm create /konrad/test.xmJan
Konrad Rzeszutek Wilk
2012-Sep-07 14:11 UTC
Re: Ping: Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
On Fri, Sep 07, 2012 at 03:09:00PM +0100, Jan Beulich wrote:> >>> On 07.09.12 at 15:39, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > On Fri, Sep 07, 2012 at 10:01:36AM +0100, Jan Beulich wrote: > >> that I''m asking, I just know that I happen to fall into this trap > >> once in a while myself), did you indeed build and install the > > > > I know. I did double check - as I couldn''t install wholesale the > > new RPM (owner of the box needed the old version of it), instead > > I did this bit of hack: > > > > xend stop > > cd /konrad > > rpmcpio xen-*konrad* | cpio -id > > tar -czvf /xen.orig.tgz /usr/lib64/*xen* > > rm -Rf /usr/lib64/*xen* > > So here you removed the old libraries. But where did you drop in > the new ones? Did you just forget to list this here?There was no need since the LD_LIBRARY_PATH did the over-write. This was to make double sure that the old libs wouldn''t be called.> > > mv /usr/lib/python2.4/site-packages/xen /usr/lib/python2.4/site-packages/xen.old > > ln -s /konrad/usr/lib/python2.4/site-packages/xen /usr/lib/python2.4/site-packages/xen > > export PATH=/konrad/usr/bin:/konrad/usr/sbin:$PATH > > export LD_LIBRARY_PATH=/konrad/usr/lib64 > > > > xend start > > xm create /konrad/test.xm > > Jan > >
Konrad Rzeszutek Wilk
2013-Aug-27 20:34 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
On Mon, Aug 13, 2012 at 08:54:47AM +0100, Jan Beulich wrote:> >>> On 03.08.12 at 16:46, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote: > > Didn''t get to it yet. Sorry for top posting. If you have a patch ready I > > can test it on Monday - travelling now. > > So here''s what I was thinking of (compile tested only).Wow. It took me a whole year to get back to this. Anyhow I did test it and it worked rather nicely for 64-bit guests. I didn''t even try to boot 32-bit guests as the pvops changes I did were only for 64-bit guests. But if you have a specific kernel for a 32-bit guest I still have the 1TB machine for a week and can boot it up there.> > Jan > > --- a/tools/libxc/xc_dom_x86.c > +++ b/tools/libxc/xc_dom_x86.c > @@ -241,7 +241,7 @@ static int setup_pgtables_x86_32_pae(str > l3_pgentry_64_t *l3tab; > l2_pgentry_64_t *l2tab = NULL; > l1_pgentry_64_t *l1tab = NULL; > - unsigned long l3off, l2off, l1off; > + unsigned long l3off, l2off = 0, l1off; > xen_vaddr_t addr; > xen_pfn_t pgpfn; > xen_pfn_t l3mfn = xc_dom_p2m_guest(dom, l3pfn); > @@ -283,8 +283,6 @@ static int setup_pgtables_x86_32_pae(str > l2off = l2_table_offset_pae(addr); > l2tab[l2off] > pfn_to_paddr(xc_dom_p2m_guest(dom, l1pfn)) | L2_PROT; > - if ( l2off == (L2_PAGETABLE_ENTRIES_PAE - 1) ) > - l2tab = NULL; > l1pfn++; > } > > @@ -296,8 +294,13 @@ static int setup_pgtables_x86_32_pae(str > if ( (addr >= dom->pgtables_seg.vstart) && > (addr < dom->pgtables_seg.vend) ) > l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ > + > if ( l1off == (L1_PAGETABLE_ENTRIES_PAE - 1) ) > + { > l1tab = NULL; > + if ( l2off == (L2_PAGETABLE_ENTRIES_PAE - 1) ) > + l2tab = NULL; > + } > } > > if ( dom->virt_pgtab_end <= 0xc0000000 ) > @@ -340,7 +343,7 @@ static int setup_pgtables_x86_64(struct > l3_pgentry_64_t *l3tab = NULL; > l2_pgentry_64_t *l2tab = NULL; > l1_pgentry_64_t *l1tab = NULL; > - uint64_t l4off, l3off, l2off, l1off; > + uint64_t l4off, l3off = 0, l2off = 0, l1off; > uint64_t addr; > xen_pfn_t pgpfn; > > @@ -364,8 +367,6 @@ static int setup_pgtables_x86_64(struct > l3off = l3_table_offset_x86_64(addr); > l3tab[l3off] > pfn_to_paddr(xc_dom_p2m_guest(dom, l2pfn)) | L3_PROT; > - if ( l3off == (L3_PAGETABLE_ENTRIES_X86_64 - 1) ) > - l3tab = NULL; > l2pfn++; > } > > @@ -376,8 +377,6 @@ static int setup_pgtables_x86_64(struct > l2off = l2_table_offset_x86_64(addr); > l2tab[l2off] > pfn_to_paddr(xc_dom_p2m_guest(dom, l1pfn)) | L2_PROT; > - if ( l2off == (L2_PAGETABLE_ENTRIES_X86_64 - 1) ) > - l2tab = NULL; > l1pfn++; > } > > @@ -389,8 +388,17 @@ static int setup_pgtables_x86_64(struct > if ( (addr >= dom->pgtables_seg.vstart) && > (addr < dom->pgtables_seg.vend) ) > l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ > + > if ( l1off == (L1_PAGETABLE_ENTRIES_X86_64 - 1) ) > + { > l1tab = NULL; > + if ( l2off == (L2_PAGETABLE_ENTRIES_X86_64 - 1) ) > + { > + l2tab = NULL; > + if ( l3off == (L3_PAGETABLE_ENTRIES_X86_64 - 1) ) > + l3tab = NULL; > + } > + } > } > return 0; > } > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Aug-28 07:55 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
>>> On 27.08.13 at 22:34, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Mon, Aug 13, 2012 at 08:54:47AM +0100, Jan Beulich wrote: >> >>> On 03.08.12 at 16:46, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote: >> > Didn''t get to it yet. Sorry for top posting. If you have a patch ready I >> > can test it on Monday - travelling now. >> >> So here''s what I was thinking of (compile tested only). > > Wow. It took me a whole year to get back to this. > > Anyhow I did test it and it worked rather nicely for 64-bit guests. I didn''t > even try to boot 32-bit guests as the pvops changes I did were only for 64-bit > guests. But if you have a specific kernel for a 32-bit guest I still have > the 1TB machine for a week and can boot it up there.Considering that you had also attached a debug patch - did it work without that, i.e. just with the patch that I had handed you? If so, I''d then finally be in the position to submit this, putting your Tested-by (and perhaps Reported-by) underneath. And no, I''m not really concerned about the 32-bit case. The analogy with the 64-bit code is sufficient to tell that the change (even if just cosmetic) should also be done to the 32-bit variant. Jan
Konrad Rzeszutek Wilk
2013-Aug-28 14:44 UTC
Re: [PATCH] Boot PV guests with more than 128GB (v2) for 3.7
On Wed, Aug 28, 2013 at 08:55:39AM +0100, Jan Beulich wrote:> >>> On 27.08.13 at 22:34, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > On Mon, Aug 13, 2012 at 08:54:47AM +0100, Jan Beulich wrote: > >> >>> On 03.08.12 at 16:46, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote: > >> > Didn''t get to it yet. Sorry for top posting. If you have a patch ready I > >> > can test it on Monday - travelling now. > >> > >> So here''s what I was thinking of (compile tested only). > > > > Wow. It took me a whole year to get back to this. > > > > Anyhow I did test it and it worked rather nicely for 64-bit guests. I didn''t > > even try to boot 32-bit guests as the pvops changes I did were only for 64-bit > > guests. But if you have a specific kernel for a 32-bit guest I still have > > the 1TB machine for a week and can boot it up there. > > Considering that you had also attached a debug patch - did it > work without that, i.e. just with the patch that I had handed > you? If so, I''d then finally be in the position to submit this, > putting your Tested-by (and perhaps Reported-by) underneath.Yes it did with the ''memory=440000'' guest config. I developed the debug patch just to make sure I could see the failing case (fix=0) and working case (fix=1) without having to reboot this monster machine. Interestingly enough if I boot with a 486GB guest I end up with: [root@ca-test111 konrad]# xl dmesg | tail -300 (XEN) d8:v0: unhandled page fault (ec=0000) (XEN) Pagetable walk from ffff880043e75070: (XEN) L4[0x110] = 00000080ba854067 0000000000001a0d (XEN) L3[0x001] = 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S (XEN) Domain 8 (vcpu#0) crashed on cpu#16: (XEN) ----[ Xen-4.4-unstable x86_64 debug=y Not tainted ]---- (XEN) CPU: 16 (XEN) RIP: e033:[<ffffffff81acd29e>] (XEN) RFLAGS: 0000000000000246 EM: 1 CONTEXT: pv guest (XEN) rax: 0000000000000000 rbx: 0000000000000000 rcx: ffffffff8219e000 (XEN) rdx: 0000000000000000 rsi: ffff880043e75000 rdi: 00000000deadbeef (XEN) rbp: ffffffff81a01ff8 rsp: ffffffff81a01f00 r8: 0000000043e7a000 (XEN) r9: 0000000043e7b000 r10: 0000000000223000 r11: 000000a0a66b6067 (XEN) r12: 0000000000000000 r13: 0000000000000000 r14: 0000000000000000 (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000000026f0 (XEN) cr3: 000000400fcb6000 cr2: ffff880043e75070 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 (XEN) Guest stack trace from rsp=ffffffff81a01f00: (XEN) ffffffff8219e000 000000a0a66b6067 0000000000000000 ffffffff81acd29e (XEN) 000000010000e030 0000000000010046 ffffffff81a01f48 000000000000e02b (XEN) ffffffff81acd267 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 809822011f898975 (XEN) 000206e501200800 0000000000000001 0000000000000000 0000000000000000 (XEN) 0f00000060c0c748 ccccccccccccc305 cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (XEN) cccccccccccccccc cccccccccccccccc cccccccccccccccc cccccccccccccccc (this is with the debug patch and the guest having ''fix=1'' enabled, meaning it uses the new code path). Thought looking at the stack more, I see: ffffffff81acd29e is: 0xffffffff81acd280 <xen_start_kernel+935>: mov $0xffffffff81931558,%rdi 0xffffffff81acd287 <xen_start_kernel+942>: xor %eax,%eax 0xffffffff81acd289 <xen_start_kernel+944>: callq 0xffffffff813f5340 <xen_raw_printk> 0xffffffff81acd28e <xen_start_kernel+949>: mov 0x1a6f53(%rip),%rsi # 0xffffffff81c741e8 <xen_start_info> 0xffffffff81acd295 <xen_start_kernel+956>: movb $0x90,0x1aa454(%rip) # 0xffffffff81c776f0 <boot_params+528> 0xffffffff81acd29c <xen_start_kernel+963>: xor %edx,%edx 0xffffffff81acd29e <xen_start_kernel+965>: mov 0x70(%rsi),%rax which implies that we copied from the xen_start_info something (pt_base? mod_start?) which has the __va address instead of the __kva one. So the bootup pagetables creation I think we are OK with and indeed you can put ''Tested-by'' tag on it. I will dig in this a bit more.> > And no, I''m not really concerned about the 32-bit case. The > analogy with the 64-bit code is sufficient to tell that the change > (even if just cosmetic) should also be done to the 32-bit variant.Right.> > Jan >
Jan Beulich
2013-Aug-28 14:58 UTC
[PATCH] libxc/x86: fix page table creation for huge guests
The switch-over logic from one page directory to the next was wrong; it needs to be deferred until we actually reach the last page within a given region, instead of being done when the last entry of a page directory gets started with. Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- a/tools/libxc/xc_dom_x86.c +++ b/tools/libxc/xc_dom_x86.c @@ -251,7 +251,7 @@ static int setup_pgtables_x86_32_pae(str l3_pgentry_64_t *l3tab; l2_pgentry_64_t *l2tab = NULL; l1_pgentry_64_t *l1tab = NULL; - unsigned long l3off, l2off, l1off; + unsigned long l3off, l2off = 0, l1off; xen_vaddr_t addr; xen_pfn_t pgpfn; xen_pfn_t l3mfn = xc_dom_p2m_guest(dom, l3pfn); @@ -299,8 +299,6 @@ static int setup_pgtables_x86_32_pae(str l2off = l2_table_offset_pae(addr); l2tab[l2off] pfn_to_paddr(xc_dom_p2m_guest(dom, l1pfn)) | L2_PROT; - if ( l2off == (L2_PAGETABLE_ENTRIES_PAE - 1) ) - l2tab = NULL; l1pfn++; } @@ -312,8 +310,13 @@ static int setup_pgtables_x86_32_pae(str if ( (addr >= dom->pgtables_seg.vstart) && (addr < dom->pgtables_seg.vend) ) l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ + if ( l1off == (L1_PAGETABLE_ENTRIES_PAE - 1) ) + { l1tab = NULL; + if ( l2off == (L2_PAGETABLE_ENTRIES_PAE - 1) ) + l2tab = NULL; + } } if ( dom->virt_pgtab_end <= 0xc0000000 ) @@ -360,7 +363,7 @@ static int setup_pgtables_x86_64(struct l3_pgentry_64_t *l3tab = NULL; l2_pgentry_64_t *l2tab = NULL; l1_pgentry_64_t *l1tab = NULL; - uint64_t l4off, l3off, l2off, l1off; + uint64_t l4off, l3off = 0, l2off = 0, l1off; uint64_t addr; xen_pfn_t pgpfn; @@ -391,8 +394,6 @@ static int setup_pgtables_x86_64(struct l3off = l3_table_offset_x86_64(addr); l3tab[l3off] pfn_to_paddr(xc_dom_p2m_guest(dom, l2pfn)) | L3_PROT; - if ( l3off == (L3_PAGETABLE_ENTRIES_X86_64 - 1) ) - l3tab = NULL; l2pfn++; } @@ -405,8 +406,6 @@ static int setup_pgtables_x86_64(struct l2off = l2_table_offset_x86_64(addr); l2tab[l2off] pfn_to_paddr(xc_dom_p2m_guest(dom, l1pfn)) | L2_PROT; - if ( l2off == (L2_PAGETABLE_ENTRIES_X86_64 - 1) ) - l2tab = NULL; l1pfn++; } @@ -418,8 +417,17 @@ static int setup_pgtables_x86_64(struct if ( (addr >= dom->pgtables_seg.vstart) && (addr < dom->pgtables_seg.vend) ) l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ + if ( l1off == (L1_PAGETABLE_ENTRIES_X86_64 - 1) ) + { l1tab = NULL; + if ( l2off == (L2_PAGETABLE_ENTRIES_X86_64 - 1) ) + { + l2tab = NULL; + if ( l3off == (L3_PAGETABLE_ENTRIES_X86_64 - 1) ) + l3tab = NULL; + } + } } return 0; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Sep-09 08:37 UTC
Ping: [PATCH] libxc/x86: fix page table creation for huge guests
Ping?>>> On 28.08.13 at 16:58, "Jan Beulich" <JBeulich@suse.com> wrote: > The switch-over logic from one page directory to the next was wrong; > it needs to be deferred until we actually reach the last page within > a given region, instead of being done when the last entry of a page > directory gets started with. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > --- a/tools/libxc/xc_dom_x86.c > +++ b/tools/libxc/xc_dom_x86.c > @@ -251,7 +251,7 @@ static int setup_pgtables_x86_32_pae(str > l3_pgentry_64_t *l3tab; > l2_pgentry_64_t *l2tab = NULL; > l1_pgentry_64_t *l1tab = NULL; > - unsigned long l3off, l2off, l1off; > + unsigned long l3off, l2off = 0, l1off; > xen_vaddr_t addr; > xen_pfn_t pgpfn; > xen_pfn_t l3mfn = xc_dom_p2m_guest(dom, l3pfn); > @@ -299,8 +299,6 @@ static int setup_pgtables_x86_32_pae(str > l2off = l2_table_offset_pae(addr); > l2tab[l2off] > pfn_to_paddr(xc_dom_p2m_guest(dom, l1pfn)) | L2_PROT; > - if ( l2off == (L2_PAGETABLE_ENTRIES_PAE - 1) ) > - l2tab = NULL; > l1pfn++; > } > > @@ -312,8 +310,13 @@ static int setup_pgtables_x86_32_pae(str > if ( (addr >= dom->pgtables_seg.vstart) && > (addr < dom->pgtables_seg.vend) ) > l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ > + > if ( l1off == (L1_PAGETABLE_ENTRIES_PAE - 1) ) > + { > l1tab = NULL; > + if ( l2off == (L2_PAGETABLE_ENTRIES_PAE - 1) ) > + l2tab = NULL; > + } > } > > if ( dom->virt_pgtab_end <= 0xc0000000 ) > @@ -360,7 +363,7 @@ static int setup_pgtables_x86_64(struct > l3_pgentry_64_t *l3tab = NULL; > l2_pgentry_64_t *l2tab = NULL; > l1_pgentry_64_t *l1tab = NULL; > - uint64_t l4off, l3off, l2off, l1off; > + uint64_t l4off, l3off = 0, l2off = 0, l1off; > uint64_t addr; > xen_pfn_t pgpfn; > > @@ -391,8 +394,6 @@ static int setup_pgtables_x86_64(struct > l3off = l3_table_offset_x86_64(addr); > l3tab[l3off] > pfn_to_paddr(xc_dom_p2m_guest(dom, l2pfn)) | L3_PROT; > - if ( l3off == (L3_PAGETABLE_ENTRIES_X86_64 - 1) ) > - l3tab = NULL; > l2pfn++; > } > > @@ -405,8 +406,6 @@ static int setup_pgtables_x86_64(struct > l2off = l2_table_offset_x86_64(addr); > l2tab[l2off] > pfn_to_paddr(xc_dom_p2m_guest(dom, l1pfn)) | L2_PROT; > - if ( l2off == (L2_PAGETABLE_ENTRIES_X86_64 - 1) ) > - l2tab = NULL; > l1pfn++; > } > > @@ -418,8 +417,17 @@ static int setup_pgtables_x86_64(struct > if ( (addr >= dom->pgtables_seg.vstart) && > (addr < dom->pgtables_seg.vend) ) > l1tab[l1off] &= ~_PAGE_RW; /* page tables are r/o */ > + > if ( l1off == (L1_PAGETABLE_ENTRIES_X86_64 - 1) ) > + { > l1tab = NULL; > + if ( l2off == (L2_PAGETABLE_ENTRIES_X86_64 - 1) ) > + { > + l2tab = NULL; > + if ( l3off == (L3_PAGETABLE_ENTRIES_X86_64 - 1) ) > + l3tab = NULL; > + } > + } > } > return 0; >
Ian Jackson
2013-Sep-12 15:38 UTC
Re: Ping: [PATCH] libxc/x86: fix page table creation for huge guests
Jan Beulich writes ("Ping: [PATCH] libxc/x86: fix page table creation for huge guests"):> Ping? > > >>> On 28.08.13 at 16:58, "Jan Beulich" <JBeulich@suse.com> wrote: > > The switch-over logic from one page directory to the next was wrong; > > it needs to be deferred until we actually reach the last page within > > a given region, instead of being done when the last entry of a page > > directory gets started with. > > > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Ian.