Konrad Rzeszutek Wilk
2012-Jul-31 10:42 UTC
[PATCH] extend_brk and fixes to users of extend_brk (v1).
In v3.5 I''ve added some patches that during boot-time and add more entries to the P2M tree: http://lists.xen.org/archives/html/xen-devel/2012-04/msg01152.html which worked great on all my machines. But there were some folks who reported that it caused them at bootup to hit this: (XEN) domain_crash_sync called from entry.S (XEN) CPU: 0 (XEN) RIP: e033:[<ffffffff818aad3b>] (XEN) RFLAGS: 0000000000000206 EM: 1 CONTEXT: pv guest (XEN) rax: ffffffff81a7c000 rbx: 000000000000003d rcx: 0000000000001000 (XEN) rdx: ffffffff81a7b000 rsi: 0000000000001000 rdi: 0000000000001000 (XEN) rbp: ffffffff81801cd8 rsp: ffffffff81801c98 r8: 0000000000100000 (XEN) r9: ffffffff81a7a000 r10: 0000000000000001 r11: 0000000000000003 (XEN) r12: 0000000000000004 r13: 0000000000000004 r14: 000000000000003d (XEN) r15: 00000000000001e8 cr0: 000000008005003b cr4: 00000000000006f0 (XEN) cr3: 0000000125803000 cr2: 0000000000000000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 (XEN) Guest stack trace from rsp=ffffffff81801c98: The failure was that my patches depended on the __brk_base to be large. On v3.5 due to some other patches (not sure which ones) the space is quite large (around ~4MB) while in previous kernels its ~344kB. But depending on the compiler to fix my mistake is not really the best way. So patch: [PATCH 2/3] xen/p2m: Reserve 4MB of _brk space for P2M leafs when fixes this, while [PATCH 1/3] xen/mmu/p2m: Check extend_brk for NULL gets in line with the DMI code to check extend_brk for NULL, and lastly: [PATCH 3/3] x86: Let extend_brk return a NULL pointer instead of makes extend_brk not do BUG_ON and instead just return NULL. This allows us to handle the bootup issues more gracefully. Note, I''ve also posted another variant of this: http://lists.xen.org/archives/html/xen-devel/2012-07/msg01401.html which Ian pointed out is a bit fragile. Looking forward to your comments.
Konrad Rzeszutek Wilk
2012-Jul-31 10:42 UTC
[PATCH 1/3] xen/mmu/p2m: Check extend_brk for NULL
Which allows us to be a bit smarter in case we exhaust the reserved virtual space. [v1: Suggested by Ian Campbell] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/enlighten.c | 2 ++ arch/x86/xen/mmu.c | 5 ++++- arch/x86/xen/p2m.c | 35 ++++++++++++++++++++++++----------- 3 files changed, 30 insertions(+), 12 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index 993e2a5..923d98e 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1519,6 +1519,8 @@ void __ref xen_hvm_init_shared_info(void) if (!shared_info_page) shared_info_page = (struct shared_info *) extend_brk(PAGE_SIZE, PAGE_SIZE); + if (!shared_info_page) + return; xatp.domid = DOMID_SELF; xatp.idx = 0; xatp.space = XENMAPSPACE_shared_info; diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 993ba07..d7a2044 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1711,6 +1711,8 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn) level1_ident_pgt = extend_brk(sizeof(pte_t) * LEVEL1_IDENT_ENTRIES, PAGE_SIZE); + if (!level_ident_pgt) + goto out; ident_pte = 0; pfn = 0; @@ -1750,7 +1752,7 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn) for (pteidx = 0; pteidx < ident_pte; pteidx += PTRS_PER_PTE) set_page_prot(&level1_ident_pgt[pteidx], PAGE_KERNEL_RO); - +out: set_page_prot(pmd, PAGE_KERNEL_RO); } #endif @@ -1948,6 +1950,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) initial_kernel_pmd extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE); + BUG_ON(!initial_kernel_pmd); max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->pt_base) + xen_start_info->nr_pt_frames * PAGE_SIZE + diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c index bbfd085..1658af0 100644 --- a/arch/x86/xen/p2m.c +++ b/arch/x86/xen/p2m.c @@ -258,6 +258,13 @@ static void p2m_init(unsigned long *p2m) p2m[i] = INVALID_P2M_ENTRY; } +static __always_inline __init void *brk_alloc_page(void) +{ + void *p = extend_brk(PAGE_SIZE, PAGE_SIZE); + /* So early that printk does not function. */ + BUG_ON(p == NULL); + return p; +} /* * Build the parallel p2m_top_mfn and p2m_mid_mfn structures * @@ -274,13 +281,13 @@ void __ref xen_build_mfn_list_list(void) /* Pre-initialize p2m_top_mfn to be completely missing */ if (p2m_top_mfn == NULL) { - p2m_mid_missing_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE); + p2m_mid_missing_mfn = brk_alloc_page(); p2m_mid_mfn_init(p2m_mid_missing_mfn); - p2m_top_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE); + p2m_top_mfn_p = brk_alloc_page(); p2m_top_mfn_p_init(p2m_top_mfn_p); - p2m_top_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE); + p2m_top_mfn = brk_alloc_page(); p2m_top_mfn_init(p2m_top_mfn); } else { /* Reinitialise, mfn''s all change after migration */ @@ -312,10 +319,10 @@ void __ref xen_build_mfn_list_list(void) /* * XXX boot-time only! We should never find * missing parts of the mfn tree after - * runtime. extend_brk() will BUG if we call + * runtime. brk_alloc_page() will BUG if we call * it too late. */ - mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE); + mid_mfn_p = brk_alloc_page(); p2m_mid_mfn_init(mid_mfn_p); p2m_top_mfn_p[topidx] = mid_mfn_p; @@ -344,16 +351,16 @@ void __init xen_build_dynamic_phys_to_machine(void) xen_max_p2m_pfn = max_pfn; - p2m_missing = extend_brk(PAGE_SIZE, PAGE_SIZE); + p2m_missing = brk_alloc_page(); p2m_init(p2m_missing); - p2m_mid_missing = extend_brk(PAGE_SIZE, PAGE_SIZE); + p2m_mid_missing = brk_alloc_page(); p2m_mid_init(p2m_mid_missing); - p2m_top = extend_brk(PAGE_SIZE, PAGE_SIZE); + p2m_top = brk_alloc_page(); p2m_top_init(p2m_top); - p2m_identity = extend_brk(PAGE_SIZE, PAGE_SIZE); + p2m_identity = brk_alloc_page(); p2m_init(p2m_identity); /* @@ -366,7 +373,7 @@ void __init xen_build_dynamic_phys_to_machine(void) unsigned mididx = p2m_mid_index(pfn); if (p2m_top[topidx] == p2m_mid_missing) { - unsigned long **mid = extend_brk(PAGE_SIZE, PAGE_SIZE); + unsigned long **mid = brk_alloc_page(); p2m_mid_init(mid); p2m_top[topidx] = mid; @@ -600,6 +607,8 @@ static bool __init early_alloc_p2m_middle(unsigned long pfn, bool check_boundary /* Boundary cross-over for the edges: */ p2m = extend_brk(PAGE_SIZE, PAGE_SIZE); + if (!p2m) + return false; p2m_init(p2m); @@ -626,7 +635,8 @@ static bool __init early_alloc_p2m(unsigned long pfn) mid_mfn_p = p2m_top_mfn_p[topidx]; if (mid == p2m_mid_missing) { mid = extend_brk(PAGE_SIZE, PAGE_SIZE); - + if (!mid) + return false; p2m_mid_init(mid); p2m_top[topidx] = mid; @@ -636,6 +646,8 @@ static bool __init early_alloc_p2m(unsigned long pfn) /* And the save/restore P2M tables.. */ if (mid_mfn_p == p2m_mid_missing_mfn) { mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE); + if (!mid_mfn_p) + return false; p2m_mid_mfn_init(mid_mfn_p); p2m_top_mfn_p[topidx] = mid_mfn_p; @@ -762,6 +774,7 @@ static void __init m2p_override_init(void) m2p_overrides = extend_brk(sizeof(*m2p_overrides) * M2P_OVERRIDE_HASH, sizeof(unsigned long)); + BUG_ON(!m2p_overrides); for (i = 0; i < M2P_OVERRIDE_HASH; i++) INIT_LIST_HEAD(&m2p_overrides[i]); -- 1.7.7.6
Konrad Rzeszutek Wilk
2012-Jul-31 10:42 UTC
[PATCH 2/3] xen/p2m: Reserve 4MB of _brk space for P2M leafs when populating back.
When we release pages back during bootup: Freeing 9d-100 pfn range: 99 pages freed Freeing 9cf36-9d0d2 pfn range: 412 pages freed Freeing 9f6bd-9f6bf pfn range: 2 pages freed Freeing 9f714-9f7bf pfn range: 171 pages freed Freeing 9f7e0-9f7ff pfn range: 31 pages freed Freeing 9f800-100000 pfn range: 395264 pages freed Released 395979 pages of unused memory We then try to populate those pages back. In the P2M tree however the space for those leafs must be reserved - as such we use extend_brk. We reserve 4MB of _brk space, which means we can fit over 524288 PFNs - which should be more than enough to cover. If we do run over that number, the code is smart enough to stop and not allocate anymore leafs. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/xen/p2m.c | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c index 1658af0..0aa61ef 100644 --- a/arch/x86/xen/p2m.c +++ b/arch/x86/xen/p2m.c @@ -194,6 +194,12 @@ RESERVE_BRK(p2m_mid_mfn, PAGE_SIZE * (MAX_DOMAIN_PAGES / (P2M_PER_PAGE * P2M_MID * boundary violation will require three middle nodes. */ RESERVE_BRK(p2m_mid_identity, PAGE_SIZE * 2 * 3); +/* When we populate back during bootup, the amount of pages can vary. The + * max we have is seen is 395979, but that does not mean it can''t be more. + * So reserve enough for 524288 pages (2G of I/O and E820 holes). Note: + * We are smart enough to stop exhausting the _brk if we are close to + * exhaustion. */ +RESERVE_BRK(p2m_populated, PMD_SIZE * 2); static inline unsigned p2m_top_index(unsigned long pfn) { BUG_ON(pfn >= MAX_P2M_PFN); -- 1.7.7.6
Konrad Rzeszutek Wilk
2012-Jul-31 10:42 UTC
[PATCH 3/3] x86: Let extend_brk return a NULL pointer instead of just BUG_ON.
The callers of this function (dmi_alloc) and the various Xen related ones can deal with a NULL pointer. Suggested-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/kernel/setup.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 16be6dc..13d6c51 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -268,9 +268,10 @@ void * __init extend_brk(size_t size, size_t align) BUG_ON(_brk_start == 0); BUG_ON(align & mask); - _brk_end = (_brk_end + mask) & ~mask; - BUG_ON((char *)(_brk_end + size) > __brk_limit); + if ((char *)(((_brk_end + mask) & ~mask) + size) > __brk_limit) + return NULL; + _brk_end = (_brk_end + mask) & ~mask; ret = (void *)_brk_end; _brk_end += size; -- 1.7.7.6
Konrad Rzeszutek Wilk
2012-Jul-31 13:48 UTC
Re: [PATCH 1/3] xen/mmu/p2m: Check extend_brk for NULL
On Tue, Jul 31, 2012 at 06:42:54AM -0400, Konrad Rzeszutek Wilk wrote:> Which allows us to be a bit smarter in case we exhaust the reserved > virtual space. > > [v1: Suggested by Ian Campbell] > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > --- > arch/x86/xen/enlighten.c | 2 ++ > arch/x86/xen/mmu.c | 5 ++++- > arch/x86/xen/p2m.c | 35 ++++++++++++++++++++++++----------- > 3 files changed, 30 insertions(+), 12 deletions(-) > > diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c > index 993e2a5..923d98e 100644 > --- a/arch/x86/xen/enlighten.c > +++ b/arch/x86/xen/enlighten.c > @@ -1519,6 +1519,8 @@ void __ref xen_hvm_init_shared_info(void) > if (!shared_info_page) > shared_info_page = (struct shared_info *) > extend_brk(PAGE_SIZE, PAGE_SIZE); > + if (!shared_info_page) > + return; > xatp.domid = DOMID_SELF; > xatp.idx = 0; > xatp.space = XENMAPSPACE_shared_info; > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c > index 993ba07..d7a2044 100644 > --- a/arch/x86/xen/mmu.c > +++ b/arch/x86/xen/mmu.c > @@ -1711,6 +1711,8 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn) > > level1_ident_pgt = extend_brk(sizeof(pte_t) * LEVEL1_IDENT_ENTRIES, > PAGE_SIZE); > + if (!level_ident_pgt) > + goto out; >And this is what I get for posting patches in the wee mornings without coffee.. It obviously should have be level1_ident_pgt.
H. Peter Anvin
2012-Jul-31 15:49 UTC
Re: [PATCH] extend_brk and fixes to users of extend_brk (v1).
On 07/31/2012 03:42 AM, Konrad Rzeszutek Wilk wrote:> > [PATCH 3/3] x86: Let extend_brk return a NULL pointer instead of > > makes extend_brk not do BUG_ON and instead just return NULL. This allows > us to handle the bootup issues more gracefully. >NAK. The whole point of the brk allocator is that users specify the upper limit on what they may need and stick to it. Hence it is a fatal code bug if that is ever exceeded. We want to catch those errors, not "handle" them. This means you''re either abusing the brk allocator to do something it is not meant to do... which may mean you can a failure in *other* code, or you have a bug in your code that you haven''t fixed. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don''t speak on their behalf.
Konrad Rzeszutek Wilk
2012-Jul-31 16:15 UTC
Re: [PATCH] extend_brk and fixes to users of extend_brk (v1).
On Tue, Jul 31, 2012 at 08:49:21AM -0700, H. Peter Anvin wrote:> On 07/31/2012 03:42 AM, Konrad Rzeszutek Wilk wrote: > > > > [PATCH 3/3] x86: Let extend_brk return a NULL pointer instead of > > > >makes extend_brk not do BUG_ON and instead just return NULL. This allows > >us to handle the bootup issues more gracefully. > > > > NAK. The whole point of the brk allocator is that users specify the > upper limit on what they may need and stick to it. Hence it is a > fatal code bug if that is ever exceeded. We want to catch those > errors, not "handle" them.OK, thanks for pointing that out.> > This means you''re either abusing the brk allocator to do something > it is not meant to do... which may mean you can a failure in *other* > code, or you have a bug in your code that you haven''t fixed.Its the latter - and of one of the patches in this patchset fixes that. Will drop the one that alter extend_brk. Thanks!
Jan Beulich
2012-Jul-31 16:41 UTC
Re: [Xen-devel] [PATCH] extend_brk and fixes to users of extend_brk (v1).
>>> On 31.07.12 at 18:15, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Tue, Jul 31, 2012 at 08:49:21AM -0700, H. Peter Anvin wrote: >> This means you''re either abusing the brk allocator to do something >> it is not meant to do... which may mean you can a failure in *other* >> code, or you have a bug in your code that you haven''t fixed. > > Its the latter - and of one of the patches in this patchset fixes that.But then you also can''t assume that fitting a 2Gb MMIO hole will suffice; I have a machine here that I can configure to have a 3Gb hole, so I think you really need to be on the safe side and allow to cover all the way up to 4Gb with the space you reserve. Jan
Konrad Rzeszutek Wilk
2012-Jul-31 17:29 UTC
Re: [Xen-devel] [PATCH] extend_brk and fixes to users of extend_brk (v1).
On Tue, Jul 31, 2012 at 05:41:51PM +0100, Jan Beulich wrote:> >>> On 31.07.12 at 18:15, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > On Tue, Jul 31, 2012 at 08:49:21AM -0700, H. Peter Anvin wrote: > >> This means you''re either abusing the brk allocator to do something > >> it is not meant to do... which may mean you can a failure in *other* > >> code, or you have a bug in your code that you haven''t fixed. > > > > Its the latter - and of one of the patches in this patchset fixes that. > > But then you also can''t assume that fitting a 2Gb MMIO hole will > suffice; I have a machine here that I can configure to have a > 3Gb hole, so I think you really need to be on the safe side and > allow to cover all the way up to 4Gb with the space you reserve.I have a patch to address that were the P2M leafs are re-used (the ones that either full of 1:1 PFNs or INVALID_P2M_ENTRY). But somehow its hitting a bug..> > Jan