Tomasz Wroblewski
2013-Aug-26 10:03 UTC
[PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
Xen crashes on boot of xsm/flask enabled builds, if policy module is not specified. This seems to have worked on 4.1 at least. Can be fixed by testing whether policy_buffer is NULL before attempting to load from it - it''s a global which is set to non-NULL when policy module is detected. Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> --- xen/xsm/flask/hooks.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c index fa0589a..cfa2929 100644 --- a/xen/xsm/flask/hooks.c +++ b/xen/xsm/flask/hooks.c @@ -1585,7 +1585,8 @@ static __init int flask_init(void) if ( register_xsm(&flask_ops) ) panic("Flask: Unable to register with XSM.\n"); - ret = security_load_policy(policy_buffer, policy_size); + if ( policy_buffer ) + ret = security_load_policy(policy_buffer, policy_size); if ( flask_enforcing ) printk("Flask: Starting in enforcing mode.\n"); -- 1.7.10.4
Andrew Cooper
2013-Aug-26 10:52 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
On 26/08/2013 11:03, Tomasz Wroblewski wrote:> Xen crashes on boot of xsm/flask enabled builds, if policy module is not specified. > This seems to have worked on 4.1 at least. Can be fixed by testing whether policy_buffer > is NULL before attempting to load from it - it''s a global which is set to non-NULL when > policy module is detected. > > Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>CCing Daniel De Graaf, as the maintainer of this code. However FWIW, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>> --- > xen/xsm/flask/hooks.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c > index fa0589a..cfa2929 100644 > --- a/xen/xsm/flask/hooks.c > +++ b/xen/xsm/flask/hooks.c > @@ -1585,7 +1585,8 @@ static __init int flask_init(void) > if ( register_xsm(&flask_ops) ) > panic("Flask: Unable to register with XSM.\n"); > > - ret = security_load_policy(policy_buffer, policy_size); > + if ( policy_buffer ) > + ret = security_load_policy(policy_buffer, policy_size); > > if ( flask_enforcing ) > printk("Flask: Starting in enforcing mode.\n");
Jan Beulich
2013-Aug-26 11:12 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
>>> On 26.08.13 at 12:03, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: > Xen crashes on boot of xsm/flask enabled builds, if policy module is not > specified. > This seems to have worked on 4.1 at least.Looking at the code (4.1.5) I can''t see what would prevent the same NULL pointer deref. Care to explain?> Can be fixed by testing whether > policy_buffer > is NULL before attempting to load from it - it''s a global which is set to > non-NULL when > policy module is detected. > > Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> > --- > xen/xsm/flask/hooks.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c > index fa0589a..cfa2929 100644 > --- a/xen/xsm/flask/hooks.c > +++ b/xen/xsm/flask/hooks.c > @@ -1585,7 +1585,8 @@ static __init int flask_init(void) > if ( register_xsm(&flask_ops) ) > panic("Flask: Unable to register with XSM.\n"); > > - ret = security_load_policy(policy_buffer, policy_size); > + if ( policy_buffer ) > + ret = security_load_policy(policy_buffer, policy_size);Question is whether policy_buffer == NULL really isn''t supposed to result in a -E... return value (as in fact flask initialization failed). Jan
Tomasz Wroblewski
2013-Aug-26 12:24 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
On 08/26/2013 01:12 PM, Jan Beulich wrote:>>>> On 26.08.13 at 12:03, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >> Xen crashes on boot of xsm/flask enabled builds, if policy module is not >> specified. >> This seems to have worked on 4.1 at least. > Looking at the code (4.1.5) I can''t see what would prevent the > same NULL pointer deref. Care to explain?The crash doesn''t happen at the NULL pointer dereference site though, but a bit later, when xen tries to flush tlbs for first time I believe, which happens during page allocation for the initial domain structure. I traced it to the following ASSERT in smp.c (so yes I should add this particular crash likely is limited to debug builds then) void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int flags) { ASSERT(local_irq_is_enabled()); ... The actual crash message is unhelpful since it''s basically only ... (XEN) Using scheduler: SMP Credit Scheduler (credit) (XEN) Unknown interrupt (cr2=0000000000000000) Either removing the assert (which is obviously bad), or checking for the null pointer deref as in the submitted patch seems to be fixing it. I''m suspecting it was always broken somehow but just was hidden or had different side effects on 4.1 than it does now. I do lack for a good explanation why fiddling with null addresses breaks up this assert, though.>> Can be fixed by testing whether >> policy_buffer >> is NULL before attempting to load from it - it''s a global which is set to >> non-NULL when >> policy module is detected. >> >> Signed-off-by: Tomasz Wroblewski<tomasz.wroblewski@citrix.com> >> --- >> xen/xsm/flask/hooks.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c >> index fa0589a..cfa2929 100644 >> --- a/xen/xsm/flask/hooks.c >> +++ b/xen/xsm/flask/hooks.c >> @@ -1585,7 +1585,8 @@ static __init int flask_init(void) >> if ( register_xsm(&flask_ops) ) >> panic("Flask: Unable to register with XSM.\n"); >> >> - ret = security_load_policy(policy_buffer, policy_size); >> + if ( policy_buffer ) >> + ret = security_load_policy(policy_buffer, policy_size); > Question is whether policy_buffer == NULL really isn''t supposed > to result in a -E... return value (as in fact flask initialization failed). > > Jan >
Andrew Cooper
2013-Aug-26 12:41 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
On 26/08/2013 13:24, Tomasz Wroblewski wrote:> On 08/26/2013 01:12 PM, Jan Beulich wrote: >>>>> On 26.08.13 at 12:03, Tomasz >>>>> Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>> Xen crashes on boot of xsm/flask enabled builds, if policy module is >>> not >>> specified. >>> This seems to have worked on 4.1 at least. >> Looking at the code (4.1.5) I can''t see what would prevent the >> same NULL pointer deref. Care to explain? > The crash doesn''t happen at the NULL pointer dereference site though, > but a bit later, when xen tries to flush tlbs for first time I > believe, which happens during page allocation for the initial domain > structure. I traced it to the following ASSERT in smp.c (so yes I > should add this particular crash likely is limited to debug builds then) > > void flush_area_mask(const cpumask_t *mask, const void *va, unsigned > int flags) > { > ASSERT(local_irq_is_enabled()); > ... > > The actual crash message is unhelpful since it''s basically only > > ... > (XEN) Using scheduler: SMP Credit Scheduler (credit) > (XEN) Unknown interrupt (cr2=0000000000000000) > > > Either removing the assert (which is obviously bad), or checking for > the null pointer deref as in the submitted patch seems to be fixing > it. I''m suspecting it was always broken somehow but just was hidden or > had different side effects on 4.1 than it does now. I do lack for a > good explanation why fiddling with null addresses breaks up this > assert, though.Do you have any more information than that? Stack trace or even a stack dump? I cant spot how those two would be connected. ~Andrew
Jan Beulich
2013-Aug-26 13:00 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
>>> On 26.08.13 at 14:24, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: > On 08/26/2013 01:12 PM, Jan Beulich wrote: >>>>> On 26.08.13 at 12:03, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not >>> specified. >>> This seems to have worked on 4.1 at least. >> Looking at the code (4.1.5) I can''t see what would prevent the >> same NULL pointer deref. Care to explain? > The crash doesn''t happen at the NULL pointer dereference site though,Then does it deref the NULL pointer, or does it not? If it does and merely doesn''t crash because something happens to be mapped there, that''s still a bug.> but a bit later, when xen tries to flush tlbs for first time I believe, > which happens during page allocation for the initial domain structure. I > traced it to the following ASSERT in smp.c (so yes I should add this > particular crash likely is limited to debug builds then) > > void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int > flags) > { > ASSERT(local_irq_is_enabled()); > ... > > The actual crash message is unhelpful since it''s basically only > > ... > (XEN) Using scheduler: SMP Credit Scheduler (credit) > (XEN) Unknown interrupt (cr2=0000000000000000) > > > Either removing the assert (which is obviously bad), or checking for theThe assertion is in no way bad. It''s the too early use of the function that is the problem here.> null pointer deref as in the submitted patch seems to be fixing it. I''m > suspecting it was always broken somehow but just was hidden or had > different side effects on 4.1 than it does now. I do lack for a good > explanation why fiddling with null addresses breaks up this assert, though.Also, you didn''t show the call trace that made things get here (yes, you may need to construct this manually). I''m in no way convinced that there''s a NULL pointer involved here at all - the fact that CR2 is zero doesn''t mean a page fault occurred in the first place. Jan
Daniel De Graaf
2013-Aug-26 13:27 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
On 08/26/2013 06:52 AM, Andrew Cooper wrote:> On 26/08/2013 11:03, Tomasz Wroblewski wrote: >> Xen crashes on boot of xsm/flask enabled builds, if policy module is not specified. >> This seems to have worked on 4.1 at least. Can be fixed by testing whether policy_buffer >> is NULL before attempting to load from it - it''s a global which is set to non-NULL when >> policy module is detected. >> >> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> > > CCing Daniel De Graaf, as the maintainer of this code. > > However FWIW, > Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> > >> --- >> xen/xsm/flask/hooks.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c >> index fa0589a..cfa2929 100644 >> --- a/xen/xsm/flask/hooks.c >> +++ b/xen/xsm/flask/hooks.c >> @@ -1585,7 +1585,8 @@ static __init int flask_init(void) >> if ( register_xsm(&flask_ops) ) >> panic("Flask: Unable to register with XSM.\n"); >> >> - ret = security_load_policy(policy_buffer, policy_size); >> + if ( policy_buffer ) >> + ret = security_load_policy(policy_buffer, policy_size); >> >> if ( flask_enforcing ) >> printk("Flask: Starting in enforcing mode.\n"); >While this change is not wrong, I also don''t see how it could fix anything. The security_load_policy function will not dereference policy_buffer if policy_size is zero (it''ll return -EINVAL first), and the only location that sets policy_size (xsm_policy_init) also sets policy_buffer. If this function is setting the NULL pointer, it also does a dereference - and it will be visible in the printk. Also, on 08/26/2013 07:12 AM, Jan Beulich wrote:> Question is whether policy_buffer == NULL really isn''t supposed > to result in a -E... return value (as in fact flask initialization failed).The return value of flask_init isn''t ever checked. A failure to load the policy at boot is identical to no policy - waiting for a flask_disable or "xl loadpolicy" hypercall from dom0. -- Daniel De Graaf National Security Agency
Tomasz Wroblewski
2013-Aug-26 13:32 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
On 08/26/2013 03:27 PM, Daniel De Graaf wrote:> On 08/26/2013 06:52 AM, Andrew Cooper wrote: >> On 26/08/2013 11:03, Tomasz Wroblewski wrote: >>> Xen crashes on boot of xsm/flask enabled builds, if policy module is >>> not specified. >>> This seems to have worked on 4.1 at least. Can be fixed by testing >>> whether policy_buffer >>> is NULL before attempting to load from it - it''s a global which is >>> set to non-NULL when >>> policy module is detected. >>> >>> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> >> >> CCing Daniel De Graaf, as the maintainer of this code. >> >> However FWIW, >> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> >> >>> --- >>> xen/xsm/flask/hooks.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c >>> index fa0589a..cfa2929 100644 >>> --- a/xen/xsm/flask/hooks.c >>> +++ b/xen/xsm/flask/hooks.c >>> @@ -1585,7 +1585,8 @@ static __init int flask_init(void) >>> if ( register_xsm(&flask_ops) ) >>> panic("Flask: Unable to register with XSM.\n"); >>> >>> - ret = security_load_policy(policy_buffer, policy_size); >>> + if ( policy_buffer ) >>> + ret = security_load_policy(policy_buffer, policy_size); >>> >>> if ( flask_enforcing ) >>> printk("Flask: Starting in enforcing mode.\n"); >> > > While this change is not wrong, I also don''t see how it could fix > anything. The security_load_policy function will not dereference > policy_buffer if policy_size is zero (it''ll return -EINVAL first), > and the only location that sets policy_size (xsm_policy_init) also > sets policy_buffer. If this function is setting the NULL pointer, > it also does a dereference - and it will be visible in the printk. >Right. I''m trying to sift through security_load_policy to figure out why induces the error later, if policy_buffer is null, because yes you are right it does not seem to deref the pointer at first glance, but it definitely causes the mentioned error later on if it''s called. Guess this needs further investigation..> Also, on 08/26/2013 07:12 AM, Jan Beulich wrote: >> Question is whether policy_buffer == NULL really isn''t supposed >> to result in a -E... return value (as in fact flask initialization >> failed). > > The return value of flask_init isn''t ever checked. A failure to > load the policy at boot is identical to no policy - waiting for a > flask_disable or "xl loadpolicy" hypercall from dom0. >
Tomasz Wroblewski
2013-Aug-26 13:34 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
On 08/26/2013 03:00 PM, Jan Beulich wrote:>>>> On 26.08.13 at 14:24, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >> On 08/26/2013 01:12 PM, Jan Beulich wrote: >>>>>> On 26.08.13 at 12:03, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not >>>> specified. >>>> This seems to have worked on 4.1 at least. >>> Looking at the code (4.1.5) I can''t see what would prevent the >>> same NULL pointer deref. Care to explain? >> The crash doesn''t happen at the NULL pointer dereference site though, > Then does it deref the NULL pointer, or does it not? If it does and > merely doesn''t crash because something happens to be mapped > there, that''s still a bug. > >> but a bit later, when xen tries to flush tlbs for first time I believe, >> which happens during page allocation for the initial domain structure. I >> traced it to the following ASSERT in smp.c (so yes I should add this >> particular crash likely is limited to debug builds then) >> >> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int >> flags) >> { >> ASSERT(local_irq_is_enabled()); >> ... >> >> The actual crash message is unhelpful since it''s basically only >> >> ... >> (XEN) Using scheduler: SMP Credit Scheduler (credit) >> (XEN) Unknown interrupt (cr2=0000000000000000) >> >> >> Either removing the assert (which is obviously bad), or checking for the > The assertion is in no way bad. It''s the too early use of the > function that is the problem here.Aye I meant removing the assertion is bad, not the assert. Looks like this needs bit more investigation to see what exact bits inside security_load_policy causes this, so I''ll do that.>> null pointer deref as in the submitted patch seems to be fixing it. I''m >> suspecting it was always broken somehow but just was hidden or had >> different side effects on 4.1 than it does now. I do lack for a good >> explanation why fiddling with null addresses breaks up this assert, though. > Also, you didn''t show the call trace that made things get here (yes, > you may need to construct this manually). I''m in no way convinced > that there''s a NULL pointer involved here at all - the fact that CR2 > is zero doesn''t mean a page fault occurred in the first place.Yeah will investigate more> Jan >
Tomasz Wroblewski
2013-Aug-26 17:00 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
On 08/26/2013 03:00 PM, Jan Beulich wrote:>>>> On 26.08.13 at 14:24, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >> On 08/26/2013 01:12 PM, Jan Beulich wrote: >>>>>> On 26.08.13 at 12:03, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not >>>> specified. >>>> This seems to have worked on 4.1 at least. >>> Looking at the code (4.1.5) I can''t see what would prevent the >>> same NULL pointer deref. Care to explain? >> The crash doesn''t happen at the NULL pointer dereference site though, > Then does it deref the NULL pointer, or does it not? If it does and > merely doesn''t crash because something happens to be mapped > there, that''s still a bug.So after bit more tracing it looks like the issue is not a null pointer deref (it is avoided as Daniel pointed out), but rather some xmalloc/xfree pairs which happen inside security_load_policy even if pointer is null. security_load_policy calls policydb_init, then tries to read the policy which fails since length of is 0, so it calls policydb_destroy. Inside these functions there is some hashtable construction/destruction happening, and a reasonably sizable ones too (there are 7 or so hashtables, biggest one uses an array of 512 pointers, they total to use 768 pointers). I''ve verified that replacing security_load_policy call completely with the following allocation/deallocaiton is enough to cause this crash: //ret = security_load_policy(policy_buffer, policy_size); { void ** p = xmalloc_array(void*, 768); xfree(p); } Note that this allocation succeeds, and also if you would not call xfree (which is not called if say a policy was succesfully loaded), there is no crash. So yeah my original patch accidentaly fixes it by just avoiding the alloc/free completely. The shaky manually constructed call graph for the assertion failure: setup.c: init_idle_domain schedule.c: scheduler_init domain.c: domain_create domain.c: alloc_domain_struct domain.c: alloc_xenheap_pages .. page_alloc.c: alloc_heap_pages flushtlb.h: flush_tlb_mask flushtlb.h: flush_mask smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here I apparently can''t get a real stacktrace because adding dump_execution_state in flush_area_mask just causes the "Unknown interrupt" error, similarily to what hitting the ASSERT fail does. I printed the assert condition manually to verify it tho and interrupts are disabled there so its bound to fail. So it looks like the flush_tlb is called too early (with interrupts disabled) due to large deallocations in the policy loader code? On 08/26/2013 03:00 PM, Jan Beulich wrote:>>>> On 26.08.13 at 14:24, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >> On 08/26/2013 01:12 PM, Jan Beulich wrote: >>>>>> On 26.08.13 at 12:03, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >>>> Xen crashes on boot of xsm/flask enabled builds, if policy module is not >>>> specified. >>>> This seems to have worked on 4.1 at least. >>> Looking at the code (4.1.5) I can''t see what would prevent the >>> same NULL pointer deref. Care to explain? >> The crash doesn''t happen at the NULL pointer dereference site though,>> but a bit later, when xen tries to flush tlbs for first time I believe, >> which happens during page allocation for the initial domain structure. I >> traced it to the following ASSERT in smp.c (so yes I should add this >> particular crash likely is limited to debug builds then) >> >> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int >> flags) >> { >> ASSERT(local_irq_is_enabled()); >> ... >> >> The actual crash message is unhelpful since it''s basically only >> >> ... >> (XEN) Using scheduler: SMP Credit Scheduler (credit) >> (XEN) Unknown interrupt (cr2=0000000000000000) >> >> >> Either removing the assert (which is obviously bad), or checking for the > The assertion is in no way bad. It''s the too early use of the > function that is the problem here. > >> null pointer deref as in the submitted patch seems to be fixing it. I''m >> suspecting it was always broken somehow but just was hidden or had >> different side effects on 4.1 than it does now. I do lack for a good >> explanation why fiddling with null addresses breaks up this assert, though. > Also, you didn''t show the call trace that made things get here (yes, > you may need to construct this manually). I''m in no way convinced > that there''s a NULL pointer involved here at all - the fact that CR2 > is zero doesn''t mean a page fault occurred in the first place. > > Jan >
Jan Beulich
2013-Aug-27 07:13 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
>>> On 26.08.13 at 19:00, Tomasz Wroblewski <tomasz.wroblewski@citrix.com> wrote: > I''ve verified that replacing security_load_policy call > completely with the following allocation/deallocaiton is enough to cause > this crash: > > //ret = security_load_policy(policy_buffer, policy_size); > { > void ** p = xmalloc_array(void*, 768); > xfree(p); > } > > Note that this allocation succeeds, and also if you would not call xfree > (which is not called if say a policy was succesfully loaded), there is > no crash. So yeah my original patch accidentaly fixes it by just > avoiding the alloc/free completely.But I then understand, together with the below, that the crash isn''t down the xfree() path, ...> The shaky manually constructed call graph for the assertion failure: > > setup.c: init_idle_domain > schedule.c: scheduler_init > domain.c: domain_create > domain.c: alloc_domain_struct > domain.c: alloc_xenheap_pages > .. > page_alloc.c: alloc_heap_pages > flushtlb.h: flush_tlb_mask > flushtlb.h: flush_mask > smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here... but instead is on a _subsequent_ allocation. Hence the prior free gets a heap page into a state that makes in non-suitable for re-use. But you certainly noticed that free_heap_pages() sets a page''s u.free.need_tlbflush only if the page had an owner, which shouldn''t be the case for Xen-internal allocations. With that, I think I can see where the bug really is: The owner field (v.inuse._domain) is in a union with the order field re-used by the xmalloc() implementation for whole page allocations. The fix therefore ought to be as simple as the patch below. Jan --- a/xen/common/xmalloc_tlsf.c +++ a/xen/common/xmalloc_tlsf.c @@ -629,6 +629,7 @@ void xfree(void *p) unsigned int i, order = get_order_from_pages(size); BUG_ON((unsigned long)p & ((PAGE_SIZE << order) - 1)); + PFN_ORDER(virt_to_page(p)) = 0; for ( i = 0; ; ++i ) { if ( !(size & (1 << i)) )
Tomasz Wroblewski
2013-Aug-27 07:23 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
On 08/27/2013 09:13 AM, Jan Beulich wrote:>>>> On 26.08.13 at 19:00, Tomasz Wroblewski<tomasz.wroblewski@citrix.com> wrote: >> I''ve verified that replacing security_load_policy call >> completely with the following allocation/deallocaiton is enough to cause >> this crash: >> >> //ret = security_load_policy(policy_buffer, policy_size); >> { >> void ** p = xmalloc_array(void*, 768); >> xfree(p); >> } >> >> Note that this allocation succeeds, and also if you would not call xfree >> (which is not called if say a policy was succesfully loaded), there is >> no crash. So yeah my original patch accidentaly fixes it by just >> avoiding the alloc/free completely. > But I then understand, together with the below, that the crash isn''t > down the xfree() path, ... > >> The shaky manually constructed call graph for the assertion failure: >> >> setup.c: init_idle_domain >> schedule.c: scheduler_init >> domain.c: domain_create >> domain.c: alloc_domain_struct >> domain.c: alloc_xenheap_pages >> .. >> page_alloc.c: alloc_heap_pages >> flushtlb.h: flush_tlb_mask >> flushtlb.h: flush_mask >> smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here > ... but instead is on a _subsequent_ allocation. Hence the prior > free gets a heap page into a state that makes in non-suitable for > re-use. But you certainly noticed that free_heap_pages() sets a > page''s u.free.need_tlbflush only if the page had an owner, > which shouldn''t be the case for Xen-internal allocations. > > With that, I think I can see where the bug really is: The owner > field (v.inuse._domain) is in a union with the order field re-used > by the xmalloc() implementation for whole page allocations. The > fix therefore ought to be as simple as the patch below.Just tested that fix, works! Thanks> Jan > > --- a/xen/common/xmalloc_tlsf.c > +++ a/xen/common/xmalloc_tlsf.c > @@ -629,6 +629,7 @@ void xfree(void *p) > unsigned int i, order = get_order_from_pages(size); > > BUG_ON((unsigned long)p& ((PAGE_SIZE<< order) - 1)); > + PFN_ORDER(virt_to_page(p)) = 0; > for ( i = 0; ; ++i ) > { > if ( !(size& (1<< i)) ) > >
Jan Beulich
2013-Aug-27 07:47 UTC
[PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc()
Not doing this was found to cause problems with sequences of allocation (multi-page), freeing, and then again allocation of the same page upon boot when interrupts are still disabled (causing the owner field to be non-zero, thus making the allocator attempt a TLB flush and, in its processing, triggering an assertion). Reported-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> --- a/xen/common/xmalloc_tlsf.c +++ b/xen/common/xmalloc_tlsf.c @@ -629,6 +629,7 @@ void xfree(void *p) unsigned int i, order = get_order_from_pages(size); BUG_ON((unsigned long)p & ((PAGE_SIZE << order) - 1)); + PFN_ORDER(virt_to_page(p)) = 0; for ( i = 0; ; ++i ) { if ( !(size & (1 << i)) ) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Andrew Cooper
2013-Aug-27 08:50 UTC
Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present
On 26/08/2013 18:00, Tomasz Wroblewski wrote:> > > The shaky manually constructed call graph for the assertion failure: > > setup.c: init_idle_domain > schedule.c: scheduler_init > domain.c: domain_create > domain.c: alloc_domain_struct > domain.c: alloc_xenheap_pages > .. > page_alloc.c: alloc_heap_pages > flushtlb.h: flush_tlb_mask > flushtlb.h: flush_mask > smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here > > I apparently can''t get a real stacktrace because adding > dump_execution_state in flush_area_mask just causes the "Unknown > interrupt" error, similarily to what hitting the ASSERT fail does. I > printed the assert condition manually to verify it tho and interrupts > are disabled there so its bound to fail. >Just for reference here, as I went digging in the code. dump_execution_state() makes use of run_in_exception_handler() which makes use of the ud2 instruction to get its hands on an exception frame, for the purpose of dumping the state. The "Unknown Interrupt" means that the CPU is still running on the early boot IDT, set up in xen/arch/x86/boot/x86_64.S which does a blanket ignore on all interrupts, including exceptions. We should probably see about setting up and using the arch traps earlier in boot. Failing that, an early_invalid_opcode() handler could at least hint that it might have hit an early bug, and give some details. ~Andrew
Keir Fraser
2013-Sep-09 11:14 UTC
Re: [PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc()
On 27/08/2013 08:47, "Jan Beulich" <JBeulich@suse.com> wrote:> Not doing this was found to cause problems with sequences of allocation > (multi-page), freeing, and then again allocation of the same page upon > boot when interrupts are still disabled (causing the owner field to be > non-zero, thus making the allocator attempt a TLB flush and, in its > processing, triggering an assertion). > > Reported-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> > Tested-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>Acked-by: Keir Fraser <keir@xen.org>> --- a/xen/common/xmalloc_tlsf.c > +++ b/xen/common/xmalloc_tlsf.c > @@ -629,6 +629,7 @@ void xfree(void *p) > unsigned int i, order = get_order_from_pages(size); > > BUG_ON((unsigned long)p & ((PAGE_SIZE << order) - 1)); > + PFN_ORDER(virt_to_page(p)) = 0; > for ( i = 0; ; ++i ) > { > if ( !(size & (1 << i)) ) > > >