thr3ads.net - Xen devel - [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Tomasz Wroblewski

2013-Aug-26 10:03 UTC

[PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Xen crashes on boot of xsm/flask enabled builds, if policy module is not
specified.
This seems to have worked on 4.1 at least. Can be fixed by testing whether
policy_buffer
is NULL before attempting to load from it - it''s a global which is set
to non-NULL when
policy module is detected.

Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
---
 xen/xsm/flask/hooks.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index fa0589a..cfa2929 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -1585,7 +1585,8 @@ static __init int flask_init(void)
     if ( register_xsm(&flask_ops) )
         panic("Flask: Unable to register with XSM.\n");
 
-    ret = security_load_policy(policy_buffer, policy_size);
+    if ( policy_buffer )
+        ret = security_load_policy(policy_buffer, policy_size);
 
     if ( flask_enforcing )
         printk("Flask:  Starting in enforcing mode.\n");
-- 
1.7.10.4

Andrew Cooper

2013-Aug-26 10:52 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

On 26/08/2013 11:03, Tomasz Wroblewski wrote:> Xen crashes on boot of xsm/flask enabled builds, if policy module is not
specified.
> This seems to have worked on 4.1 at least. Can be fixed by testing whether
policy_buffer
> is NULL before attempting to load from it - it''s a global which is
set to non-NULL when
> policy module is detected.
>
> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
CCing Daniel De Graaf, as the maintainer of this code.

However FWIW,
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/xsm/flask/hooks.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
> index fa0589a..cfa2929 100644
> --- a/xen/xsm/flask/hooks.c
> +++ b/xen/xsm/flask/hooks.c
> @@ -1585,7 +1585,8 @@ static __init int flask_init(void)
>      if ( register_xsm(&flask_ops) )
>          panic("Flask: Unable to register with XSM.\n");
>  
> -    ret = security_load_policy(policy_buffer, policy_size);
> +    if ( policy_buffer )
> +        ret = security_load_policy(policy_buffer, policy_size);
>  
>      if ( flask_enforcing )
>          printk("Flask:  Starting in enforcing mode.\n");

Jan Beulich

2013-Aug-26 11:12 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

>>> On 26.08.13 at 12:03, Tomasz Wroblewski
<tomasz.wroblewski@citrix.com> wrote:
> Xen crashes on boot of xsm/flask enabled builds, if policy module is not 
> specified.
> This seems to have worked on 4.1 at least.
Looking at the code (4.1.5) I can''t see what would prevent the
same NULL pointer deref. Care to explain?
> Can be fixed by testing whether 
> policy_buffer
> is NULL before attempting to load from it - it''s a global which is
set to
> non-NULL when
> policy module is detected.
> 
> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
> ---
>  xen/xsm/flask/hooks.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
> index fa0589a..cfa2929 100644
> --- a/xen/xsm/flask/hooks.c
> +++ b/xen/xsm/flask/hooks.c
> @@ -1585,7 +1585,8 @@ static __init int flask_init(void)
>      if ( register_xsm(&flask_ops) )
>          panic("Flask: Unable to register with XSM.\n");
>  
> -    ret = security_load_policy(policy_buffer, policy_size);
> +    if ( policy_buffer )
> +        ret = security_load_policy(policy_buffer, policy_size);
Question is whether policy_buffer == NULL really isn''t supposed
to result in a -E... return value (as in fact flask initialization failed).

Jan

Tomasz Wroblewski

2013-Aug-26 12:24 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

On 08/26/2013 01:12 PM, Jan Beulich wrote:>>>> On 26.08.13 at 12:03, Tomasz
Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>> Xen crashes on boot of xsm/flask enabled builds, if policy module is
not
>> specified.
>> This seems to have worked on 4.1 at least.
> Looking at the code (4.1.5) I can''t see what would prevent the
> same NULL pointer deref. Care to explain?The crash doesn''t happen at the NULL pointer dereference site though, 
but a bit later, when xen tries to flush tlbs for first time I believe, 
which happens during page allocation for the initial domain structure. I 
traced it to the following ASSERT in smp.c (so yes I should add this 
particular crash likely is limited to debug builds then)

void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int 
flags)
{
     ASSERT(local_irq_is_enabled());
     ...

The actual crash message is unhelpful since it''s basically only

...
(XEN) Using scheduler: SMP Credit Scheduler (credit)
(XEN) Unknown interrupt (cr2=0000000000000000)


Either removing the assert (which is obviously bad), or checking for the 
null pointer deref as in the submitted patch seems to be fixing it. I''m
suspecting it was always broken somehow but just was hidden or had 
different side effects on 4.1 than it does now. I do lack for a good 
explanation why fiddling with null addresses breaks up this assert, though.
>> Can be fixed by testing whether
>> policy_buffer
>> is NULL before attempting to load from it - it''s a global
which is set to
>> non-NULL when
>> policy module is detected.
>>
>> Signed-off-by: Tomasz Wroblewski<tomasz.wroblewski@citrix.com>
>> ---
>>   xen/xsm/flask/hooks.c |    3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
>> index fa0589a..cfa2929 100644
>> --- a/xen/xsm/flask/hooks.c
>> +++ b/xen/xsm/flask/hooks.c
>> @@ -1585,7 +1585,8 @@ static __init int flask_init(void)
>>       if ( register_xsm(&flask_ops) )
>>           panic("Flask: Unable to register with XSM.\n");
>>
>> -    ret = security_load_policy(policy_buffer, policy_size);
>> +    if ( policy_buffer )
>> +        ret = security_load_policy(policy_buffer, policy_size);
> Question is whether policy_buffer == NULL really isn''t supposed
> to result in a -E... return value (as in fact flask initialization failed).
>
> Jan
>

Andrew Cooper

2013-Aug-26 12:41 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

On 26/08/2013 13:24, Tomasz Wroblewski wrote:> On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>>> On 26.08.13 at 12:03, Tomasz
>>>>> Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>>> Xen crashes on boot of xsm/flask enabled builds, if policy module
is
>>> not
>>> specified.
>>> This seems to have worked on 4.1 at least.
>> Looking at the code (4.1.5) I can''t see what would prevent the
>> same NULL pointer deref. Care to explain?
> The crash doesn''t happen at the NULL pointer dereference site
though,
> but a bit later, when xen tries to flush tlbs for first time I
> believe, which happens during page allocation for the initial domain
> structure. I traced it to the following ASSERT in smp.c (so yes I
> should add this particular crash likely is limited to debug builds then)
>
> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned
> int flags)
> {
>     ASSERT(local_irq_is_enabled());
>     ...
>
> The actual crash message is unhelpful since it''s basically only
>
> ...
> (XEN) Using scheduler: SMP Credit Scheduler (credit)
> (XEN) Unknown interrupt (cr2=0000000000000000)
>
>
> Either removing the assert (which is obviously bad), or checking for
> the null pointer deref as in the submitted patch seems to be fixing
> it. I''m suspecting it was always broken somehow but just was
hidden or
> had different side effects on 4.1 than it does now. I do lack for a
> good explanation why fiddling with null addresses breaks up this
> assert, though.
Do you have any more information than that?  Stack trace or even a stack
dump?

I cant spot how those two would be connected.

~Andrew

Jan Beulich

2013-Aug-26 13:00 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

>>> On 26.08.13 at 14:24, Tomasz Wroblewski
<tomasz.wroblewski@citrix.com> wrote:
> On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>>> On 26.08.13 at 12:03, Tomasz
Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>>> Xen crashes on boot of xsm/flask enabled builds, if policy module
is not
>>> specified.
>>> This seems to have worked on 4.1 at least.
>> Looking at the code (4.1.5) I can''t see what would prevent the
>> same NULL pointer deref. Care to explain?
> The crash doesn''t happen at the NULL pointer dereference site
though,
Then does it deref the NULL pointer, or does it not? If it does and
merely doesn''t crash because something happens to be mapped
there, that''s still a bug.
> but a bit later, when xen tries to flush tlbs for first time I believe, 
> which happens during page allocation for the initial domain structure. I 
> traced it to the following ASSERT in smp.c (so yes I should add this 
> particular crash likely is limited to debug builds then)
> 
> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int 
> flags)
> {
>      ASSERT(local_irq_is_enabled());
>      ...
> 
> The actual crash message is unhelpful since it''s basically only
> 
> ...
> (XEN) Using scheduler: SMP Credit Scheduler (credit)
> (XEN) Unknown interrupt (cr2=0000000000000000)
> 
> 
> Either removing the assert (which is obviously bad), or checking for the 
The assertion is in no way bad. It''s the too early use of the
function that is the problem here.
> null pointer deref as in the submitted patch seems to be fixing it.
I''m
> suspecting it was always broken somehow but just was hidden or had 
> different side effects on 4.1 than it does now. I do lack for a good 
> explanation why fiddling with null addresses breaks up this assert, though.
Also, you didn''t show the call trace that made things get here (yes,
you may need to construct this manually). I''m in no way convinced
that there''s a NULL pointer involved here at all - the fact that CR2
is zero doesn''t mean a page fault occurred in the first place.

Jan

Daniel De Graaf

2013-Aug-26 13:27 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

On 08/26/2013 06:52 AM, Andrew Cooper wrote:> On 26/08/2013 11:03, Tomasz Wroblewski wrote:
>> Xen crashes on boot of xsm/flask enabled builds, if policy module is
not specified.
>> This seems to have worked on 4.1 at least. Can be fixed by testing
whether policy_buffer
>> is NULL before attempting to load from it - it''s a global
which is set to non-NULL when
>> policy module is detected.
>>
>> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
>
> CCing Daniel De Graaf, as the maintainer of this code.
>
> However FWIW,
> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
>> ---
>>   xen/xsm/flask/hooks.c |    3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
>> index fa0589a..cfa2929 100644
>> --- a/xen/xsm/flask/hooks.c
>> +++ b/xen/xsm/flask/hooks.c
>> @@ -1585,7 +1585,8 @@ static __init int flask_init(void)
>>       if ( register_xsm(&flask_ops) )
>>           panic("Flask: Unable to register with XSM.\n");
>>
>> -    ret = security_load_policy(policy_buffer, policy_size);
>> +    if ( policy_buffer )
>> +        ret = security_load_policy(policy_buffer, policy_size);
>>
>>       if ( flask_enforcing )
>>           printk("Flask:  Starting in enforcing mode.\n");
>
While this change is not wrong, I also don''t see how it could fix
anything. The security_load_policy function will not dereference
policy_buffer if policy_size is zero (it''ll return -EINVAL first),
and the only location that sets policy_size (xsm_policy_init) also
sets policy_buffer. If this function is setting the NULL pointer,
it also does a dereference - and it will be visible in the printk.

Also, on 08/26/2013 07:12 AM, Jan Beulich wrote:> Question is whether policy_buffer == NULL really isn''t supposed
> to result in a -E... return value (as in fact flask initialization failed).
The return value of flask_init isn''t ever checked. A failure to
load the policy at boot is identical to no policy - waiting for a
flask_disable or "xl loadpolicy" hypercall from dom0.

-- 
Daniel De Graaf
National Security Agency

Tomasz Wroblewski

2013-Aug-26 13:32 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

On 08/26/2013 03:27 PM, Daniel De Graaf wrote:> On 08/26/2013 06:52 AM, Andrew Cooper wrote:
>> On 26/08/2013 11:03, Tomasz Wroblewski wrote:
>>> Xen crashes on boot of xsm/flask enabled builds, if policy module
is
>>> not specified.
>>> This seems to have worked on 4.1 at least. Can be fixed by testing 
>>> whether policy_buffer
>>> is NULL before attempting to load from it - it''s a global
which is
>>> set to non-NULL when
>>> policy module is detected.
>>>
>>> Signed-off-by: Tomasz Wroblewski
<tomasz.wroblewski@citrix.com>
>>
>> CCing Daniel De Graaf, as the maintainer of this code.
>>
>> However FWIW,
>> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>>> ---
>>>   xen/xsm/flask/hooks.c |    3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
>>> index fa0589a..cfa2929 100644
>>> --- a/xen/xsm/flask/hooks.c
>>> +++ b/xen/xsm/flask/hooks.c
>>> @@ -1585,7 +1585,8 @@ static __init int flask_init(void)
>>>       if ( register_xsm(&flask_ops) )
>>>           panic("Flask: Unable to register with XSM.\n");
>>>
>>> -    ret = security_load_policy(policy_buffer, policy_size);
>>> +    if ( policy_buffer )
>>> +        ret = security_load_policy(policy_buffer, policy_size);
>>>
>>>       if ( flask_enforcing )
>>>           printk("Flask:  Starting in enforcing
mode.\n");
>>
>
> While this change is not wrong, I also don''t see how it could fix
> anything. The security_load_policy function will not dereference
> policy_buffer if policy_size is zero (it''ll return -EINVAL first),
> and the only location that sets policy_size (xsm_policy_init) also
> sets policy_buffer. If this function is setting the NULL pointer,
> it also does a dereference - and it will be visible in the printk.
>Right. I''m trying to sift through security_load_policy to figure out
why
induces the error later, if policy_buffer is null, because yes you are 
right it does not seem to deref the pointer at first glance, but it 
definitely causes the mentioned error later on if it''s called. Guess 
this needs further investigation..

> Also, on 08/26/2013 07:12 AM, Jan Beulich wrote:
>> Question is whether policy_buffer == NULL really isn''t
supposed
>> to result in a -E... return value (as in fact flask initialization 
>> failed).
>
> The return value of flask_init isn''t ever checked. A failure to
> load the policy at boot is identical to no policy - waiting for a
> flask_disable or "xl loadpolicy" hypercall from dom0.
>

Tomasz Wroblewski

2013-Aug-26 13:34 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

On 08/26/2013 03:00 PM, Jan Beulich wrote:>>>> On 26.08.13 at 14:24, Tomasz
Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>> On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>>>> On 26.08.13 at 12:03, Tomasz
Wroblewski<tomasz.wroblewski@citrix.com>   wrote:
>>>> Xen crashes on boot of xsm/flask enabled builds, if policy
module is not
>>>> specified.
>>>> This seems to have worked on 4.1 at least.
>>> Looking at the code (4.1.5) I can''t see what would prevent
the
>>> same NULL pointer deref. Care to explain?
>> The crash doesn''t happen at the NULL pointer dereference site
though,
> Then does it deref the NULL pointer, or does it not? If it does and
> merely doesn''t crash because something happens to be mapped
> there, that''s still a bug.
>
>> but a bit later, when xen tries to flush tlbs for first time I believe,
>> which happens during page allocation for the initial domain structure.
I
>> traced it to the following ASSERT in smp.c (so yes I should add this
>> particular crash likely is limited to debug builds then)
>>
>> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned
int
>> flags)
>> {
>>       ASSERT(local_irq_is_enabled());
>>       ...
>>
>> The actual crash message is unhelpful since it''s basically
only
>>
>> ...
>> (XEN) Using scheduler: SMP Credit Scheduler (credit)
>> (XEN) Unknown interrupt (cr2=0000000000000000)
>>
>>
>> Either removing the assert (which is obviously bad), or checking for
the
> The assertion is in no way bad. It''s the too early use of the
> function that is the problem here.
Aye I meant removing the assertion is bad, not the assert. Looks like 
this needs bit more investigation to see what exact bits inside 
security_load_policy causes this, so I''ll do that.
>> null pointer deref as in the submitted patch seems to be fixing it.
I''m
>> suspecting it was always broken somehow but just was hidden or had
>> different side effects on 4.1 than it does now. I do lack for a good
>> explanation why fiddling with null addresses breaks up this assert,
though.
> Also, you didn''t show the call trace that made things get here
(yes,
> you may need to construct this manually). I''m in no way convinced
> that there''s a NULL pointer involved here at all - the fact that
CR2
> is zero doesn''t mean a page fault occurred in the first place.
Yeah will investigate more> Jan
>

Tomasz Wroblewski

2013-Aug-26 17:00 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

On 08/26/2013 03:00 PM, Jan Beulich wrote:>>>> On 26.08.13 at 14:24, Tomasz
Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>> On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>>>> On 26.08.13 at 12:03, Tomasz
Wroblewski<tomasz.wroblewski@citrix.com>   wrote:
>>>> Xen crashes on boot of xsm/flask enabled builds, if policy
module is not
>>>> specified.
>>>> This seems to have worked on 4.1 at least.
>>> Looking at the code (4.1.5) I can''t see what would prevent
the
>>> same NULL pointer deref. Care to explain?
>> The crash doesn''t happen at the NULL pointer dereference site
though,
> Then does it deref the NULL pointer, or does it not? If it does and
> merely doesn''t crash because something happens to be mapped
> there, that''s still a bug.So after bit more tracing it looks like the issue is not a null pointer 
deref (it is avoided as Daniel pointed out), but rather some 
xmalloc/xfree pairs which happen inside security_load_policy even if 
pointer is null.

security_load_policy calls policydb_init, then tries to read the policy 
which fails since length of is 0, so it calls policydb_destroy. Inside 
these functions there is some hashtable construction/destruction 
happening, and a reasonably sizable ones too (there are 7 or so 
hashtables, biggest one uses an array of 512 pointers, they total to use 
768 pointers). I''ve verified that replacing security_load_policy call 
completely with the following allocation/deallocaiton is enough to cause 
this crash:

     //ret = security_load_policy(policy_buffer, policy_size);
     {
             void ** p = xmalloc_array(void*, 768);
             xfree(p);
     }

Note that this allocation succeeds, and also if you would not call xfree 
(which is not called if say a policy was succesfully loaded), there is 
no crash. So yeah my original patch accidentaly fixes it by just 
avoiding the alloc/free completely.

The shaky manually constructed call graph for the assertion failure:

setup.c: init_idle_domain
schedule.c: scheduler_init
domain.c: domain_create
domain.c: alloc_domain_struct
domain.c: alloc_xenheap_pages
..
page_alloc.c: alloc_heap_pages
flushtlb.h: flush_tlb_mask
flushtlb.h: flush_mask
smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here

I apparently can''t get a real stacktrace because adding 
dump_execution_state in flush_area_mask just causes the "Unknown 
interrupt" error, similarily to what hitting the ASSERT fail does. I 
printed the assert condition manually to verify it tho and interrupts 
are disabled there so its bound to fail.

So it looks like the flush_tlb is called too early (with interrupts 
disabled) due to large deallocations in the policy loader code?

On 08/26/2013 03:00 PM, Jan Beulich wrote:>>>> On 26.08.13 at 14:24, Tomasz
Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>> On 08/26/2013 01:12 PM, Jan Beulich wrote:
>>>>>> On 26.08.13 at 12:03, Tomasz
Wroblewski<tomasz.wroblewski@citrix.com>   wrote:
>>>> Xen crashes on boot of xsm/flask enabled builds, if policy
module is not
>>>> specified.
>>>> This seems to have worked on 4.1 at least.
>>> Looking at the code (4.1.5) I can''t see what would prevent
the
>>> same NULL pointer deref. Care to explain?
>> The crash doesn''t happen at the NULL pointer dereference site
though,
>> but a bit later, when xen tries to flush tlbs for first time I believe,
>> which happens during page allocation for the initial domain structure.
I
>> traced it to the following ASSERT in smp.c (so yes I should add this
>> particular crash likely is limited to debug builds then)
>>
>> void flush_area_mask(const cpumask_t *mask, const void *va, unsigned
int
>> flags)
>> {
>>       ASSERT(local_irq_is_enabled());
>>       ...
>>
>> The actual crash message is unhelpful since it''s basically
only
>>
>> ...
>> (XEN) Using scheduler: SMP Credit Scheduler (credit)
>> (XEN) Unknown interrupt (cr2=0000000000000000)
>>
>>
>> Either removing the assert (which is obviously bad), or checking for
the
> The assertion is in no way bad. It''s the too early use of the
> function that is the problem here.
>
>> null pointer deref as in the submitted patch seems to be fixing it.
I''m
>> suspecting it was always broken somehow but just was hidden or had
>> different side effects on 4.1 than it does now. I do lack for a good
>> explanation why fiddling with null addresses breaks up this assert,
though.
> Also, you didn''t show the call trace that made things get here
(yes,
> you may need to construct this manually). I''m in no way convinced
> that there''s a NULL pointer involved here at all - the fact that
CR2
> is zero doesn''t mean a page fault occurred in the first place.
>
> Jan
>

Jan Beulich

2013-Aug-27 07:13 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

>>> On 26.08.13 at 19:00, Tomasz Wroblewski
<tomasz.wroblewski@citrix.com> wrote:
> I''ve verified that replacing security_load_policy call 
> completely with the following allocation/deallocaiton is enough to cause 
> this crash:
> 
>      //ret = security_load_policy(policy_buffer, policy_size);
>      {
>              void ** p = xmalloc_array(void*, 768);
>              xfree(p);
>      }
> 
> Note that this allocation succeeds, and also if you would not call xfree 
> (which is not called if say a policy was succesfully loaded), there is 
> no crash. So yeah my original patch accidentaly fixes it by just 
> avoiding the alloc/free completely.
But I then understand, together with the below, that the crash isn''t
down the xfree() path, ...
> The shaky manually constructed call graph for the assertion failure:
> 
> setup.c: init_idle_domain
> schedule.c: scheduler_init
> domain.c: domain_create
> domain.c: alloc_domain_struct
> domain.c: alloc_xenheap_pages
> ..
> page_alloc.c: alloc_heap_pages
> flushtlb.h: flush_tlb_mask
> flushtlb.h: flush_mask
> smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here
... but instead is on a _subsequent_ allocation. Hence the prior
free gets a heap page into a state that makes in non-suitable for
re-use. But you certainly noticed that free_heap_pages() sets a
page''s u.free.need_tlbflush only if the page had an owner,
which shouldn''t be the case for Xen-internal allocations.

With that, I think I can see where the bug really is: The owner
field (v.inuse._domain) is in a union with the order field re-used
by the xmalloc() implementation for whole page allocations. The
fix therefore ought to be as simple as the patch below.

Jan

--- a/xen/common/xmalloc_tlsf.c
+++ a/xen/common/xmalloc_tlsf.c
@@ -629,6 +629,7 @@ void xfree(void *p)
         unsigned int i, order = get_order_from_pages(size);
 
         BUG_ON((unsigned long)p & ((PAGE_SIZE << order) - 1));
+        PFN_ORDER(virt_to_page(p)) = 0;
         for ( i = 0; ; ++i )
         {
             if ( !(size & (1 << i)) )

Tomasz Wroblewski

2013-Aug-27 07:23 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

On 08/27/2013 09:13 AM, Jan Beulich wrote:>>>> On 26.08.13 at 19:00, Tomasz
Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>> I''ve verified that replacing security_load_policy call
>> completely with the following allocation/deallocaiton is enough to
cause
>> this crash:
>>
>>       //ret = security_load_policy(policy_buffer, policy_size);
>>       {
>>               void ** p = xmalloc_array(void*, 768);
>>               xfree(p);
>>       }
>>
>> Note that this allocation succeeds, and also if you would not call
xfree
>> (which is not called if say a policy was succesfully loaded), there is
>> no crash. So yeah my original patch accidentaly fixes it by just
>> avoiding the alloc/free completely.
> But I then understand, together with the below, that the crash
isn''t
> down the xfree() path, ...
>
>> The shaky manually constructed call graph for the assertion failure:
>>
>> setup.c: init_idle_domain
>> schedule.c: scheduler_init
>> domain.c: domain_create
>> domain.c: alloc_domain_struct
>> domain.c: alloc_xenheap_pages
>> ..
>> page_alloc.c: alloc_heap_pages
>> flushtlb.h: flush_tlb_mask
>> flushtlb.h: flush_mask
>> smp.c: flush_area_mask - hits ASSERT because interrupts are disabled
here
> ... but instead is on a _subsequent_ allocation. Hence the prior
> free gets a heap page into a state that makes in non-suitable for
> re-use. But you certainly noticed that free_heap_pages() sets a
> page''s u.free.need_tlbflush only if the page had an owner,
> which shouldn''t be the case for Xen-internal allocations.
>
> With that, I think I can see where the bug really is: The owner
> field (v.inuse._domain) is in a union with the order field re-used
> by the xmalloc() implementation for whole page allocations. The
> fix therefore ought to be as simple as the patch below.
Just tested that fix, works! Thanks> Jan
>
> --- a/xen/common/xmalloc_tlsf.c
> +++ a/xen/common/xmalloc_tlsf.c
> @@ -629,6 +629,7 @@ void xfree(void *p)
>           unsigned int i, order = get_order_from_pages(size);
>
>           BUG_ON((unsigned long)p&  ((PAGE_SIZE<<  order) - 1));
> +        PFN_ORDER(virt_to_page(p)) = 0;
>           for ( i = 0; ; ++i )
>           {
>               if ( !(size&  (1<<  i)) )
>
>

Jan Beulich

2013-Aug-27 07:47 UTC

head link

[PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc()

Not doing this was found to cause problems with sequences of allocation
(multi-page), freeing, and then again allocation of the same page upon
boot when interrupts are still disabled (causing the owner field to be
non-zero, thus making the allocator attempt a TLB flush and, in its
processing, triggering an assertion).

Reported-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>

--- a/xen/common/xmalloc_tlsf.c
+++ b/xen/common/xmalloc_tlsf.c
@@ -629,6 +629,7 @@ void xfree(void *p)
         unsigned int i, order = get_order_from_pages(size);
 
         BUG_ON((unsigned long)p & ((PAGE_SIZE << order) - 1));
+        PFN_ORDER(virt_to_page(p)) = 0;
         for ( i = 0; ; ++i )
         {
             if ( !(size & (1 << i)) )





_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andrew Cooper

2013-Aug-27 08:50 UTC

head link

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

On 26/08/2013 18:00, Tomasz Wroblewski wrote:>
>
> The shaky manually constructed call graph for the assertion failure:
>
> setup.c: init_idle_domain
> schedule.c: scheduler_init
> domain.c: domain_create
> domain.c: alloc_domain_struct
> domain.c: alloc_xenheap_pages
> ..
> page_alloc.c: alloc_heap_pages
> flushtlb.h: flush_tlb_mask
> flushtlb.h: flush_mask
> smp.c: flush_area_mask - hits ASSERT because interrupts are disabled here
>
> I apparently can''t get a real stacktrace because adding
> dump_execution_state in flush_area_mask just causes the "Unknown
> interrupt" error, similarily to what hitting the ASSERT fail does. I
> printed the assert condition manually to verify it tho and interrupts
> are disabled there so its bound to fail.
>
Just for reference here, as I went digging in the code.
dump_execution_state() makes use of run_in_exception_handler() which
makes use of the ud2 instruction to get its hands on an exception frame,
for the purpose of dumping the state.

The "Unknown Interrupt" means that the CPU is still running on the
early
boot IDT, set up in xen/arch/x86/boot/x86_64.S which does a blanket
ignore on all interrupts, including exceptions.

We should probably see about setting up and using the arch traps earlier
in boot.  Failing that, an early_invalid_opcode() handler could at least
hint that it might have hit an early bug, and give some details.

~Andrew

Keir Fraser

2013-Sep-09 11:14 UTC

head link

Re: [PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc()

On 27/08/2013 08:47, "Jan Beulich" <JBeulich@suse.com> wrote:
> Not doing this was found to cause problems with sequences of allocation
> (multi-page), freeing, and then again allocation of the same page upon
> boot when interrupts are still disabled (causing the owner field to be
> non-zero, thus making the allocator attempt a TLB flush and, in its
> processing, triggering an assertion).
> 
> Reported-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Tested-by: Tomasz Wroblewski <tomasz.wroblewski@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
> --- a/xen/common/xmalloc_tlsf.c
> +++ b/xen/common/xmalloc_tlsf.c
> @@ -629,6 +629,7 @@ void xfree(void *p)
>          unsigned int i, order = get_order_from_pages(size);
>  
>          BUG_ON((unsigned long)p & ((PAGE_SIZE << order) - 1));
> +        PFN_ORDER(virt_to_page(p)) = 0;
>          for ( i = 0; ; ++i )
>          {
>              if ( !(size & (1 << i)) )
> 
> 
>

Xen devel - Aug 2013 - [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

[PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

[PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc()

Re: [PATCH] Fix boot crash on xsm/flask enabled builds when no policy module is present

Re: [PATCH] xmalloc: make whole pages xfree() clear the order field (ab)used by xmalloc()