thr3ads.net - Xen devel - [Xen on ARM] Possible unhandled SGI bug. [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Sander Bogaert

2013-Apr-28 19:02 UTC

[Xen on ARM] Possible unhandled SGI bug.

Hi,

all previous information can be found in this thread:
http://lists.xen.org/archives/html/xen-devel/2013-04/msg02772.html

I''ve been trying to reproduce this behaviour for the last 2 days,
crashme has been running on the Arndale board for a total of at least
20 hours. I restarted the process once in a while with the seed I saw
crashing Xen ( ''crashme +2000.4 666 50 2:00:00 2'' ).

The version of crashme is 2.4, the one from the Debian Wheezy
repository. The last seed logged ( needs a SD card write so I don''t
know when the last sync was before the crash ) was 43166

I have not been able to reproduce the crash. However I''m quite sure I
wasn''t imagining things, I really did see Xen crash with the "SGI
2
Unhandled" error when I was running crashme from dom0 userspace.

This seems like a big deal and not being able to reproduce it is kind
of frustrating. So I was wondering if there were any ideas on how this
could have happened? When it did happend I just rebooted the board so
it was in a ''clean'' state.

Maybe some speculations on a cause could help me reproduce it? A small
explanation on when exactly it should issue sgi''s? I would really
really like to get to the bottom of this :-)

Thanks,
Sander

Ian Campbell

2013-Apr-29 09:39 UTC

head link

Re: [Xen on ARM] Possible unhandled SGI bug.

On Sun, 2013-04-28 at 20:02 +0100, Sander Bogaert wrote:> Hi,
> 
> all previous information can be found in this thread:
> http://lists.xen.org/archives/html/xen-devel/2013-04/msg02772.html
> 
> I''ve been trying to reproduce this behaviour for the last 2 days,
> crashme has been running on the Arndale board for a total of at least
> 20 hours. I restarted the process once in a while with the seed I saw
> crashing Xen ( ''crashme +2000.4 666 50 2:00:00 2'' ).
> 
> The version of crashme is 2.4, the one from the Debian Wheezy
> repository. The last seed logged ( needs a SD card write so I
don''t
> know when the last sync was before the crash ) was 43166
> 
> I have not been able to reproduce the crash. However I''m quite
sure I
> wasn''t imagining things, I really did see Xen crash with the
"SGI 2
> Unhandled" error when I was running crashme from dom0 userspace.
It could be that running crashme was just incidental, and the crash just
happened independently. There really ought to be no way for a guest to
directly generate a host level SGI and certainly no way for it to
generate one with a number of its choosing.
> This seems like a big deal and not being able to reproduce it is kind
> of frustrating. So I was wondering if there were any ideas on how this
> could have happened? When it did happend I just rebooted the board so
> it was in a ''clean'' state.
> 
> Maybe some speculations on a cause could help me reproduce it? A small
> explanation on when exactly it should issue sgi''s? I would really
> really like to get to the bottom of this :-)
The xen.git hypervisor uses two SGIs, GIC_SGI_EVENT_CHECK (==0) and
GIC_SGI_DUMP_STATE (==1). Both are issued only via calls to one of
send_SGI_{mask,self,allbutself} (or their various wrappers). In practice
this means smp_send_event_check_mask() or smp_send_state_dump(). You can
verify this by looking at callchains lead to one of the small number of
writes to GICD[GICD_SGIR].

Julien added a new SGI in his Arndale tree to call a function on another
CPU (not sure what he called it without looking it up, it''s #2 though),
this would be exercised via smp_call_function() and friends.

About my only theory about how you can have seen a spurious host level
SGI==2 is a partial rebuild error -- i.e. make b0rked the build and you
got the new version of smp_call_function et al but not the new version
of do_sgi(). Unless of course Julien''s tree temporarily had code with
that behaviour (i.e. added the smp_call stuff before the handler)?

TBH, there probably isn''t going to be much we can do about this until
we
get a repro, so I''d be tempted to ignore it and move on and hope we
never see it again.

About the only useful things we could do in case it does happen again
would be to print othercpu in the panic from do_sgi and to add asserts
to send_SGI_* to assert it is sending an SGI which we have defined (not
just one which the hardware defines as it asserts now. Could you whip up
a patch to do those?

Ian.

Julien Grall

2013-Apr-29 12:27 UTC

head link

Re: [Xen on ARM] Possible unhandled SGI bug.

On 04/29/2013 10:39 AM, Ian Campbell wrote:
> On Sun, 2013-04-28 at 20:02 +0100, Sander Bogaert wrote:
>> Hi,
>>
>> all previous information can be found in this thread:
>> http://lists.xen.org/archives/html/xen-devel/2013-04/msg02772.html
>>
>> I''ve been trying to reproduce this behaviour for the last 2
days,
>> crashme has been running on the Arndale board for a total of at least
>> 20 hours. I restarted the process once in a while with the seed I saw
>> crashing Xen ( ''crashme +2000.4 666 50 2:00:00 2'' ).
>>
>> The version of crashme is 2.4, the one from the Debian Wheezy
>> repository. The last seed logged ( needs a SD card write so I
don''t
>> know when the last sync was before the crash ) was 43166
>>
>> I have not been able to reproduce the crash. However I''m quite
sure I
>> wasn''t imagining things, I really did see Xen crash with the
"SGI 2
>> Unhandled" error when I was running crashme from dom0 userspace.
> 
> It could be that running crashme was just incidental, and the crash just
> happened independently. There really ought to be no way for a guest to
> directly generate a host level SGI and certainly no way for it to
> generate one with a number of its choosing.
> 
>> This seems like a big deal and not being able to reproduce it is kind
>> of frustrating. So I was wondering if there were any ideas on how this
>> could have happened? When it did happend I just rebooted the board so
>> it was in a ''clean'' state.
>>
>> Maybe some speculations on a cause could help me reproduce it? A small
>> explanation on when exactly it should issue sgi''s? I would
really
>> really like to get to the bottom of this :-)
> 
> The xen.git hypervisor uses two SGIs, GIC_SGI_EVENT_CHECK (==0) and
> GIC_SGI_DUMP_STATE (==1). Both are issued only via calls to one of
> send_SGI_{mask,self,allbutself} (or their various wrappers). In practice
> this means smp_send_event_check_mask() or smp_send_state_dump(). You can
> verify this by looking at callchains lead to one of the small number of
> writes to GICD[GICD_SGIR].
> 
> Julien added a new SGI in his Arndale tree to call a function on another
> CPU (not sure what he called it without looking it up, it''s #2
though),
> this would be exercised via smp_call_function() and friends.
> 
> About my only theory about how you can have seen a spurious host level
> SGI==2 is a partial rebuild error -- i.e. make b0rked the build and you
> got the new version of smp_call_function et al but not the new version
> of do_sgi(). Unless of course Julien''s tree temporarily had code
with
> that behaviour (i.e. added the smp_call stuff before the handler)?
All this functionality is implemented in a single commit and I don''t
see this
commit on you tree (commit 5ce4118f5768c6137d58888d57972bdfdf4c9aba).

GIC_SGI_CALL_FUNCTION is called by on_selected_cpus which is used for:
   - halt a physical cpu
   - gdb
   - read clocks keyhandler

-- 
Julien Grall

Sander Bogaert

2013-Apr-29 12:36 UTC

head link

Re: [Xen on ARM] Possible unhandled SGI bug.

On 29-04-13 14:27, Julien Grall wrote:> On 04/29/2013 10:39 AM, Ian Campbell wrote:
> 
>> On Sun, 2013-04-28 at 20:02 +0100, Sander Bogaert wrote:
>>> Hi,
>>> 
>>> all previous information can be found in this thread: 
>>> http://lists.xen.org/archives/html/xen-devel/2013-04/msg02772.html
>>>
>>>
>>> I''ve been trying to reproduce this behaviour for the last 2
days,>>> crashme has been running on the Arndale board for a total of at
>>> least 20 hours. I restarted the process once in a while with
>>> the seed I saw crashing Xen ( ''crashme +2000.4 666 50
2:00:00
>>> 2'' ).
>>> 
>>> The version of crashme is 2.4, the one from the Debian Wheezy 
>>> repository. The last seed logged ( needs a SD card write so I
>>> don''t know when the last sync was before the crash ) was
43166
>>> 
>>> I have not been able to reproduce the crash. However I''m
quite
>>> sure I wasn''t imagining things, I really did see Xen crash
with
>>> the "SGI 2 Unhandled" error when I was running crashme
from
>>> dom0 userspace.
>> 
>> It could be that running crashme was just incidental, and the
>> crash just happened independently. There really ought to be no
>> way for a guest to directly generate a host level SGI and
>> certainly no way for it to generate one with a number of its
>> choosing.
>> 
>>> This seems like a big deal and not being able to reproduce it
>>> is kind of frustrating. So I was wondering if there were any
>>> ideas on how this could have happened? When it did happend I
>>> just rebooted the board so it was in a ''clean''
state.
>>> 
>>> Maybe some speculations on a cause could help me reproduce it?
>>> A small explanation on when exactly it should issue sgi''s?
I
>>> would really really like to get to the bottom of this :-)
>> 
>> The xen.git hypervisor uses two SGIs, GIC_SGI_EVENT_CHECK (==0)
>> and GIC_SGI_DUMP_STATE (==1). Both are issued only via calls to
>> one of send_SGI_{mask,self,allbutself} (or their various
>> wrappers). In practice this means smp_send_event_check_mask() or
>> smp_send_state_dump(). You can verify this by looking at
>> callchains lead to one of the small number of writes to
>> GICD[GICD_SGIR].
>> 
>> Julien added a new SGI in his Arndale tree to call a function on
>> another CPU (not sure what he called it without looking it up,
>> it''s #2 though), this would be exercised via
smp_call_function()
>> and friends.
>> 
>> About my only theory about how you can have seen a spurious host
>> level SGI==2 is a partial rebuild error -- i.e. make b0rked the
>> build and you got the new version of smp_call_function et al but
>> not the new version of do_sgi(). Unless of course Julien''s
tree
>> temporarily had code with that behaviour (i.e. added the smp_call
>> stuff before the handler)?
> 
> All this functionality is implemented in a single commit and I
> don''t see this commit on you tree (commit
> 5ce4118f5768c6137d58888d57972bdfdf4c9aba).
> 
> GIC_SGI_CALL_FUNCTION is called by on_selected_cpus which is used
> for: - halt a physical cpu - gdb - read clocks keyhandler
> 
I understand I''m using an older version. The reason I''m still
using it
is because I hope to reproduce this. I really don''t think I
''b0rked''
my build, it''s a clean pull & build. So if sgi 2 was sent it
wasn''t
because of this functionality. I will rerun the test from time to time
maybe it pops up again.

Sander

Xen devel - Apr 2013 - [Xen on ARM] Possible unhandled SGI bug.

[Xen on ARM] Possible unhandled SGI bug.

Re: [Xen on ARM] Possible unhandled SGI bug.

Re: [Xen on ARM] Possible unhandled SGI bug.

Re: [Xen on ARM] Possible unhandled SGI bug.