Hi, all previous information can be found in this thread: http://lists.xen.org/archives/html/xen-devel/2013-04/msg02772.html I''ve been trying to reproduce this behaviour for the last 2 days, crashme has been running on the Arndale board for a total of at least 20 hours. I restarted the process once in a while with the seed I saw crashing Xen ( ''crashme +2000.4 666 50 2:00:00 2'' ). The version of crashme is 2.4, the one from the Debian Wheezy repository. The last seed logged ( needs a SD card write so I don''t know when the last sync was before the crash ) was 43166 I have not been able to reproduce the crash. However I''m quite sure I wasn''t imagining things, I really did see Xen crash with the "SGI 2 Unhandled" error when I was running crashme from dom0 userspace. This seems like a big deal and not being able to reproduce it is kind of frustrating. So I was wondering if there were any ideas on how this could have happened? When it did happend I just rebooted the board so it was in a ''clean'' state. Maybe some speculations on a cause could help me reproduce it? A small explanation on when exactly it should issue sgi''s? I would really really like to get to the bottom of this :-) Thanks, Sander
On Sun, 2013-04-28 at 20:02 +0100, Sander Bogaert wrote:> Hi, > > all previous information can be found in this thread: > http://lists.xen.org/archives/html/xen-devel/2013-04/msg02772.html > > I''ve been trying to reproduce this behaviour for the last 2 days, > crashme has been running on the Arndale board for a total of at least > 20 hours. I restarted the process once in a while with the seed I saw > crashing Xen ( ''crashme +2000.4 666 50 2:00:00 2'' ). > > The version of crashme is 2.4, the one from the Debian Wheezy > repository. The last seed logged ( needs a SD card write so I don''t > know when the last sync was before the crash ) was 43166 > > I have not been able to reproduce the crash. However I''m quite sure I > wasn''t imagining things, I really did see Xen crash with the "SGI 2 > Unhandled" error when I was running crashme from dom0 userspace.It could be that running crashme was just incidental, and the crash just happened independently. There really ought to be no way for a guest to directly generate a host level SGI and certainly no way for it to generate one with a number of its choosing.> This seems like a big deal and not being able to reproduce it is kind > of frustrating. So I was wondering if there were any ideas on how this > could have happened? When it did happend I just rebooted the board so > it was in a ''clean'' state. > > Maybe some speculations on a cause could help me reproduce it? A small > explanation on when exactly it should issue sgi''s? I would really > really like to get to the bottom of this :-)The xen.git hypervisor uses two SGIs, GIC_SGI_EVENT_CHECK (==0) and GIC_SGI_DUMP_STATE (==1). Both are issued only via calls to one of send_SGI_{mask,self,allbutself} (or their various wrappers). In practice this means smp_send_event_check_mask() or smp_send_state_dump(). You can verify this by looking at callchains lead to one of the small number of writes to GICD[GICD_SGIR]. Julien added a new SGI in his Arndale tree to call a function on another CPU (not sure what he called it without looking it up, it''s #2 though), this would be exercised via smp_call_function() and friends. About my only theory about how you can have seen a spurious host level SGI==2 is a partial rebuild error -- i.e. make b0rked the build and you got the new version of smp_call_function et al but not the new version of do_sgi(). Unless of course Julien''s tree temporarily had code with that behaviour (i.e. added the smp_call stuff before the handler)? TBH, there probably isn''t going to be much we can do about this until we get a repro, so I''d be tempted to ignore it and move on and hope we never see it again. About the only useful things we could do in case it does happen again would be to print othercpu in the panic from do_sgi and to add asserts to send_SGI_* to assert it is sending an SGI which we have defined (not just one which the hardware defines as it asserts now. Could you whip up a patch to do those? Ian.
On 04/29/2013 10:39 AM, Ian Campbell wrote:> On Sun, 2013-04-28 at 20:02 +0100, Sander Bogaert wrote: >> Hi, >> >> all previous information can be found in this thread: >> http://lists.xen.org/archives/html/xen-devel/2013-04/msg02772.html >> >> I''ve been trying to reproduce this behaviour for the last 2 days, >> crashme has been running on the Arndale board for a total of at least >> 20 hours. I restarted the process once in a while with the seed I saw >> crashing Xen ( ''crashme +2000.4 666 50 2:00:00 2'' ). >> >> The version of crashme is 2.4, the one from the Debian Wheezy >> repository. The last seed logged ( needs a SD card write so I don''t >> know when the last sync was before the crash ) was 43166 >> >> I have not been able to reproduce the crash. However I''m quite sure I >> wasn''t imagining things, I really did see Xen crash with the "SGI 2 >> Unhandled" error when I was running crashme from dom0 userspace. > > It could be that running crashme was just incidental, and the crash just > happened independently. There really ought to be no way for a guest to > directly generate a host level SGI and certainly no way for it to > generate one with a number of its choosing. > >> This seems like a big deal and not being able to reproduce it is kind >> of frustrating. So I was wondering if there were any ideas on how this >> could have happened? When it did happend I just rebooted the board so >> it was in a ''clean'' state. >> >> Maybe some speculations on a cause could help me reproduce it? A small >> explanation on when exactly it should issue sgi''s? I would really >> really like to get to the bottom of this :-) > > The xen.git hypervisor uses two SGIs, GIC_SGI_EVENT_CHECK (==0) and > GIC_SGI_DUMP_STATE (==1). Both are issued only via calls to one of > send_SGI_{mask,self,allbutself} (or their various wrappers). In practice > this means smp_send_event_check_mask() or smp_send_state_dump(). You can > verify this by looking at callchains lead to one of the small number of > writes to GICD[GICD_SGIR]. > > Julien added a new SGI in his Arndale tree to call a function on another > CPU (not sure what he called it without looking it up, it''s #2 though), > this would be exercised via smp_call_function() and friends. > > About my only theory about how you can have seen a spurious host level > SGI==2 is a partial rebuild error -- i.e. make b0rked the build and you > got the new version of smp_call_function et al but not the new version > of do_sgi(). Unless of course Julien''s tree temporarily had code with > that behaviour (i.e. added the smp_call stuff before the handler)?All this functionality is implemented in a single commit and I don''t see this commit on you tree (commit 5ce4118f5768c6137d58888d57972bdfdf4c9aba). GIC_SGI_CALL_FUNCTION is called by on_selected_cpus which is used for: - halt a physical cpu - gdb - read clocks keyhandler -- Julien Grall
On 29-04-13 14:27, Julien Grall wrote:> On 04/29/2013 10:39 AM, Ian Campbell wrote: > >> On Sun, 2013-04-28 at 20:02 +0100, Sander Bogaert wrote: >>> Hi, >>> >>> all previous information can be found in this thread: >>> http://lists.xen.org/archives/html/xen-devel/2013-04/msg02772.html >>> >>> >>>I''ve been trying to reproduce this behaviour for the last 2 days,>>> crashme has been running on the Arndale board for a total of at >>> least 20 hours. I restarted the process once in a while with >>> the seed I saw crashing Xen ( ''crashme +2000.4 666 50 2:00:00 >>> 2'' ). >>> >>> The version of crashme is 2.4, the one from the Debian Wheezy >>> repository. The last seed logged ( needs a SD card write so I >>> don''t know when the last sync was before the crash ) was 43166 >>> >>> I have not been able to reproduce the crash. However I''m quite >>> sure I wasn''t imagining things, I really did see Xen crash with >>> the "SGI 2 Unhandled" error when I was running crashme from >>> dom0 userspace. >> >> It could be that running crashme was just incidental, and the >> crash just happened independently. There really ought to be no >> way for a guest to directly generate a host level SGI and >> certainly no way for it to generate one with a number of its >> choosing. >> >>> This seems like a big deal and not being able to reproduce it >>> is kind of frustrating. So I was wondering if there were any >>> ideas on how this could have happened? When it did happend I >>> just rebooted the board so it was in a ''clean'' state. >>> >>> Maybe some speculations on a cause could help me reproduce it? >>> A small explanation on when exactly it should issue sgi''s? I >>> would really really like to get to the bottom of this :-) >> >> The xen.git hypervisor uses two SGIs, GIC_SGI_EVENT_CHECK (==0) >> and GIC_SGI_DUMP_STATE (==1). Both are issued only via calls to >> one of send_SGI_{mask,self,allbutself} (or their various >> wrappers). In practice this means smp_send_event_check_mask() or >> smp_send_state_dump(). You can verify this by looking at >> callchains lead to one of the small number of writes to >> GICD[GICD_SGIR]. >> >> Julien added a new SGI in his Arndale tree to call a function on >> another CPU (not sure what he called it without looking it up, >> it''s #2 though), this would be exercised via smp_call_function() >> and friends. >> >> About my only theory about how you can have seen a spurious host >> level SGI==2 is a partial rebuild error -- i.e. make b0rked the >> build and you got the new version of smp_call_function et al but >> not the new version of do_sgi(). Unless of course Julien''s tree >> temporarily had code with that behaviour (i.e. added the smp_call >> stuff before the handler)? > > All this functionality is implemented in a single commit and I > don''t see this commit on you tree (commit > 5ce4118f5768c6137d58888d57972bdfdf4c9aba). > > GIC_SGI_CALL_FUNCTION is called by on_selected_cpus which is used > for: - halt a physical cpu - gdb - read clocks keyhandler >I understand I''m using an older version. The reason I''m still using it is because I hope to reproduce this. I really don''t think I ''b0rked'' my build, it''s a clean pull & build. So if sgi 2 was sent it wasn''t because of this functionality. I will rerun the test from time to time maybe it pops up again. Sander