Hi all, I''ve had a report of a host starting to give this output: Oct 7 19:36:37 kernel: INFO: rcu_sched self-detected stall on CPU { 2} (t=273561 jiffies g=17919 c=17918 q=9688) Oct 7 19:36:37 kernel: sending NMI to all CPUs: Oct 7 19:36:37 kernel: xen: vector 0x2 is not implemented Kernel is 3.11.2. Xen is 4.2.3. I''ve seen various random reports of this, but never a follow up on what seemed to be the issue or a fix. The latest was on the ARM platform, but may be a different issue. Has anyone come across these before and found a solution? Xen 4.2.2 seems unaffected. Bug Report: http://xen.crc.id.au/bugs/view.php?id=21 -- Steven Haigh Email: netwiz@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 Fax: (03) 8338 0299 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Tue, 2013-10-08 at 10:50 +1100, Steven Haigh wrote:> Hi all, > > I''ve had a report of a host starting to give this output: > Oct 7 19:36:37 kernel: INFO: rcu_sched self-detected stall on CPU { 2} > (t=273561 jiffies g=17919 c=17918 q=9688) > Oct 7 19:36:37 kernel: sending NMI to all CPUs: > Oct 7 19:36:37 kernel: xen: vector 0x2 is not implementedThis bit is just a symptom triggered by the initial rcu stall which is your real problem. That said I thought the NMI thing was fixed recently, which might have gotten you better debugging on the rcu problem. It''s a kernel issue rather than a Xen one BTW. Ian.> > Kernel is 3.11.2. > Xen is 4.2.3. > > I''ve seen various random reports of this, but never a follow up on what > seemed to be the issue or a fix. The latest was on the ARM platform, but > may be a different issue. > > Has anyone come across these before and found a solution? > > Xen 4.2.2 seems unaffected. > > Bug Report: > http://xen.crc.id.au/bugs/view.php?id=21 > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 10/08/2013 08:42 AM, Ian Campbell wrote:> On Tue, 2013-10-08 at 10:50 +1100, Steven Haigh wrote: >> Hi all, >> >> I''ve had a report of a host starting to give this output: >> Oct 7 19:36:37 kernel: INFO: rcu_sched self-detected stall on CPU { 2} >> (t=273561 jiffies g=17919 c=17918 q=9688) >> Oct 7 19:36:37 kernel: sending NMI to all CPUs: >> Oct 7 19:36:37 kernel: xen: vector 0x2 is not implemented > This bit is just a symptom triggered by the initial rcu stall which is > your real problem. > > That said I thought the NMI thing was fixed recently, which might have > gotten you better debugging on the rcu problem.This went into v3.12-rc1 (commit 6efa20e). Steven, can you try your test with newer kernels that have this fix? With it we should be able to see where the stall is happening. -boris> > It''s a kernel issue rather than a Xen one BTW. > > Ian. > >> Kernel is 3.11.2. >> Xen is 4.2.3. >> >> I''ve seen various random reports of this, but never a follow up on what >> seemed to be the issue or a fix. The latest was on the ARM platform, but >> may be a different issue. >> >> Has anyone come across these before and found a solution? >> >> Xen 4.2.2 seems unaffected. >> >> Bug Report: >> http://xen.crc.id.au/bugs/view.php?id=21 >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 10/09/2013 01:31 AM, Boris Ostrovsky wrote:> On 10/08/2013 08:42 AM, Ian Campbell wrote: >> On Tue, 2013-10-08 at 10:50 +1100, Steven Haigh wrote: >>> Hi all, >>> >>> I''ve had a report of a host starting to give this output: >>> Oct 7 19:36:37 kernel: INFO: rcu_sched self-detected stall on CPU { 2} >>> (t=273561 jiffies g=17919 c=17918 q=9688) >>> Oct 7 19:36:37 kernel: sending NMI to all CPUs: >>> Oct 7 19:36:37 kernel: xen: vector 0x2 is not implemented >> This bit is just a symptom triggered by the initial rcu stall which is >> your real problem. >> >> That said I thought the NMI thing was fixed recently, which might have >> gotten you better debugging on the rcu problem. > > This went into v3.12-rc1 (commit 6efa20e). > > Steven, can you try your test with newer kernels that have this fix? With > it we should be able to see where the stall is happening.Sadly I can''t easily rebuild this to 3.12. Is it possible to get a version / patch / commit that will work with 3.11.x? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 10/08/2013 10:41 AM, Steven Haigh wrote:> On 10/09/2013 01:31 AM, Boris Ostrovsky wrote: >> On 10/08/2013 08:42 AM, Ian Campbell wrote: >>> On Tue, 2013-10-08 at 10:50 +1100, Steven Haigh wrote: >>>> Hi all, >>>> >>>> I''ve had a report of a host starting to give this output: >>>> Oct 7 19:36:37 kernel: INFO: rcu_sched self-detected stall on CPU { 2} >>>> (t=273561 jiffies g=17919 c=17918 q=9688) >>>> Oct 7 19:36:37 kernel: sending NMI to all CPUs: >>>> Oct 7 19:36:37 kernel: xen: vector 0x2 is not implemented >>> This bit is just a symptom triggered by the initial rcu stall which is >>> your real problem. >>> >>> That said I thought the NMI thing was fixed recently, which might have >>> gotten you better debugging on the rcu problem. >> This went into v3.12-rc1 (commit 6efa20e). >> >> Steven, can you try your test with newer kernels that have this fix? With >> it we should be able to see where the stall is happening. > Sadly I can''t easily rebuild this to 3.12. Is it possible to get a > version / patch / commit that will work with 3.11.x?I am not sure I understand what you are asking for. A backport of the patch to 3.11.x so that you apply it on top of your sources and build it yourself? -boris
On 9/10/2013 2:56 AM, Boris Ostrovsky wrote:> On 10/08/2013 10:41 AM, Steven Haigh wrote: >> On 10/09/2013 01:31 AM, Boris Ostrovsky wrote: >>> On 10/08/2013 08:42 AM, Ian Campbell wrote: >>>> On Tue, 2013-10-08 at 10:50 +1100, Steven Haigh wrote: >>>>> Hi all, >>>>> >>>>> I''ve had a report of a host starting to give this output: >>>>> Oct 7 19:36:37 kernel: INFO: rcu_sched self-detected stall on CPU { 2} >>>>> (t=273561 jiffies g=17919 c=17918 q=9688) >>>>> Oct 7 19:36:37 kernel: sending NMI to all CPUs: >>>>> Oct 7 19:36:37 kernel: xen: vector 0x2 is not implemented >>>> This bit is just a symptom triggered by the initial rcu stall which is >>>> your real problem. >>>> >>>> That said I thought the NMI thing was fixed recently, which might have >>>> gotten you better debugging on the rcu problem. >>> This went into v3.12-rc1 (commit 6efa20e). >>> >>> Steven, can you try your test with newer kernels that have this fix? >>> With >>> it we should be able to see where the stall is happening. >> Sadly I can''t easily rebuild this to 3.12. Is it possible to get a >> version / patch / commit that will work with 3.11.x? > > I am not sure I understand what you are asking for. A backport of the > patch to 3.11.x so that you apply it on top of your sources and build it > yourself?Correct. I have a buildroot set up for building 3.11.x into RPMs that would take a lot of work to change to 3.12. As this is a problem with 3.11 as well (as I don''t run or provide 3.12 anywhere), I''d like to test the fix on 3.11. Eventually, it will need to be fixed in the 3.11 series as well. -- Steven Haigh Email: netwiz@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 Fax: (03) 8338 0299 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
----- netwiz@crc.id.au wrote:> On 9/10/2013 2:56 AM, Boris Ostrovsky wrote: > > On 10/08/2013 10:41 AM, Steven Haigh wrote: > >> On 10/09/2013 01:31 AM, Boris Ostrovsky wrote: > >>> On 10/08/2013 08:42 AM, Ian Campbell wrote: > >>>> On Tue, 2013-10-08 at 10:50 +1100, Steven Haigh wrote: > >>>>> Hi all, > >>>>> > >>>>> I''ve had a report of a host starting to give this output: > >>>>> Oct 7 19:36:37 kernel: INFO: rcu_sched self-detected stall on > CPU { 2} > >>>>> (t=273561 jiffies g=17919 c=17918 q=9688) > >>>>> Oct 7 19:36:37 kernel: sending NMI to all CPUs: > >>>>> Oct 7 19:36:37 kernel: xen: vector 0x2 is not implemented > >>>> This bit is just a symptom triggered by the initial rcu stall > which is > >>>> your real problem. > >>>> > >>>> That said I thought the NMI thing was fixed recently, which might > have > >>>> gotten you better debugging on the rcu problem. > >>> This went into v3.12-rc1 (commit 6efa20e). > >>> > >>> Steven, can you try your test with newer kernels that have this > fix? > >>> With > >>> it we should be able to see where the stall is happening. > >> Sadly I can''t easily rebuild this to 3.12. Is it possible to get a > >> version / patch / commit that will work with 3.11.x? > > > > I am not sure I understand what you are asking for. A backport of > the > > patch to 3.11.x so that you apply it on top of your sources and > build it > > yourself? > > Correct. I have a buildroot set up for building 3.11.x into RPMs that > would take a lot of work to change to 3.12. As this is a problem with > 3.11 as well (as I don''t run or provide 3.12 anywhere), I''d like to > test > the fix on 3.11.I am attaching the patch to 3.11.4 (which is exactly the same as the one that went into 3.12 btw). I only compile-tested it to make sure that it builds.> > Eventually, it will need to be fixed in the 3.11 series as well. >It''s unlikely to go into 3.11 since it''s really a new feature and not a bug fix. -boris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 9/10/2013 12:53 PM, Boris Ostrovsky wrote:> > ----- netwiz@crc.id.au wrote: > >> On 9/10/2013 2:56 AM, Boris Ostrovsky wrote: >>> On 10/08/2013 10:41 AM, Steven Haigh wrote: >>>> On 10/09/2013 01:31 AM, Boris Ostrovsky wrote: >>>>> On 10/08/2013 08:42 AM, Ian Campbell wrote: >>>>>> On Tue, 2013-10-08 at 10:50 +1100, Steven Haigh wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> I''ve had a report of a host starting to give this output: >>>>>>> Oct 7 19:36:37 kernel: INFO: rcu_sched self-detected stall on >> CPU { 2} >>>>>>> (t=273561 jiffies g=17919 c=17918 q=9688) >>>>>>> Oct 7 19:36:37 kernel: sending NMI to all CPUs: >>>>>>> Oct 7 19:36:37 kernel: xen: vector 0x2 is not implemented >>>>>> This bit is just a symptom triggered by the initial rcu stall >> which is >>>>>> your real problem. >>>>>> >>>>>> That said I thought the NMI thing was fixed recently, which might >> have >>>>>> gotten you better debugging on the rcu problem. >>>>> This went into v3.12-rc1 (commit 6efa20e). >>>>> >>>>> Steven, can you try your test with newer kernels that have this >> fix? >>>>> With >>>>> it we should be able to see where the stall is happening. >>>> Sadly I can''t easily rebuild this to 3.12. Is it possible to get a >>>> version / patch / commit that will work with 3.11.x? >>> >>> I am not sure I understand what you are asking for. A backport of >> the >>> patch to 3.11.x so that you apply it on top of your sources and >> build it >>> yourself? >> >> Correct. I have a buildroot set up for building 3.11.x into RPMs that >> would take a lot of work to change to 3.12. As this is a problem with >> 3.11 as well (as I don''t run or provide 3.12 anywhere), I''d like to >> test >> the fix on 3.11. > > I am attaching the patch to 3.11.4 (which is exactly the same as the one > that went into 3.12 btw). I only compile-tested it to make sure that it builds. > >> >> Eventually, it will need to be fixed in the 3.11 series as well. >> > > It''s unlikely to go into 3.11 since it''s really a new feature and not a bug fix.Thanks, I''ll test this and see what I can turn up. The problems are happening in 3.11.2 at this stage - which causes the system to become unresponsive and a load of 45+. It also causes network traffic to stop. As such, it should probably be looked at as a bugfix for 3.11. Will try and get further results and revert back when I can provide more feedback. -- Steven Haigh Email: netwiz@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 Fax: (03) 8338 0299 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
----- netwiz@crc.id.au wrote:> On 9/10/2013 12:53 PM, Boris Ostrovsky wrote: > > > > ----- netwiz@crc.id.au wrote: > > > >> On 9/10/2013 2:56 AM, Boris Ostrovsky wrote: > >>> On 10/08/2013 10:41 AM, Steven Haigh wrote: > >>>> On 10/09/2013 01:31 AM, Boris Ostrovsky wrote: > >>>>> On 10/08/2013 08:42 AM, Ian Campbell wrote: > >>>>>> On Tue, 2013-10-08 at 10:50 +1100, Steven Haigh wrote: > >>>>>>> Hi all, > >>>>>>> > >>>>>>> I''ve had a report of a host starting to give this output: > >>>>>>> Oct 7 19:36:37 kernel: INFO: rcu_sched self-detected stall on > >> CPU { 2} > >>>>>>> (t=273561 jiffies g=17919 c=17918 q=9688) > >>>>>>> Oct 7 19:36:37 kernel: sending NMI to all CPUs: > >>>>>>> Oct 7 19:36:37 kernel: xen: vector 0x2 is not implemented > >>>>>> This bit is just a symptom triggered by the initial rcu stall > >> which is > >>>>>> your real problem. > >>>>>> > >>>>>> That said I thought the NMI thing was fixed recently, which > might > >> have > >>>>>> gotten you better debugging on the rcu problem. > >>>>> This went into v3.12-rc1 (commit 6efa20e). > >>>>> > >>>>> Steven, can you try your test with newer kernels that have this > >> fix? > >>>>> With > >>>>> it we should be able to see where the stall is happening. > >>>> Sadly I can''t easily rebuild this to 3.12. Is it possible to get > a > >>>> version / patch / commit that will work with 3.11.x? > >>> > >>> I am not sure I understand what you are asking for. A backport of > >> the > >>> patch to 3.11.x so that you apply it on top of your sources and > >> build it > >>> yourself? > >> > >> Correct. I have a buildroot set up for building 3.11.x into RPMs > that > >> would take a lot of work to change to 3.12. As this is a problem > with > >> 3.11 as well (as I don''t run or provide 3.12 anywhere), I''d like > to > >> test > >> the fix on 3.11. > > > > I am attaching the patch to 3.11.4 (which is exactly the same as the > one > > that went into 3.12 btw). I only compile-tested it to make sure that > it builds. > > > >> > >> Eventually, it will need to be fixed in the 3.11 series as well. > >> > > > > It''s unlikely to go into 3.11 since it''s really a new feature and > not a bug fix. > > Thanks, I''ll test this and see what I can turn up. The problems are > happening in 3.11.2 at this stage - which causes the system to become > unresponsive and a load of 45+. It also causes network traffic to > stop. > > As such, it should probably be looked at as a bugfix for 3.11.Just to be clear: this patch is not to make your hang go away: it''s a way for kernel to produce stack trace that will (hopefully) show where the hang is. So don''t get your hopes high that it will fix your problem ;-) -boris> Will try and get further results and revert back when I can provide more > feedback.