David Vrabel
2013-Nov-06 14:49 UTC
[PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
The series (for Xen 4.4) improves the kexec hypercall by making Xen responsible for loading and relocating the image. This allows kexec to be usable by pv-ops kernels and should allow kexec to be usable from a HVM or PVH privileged domain. I have now tested this with a Linux kernel image using the VGA console which was what was causing problems in v9 (this turned out to be a kexec-tools bug). The required patch series for kexec-tools will be posted shortly and are available from the xen-v7 branch of: http://xenbits.xen.org/gitweb/?p=people/dvrabel/kexec-tools.git;a=summary Changes in v10: - Document host state on exec. - Fix kimage_alloc() error path (double free, crash on zero kimage->head). - Check for segment before expanding it in load_v1. - Move kexec_lock define into kexec_swap_images(). Changes in v9: - Update comments to correctly say 4.4. - Minor updates the kexec_reloc assembly to improve maintainability a bit. Changes in v8: - Use #defines for compat ABI structures. - Tweak link time check for kexec_reloc. Changes in v7: - No longer use GUEST_HANDLE_64(), get a uniform ABI by using unions and explicit padding. - Only map the segments and not all of RAM. - Add a mechanism to create mappings for use by the exec''d image (a segment with a NULL buf handle). - Fix a bug where a crash image''s code page would by placed at machine address 0 (instead of inside the crash region). Changes in v6: - Fix double free in KEXEC_load_v1 failure path. - Only copy the relocation code and not the whole page. - Add myself as the kexec maintainer. Changes in v5 (not posted to the list): - _rsvd -> _pad in one of the public ABI structures. - Fix bug where trailing pages were not zeroed. This fixes loading a 64-bit Linux kernel using a more recent version of kexec-tools. - Check the relocation code fits into a page at link time. Changes in v4: - Use paddr_t and page_to_maddr() etc. for portability. - Add explicit padding to hypercall structures where required. - Minor cleanup of the kexec_reloc assembly. - Print a message before exec''ing a crash image. - Style fixes (tabs, trailing whitespace) and typos. - Fix a bug where using the V1 interface and unloading a image may crash. Changes in v3: - Provide old struct xen_kexec_load if __XEN_INTERFACE_VERSION__ < 4.3 - Adjust new struct xen_kexec_load to avoid unnecessary padding. - Use domheap pages for the image and control pages. - Remove the DBG() macros from the reloc code. David
Daniel Kiper
2013-Nov-07 21:16 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote:> The series (for Xen 4.4) improves the kexec hypercall by making Xen > responsible for loading and relocating the image. This allows kexec > to be usable by pv-ops kernels and should allow kexec to be usable > from a HVM or PVH privileged domain. > > I have now tested this with a Linux kernel image using the VGA console > which was what was causing problems in v9 (this turned out to be a > kexec-tools bug). > > The required patch series for kexec-tools will be posted shortly and > are available from the xen-v7 branch of:In general it works. However, quite often I am not able to execute panic kernel. Machine hangs with following message: (XEN) Domain 0 crashed: Executing crash image gdb shows: (gdb) bt #0 0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113 #1 0xffff82d0802281d9 in nmi_crash () at entry.S:666 #2 0x0000000000000000 in ?? () (gdb) Especially second bt line scares me... ;-))) I have not been able to identify why NMI was activated because stack is completely cleared. I tried to record execution in gdb but it stops with following message: cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0) at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108 108 clear_bit(cpumask_check(cpu), dstp->bits); Process record: failed to record execution log. Do you know how to find out why NMI was activated? I am able almost always reproduce this issue doing this: - boot Xen, - load panic kernel, - echo c > /proc/sysrq-trigger, - reboot from command line, - boot Xen, - load panic kernel, - echo c > /proc/sysrq-trigger. Additionally, my compiler fails because it detects unused result variable in xen/common/kimage.c:kimage_crash_alloc(). Daniel
Andrew Cooper
2013-Nov-07 21:25 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On 07/11/13 21:16, Daniel Kiper wrote:> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote: >> The series (for Xen 4.4) improves the kexec hypercall by making Xen >> responsible for loading and relocating the image. This allows kexec >> to be usable by pv-ops kernels and should allow kexec to be usable >> from a HVM or PVH privileged domain. >> >> I have now tested this with a Linux kernel image using the VGA console >> which was what was causing problems in v9 (this turned out to be a >> kexec-tools bug). >> >> The required patch series for kexec-tools will be posted shortly and >> are available from the xen-v7 branch of: > In general it works. However, quite often I am not able to execute panic > kernel. Machine hangs with following message: > > (XEN) Domain 0 crashed: Executing crash image > > gdb shows: > > (gdb) bt > #0 0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113 > #1 0xffff82d0802281d9 in nmi_crash () at entry.S:666 > #2 0x0000000000000000 in ?? () > (gdb) > > Especially second bt line scares me... ;-)))Why? This is completely normal. If you look in crash.c at that line, it is a for (;;) halt(); loop How are you hooking gdb up?> > I have not been able to identify why NMI was activated because > stack is completely cleared. I tried to record execution in gdb > but it stops with following message:NMIs are used for cpu shootdown of the non-crashing cpus. Again, this is not touched by the series. ~Andrew> > cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0) > at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108 > 108 clear_bit(cpumask_check(cpu), dstp->bits); > Process record: failed to record execution log. > > Do you know how to find out why NMI was activated? > > I am able almost always reproduce this issue doing this: > - boot Xen, > - load panic kernel, > - echo c > /proc/sysrq-trigger, > - reboot from command line, > - boot Xen, > - load panic kernel, > - echo c > /proc/sysrq-trigger. > > Additionally, my compiler fails because it detects unused result > variable in xen/common/kimage.c:kimage_crash_alloc(). > > Daniel > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Daniel Kiper
2013-Nov-07 21:41 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On Thu, Nov 07, 2013 at 09:25:33PM +0000, Andrew Cooper wrote:> On 07/11/13 21:16, Daniel Kiper wrote: > > On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote: > >> The series (for Xen 4.4) improves the kexec hypercall by making Xen > >> responsible for loading and relocating the image. This allows kexec > >> to be usable by pv-ops kernels and should allow kexec to be usable > >> from a HVM or PVH privileged domain. > >> > >> I have now tested this with a Linux kernel image using the VGA console > >> which was what was causing problems in v9 (this turned out to be a > >> kexec-tools bug). > >> > >> The required patch series for kexec-tools will be posted shortly and > >> are available from the xen-v7 branch of: > > In general it works. However, quite often I am not able to execute panic > > kernel. Machine hangs with following message: > > > > (XEN) Domain 0 crashed: Executing crash image > > > > gdb shows: > > > > (gdb) bt > > #0 0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113 > > #1 0xffff82d0802281d9 in nmi_crash () at entry.S:666 > > #2 0x0000000000000000 in ?? () > > (gdb) > > > > Especially second bt line scares me... ;-))) > > Why? This is completely normal. If you look in crash.c at that line, it > is a for (;;) halt(); loopI thought more about this: #1 0xffff82d0802281d9 in nmi_crash () at entry.S:666 Look at the end of this line... ;-)))> How are you hooking gdb up?I am doing tests in QEMU and using QEMU''s -gdb option.> > I have not been able to identify why NMI was activated because > > stack is completely cleared. I tried to record execution in gdb > > but it stops with following message: > > NMIs are used for cpu shootdown of the non-crashing cpus. Again, this > is not touched by the series.Ahh... It makes sens. However, why machine hangs at this stage? Hmmm... CPU sending NMIs receives one and instead of ignoring it halts itself? Daniel
Andrew Cooper
2013-Nov-07 21:57 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On 07/11/2013 21:41, Daniel Kiper wrote:> On Thu, Nov 07, 2013 at 09:25:33PM +0000, Andrew Cooper wrote: >> On 07/11/13 21:16, Daniel Kiper wrote: >>> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote: >>>> The series (for Xen 4.4) improves the kexec hypercall by making Xen >>>> responsible for loading and relocating the image. This allows kexec >>>> to be usable by pv-ops kernels and should allow kexec to be usable >>>> from a HVM or PVH privileged domain. >>>> >>>> I have now tested this with a Linux kernel image using the VGA console >>>> which was what was causing problems in v9 (this turned out to be a >>>> kexec-tools bug). >>>> >>>> The required patch series for kexec-tools will be posted shortly and >>>> are available from the xen-v7 branch of: >>> In general it works. However, quite often I am not able to execute panic >>> kernel. Machine hangs with following message: >>> >>> (XEN) Domain 0 crashed: Executing crash image >>> >>> gdb shows: >>> >>> (gdb) bt >>> #0 0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113 >>> #1 0xffff82d0802281d9 in nmi_crash () at entry.S:666 >>> #2 0x0000000000000000 in ?? () >>> (gdb) >>> >>> Especially second bt line scares me... ;-))) >> Why? This is completely normal. If you look in crash.c at that line, it >> is a for (;;) halt(); loop > I thought more about this: > > #1 0xffff82d0802281d9 in nmi_crash () at entry.S:666 > > Look at the end of this line... ;-)))Which line and what about it? In current master, that is a SAVE_ALL, but as the call to do_nmi_crash has happened, I presume 0xffff82d0802281d9 is a ud2 instruction in your tree?> >> How are you hooking gdb up? > I am doing tests in QEMU and using QEMU''s -gdb option. > >>> I have not been able to identify why NMI was activated because >>> stack is completely cleared. I tried to record execution in gdb >>> but it stops with following message: >> NMIs are used for cpu shootdown of the non-crashing cpus. Again, this >> is not touched by the series. > Ahh... It makes sens. However, why machine hangs at this stage? Hmmm... > CPU sending NMIs receives one and instead of ignoring it halts itself? > > DanielNo - there is very clear protection from racing down the crash path. The crashing CPU forces all other cpus into nmi_crash(), where they will stay until reset. It is the one cpu which is not executing nmi_crash() which will end up executing the crash image. ~Andrew
David Vrabel
2013-Nov-08 13:13 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
Keir, Sorry, forgot to CC you on this series. Can we have your opinion on whether this kexec series can be merged? And if not, what further work and/or testing is required? On 07/11/13 21:16, Daniel Kiper wrote:> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote: >> The series (for Xen 4.4) improves the kexec hypercall by making Xen >> responsible for loading and relocating the image. This allows kexec >> to be usable by pv-ops kernels and should allow kexec to be usable >> from a HVM or PVH privileged domain. >> >> I have now tested this with a Linux kernel image using the VGA console >> which was what was causing problems in v9 (this turned out to be a >> kexec-tools bug). >> >> The required patch series for kexec-tools will be posted shortly and >> are available from the xen-v7 branch of: > > In general it works. However, quite often I am not able to execute panic > kernel. Machine hangs with following message:I cannot reproduce any failures, neither on my dev box nor on any of the automated XenServer tests that run on a range of different hardware platforms. I find kexec to be very reliable and an earlier version of this series has been in production within XenServer for a while now and has seen real use in the field. None of the issues reported so far have been regressions but failures in specific uses of the new support for pv-ops kernels. I really can''t see how I can do anything else to make this series acceptable for merging. In my opinion, the current implementation is so broken[1] and useless[2] that anything that even vaguely looks like it might work is significant improvement, and something that is deployed usefully in production should definitely be merged. [1] Uses code provided by the guest to jump out of Xen into the image which works only through luck. Does not (and has never) worked reliably with 32-bit dom0. [2] Does not work at all (and will never work) with upstream kernels.> (XEN) Domain 0 crashed: Executing crash image > > gdb shows: > > (gdb) bt > #0 0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113 > #1 0xffff82d0802281d9 in nmi_crash () at entry.S:666 > #2 0x0000000000000000 in ?? () > (gdb) > > Especially second bt line scares me... ;-))) > > I have not been able to identify why NMI was activated because > stack is completely cleared.All this you have described here is correct and expected behavior, which, quite frankly, you should have been able to see with even the most cursory look at the code.> Additionally, my compiler fails because it detects unused result > variable in xen/common/kimage.c:kimage_crash_alloc().Yes, sorry about that. That was fallout from a last minute trivial cleanup. I''ve posted an updated patch correcting this. David
Jan Beulich
2013-Nov-08 13:19 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote: > Keir, > > Sorry, forgot to CC you on this series. > > Can we have your opinion on whether this kexec series can be merged? > And if not, what further work and/or testing is required?Just to clarify - unless I missed something, there was still no review of this from Daniel or someone else known to be familiar with the subject. If Keir gave his ack, formally this could go in, but I wouldn''t feel too well with that (the more that apart from not having reviewed it, Daniel seems to also continue to have problems with it). Jan
David Vrabel
2013-Nov-08 13:20 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On 07/11/13 21:41, Daniel Kiper wrote:> > I am doing tests in QEMU and using QEMU''s -gdb option.Er. I''m not sure this is a very interesting real world use case. I would suggest the failure here is more likely to be bugs in qemu''s emulation. David
Daniel Kiper
2013-Nov-08 13:48 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On Fri, Nov 08, 2013 at 01:13:59PM +0000, David Vrabel wrote:> Keir, > > Sorry, forgot to CC you on this series. > > Can we have your opinion on whether this kexec series can be merged? > And if not, what further work and/or testing is required? > > On 07/11/13 21:16, Daniel Kiper wrote: > > On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote: > >> The series (for Xen 4.4) improves the kexec hypercall by making Xen > >> responsible for loading and relocating the image. This allows kexec > >> to be usable by pv-ops kernels and should allow kexec to be usable > >> from a HVM or PVH privileged domain. > >> > >> I have now tested this with a Linux kernel image using the VGA console > >> which was what was causing problems in v9 (this turned out to be a > >> kexec-tools bug). > >> > >> The required patch series for kexec-tools will be posted shortly and > >> are available from the xen-v7 branch of: > > > > In general it works. However, quite often I am not able to execute panic > > kernel. Machine hangs with following message: > > I cannot reproduce any failures, neither on my dev box nor on any of the > automated XenServer tests that run on a range of different hardware > platforms. I find kexec to be very reliable and an earlier version of > this series has been in production within XenServer for a while now and > has seen real use in the field. > > None of the issues reported so far have been regressions but failures in > specific uses of the new support for pv-ops kernels. > > I really can''t see how I can do anything else to make this series > acceptable for merging.I think that in general it is OK. However, we must solve discovered issues or confirm that it is not a problem of current implementation. That is all. I hope that we finally do that next week (FYI, Monday is public holiday in Poland). Additionally, we agreed that shortly after applying this patch series we decide that registers should be cleared before jumping into new image or not. I think that it will be done quickly too. Daniel
Andrew Cooper
2013-Nov-08 14:01 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On 08/11/13 13:19, Jan Beulich wrote:>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote: >> Keir, >> >> Sorry, forgot to CC you on this series. >> >> Can we have your opinion on whether this kexec series can be merged? >> And if not, what further work and/or testing is required? > Just to clarify - unless I missed something, there was still no > review of this from Daniel or someone else known to be > familiar with the subject. If Keir gave his ack, formally this > could go in, but I wouldn''t feel too well with that (the more > that apart from not having reviewed it, Daniel seems to also > continue to have problems with it). > > JanCan I have myself deemed to be familiar with the subject as far as this is concerned? A noticeable quantity of my contributions to Xen have been in the kexec / crash areas, and I am the author of the xen-crashdump-analyser. I do realise that I certainly not impartial as far as this series is concerned, being a co-developer. Davids statement of "the current implementation is so broken[1] and useless[2] that..." is completely accurate. It is frankly a miracle that the current code ever worked at all (and from XenServers point of view, failed far more often than it worked). For reference, XenServer 6.2 shipped with approximately v7 of this series, and an appropriate kexec-tools and xen-crashdump-analyser. Since we put the code in, we have not had a single failure-to-kexec in automated testing (both specific crash tests, and from unexpected host crashes), whereas we were seeing reliable failures to crash on most of our test infrastructure. In stark contrast to previous versions of XenServer, we have not had a single customer reported host crash where the kexec path has failed. There was one systematic failure where the HPSA driver was unhappy with the state of the hardware, resulting in no root filesystem to write logs to, and a repeated panic and Xen deadlock in the queued invalidation codepath. ~Andrew
Andrew Cooper
2013-Nov-08 14:01 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On 08/11/13 13:48, Daniel Kiper wrote:> On Fri, Nov 08, 2013 at 01:13:59PM +0000, David Vrabel wrote: >> Keir, >> >> Sorry, forgot to CC you on this series. >> >> Can we have your opinion on whether this kexec series can be merged? >> And if not, what further work and/or testing is required? >> >> On 07/11/13 21:16, Daniel Kiper wrote: >>> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote: >>>> The series (for Xen 4.4) improves the kexec hypercall by making Xen >>>> responsible for loading and relocating the image. This allows kexec >>>> to be usable by pv-ops kernels and should allow kexec to be usable >>>> from a HVM or PVH privileged domain. >>>> >>>> I have now tested this with a Linux kernel image using the VGA console >>>> which was what was causing problems in v9 (this turned out to be a >>>> kexec-tools bug). >>>> >>>> The required patch series for kexec-tools will be posted shortly and >>>> are available from the xen-v7 branch of: >>> In general it works. However, quite often I am not able to execute panic >>> kernel. Machine hangs with following message: >> I cannot reproduce any failures, neither on my dev box nor on any of the >> automated XenServer tests that run on a range of different hardware >> platforms. I find kexec to be very reliable and an earlier version of >> this series has been in production within XenServer for a while now and >> has seen real use in the field. >> >> None of the issues reported so far have been regressions but failures in >> specific uses of the new support for pv-ops kernels. >> >> I really can''t see how I can do anything else to make this series >> acceptable for merging. > I think that in general it is OK. However, we must solve discovered > issues or confirm that it is not a problem of current implementation. > That is all. I hope that we finally do that next week (FYI, Monday > is public holiday in Poland).What outstanding issues do you think are present then? ~Andrew
Don Slutz
2013-Nov-08 14:22 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On 11/08/13 09:01, Andrew Cooper wrote:> On 08/11/13 13:19, Jan Beulich wrote: >>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote: >>> Keir, >>> >>> Sorry, forgot to CC you on this series. >>> >>> Can we have your opinion on whether this kexec series can be merged? >>> And if not, what further work and/or testing is required? >> Just to clarify - unless I missed something, there was still no >> review of this from Daniel or someone else known to be >> familiar with the subject. If Keir gave his ack, formally this >> could go in, but I wouldn''t feel too well with that (the more >> that apart from not having reviewed it, Daniel seems to also >> continue to have problems with it).If I am following this correctly, Jan is testing this by running xen under QEMU. All my testing has been on bare metal.>> Jan > Can I have myself deemed to be familiar with the subject as far as this > is concerned? > > A noticeable quantity of my contributions to Xen have been in the kexec > / crash areas, and I am the author of the xen-crashdump-analyser. > > I do realise that I certainly not impartial as far as this series is > concerned, being a co-developer. > > Davids statement of "the current implementation is so broken[1] and > useless[2] that..." is completely accurate. It is frankly a miracle > that the current code ever worked at all (and from XenServers point of > view, failed far more often than it worked). > > > For reference, XenServer 6.2 shipped with approximately v7 of this > series, and an appropriate kexec-tools and xen-crashdump-analyser. > Since we put the code in, we have not had a single failure-to-kexec in > automated testing (both specific crash tests, and from unexpected host > crashes), whereas we were seeing reliable failures to crash on most of > our test infrastructure.Verizon is also using an older version back ported to 4.2.1, and we have yet to see a failure in getting into the crash kernel via kexec (it is a very small sample size ~6 Dom0 crashes so far). I have only done 10 crashes so far with v10+ (soon to be v11). -Don Slutz> In stark contrast to previous versions of XenServer, we have not had a > single customer reported host crash where the kexec path has failed. > There was one systematic failure where the HPSA driver was unhappy with > the state of the hardware, resulting in no root filesystem to write logs > to, and a repeated panic and Xen deadlock in the queued invalidation > codepath. > > ~Andrew > > _______________________________________________ > kexec mailing list > kexec@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec
Jan Beulich
2013-Nov-08 14:36 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
>>> On 08.11.13 at 15:01, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 08/11/13 13:19, Jan Beulich wrote: >>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote: >>> Keir, >>> >>> Sorry, forgot to CC you on this series. >>> >>> Can we have your opinion on whether this kexec series can be merged? >>> And if not, what further work and/or testing is required? >> Just to clarify - unless I missed something, there was still no >> review of this from Daniel or someone else known to be >> familiar with the subject. If Keir gave his ack, formally this >> could go in, but I wouldn''t feel too well with that (the more >> that apart from not having reviewed it, Daniel seems to also >> continue to have problems with it). > > Can I have myself deemed to be familiar with the subject as far as this > is concerned? > > A noticeable quantity of my contributions to Xen have been in the kexec > / crash areas, and I am the author of the xen-crashdump-analyser.I''m sorry, I didn''t mean to offend you in any way. In fact David and I briefly discussed this situation on the summit, and he sort of understood that I consider your review valuable, but ...> I do realise that I certainly not impartial as far as this series is > concerned, being a co-developer.... possibly/likely biased. Not the least because both of you work for Citrix. I''m therefore rather after a second, really independent review. Please forgive me not having expressed myself correctly. Jan
Daniel Kiper
2013-Nov-08 15:04 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On Fri, Nov 08, 2013 at 01:13:59PM +0000, David Vrabel wrote: [...]> > (XEN) Domain 0 crashed: Executing crash image > > > > gdb shows: > > > > (gdb) bt > > #0 0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113 > > #1 0xffff82d0802281d9 in nmi_crash () at entry.S:666 > > #2 0x0000000000000000 in ?? () > > (gdb) > > > > Especially second bt line scares me... ;-))) > > > > I have not been able to identify why NMI was activated because > > stack is completely cleared. > > All this you have described here is correct and expected behavior, > which, quite frankly, you should have been able to see with even the > most cursory look at the code.This is more a fun stuff than a real concern. That is why I have added smile at the end of my statement. nmi_crash () at entry.S:666 is a quite interesting coincidence for me... ;-))) Anyway, it is interesting why all CPUs were stopped at this stage. One should execute kdump code still. I will try to reproduce this on real hardware. Daniel
Daniel Kiper
2013-Nov-08 15:15 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote:> On 08/11/13 13:19, Jan Beulich wrote: > >>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote: > >> Keir, > >> > >> Sorry, forgot to CC you on this series. > >> > >> Can we have your opinion on whether this kexec series can be merged? > >> And if not, what further work and/or testing is required? > > Just to clarify - unless I missed something, there was still no > > review of this from Daniel or someone else known to be > > familiar with the subject. If Keir gave his ack, formally this > > could go in, but I wouldn''t feel too well with that (the more > > that apart from not having reviewed it, Daniel seems to also > > continue to have problems with it). > > > > Jan > > Can I have myself deemed to be familiar with the subject as far as this > is concerned? > > A noticeable quantity of my contributions to Xen have been in the kexec > / crash areas, and I am the author of the xen-crashdump-analyser. > > I do realise that I certainly not impartial as far as this series is > concerned, being a co-developer. > > Davids statement of "the current implementation is so broken[1] and > useless[2] that..." is completely accurate. It is frankly a miracle > that the current code ever worked at all (and from XenServers point of > view, failed far more often than it worked). > > > For reference, XenServer 6.2 shipped with approximately v7 of this > series, and an appropriate kexec-tools and xen-crashdump-analyser. > Since we put the code in, we have not had a single failure-to-kexec in > automated testing (both specific crash tests, and from unexpected host > crashes), whereas we were seeing reliable failures to crash on most of > our test infrastructure. > > In stark contrast to previous versions of XenServer, we have not had a > single customer reported host crash where the kexec path has failed. > There was one systematic failure where the HPSA driver was unhappy with > the state of the hardware, resulting in no root filesystem to write logs > to, and a repeated panic and Xen deadlock in the queued invalidation > codepath.Andrew, if it runs on all your hardware it does not mean that it runs everywhere. I have discovered the problem (I hope the last one) and it should be taken into consideration. Another question is what is the source of this problem. Maybe QEMU but it should be checked and not ignored. Daniel
Konrad Rzeszutek Wilk
2013-Nov-08 15:42 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On Fri, Nov 08, 2013 at 07:15:00AM -0800, Daniel Kiper wrote:> On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote: > > On 08/11/13 13:19, Jan Beulich wrote: > > >>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote: > > >> Keir, > > >> > > >> Sorry, forgot to CC you on this series. > > >> > > >> Can we have your opinion on whether this kexec series can be merged? > > >> And if not, what further work and/or testing is required? > > > Just to clarify - unless I missed something, there was still no > > > review of this from Daniel or someone else known to be > > > familiar with the subject. If Keir gave his ack, formally this > > > could go in, but I wouldn''t feel too well with that (the more > > > that apart from not having reviewed it, Daniel seems to also > > > continue to have problems with it). > > > > > > Jan > > > > Can I have myself deemed to be familiar with the subject as far as this > > is concerned? > > > > A noticeable quantity of my contributions to Xen have been in the kexec > > / crash areas, and I am the author of the xen-crashdump-analyser. > > > > I do realise that I certainly not impartial as far as this series is > > concerned, being a co-developer. > > > > Davids statement of "the current implementation is so broken[1] and > > useless[2] that..." is completely accurate. It is frankly a miracle > > that the current code ever worked at all (and from XenServers point of > > view, failed far more often than it worked). > > > > > > For reference, XenServer 6.2 shipped with approximately v7 of this > > series, and an appropriate kexec-tools and xen-crashdump-analyser. > > Since we put the code in, we have not had a single failure-to-kexec in > > automated testing (both specific crash tests, and from unexpected host > > crashes), whereas we were seeing reliable failures to crash on most of > > our test infrastructure. > > > > In stark contrast to previous versions of XenServer, we have not had a > > single customer reported host crash where the kexec path has failed. > > There was one systematic failure where the HPSA driver was unhappy with > > the state of the hardware, resulting in no root filesystem to write logs > > to, and a repeated panic and Xen deadlock in the queued invalidation > > codepath. > > Andrew, if it runs on all your hardware it does not mean that it runs > everywhere. I have discovered the problem (I hope the last one) and it > should be taken into consideration. Another question is what is the > source of this problem. Maybe QEMU but it should be checked and not > ignored.I think the question is that the feature freeze is the 18th - and whether this single bug should halt the integration of this whole patchset. Or that it is OK to put in the patchset in and deal with the bugs and not stall this initial patchset.> > Daniel > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Andrew Cooper
2013-Nov-08 15:48 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On 08/11/13 15:15, Daniel Kiper wrote:> On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote: >> On 08/11/13 13:19, Jan Beulich wrote: >>>>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote: >>>> Keir, >>>> >>>> Sorry, forgot to CC you on this series. >>>> >>>> Can we have your opinion on whether this kexec series can be merged? >>>> And if not, what further work and/or testing is required? >>> Just to clarify - unless I missed something, there was still no >>> review of this from Daniel or someone else known to be >>> familiar with the subject. If Keir gave his ack, formally this >>> could go in, but I wouldn''t feel too well with that (the more >>> that apart from not having reviewed it, Daniel seems to also >>> continue to have problems with it). >>> >>> Jan >> Can I have myself deemed to be familiar with the subject as far as this >> is concerned? >> >> A noticeable quantity of my contributions to Xen have been in the kexec >> / crash areas, and I am the author of the xen-crashdump-analyser. >> >> I do realise that I certainly not impartial as far as this series is >> concerned, being a co-developer. >> >> Davids statement of "the current implementation is so broken[1] and >> useless[2] that..." is completely accurate. It is frankly a miracle >> that the current code ever worked at all (and from XenServers point of >> view, failed far more often than it worked). >> >> >> For reference, XenServer 6.2 shipped with approximately v7 of this >> series, and an appropriate kexec-tools and xen-crashdump-analyser. >> Since we put the code in, we have not had a single failure-to-kexec in >> automated testing (both specific crash tests, and from unexpected host >> crashes), whereas we were seeing reliable failures to crash on most of >> our test infrastructure. >> >> In stark contrast to previous versions of XenServer, we have not had a >> single customer reported host crash where the kexec path has failed. >> There was one systematic failure where the HPSA driver was unhappy with >> the state of the hardware, resulting in no root filesystem to write logs >> to, and a repeated panic and Xen deadlock in the queued invalidation >> codepath. > Andrew, if it runs on all your hardware it does not mean that it runs > everywhere. I have discovered the problem (I hope the last one) and it > should be taken into consideration. Another question is what is the > source of this problem. Maybe QEMU but it should be checked and not > ignored. > > DanielI am not trying to suggest that it is 100% perfect with all corner cases covered. However, I feel that a QEMU failure in the NMI shootdown logic (which has not been touched by this series, and has been present in Xen since the 4.3 development cycle) should not be considered against the series. Or are you meaning that the QEMU failure is a regression caused by the series? For interest, our nightly tests consist of: * xl debug-keys C * echo c > /proc/sysrq-trigger ** This is further repeated several times with a 1-vcpu dom0 pinned to pcpu 0, 1, -1 and a 2 further randomly-chosen pcpus. * echo c > /proc/sysrq-trigger with the server running VM workloads. Which are chained back-to-back with our crashdump environment which takes logs and automatically reboots. For each individual crash, the crashdump-analyser logs are checked for correctness. There is a separate test on supporting hardware which uses an IPMI controller to inject an IOCK NMI. The above tests get run on a random server every single night. During development when the lab was idle, we repeatedly ran the test against every unique machine we had available (about 100 types, different brands, different generations of technology) ~Andrew
Daniel Kiper
2013-Nov-08 16:28 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On Fri, Nov 08, 2013 at 10:42:51AM -0500, Konrad Rzeszutek Wilk wrote:> On Fri, Nov 08, 2013 at 07:15:00AM -0800, Daniel Kiper wrote: > > On Fri, Nov 08, 2013 at 02:01:28PM +0000, Andrew Cooper wrote: > > > On 08/11/13 13:19, Jan Beulich wrote: > > > >>>> On 08.11.13 at 14:13, David Vrabel <david.vrabel@citrix.com> wrote: > > > >> Keir, > > > >> > > > >> Sorry, forgot to CC you on this series. > > > >> > > > >> Can we have your opinion on whether this kexec series can be merged? > > > >> And if not, what further work and/or testing is required? > > > > Just to clarify - unless I missed something, there was still no > > > > review of this from Daniel or someone else known to be > > > > familiar with the subject. If Keir gave his ack, formally this > > > > could go in, but I wouldn''t feel too well with that (the more > > > > that apart from not having reviewed it, Daniel seems to also > > > > continue to have problems with it). > > > > > > > > Jan > > > > > > Can I have myself deemed to be familiar with the subject as far as this > > > is concerned? > > > > > > A noticeable quantity of my contributions to Xen have been in the kexec > > > / crash areas, and I am the author of the xen-crashdump-analyser. > > > > > > I do realise that I certainly not impartial as far as this series is > > > concerned, being a co-developer. > > > > > > Davids statement of "the current implementation is so broken[1] and > > > useless[2] that..." is completely accurate. It is frankly a miracle > > > that the current code ever worked at all (and from XenServers point of > > > view, failed far more often than it worked). > > > > > > > > > For reference, XenServer 6.2 shipped with approximately v7 of this > > > series, and an appropriate kexec-tools and xen-crashdump-analyser. > > > Since we put the code in, we have not had a single failure-to-kexec in > > > automated testing (both specific crash tests, and from unexpected host > > > crashes), whereas we were seeing reliable failures to crash on most of > > > our test infrastructure. > > > > > > In stark contrast to previous versions of XenServer, we have not had a > > > single customer reported host crash where the kexec path has failed. > > > There was one systematic failure where the HPSA driver was unhappy with > > > the state of the hardware, resulting in no root filesystem to write logs > > > to, and a repeated panic and Xen deadlock in the queued invalidation > > > codepath. > > > > Andrew, if it runs on all your hardware it does not mean that it runs > > everywhere. I have discovered the problem (I hope the last one) and it > > should be taken into consideration. Another question is what is the > > source of this problem. Maybe QEMU but it should be checked and not > > ignored. > > I think the question is that the feature freeze is the 18th - and whether > this single bug should halt the integration of this whole patchset. > > Or that it is OK to put in the patchset in and deal with the bugs > and not stall this initial patchset.I have never stated that I would like to block this patch series indefinitely due to this one bug (I am still not sure that this is a bug; Currently, I feel that I am only one person who tries to verify that). We have more then one week and I think that we are able to discover what is going on. If not I think that we can workout reasonable solution for this issue (as we did in other cases). Last but not least, I would like to underline that I wish that this patch series were included in Xen 4.4 too. However, it must be done in sensible way. Daniel
Daniel Kiper
2013-Nov-09 19:18 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On Thu, Nov 07, 2013 at 10:16:51PM +0100, Daniel Kiper wrote:> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote: > > The series (for Xen 4.4) improves the kexec hypercall by making Xen > > responsible for loading and relocating the image. This allows kexec > > to be usable by pv-ops kernels and should allow kexec to be usable > > from a HVM or PVH privileged domain. > > > > I have now tested this with a Linux kernel image using the VGA console > > which was what was causing problems in v9 (this turned out to be a > > kexec-tools bug). > > > > The required patch series for kexec-tools will be posted shortly and > > are available from the xen-v7 branch of: > > In general it works. However, quite often I am not able to execute panic > kernel. Machine hangs with following message: > > (XEN) Domain 0 crashed: Executing crash image > > gdb shows: > > (gdb) bt > #0 0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113 > #1 0xffff82d0802281d9 in nmi_crash () at entry.S:666 > #2 0x0000000000000000 in ?? () > (gdb) > > Especially second bt line scares me... ;-))) > > I have not been able to identify why NMI was activated because > stack is completely cleared. I tried to record execution in gdb > but it stops with following message: > > cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0) > at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108 > 108 clear_bit(cpumask_check(cpu), dstp->bits); > Process record: failed to record execution log. > > Do you know how to find out why NMI was activated? > > I am able almost always reproduce this issue doing this: > - boot Xen, > - load panic kernel, > - echo c > /proc/sysrq-trigger, > - reboot from command line, > - boot Xen, > - load panic kernel, > - echo c > /proc/sysrq-trigger.I am not able to reproduce this on real hardware. Sorry for confusion. Hence, for whole Xen kexec/kdump series: Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> Daniel
Don Slutz
2013-Nov-11 14:34 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On 11/09/13 14:18, Daniel Kiper wrote:> On Thu, Nov 07, 2013 at 10:16:51PM +0100, Daniel Kiper wrote: >> On Wed, Nov 06, 2013 at 02:49:37PM +0000, David Vrabel wrote: >>> The series (for Xen 4.4) improves the kexec hypercall by making Xen >>> responsible for loading and relocating the image. This allows kexec >>> to be usable by pv-ops kernels and should allow kexec to be usable >>> from a HVM or PVH privileged domain. >>> >>> I have now tested this with a Linux kernel image using the VGA console >>> which was what was causing problems in v9 (this turned out to be a >>> kexec-tools bug). >>> >>> The required patch series for kexec-tools will be posted shortly and >>> are available from the xen-v7 branch of: >> In general it works. However, quite often I am not able to execute panic >> kernel. Machine hangs with following message: >> >> (XEN) Domain 0 crashed: Executing crash image >> >> gdb shows: >> >> (gdb) bt >> #0 0xffff82d0801a0092 in do_nmi_crash (regs=<optimized out>) at crash.c:113 >> #1 0xffff82d0802281d9 in nmi_crash () at entry.S:666 >> #2 0x0000000000000000 in ?? () >> (gdb) >> >> Especially second bt line scares me... ;-))) >> >> I have not been able to identify why NMI was activated because >> stack is completely cleared. I tried to record execution in gdb >> but it stops with following message: >> >> cpumask_clear_cpu (dstp=0xffff82d0802f7f78 <call_data+24>, cpu=0) >> at /srv/dev/xen/xen_20130413_20131107.kexec/xen/include/xen/cpumask.h:108 >> 108 clear_bit(cpumask_check(cpu), dstp->bits); >> Process record: failed to record execution log. >> >> Do you know how to find out why NMI was activated? >> >> I am able almost always reproduce this issue doing this: >> - boot Xen, >> - load panic kernel, >> - echo c > /proc/sysrq-trigger, >> - reboot from command line, >> - boot Xen, >> - load panic kernel, >> - echo c > /proc/sysrq-trigger. > I am not able to reproduce this on real hardware. Sorry for confusion. > > Hence, for whole Xen kexec/kdump series: > > Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> > Tested-by: Daniel Kiper <daniel.kiper@oracle.com> > > DanielAlso Tested-by: Don Slutz <dslutz@verizon.com> -Don Slutz> _______________________________________________ > kexec mailing list > kexec@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec
David Vrabel
2013-Nov-11 15:09 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On 09/11/13 19:18, Daniel Kiper wrote:> > Hence, for whole Xen kexec/kdump series: > > Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> > Tested-by: Daniel Kiper <daniel.kiper@oracle.com>Thanks. David
Keir Fraser
2013-Nov-11 17:18 UTC
Re: [PATCHv10 0/9] Xen: extend kexec hypercall for use with pv-ops kernels
On 06/11/2013 14:49, "David Vrabel" <david.vrabel@citrix.com> wrote:> The series (for Xen 4.4) improves the kexec hypercall by making Xen > responsible for loading and relocating the image. This allows kexec > to be usable by pv-ops kernels and should allow kexec to be usable > from a HVM or PVH privileged domain.Acked-by: Keir Fraser <keir@xen.org>> I have now tested this with a Linux kernel image using the VGA console > which was what was causing problems in v9 (this turned out to be a > kexec-tools bug). > > The required patch series for kexec-tools will be posted shortly and > are available from the xen-v7 branch of: > > http://xenbits.xen.org/gitweb/?p=people/dvrabel/kexec-tools.git;a=summary > > Changes in v10: > > - Document host state on exec. > - Fix kimage_alloc() error path (double free, crash on zero kimage->head). > - Check for segment before expanding it in load_v1. > - Move kexec_lock define into kexec_swap_images(). > > Changes in v9: > > - Update comments to correctly say 4.4. > - Minor updates the kexec_reloc assembly to improve maintainability a > bit. > > Changes in v8: > > - Use #defines for compat ABI structures. > - Tweak link time check for kexec_reloc. > > Changes in v7: > > - No longer use GUEST_HANDLE_64(), get a uniform ABI by using unions > and explicit padding. > - Only map the segments and not all of RAM. > - Add a mechanism to create mappings for use by the exec''d image (a > segment with a NULL buf handle). > - Fix a bug where a crash image''s code page would by placed at machine > address 0 (instead of inside the crash region). > > Changes in v6: > > - Fix double free in KEXEC_load_v1 failure path. > - Only copy the relocation code and not the whole page. > - Add myself as the kexec maintainer. > > Changes in v5 (not posted to the list): > > - _rsvd -> _pad in one of the public ABI structures. > - Fix bug where trailing pages were not zeroed. This fixes loading a > 64-bit Linux kernel using a more recent version of kexec-tools. > - Check the relocation code fits into a page at link time. > > Changes in v4: > > - Use paddr_t and page_to_maddr() etc. for portability. > - Add explicit padding to hypercall structures where required. > - Minor cleanup of the kexec_reloc assembly. > - Print a message before exec''ing a crash image. > - Style fixes (tabs, trailing whitespace) and typos. > - Fix a bug where using the V1 interface and unloading a image may crash. > > Changes in v3: > > - Provide old struct xen_kexec_load if __XEN_INTERFACE_VERSION__ < 4.3 > - Adjust new struct xen_kexec_load to avoid unnecessary padding. > - Use domheap pages for the image and control pages. > - Remove the DBG() macros from the reloc code. > > David > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel