Hi Jan, I have found an issue in where the system crash right when I start another HVM guest inside an HVM guest. I have traced back to the patch which the issue started. commit f1bde87fc08ce8c818a1640a8fe4765d48923091 Author: Jan Beulich <jbeulich@suse.com> Date: Fri Feb 8 11:06:04 2013 +0100 x86: debugging code for testing 16Tb support on smaller memory systems Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> The issue doesn''t reproduce when starting a PV (L2) guest inside an HVM (L1) guest. Suravee PS: The L2 Xen is running Xen-4.3, but I think the issue is at the L1 Xen since it crashes the system.
>>> On 27.06.13 at 02:24, Suravee Suthikulanit <suravee.suthikulpanit@amd.com> wrote: > I have found an issue in where the system crash right when I start > another HVM guest inside an HVM guest. I have traced back to the patch > which the issue started. > > commit f1bde87fc08ce8c818a1640a8fe4765d48923091 > Author: Jan Beulich <jbeulich@suse.com> > Date: Fri Feb 8 11:06:04 2013 +0100 > > x86: debugging code for testing 16Tb support on smaller memory systems > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > Acked-by: Keir Fraser <keir@xen.org>We had issues exposed by this patch before, but any such issue would just have been masked before that patch (and would surface on a system with more than 5Tb of memory anyway). So it is very unlikely for the patch itself to be at fault. Furthermore, the crash them supposedly is because of the code added conditional upon NDEBUG, and hence would (on a smaller memory system) otherwise not surface at all for a production (debug=n) build.> The issue doesn''t reproduce when starting a PV (L2) guest inside an HVM > (L1) guest."Does not" or "does"? In the former case - what is this supposed to tell me? In any case - without you sharing technical details (register/stack dump of the crash at the very least) I don''t think I have anything at hand to look for possible problems. Jan
On 6/27/2013 3:22 AM, Jan Beulich wrote:>>>> On 27.06.13 at 02:24, Suravee Suthikulanit <suravee.suthikulpanit@amd.com> wrote: >> I have found an issue in where the system crash right when I start >> another HVM guest inside an HVM guest. I have traced back to the patch >> which the issue started. >> >> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 >> Author: Jan Beulich <jbeulich@suse.com> >> Date: Fri Feb 8 11:06:04 2013 +0100 >> >> x86: debugging code for testing 16Tb support on smaller memory systems >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> Acked-by: Keir Fraser <keir@xen.org> > We had issues exposed by this patch before, but any such issue > would just have been masked before that patch (and would > surface on a system with more than 5Tb of memory anyway).The system I am having the issue has 48GB of memory.> So it is very unlikely for the patch itself to be at fault.I have traced the issue and found that the system crashing starts from this commit id and onward. (i.e. The system does not crash with commit id ed759d20249197cf87b338ff0ed328052ca3b8e7) So, I am still believe that this patch has somehow triggered the issue.> Furthermore, the crash them supposedly is because of the code > added conditional upon NDEBUG, and hence would (on a smaller > memory system) otherwise not surface at all for a production > (debug=n) build. > >> The issue doesn''t reproduce when starting a PV (L2) guest inside an HVM >> (L1) guest. > "Does not" or "does"? In the former case - what is this supposed to > tell me?What I am trying to say here is that the system_does not_ crash when starting the PV guest as level 2 guest. This is meant to be another data point to help analyzing the issue.> In any case - without you sharing technical details (register/stack > dump of the crash at the very least) I don''t think I have anything > at hand to look for possible problems.At this point, I am just reporting the issue. I have not been able to get the crash dump because the system immediately reboot. I''ll try to boot Xen with "noreboot" and inspect the log for more clues. Any suggestions are welcome. Thank you, Suravee> > Jan > >
Running a PV guest as L2 guest makes no difference how the p2m code is used in L0 Xen (L0 == OS that runs on bare metal hardware). I assume you use the default settings which means you use NPT-on-NPT. Try shadow-on-npt. You can do this with cpuid="host,svm_npt=0" in the guest config file. Then in the L1 guest you should see the NPT svm feature bit not available. Then launch a l2 guest and check if it still crashes. I agree with Jan: Please provide the crash logs he requested. Christoph On 27.06.13 11:20, Suravee Suthikulpanit wrote:> On 6/27/2013 3:22 AM, Jan Beulich wrote: >>>>> On 27.06.13 at 02:24, Suravee Suthikulanit >>>>> <suravee.suthikulpanit@amd.com> wrote: >>> I have found an issue in where the system crash right when I start >>> another HVM guest inside an HVM guest. I have traced back to the patch >>> which the issue started. >>> >>> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 >>> Author: Jan Beulich <jbeulich@suse.com> >>> Date: Fri Feb 8 11:06:04 2013 +0100 >>> >>> x86: debugging code for testing 16Tb support on smaller memory >>> systems >>> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>> Acked-by: Keir Fraser <keir@xen.org> >> We had issues exposed by this patch before, but any such issue >> would just have been masked before that patch (and would >> surface on a system with more than 5Tb of memory anyway). > > The system I am having the issue has 48GB of memory. > >> So it is very unlikely for the patch itself to be at fault. > > I have traced the issue and found that the system crashing starts from > this commit id and onward. > (i.e. The system does not crash with commit id > ed759d20249197cf87b338ff0ed328052ca3b8e7) > So, I am still believe that this patch has somehow triggered the issue. > >> Furthermore, the crash them supposedly is because of the code >> added conditional upon NDEBUG, and hence would (on a smaller >> memory system) otherwise not surface at all for a production >> (debug=n) build. >> >>> The issue doesn''t reproduce when starting a PV (L2) guest inside an HVM >>> (L1) guest. >> "Does not" or "does"? In the former case - what is this supposed to >> tell me? > > What I am trying to say here is that the system_does not_ crash when > starting the PV guest as level 2 guest. > This is meant to be another data point to help analyzing the issue. > >> In any case - without you sharing technical details (register/stack >> dump of the crash at the very least) I don''t think I have anything >> at hand to look for possible problems. > > At this point, I am just reporting the issue. I have not been able to > get the crash dump because the system immediately reboot. > I''ll try to boot Xen with "noreboot" and inspect the log for more clues. > Any suggestions are welcome. > > Thank you, > > Suravee
>>> On 27.06.13 at 11:20, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> wrote: > On 6/27/2013 3:22 AM, Jan Beulich wrote: >>>>> On 27.06.13 at 02:24, Suravee Suthikulanit <suravee.suthikulpanit@amd.com> wrote: >>> I have found an issue in where the system crash right when I start >>> another HVM guest inside an HVM guest. I have traced back to the patch >>> which the issue started. >>> >>> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 >>> Author: Jan Beulich <jbeulich@suse.com> >>> Date: Fri Feb 8 11:06:04 2013 +0100 >>> >>> x86: debugging code for testing 16Tb support on smaller memory systems >>> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>> Acked-by: Keir Fraser <keir@xen.org> >> We had issues exposed by this patch before, but any such issue >> would just have been masked before that patch (and would >> surface on a system with more than 5Tb of memory anyway). > > The system I am having the issue has 48GB of memory.Which is why you''re seeing the problem only with the debugging code enabled. (And of course I didn''t really expect you to have tried this on a huge memory system - they''re just too rare still for this to be likely.)>> So it is very unlikely for the patch itself to be at fault. > > I have traced the issue and found that the system crashing starts from this > commit id and onward. > (i.e. The system does not crash with commit id > ed759d20249197cf87b338ff0ed328052ca3b8e7) > So, I am still believe that this patch has somehow triggered the issue.As said - I''m pretty certain this merely unmasked an already lurking issue. And that''s what the purpose of that patch is. Jan
On 6/27/2013 5:08 AM, Jan Beulich wrote:>>>> On 27.06.13 at 11:20, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> wrote: >> On 6/27/2013 3:22 AM, Jan Beulich wrote: >>>>>> On 27.06.13 at 02:24, Suravee Suthikulanit <suravee.suthikulpanit@amd.com> wrote: >>>> I have found an issue in where the system crash right when I start >>>> another HVM guest inside an HVM guest. I have traced back to the patch >>>> which the issue started. >>>> >>>> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 >>>> Author: Jan Beulich <jbeulich@suse.com> >>>> Date: Fri Feb 8 11:06:04 2013 +0100 >>>> >>>> x86: debugging code for testing 16Tb support on smaller memory systems >>>> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>> Acked-by: Keir Fraser <keir@xen.org> >>> We had issues exposed by this patch before, but any such issue >>> would just have been masked before that patch (and would >>> surface on a system with more than 5Tb of memory anyway). >> The system I am having the issue has 48GB of memory. > Which is why you''re seeing the problem only with the debugging > code enabled.Is the "debugging" enabled by default? I didn''t specify any debug when building. How can I check and disable debugging?> (And of course I didn''t really expect you to have > tried this on a huge memory system - they''re just too rare still > for this to be likely.) > >>> So it is very unlikely for the patch itself to be at fault. >> I have traced the issue and found that the system crashing starts from this >> commit id and onward. >> (i.e. The system does not crash with commit id >> ed759d20249197cf87b338ff0ed328052ca3b8e7) >> So, I am still believe that this patch has somehow triggered the issue. > As said - I''m pretty certain this merely unmasked an already > lurking issue.I''m not quite sure what you meant here. Are you saying that this "crashing" is a known issue?> And that''s what the purpose of that patch is.This patch is crashing the system. What do you mean by "And that''s what the purpose of that patch is"? Suravee> > Jan > >
On 27/06/13 11:24, Suravee Suthikulpanit wrote:> On 6/27/2013 5:08 AM, Jan Beulich wrote: >>>>> On 27.06.13 at 11:20, Suravee Suthikulpanit >>>>> <suravee.suthikulpanit@amd.com> wrote: >>> On 6/27/2013 3:22 AM, Jan Beulich wrote: >>>>>>> On 27.06.13 at 02:24, Suravee Suthikulanit >>>>>>> <suravee.suthikulpanit@amd.com> wrote: >>>>> I have found an issue in where the system crash right when I start >>>>> another HVM guest inside an HVM guest. I have traced back to the >>>>> patch >>>>> which the issue started. >>>>> >>>>> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 >>>>> Author: Jan Beulich <jbeulich@suse.com> >>>>> Date: Fri Feb 8 11:06:04 2013 +0100 >>>>> >>>>> x86: debugging code for testing 16Tb support on smaller >>>>> memory systems >>>>> >>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>>> Acked-by: Keir Fraser <keir@xen.org> >>>> We had issues exposed by this patch before, but any such issue >>>> would just have been masked before that patch (and would >>>> surface on a system with more than 5Tb of memory anyway). >>> The system I am having the issue has 48GB of memory. >> Which is why you''re seeing the problem only with the debugging >> code enabled. > Is the "debugging" enabled by default? I didn''t specify any debug > when building. > How can I check and disable debugging? > >> (And of course I didn''t really expect you to have >> tried this on a huge memory system - they''re just too rare still >> for this to be likely.) >> >>>> So it is very unlikely for the patch itself to be at fault. >>> I have traced the issue and found that the system crashing starts >>> from this >>> commit id and onward. >>> (i.e. The system does not crash with commit id >>> ed759d20249197cf87b338ff0ed328052ca3b8e7) >>> So, I am still believe that this patch has somehow triggered the issue. >> As said - I''m pretty certain this merely unmasked an already >> lurking issue. > I''m not quite sure what you meant here. Are you saying that this > "crashing" is a known issue? > >> And that''s what the purpose of that patch is. > This patch is crashing the system. What do you mean by "And that''s > what the purpose of that patch is"? > > Suravee >> >> Jan >> >> > >It means that this patch is exposing a latent bug where the nested hvm code is already wrong. It will be something in the nested hvm code which is not using map_domain_page() when it really should be. Without posting a stack trace, there is nothing we can do to help narrow down the issue. ~Andrew> > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On 27.06.13 12:24, Suravee Suthikulpanit wrote:> On 6/27/2013 5:08 AM, Jan Beulich wrote: >>>>> On 27.06.13 at 11:20, Suravee Suthikulpanit >>>>> <suravee.suthikulpanit@amd.com> wrote: >>> On 6/27/2013 3:22 AM, Jan Beulich wrote: >>>>>>> On 27.06.13 at 02:24, Suravee Suthikulanit >>>>>>> <suravee.suthikulpanit@amd.com> wrote: >>>>> I have found an issue in where the system crash right when I start >>>>> another HVM guest inside an HVM guest. I have traced back to the >>>>> patch >>>>> which the issue started. >>>>> >>>>> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 >>>>> Author: Jan Beulich <jbeulich@suse.com> >>>>> Date: Fri Feb 8 11:06:04 2013 +0100 >>>>> >>>>> x86: debugging code for testing 16Tb support on smaller >>>>> memory systems >>>>> >>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>>> Acked-by: Keir Fraser <keir@xen.org> >>>> We had issues exposed by this patch before, but any such issue >>>> would just have been masked before that patch (and would >>>> surface on a system with more than 5Tb of memory anyway). >>> The system I am having the issue has 48GB of memory. >> Which is why you''re seeing the problem only with the debugging >> code enabled. > Is the "debugging" enabled by default? I didn''t specify any debug when > building."Debugging" is enabled by default in the development tree.> How can I check and disable debugging?In the toplevel source directory look into Config.mk and set the line debug ?= y accordingly.> >> (And of course I didn''t really expect you to have >> tried this on a huge memory system - they''re just too rare still >> for this to be likely.) >> >>>> So it is very unlikely for the patch itself to be at fault. >>> I have traced the issue and found that the system crashing starts >>> from this >>> commit id and onward. >>> (i.e. The system does not crash with commit id >>> ed759d20249197cf87b338ff0ed328052ca3b8e7) >>> So, I am still believe that this patch has somehow triggered the issue. >> As said - I''m pretty certain this merely unmasked an already >> lurking issue. > I''m not quite sure what you meant here. Are you saying that this > "crashing" is a known issue?He means nestedhvm reveals an existing bug in his patch. If he is right then you do not see nestedhvm crashing with a non-debug xen-kernel (unless something else broke it).> >> And that''s what the purpose of that patch is. > This patch is crashing the system. What do you mean by "And that''s what > the purpose of that patch is"?The purpose is "People, please test". Christoph> > Suravee >> >> Jan
On 6/27/2013 5:33 AM, Egger, Christoph wrote:> On 27.06.13 12:24, Suravee Suthikulpanit wrote: >> On 6/27/2013 5:08 AM, Jan Beulich wrote: >>>>>> On 27.06.13 at 11:20, Suravee Suthikulpanit >>>>>> <suravee.suthikulpanit@amd.com> wrote: >>>> On 6/27/2013 3:22 AM, Jan Beulich wrote: >>>>>>>> On 27.06.13 at 02:24, Suravee Suthikulanit >>>>>>>> <suravee.suthikulpanit@amd.com> wrote: >>>>>> I have found an issue in where the system crash right when I start >>>>>> another HVM guest inside an HVM guest. I have traced back to the >>>>>> patch >>>>>> which the issue started. >>>>>> >>>>>> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 >>>>>> Author: Jan Beulich <jbeulich@suse.com> >>>>>> Date: Fri Feb 8 11:06:04 2013 +0100 >>>>>> >>>>>> x86: debugging code for testing 16Tb support on smaller >>>>>> memory systems >>>>>> >>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>>>> Acked-by: Keir Fraser <keir@xen.org> >>>>> We had issues exposed by this patch before, but any such issue >>>>> would just have been masked before that patch (and would >>>>> surface on a system with more than 5Tb of memory anyway). >>>> The system I am having the issue has 48GB of memory. >>> Which is why you''re seeing the problem only with the debugging >>> code enabled. >> Is the "debugging" enabled by default? I didn''t specify any debug when >> building. > "Debugging" is enabled by default in the development tree. > >> How can I check and disable debugging? > In the toplevel source directory look into Config.mk > and set the line > > debug ?= y > > accordingly.Thank you for clarification.>>> (And of course I didn''t really expect you to have >>> tried this on a huge memory system - they''re just too rare still >>> for this to be likely.) >>> >>>>> So it is very unlikely for the patch itself to be at fault. >>>> I have traced the issue and found that the system crashing starts >>>> from this >>>> commit id and onward. >>>> (i.e. The system does not crash with commit id >>>> ed759d20249197cf87b338ff0ed328052ca3b8e7) >>>> So, I am still believe that this patch has somehow triggered the issue. >>> As said - I''m pretty certain this merely unmasked an already >>> lurking issue. >> I''m not quite sure what you meant here. Are you saying that this >> "crashing" is a known issue? > He means nestedhvm reveals an existing bug in his patch. > If he is right then you do not see nestedhvm crashing with a non-debug > xen-kernel (unless something else broke it).After I rebuilt Xen kernel with debug=n, the system no longer crash when starting npt-on-npt and shadown-on-npt guests. I was not able to get to the crash dump previously. I will try again tomorrow at work and will post them. Thank you, Suravee
On Thu, Jun 27, 2013 at 11:24 AM, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> wrote:> On 6/27/2013 5:08 AM, Jan Beulich wrote: >>>>> >>>>> On 27.06.13 at 11:20, Suravee Suthikulpanit >>>>> <suravee.suthikulpanit@amd.com> wrote: >>> >>> On 6/27/2013 3:22 AM, Jan Beulich wrote: >>>>>>> >>>>>>> On 27.06.13 at 02:24, Suravee Suthikulanit >>>>>>> <suravee.suthikulpanit@amd.com> wrote: >>>>> >>>>> I have found an issue in where the system crash right when I start >>>>> another HVM guest inside an HVM guest. I have traced back to the patch >>>>> which the issue started. >>>>> >>>>> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 >>>>> Author: Jan Beulich <jbeulich@suse.com> >>>>> Date: Fri Feb 8 11:06:04 2013 +0100 >>>>> >>>>> x86: debugging code for testing 16Tb support on smaller memory >>>>> systems >>>>> >>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>>> Acked-by: Keir Fraser <keir@xen.org> >>>> >>>> We had issues exposed by this patch before, but any such issue >>>> would just have been masked before that patch (and would >>>> surface on a system with more than 5Tb of memory anyway). >>> >>> The system I am having the issue has 48GB of memory. >> >> Which is why you''re seeing the problem only with the debugging >> code enabled. > > Is the "debugging" enabled by default? I didn''t specify any debug when > building. > How can I check and disable debugging? > > >> (And of course I didn''t really expect you to have >> tried this on a huge memory system - they''re just too rare still >> for this to be likely.) >> >>>> So it is very unlikely for the patch itself to be at fault. >>> >>> I have traced the issue and found that the system crashing starts from >>> this >>> commit id and onward. >>> (i.e. The system does not crash with commit id >>> ed759d20249197cf87b338ff0ed328052ca3b8e7) >>> So, I am still believe that this patch has somehow triggered the issue. >> >> As said - I''m pretty certain this merely unmasked an already >> lurking issue. > > I''m not quite sure what you meant here. Are you saying that this "crashing" > is a known issue? > > >> And that''s what the purpose of that patch is. > > This patch is crashing the system. What do you mean by "And that''s what the > purpose of that patch is"?*If* you had had >5TiB, then you would have crashed even without this patch. The purpose of the patch is to make it so that if there is a bug that will crash for >5TiB, then it will *also* crash for <5TiB. Since the vast majority of people have <5TiB of RAM, this results in better testing coverage for those with >5TiB of RAM. On production systems, we want it to work as often as possible, so this test is disabled when debug=n, which is the default for released versions of Xen. But the development branch we very much want to find bugs, so during development, we set debug=y by default. -George
>>> On 27.06.13 at 12:24, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> wrote: > On 6/27/2013 5:08 AM, Jan Beulich wrote: >>>>> On 27.06.13 at 11:20, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> wrote: >>> On 6/27/2013 3:22 AM, Jan Beulich wrote: >>>> We had issues exposed by this patch before, but any such issue >>>> would just have been masked before that patch (and would >>>> surface on a system with more than 5Tb of memory anyway). >>> The system I am having the issue has 48GB of memory. >> Which is why you''re seeing the problem only with the debugging >> code enabled. > Is the "debugging" enabled by default? I didn''t specify any debug when > building. > How can I check and disable debugging?Set debug := n close to the top of ./Config.mk.>> (And of course I didn''t really expect you to have >> tried this on a huge memory system - they''re just too rare still >> for this to be likely.) >> >>>> So it is very unlikely for the patch itself to be at fault. >>> I have traced the issue and found that the system crashing starts from this >>> commit id and onward. >>> (i.e. The system does not crash with commit id >>> ed759d20249197cf87b338ff0ed328052ca3b8e7) >>> So, I am still believe that this patch has somehow triggered the issue. >> As said - I''m pretty certain this merely unmasked an already >> lurking issue. > I''m not quite sure what you meant here. Are you saying that this > "crashing" is a known issue?No, I''m unaware of any issue similar to what you describe.>> And that''s what the purpose of that patch is. > This patch is crashing the system. What do you mean by "And that''s what > the purpose of that patch is"?The finding of bugs that otherwise would surface only once indeed running on a huge memory system. If on such a system this would result in crashing the host, so be it with this debugging code even on "normal" systems (as long as not running in production mode). The alternative would be to keep the bug masked until someone really tried to run Xen on such a huge system, and the debugging of this then would be quite a bit more expensive (if nothing else then in the amount of electrical power needed). Jan
On 6/27/2013 6:14 AM, Suravee Suthikulpanit wrote:> On 6/27/2013 5:33 AM, Egger, Christoph wrote: >> On 27.06.13 12:24, Suravee Suthikulpanit wrote: >>> On 6/27/2013 5:08 AM, Jan Beulich wrote: >>>>>>> On 27.06.13 at 11:20, Suravee Suthikulpanit >>>>>>> <suravee.suthikulpanit@amd.com> wrote: >>>>> On 6/27/2013 3:22 AM, Jan Beulich wrote: >>>>>>>>> On 27.06.13 at 02:24, Suravee Suthikulanit >>>>>>>>> <suravee.suthikulpanit@amd.com> wrote: >>>>>>> I have found an issue in where the system crash right when I start >>>>>>> another HVM guest inside an HVM guest. I have traced back to the >>>>>>> patch >>>>>>> which the issue started. >>>>>>> >>>>>>> commit f1bde87fc08ce8c818a1640a8fe4765d48923091 >>>>>>> Author: Jan Beulich <jbeulich@suse.com> >>>>>>> Date: Fri Feb 8 11:06:04 2013 +0100 >>>>>>> >>>>>>> x86: debugging code for testing 16Tb support on smaller >>>>>>> memory systems >>>>>>> >>>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>>>>> Acked-by: Keir Fraser <keir@xen.org> >>>>>> We had issues exposed by this patch before, but any such issue >>>>>> would just have been masked before that patch (and would >>>>>> surface on a system with more than 5Tb of memory anyway). >>>>> The system I am having the issue has 48GB of memory. >>>> Which is why you''re seeing the problem only with the debugging >>>> code enabled. >>> Is the "debugging" enabled by default? I didn''t specify any debug when >>> building. >> "Debugging" is enabled by default in the development tree. >> >>> How can I check and disable debugging? >> In the toplevel source directory look into Config.mk >> and set the line >> >> debug ?= y >> >> accordingly. > > Thank you for clarification. > >>>> (And of course I didn''t really expect you to have >>>> tried this on a huge memory system - they''re just too rare still >>>> for this to be likely.) >>>> >>>>>> So it is very unlikely for the patch itself to be at fault. >>>>> I have traced the issue and found that the system crashing starts >>>>> from this >>>>> commit id and onward. >>>>> (i.e. The system does not crash with commit id >>>>> ed759d20249197cf87b338ff0ed328052ca3b8e7) >>>>> So, I am still believe that this patch has somehow triggered the >>>>> issue. >>>> As said - I''m pretty certain this merely unmasked an already >>>> lurking issue. >>> I''m not quite sure what you meant here. Are you saying that this >>> "crashing" is a known issue? >> He means nestedhvm reveals an existing bug in his patch. >> If he is right then you do not see nestedhvm crashing with a non-debug >> xen-kernel (unless something else broke it). > > After I rebuilt Xen kernel with debug=n, the system no longer crash > when starting npt-on-npt and shadown-on-npt guests. > I was not able to get to the crash dump previously. I will try again > tomorrow at work and will post them. > > Thank you, > > SuraveeSo, I have finally able to get the crash dump (see below). The crash is due to an assert (XEN) Assertion ''va >= XEN_VIRT_START'' failed at /sandbox/xen/xen.git/xen/include/asm/x86_64/page.h:86 * Debugging show the va=ffff82c40002d000, XEN_VIRT_START=ffff82c4c0000000, DIRECTMAP_VIRT_END=ffffff8000000000. * Backtrace symbol showing the crash is in "svm_vmexit_handler()", which is inlined from "svm_vmexit_do_vmsave()" and "svm_vmsave()". CRASH DUMP ========= (XEN) Assertion ''va >= XEN_VIRT_START'' failed at /sandbox/xen/xen.git/xen/include/asm/x86_64/page.h:86 (XEN) Debugging connection not set up. (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Not tainted ]---- (XEN) CPU: 17 (XEN) RIP: e008:[<ffff82c4c01cfbfc>] svm_vmexit_handler+0x1574/0x1a2a (XEN) RFLAGS: 0000000000010293 CONTEXT: hypervisor (XEN) rax: ffff82c4bfffffff rbx: ffff830852ec1000 rcx: 0000000000000000 (XEN) rdx: ffff830434757020 rsi: 000000000000000a rdi: ffff82c4c0283740 (XEN) rbp: ffff83043474ff08 rsp: ffff83043474fd28 r8: 0000000000000004 (XEN) r9: 0000000000000010 r10: ffffff8000000000 r11: 0000000000000010 (XEN) r12: ffff83000e010000 r13: 0000000000000003 r14: 0000000000000000 (XEN) r15: ffff82c40002d000 cr0: 000000008005003b cr4: 00000000000406f0 (XEN) cr3: 000000086d9dd000 cr2: 00007fe7f8e99120 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen stack trace from rsp=ffff83043474fd28: (XEN) ffff83000e010000 ffff83043474fd70 ffff82c4c01bb001 0000000000000000 (XEN) ffff83000e010000 ffff830852ec1000 0000000000000000 0000000000000000 (XEN) ffff830434757080 ffff83043474fda0 ffff82c4c01cca33 0000000000000000 (XEN) ffff8300c7ea6000 00000000000fee00 ffff83043474ff18 ffff830400000000 (XEN) ffff82c4c015fe19 ffff83043474fe10 ffff82c4c0185827 00000000000000fc (XEN) 0000003b5c327b44 0000000a0000000d 0000000000000000 0000000000000000 (XEN) 0000000000000000 ffff83043474fe20 ffff8300c7ea6000 00000049a0b0dcf5 (XEN) 0000000000000286 ffff83043474fe28 ffff82c4c0125e9e ffff83000e010000 (XEN) ffff83043474fe98 ffff82c4c01c8048 ffff82c4c0125e9e ffff83000e010488 (XEN) ffff83043474fe98 ffffffffffffffff ffff83043474fe78 ffff82c4c01c5e56 (XEN) ffff83000e010000 ffff830853ea1000 ffff83043474fe98 ffff82c4c01be614 (XEN) ffff83000e010000 0000000000000007 ffff83043474ff08 ffff82c4c01c8e66 (XEN) ffff830434757080 000000fc3474fee0 ffff82c4c0125a52 ffff830434748000 (XEN) ffff830434748000 00000000ffffffff ffff830852ec1000 ffff83000e010000 (XEN) ffff830209c87000 0000000000000007 0000000000000003 ffff830203ddff18 (XEN) ffff830203ddfd70 ffff82c4c01d1c45 ffff830203ddff18 0000000000000003 (XEN) 0000000000000007 ffff830209c87000 ffff830203ddfd70 ffff8300d4b46000 (XEN) 0000000000000246 00000000deadbeef 00000013eabcc169 0000000000000003 (XEN) 0000000203de2000 0000000000000000 0000000203de2000 ffff830203de4000 (XEN) ffff830209c87000 0000beef0000beef ffff82c4c01ce158 0000beef0000beef (XEN) Xen call trace: (XEN) [<ffff82c4c01cfbfc>] svm_vmexit_handler+0x1574/0x1a2a (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 17: (XEN) Assertion ''va >= XEN_VIRT_START'' failed at /sandbox/xen/xen.git/xen/include/asm/x86_64/page.h:92 (XEN) **************************************** (XEN) (XEN) Manual reset required (''noreboot'' specified) (XEN) Debugging connection not set up. Suravee
>>> On 28.06.13 at 02:44, Suravee Suthikulanit <suravee.suthikulpanit@amd.com> wrote: > So, I have finally able to get the crash dump (see below). The crash is due > to an assert > > (XEN) Assertion ''va >= XEN_VIRT_START'' failed at > /sandbox/xen/xen.git/xen/include/asm/x86_64/page.h:86 > > * Debugging show the va=ffff82c40002d000, XEN_VIRT_START=ffff82c4c0000000, > DIRECTMAP_VIRT_END=ffffff8000000000. > * Backtrace symbol showing the crash is in "svm_vmexit_handler()", which is > inlined from "svm_vmexit_do_vmsave()" and "svm_vmsave()".Which helps in no way identifying where the problem is - svm_vmexit_handler() is just too large to spot this without either the matching xen-syms at hand, or you adding further instrumentation. Jan
On 6/28/2013 2:58 AM, Jan Beulich wrote:>>>> On 28.06.13 at 02:44, Suravee Suthikulanit <suravee.suthikulpanit@amd.com> wrote: >> So, I have finally able to get the crash dump (see below). The crash is due >> to an assert >> >> (XEN) Assertion ''va >= XEN_VIRT_START'' failed at >> /sandbox/xen/xen.git/xen/include/asm/x86_64/page.h:86 >> >> * Debugging show the va=ffff82c40002d000, XEN_VIRT_START=ffff82c4c0000000, >> DIRECTMAP_VIRT_END=ffffff8000000000. >> * Backtrace symbol showing the crash is in "svm_vmexit_handler()", which is >> inlined from "svm_vmexit_do_vmsave()" and "svm_vmsave()". > Which helps in no way identifying where the problem is - > svm_vmexit_handler() is just too large to spot this without either > the matching xen-syms at hand, or you adding further > instrumentation. > > JanWhat I am trying to say is, the assertion is in the __virt_to_maddr which is called from svm_vmexit_do_vmsave(). However, this is a bit complicate due to macros and inlines. Here is the callchain supposed to look like: ASSERT(va >= XEN_VIRT_START ) __virt_to_maddr <---- inlined virt_to_mfn () <---- macro __pa () <---- macro smv_vmasave() <---- inlined svm_vmexit_do_vmsave() <---- inlined svm_vmexit_handler() <---- symbol Suravee
On 28/06/13 15:20, Suravee Suthikulanit wrote:> On 6/28/2013 2:58 AM, Jan Beulich wrote: >>>>> On 28.06.13 at 02:44, Suravee Suthikulanit >>>>> <suravee.suthikulpanit@amd.com> wrote: >>> So, I have finally able to get the crash dump (see below). The crash >>> is due >>> to an assert >>> >>> (XEN) Assertion ''va >= XEN_VIRT_START'' failed at >>> /sandbox/xen/xen.git/xen/include/asm/x86_64/page.h:86 >>> >>> * Debugging show the va=ffff82c40002d000, >>> XEN_VIRT_START=ffff82c4c0000000, >>> DIRECTMAP_VIRT_END=ffffff8000000000. >>> * Backtrace symbol showing the crash is in "svm_vmexit_handler()", >>> which is >>> inlined from "svm_vmexit_do_vmsave()" and "svm_vmsave()". >> Which helps in no way identifying where the problem is - >> svm_vmexit_handler() is just too large to spot this without either >> the matching xen-syms at hand, or you adding further >> instrumentation. >> >> Jan > > What I am trying to say is, the assertion is in the __virt_to_maddr > which is called from > svm_vmexit_do_vmsave(). However, this is a bit complicate due to > macros and inlines. > Here is the callchain supposed to look like: > > ASSERT(va >= XEN_VIRT_START ) > __virt_to_maddr <---- inlined > virt_to_mfn () <---- macro > __pa () <---- macro > smv_vmasave() <---- inlined > svm_vmexit_do_vmsave() <---- inlined > svm_vmexit_handler() <---- symbol > > SuraveeThe code is assuming that the virtual address is mapped into the Xen pagetables when in fact it is not. The code needs to be corrected to use map_domain_page() to correctly access a domheap page. ~Andrew
>>> On 28.06.13 at 16:20, Suravee Suthikulanit <suravee.suthikulpanit@amd.com>wrote:> On 6/28/2013 2:58 AM, Jan Beulich wrote: >>>>> On 28.06.13 at 02:44, Suravee Suthikulanit <suravee.suthikulpanit@amd.com> > wrote: >>> So, I have finally able to get the crash dump (see below). The crash is due >>> to an assert >>> >>> (XEN) Assertion ''va >= XEN_VIRT_START'' failed at >>> /sandbox/xen/xen.git/xen/include/asm/x86_64/page.h:86 >>> >>> * Debugging show the va=ffff82c40002d000, XEN_VIRT_START=ffff82c4c0000000, >>> DIRECTMAP_VIRT_END=ffffff8000000000. >>> * Backtrace symbol showing the crash is in "svm_vmexit_handler()", which is >>> inlined from "svm_vmexit_do_vmsave()" and "svm_vmsave()". >> Which helps in no way identifying where the problem is - >> svm_vmexit_handler() is just too large to spot this without either >> the matching xen-syms at hand, or you adding further >> instrumentation. >> >> Jan > > What I am trying to say is, the assertion is in the __virt_to_maddr which is > called from > svm_vmexit_do_vmsave(). However, this is a bit complicate due to macros and > inlines. > Here is the callchain supposed to look like: > > ASSERT(va >= XEN_VIRT_START ) > __virt_to_maddr <---- inlined > virt_to_mfn () <---- macro > __pa () <---- macro > smv_vmasave() <---- inlined > svm_vmexit_do_vmsave() <---- inlined > svm_vmexit_handler() <---- symbolSo the problem is the inverse of the usual one (and that''s part of why I didn''t spot it when searching the tree for code that needs fixing; the other part is that while running into these functions I knew that VMCBs get allocated from the Xen heap, but didn''t notice that the same functions also get used for dealing with guest VMCBs): nestedsvm_vmcb_map() properly does the necessary mapping, but svm_vmsave() (just like svm_vmload()) blindly uses __pa() on something that''s not an address in the direct mapping region. Which means that on 4.2.0, where we still had a 32-bit hypervisor, nested SVM was completely broken (and presumably never tested) in that 32-bit case. Luckily we meanwhile disabled the use of nested HVM in 4.2.x''s 32-bit builds. Jan
On 28.06.13 16:52, Jan Beulich wrote:>>>> On 28.06.13 at 16:20, Suravee Suthikulanit <suravee.suthikulpanit@amd.com> > wrote: >> On 6/28/2013 2:58 AM, Jan Beulich wrote: >>>>>> On 28.06.13 at 02:44, Suravee Suthikulanit <suravee.suthikulpanit@amd.com> >> wrote: >>>> So, I have finally able to get the crash dump (see below). The crash is due >>>> to an assert >>>> >>>> (XEN) Assertion ''va >= XEN_VIRT_START'' failed at >>>> /sandbox/xen/xen.git/xen/include/asm/x86_64/page.h:86 >>>> >>>> * Debugging show the va=ffff82c40002d000, XEN_VIRT_START=ffff82c4c0000000, >>>> DIRECTMAP_VIRT_END=ffffff8000000000. >>>> * Backtrace symbol showing the crash is in "svm_vmexit_handler()", which is >>>> inlined from "svm_vmexit_do_vmsave()" and "svm_vmsave()". >>> Which helps in no way identifying where the problem is - >>> svm_vmexit_handler() is just too large to spot this without either >>> the matching xen-syms at hand, or you adding further >>> instrumentation. >>> >>> Jan >> >> What I am trying to say is, the assertion is in the __virt_to_maddr which is >> called from >> svm_vmexit_do_vmsave(). However, this is a bit complicate due to macros and >> inlines. >> Here is the callchain supposed to look like: >> >> ASSERT(va >= XEN_VIRT_START ) >> __virt_to_maddr <---- inlined >> virt_to_mfn () <---- macro >> __pa () <---- macro >> smv_vmasave() <---- inlined >> svm_vmexit_do_vmsave() <---- inlined >> svm_vmexit_handler() <---- symbol > > So the problem is the inverse of the usual one (and that''s part of > why I didn''t spot it when searching the tree for code that needs > fixing; the other part is that while running into these functions I > knew that VMCBs get allocated from the Xen heap, but didn''t > notice that the same functions also get used for dealing with > guest VMCBs): > > nestedsvm_vmcb_map() properly does the necessary mapping, > but svm_vmsave() (just like svm_vmload()) blindly uses __pa() on > something that''s not an address in the direct mapping region. > Which means that on 4.2.0, where we still had a 32-bit hypervisor, > nested SVM was completely broken (and presumably never tested) > in that 32-bit case. Luckily we meanwhile disabled the use of nested > HVM in 4.2.x''s 32-bit builds.I never tested nested svm on 32bit on the host. I did test 32bit hypervisors as guest. Christoph