Santos, Jose Renato G
2005-Sep-08 19:33 UTC
[Xen-devel] NMI with SMP domain causing machine to reboot
I have spend most of the last weeks trying to nail down a nasty bug that is preventing me to release xenoprof for SMP domains. The bug is non-deterministic and when it happens the machine just reboots with no message or warning on the serial console. This made the debugging process painfull and slow. I started removing specific components of xenoprof code trying to find what component is causing the problem. After removing almost all code it seems the bug is associated with NMI interrupts. Right now the only code left is the code to program a hardware perf. counter to count "non-halted" clock cycles (hard-coded) and to handle NMI interrupts. All other logic was removed and and I am still seeing the machine auto rebooting at some non-determinist time. I am starting to suspect this might be a Xen bug and I will probably need some help from the Xen core team to nail this down. I have attached a patch that enables Xen to program the perf counter and handle the NMIs they generate. I have also attached a patch for a new user level test tool for starting the performance counter. I hope these patches enable others to reproduce the behaviour I am observing I only see this bug when running SMP domains (either dom0 or domU) with NMIs being generated. My machine has two CPUs with hyperthreading disabled. When I boot an SMP domain0 (with 2 VCPUs) I only see the the bug when NMIs are generated for CPU 1. Surprisingly, I have never seen the auto rebooting behavior when NMIs are generated on CPU 0 only. Since the bug is non determinitic it is possible that the bug is still there but for some reason not triggered for NMIs on CPU 0. Here is a sequence of steps that I use to trigger the bug (on an SMP dom0 with 2 VCPUs); 1) initialize the performance counter > xenpmc -i 2) start the counter > xenpmc -g 3) verify that NMIs are being generated > xenpmc -s This causes a counter of NMIs for [CPU0,CPU1] to be printed. This command was originally intended to stop the counters (and NMI generation) but the command was modified to just return without stopping the counters. As a side effect the number of NMIs are printed on the xen console and can be used to verify that NMIs are being generated In order to trigger the bug I execute the comand "xm dmesg" in a loop and eventually the machine auto reboot. (usually after a few minutes). I use the following shell script to execute "xm dmesg" in a loop. #!/bin/bash while true; do xm dmesg; sleep 1; done Does anybody has an idea of what can be causing this behavior and how we could nail this down? Thanks Renato _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2005-Sep-09 08:57 UTC
Re: [Xen-devel] NMI with SMP domain causing machine to reboot
On 8 Sep 2005, at 20:33, Santos, Jose Renato G wrote:> I have spend most of the last weeks trying to nail down a nasty bug > that is preventing me to release xenoprof for SMP domains. > The bug is non-deterministic and when it happens the machine just > reboots with no message or warning on the serial console. > This made the debugging process painfull and slow.Hard to say from the code, but maybe it''s somethign to do with hyperthreading? The performance counter MSRs are shared in a weird way between hyperthreads. Maybe you''re not properly resetting CPU1''s perf counter and ending up with an NMI storm? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Santos, Jose Renato G
2005-Sep-09 17:44 UTC
RE: [Xen-devel] NMI with SMP domain causing machine to reboot
Keir Thanks for your reply. I don''t think the problem is caused by not properly reseting CPU1''s perf counter. I can see that the number of NMIs being generated are similar both for CPU0 and CPU1, and both CPUs perf counters are being programmed in the exact same way. (The command "xenpmc -s" enables me to see the number of NMIs generated) Moreover, when we have multiple non-SMP domains running on both CPUs, this problem does not happen. Sharing of MSRs between hyperthreads should not be the problem either, since my machine has 2 physical CPUs and hyperthreading is disabled in the BIOS.(ie. CPU0 and CPU1 are distinct physical CPUs) It seems that there is something wrong or some race condition introduced by SMPs domains. Any idea of what is different in Xen (maybe interrupt handling) when you have SMP domains? Any chance you could try reproducing this behavior in one of your machines? Can you think of any situation that would cause the machine to reboot without printing any error message in the serial console? Any help is deeply appreciate since I loosing hope I will be able to nail this down by myself. It is always possible possible that I am doing something wrong, but at this point the code left is not doing much and I am starting to suspect the problem lies somewhere else in Xen. In this case I would desperately need someone else help. Thanks Renato>> -----Original Message----- >> From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk] >> Sent: Friday, September 09, 2005 1:57 AM >> To: Santos, Jose Renato G >> Cc: Turner, Yoshio; xen-devel@lists.xensource.com; G John Janakiraman >> Subject: Re: [Xen-devel] NMI with SMP domain causing machine >> to reboot >> >> >> >> On 8 Sep 2005, at 20:33, Santos, Jose Renato G wrote: >> >> > I have spend most of the last weeks trying to nail down >> a nasty bug >> > that is preventing me to release xenoprof for SMP domains. >> > The bug is non-deterministic and when it happens the machine just >> > reboots with no message or warning on the serial console. >> > This made the debugging process painfull and slow. >> >> Hard to say from the code, but maybe it''s somethign to do with >> hyperthreading? The performance counter MSRs are shared in a >> weird way >> between hyperthreads. Maybe you''re not properly resetting >> CPU1''s perf >> counter and ending up with an NMI storm? >> >> -- Keir >> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2005-Sep-09 18:13 UTC
Re: [Xen-devel] NMI with SMP domain causing machine to reboot
On 9 Sep 2005, at 18:44, Santos, Jose Renato G wrote:> Any chance you could try reproducing this behavior in one of > your machines? > Can you think of any situation that would cause the machine to > reboot without printing any error message in the serial console?Only a triple fault, which can only occur if our double fault handler fails to work correctly. If you can cause the spontaneous reboot even with a debug build of Xen, it probably means a bogus value in %cr3 (garbage got written there, or the pagetables got corrupted or freed before a CPU had finished using them). I guess the lazy context switching could be going wrong -- might be worth calling __context_switch() unconditionally from context_switch() in arch/x86/domain.c (i.e., just comment out the if statement). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Santos, Jose Renato G
2005-Sep-10 01:48 UTC
RE: [Xen-devel] NMI with SMP domain causing machine to reboot
>> -----Original Message----- >> From: Keir Fraser [mailto:Keir.Fraser@cl.cam.ac.uk] >> Sent: Friday, September 09, 2005 11:13 AM >> To: Santos, Jose Renato G >> Cc: Turner, Yoshio; xen-devel@lists.xensource.com; G John Janakiraman >> Subject: Re: [Xen-devel] NMI with SMP domain causing machine >> to reboot >> >> >> >> On 9 Sep 2005, at 18:44, Santos, Jose Renato G wrote: >> >> > Any chance you could try reproducing this behavior in one of >> > your machines? >> > Can you think of any situation that would cause the machine to >> > reboot without printing any error message in the serial console? >> >> Only a triple fault, which can only occur if our double >> fault handler >> fails to work correctly. If you can cause the spontaneous >> reboot even >> with a debug build of Xen, it probably means a bogus value in %cr3 >> (garbage got written there, or the pagetables got corrupted or freed >> before a CPU had finished using them). >> >> I guess the lazy context switching could be going wrong -- might be >> worth calling __context_switch() unconditionally from >> context_switch() >> in arch/x86/domain.c (i.e., just comment out the if statement). >>Keir. Thanks for your suggestions. I really appreciate your help I tried the suggestion for changing context_switch() to call __context_switch() uncondicionally , but that did not help. I still see the same behavior. I was using a debug version of Xen from August 16 when doing my experiments until Wednesday when I downloaded the latest source tar ball (from sep 7). In this new version I have not set the debug flag. I will try this on Monday just to confirm that the spontaneous reboot still happens in the new version with debug on. Thanks Renato>> -- Keir >> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel