Hello, I am trying to debug an issue which appears on the surface as "run shutdown -h +0 in dom0 and the machine reboots". The issue reproduces on a Supermicro X8DT6 motherboard (http://www.supermicro.com/products/motherboard/QPI/5500/X8DT6-F.cfm) only (as far as we can tell - we cant reproduce it on any other hardware), on both Xen 3.4 and Xen 4.1. The debugging described below is specifically against 3.4 It reproduces irrespective of number of CPUs and irrespective of IOMMU utilization. For all tests, the server is being run with maxcpus=1 on the Xen command-line and no domUs at all. Tracing the path of execution, Xen is getting the XENPF_enter_acpi_sleep platform op and acting on it correctly, going down the ACPI S5 codepath. My assumption is that the reboot is caused by a triple fault, as the server reboots before it actually writes to the PM1A register (except for the case where it actually works, at which point it writes correctly and properly shuts down). There is no indication on the serial console of a fault or double fault. My method of tracing is #define SERIAL_CHAR(ch) __asm__ __volatile__ ("mov %0, %%al\n\t"\ "mov $0x3f8, %%dx\n\t" \ "out %%al,%%dx\n\t" :: "g"(ch) : "%ax", "%dx"); scattered over the codebase. The fault itself is time dependent - it occasionally works when the shutdown code spends very little time in get_cmos_time. By waiting at certain points, but particularly inserting: for( i=0; i < 10; ++i) { SERIAL_CHAR(''*''); mdelay(1000); } in the XENPF_enter_acpi_sleep case statement, It shows that the triple fault is reliably 5 seconds after the hypercall, and in otherwise safe code. I SERIAL_CHAR''d the entry and exit of the nmi handler, which shows that the triple fault is not caused by the nmi watchdog, which I thought might be having an effect. While waiting to print ''*'' every second, the serial console buffer continues to be written to the UART, showing that other tasks are going on while XENPF_enter_acpi_sleep is being serviced. The server itself is otherwise totally stable, running PV, HVM (and some bodged pv-on-hvm container for FreeBSD), along with performing SR-IOV from 8 NICs with 40 VFs each. I have a workaround by removing the call to time_suspend() at which point proding the PM1A register happens reliably before whatever causes the triple fault later. However, this is not a suitable solution for the S3 codepath which suffers the same problem but really does need to run time_suspend. My questions to the Xen community are: what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is in action, and more generally, how can I go about debugging which tasks are being run. Thanks in advance for any advice/tips -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 28/07/2011 20:53, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:> My questions to the Xen community are: > > what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is > in action, and more generally, how can I go about debugging which tasks > are being run.By the time you get to time_suspend(), you are running on CPU0, all other CPUs are offline, all domUs are suspended, and IRQs are disabled. There''s not much scope for unexpected interruptions unless it''s an NMI or SMI. By that point the serial subsystem is in synchronous mode, rather than interrupt-driven, so it''s no wonder it continues to work. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
________________________________________ From: Keir Fraser [keir.xen@gmail.com] Sent: 28 July 2011 21:42 To: Andrew Cooper; xen-devel@lists.xensource.com Subject: Re: [Xen-devel] Debugging a weird hardware fault. On 28/07/2011 20:53, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:> My questions to the Xen community are: > > what (if any) new tasks get scheduled when a XENPF_enter_acpi_sleep is > in action, and more generally, how can I go about debugging which tasks > are being run.By the time you get to time_suspend(), you are running on CPU0, all other CPUs are offline, all domUs are suspended, and IRQs are disabled. There''s not much scope for unexpected interruptions unless it''s an NMI or SMI. By that point the serial subsystem is in synchronous mode, rather than interrupt-driven, so it''s no wonder it continues to work. -- Keir Initially, an SMI was what I was thinking, but the triple fault occurs whether you start bringing down CPUs or not. While waiting 10 seconds in the platform_op select statment, the fault still occurs when all CPUs are still up, all IRQs still enabled and potentially domU''s still up. (Also, from studying the Xen3.4 code, I believe that interrupts are still actually up during time_suspend(), but are soon brought down by lapic_suspend() later in device_power_down().) Convertly, in the hacked up case where I ditched most of the shared S3/S5 codepath and just hit the PM1A, the server correctly shut down and stayed shut down, implying that the fault was caused by software (be it BIOS or OS) rather than hardware. From what I understand of the APCI spec (and I claim very little knowledge), there are a multitude of hardware events which could bring the server out of S5, appearing as a triple fault, which would not be affected by whether you had hit the PM1A register. In this specific example, dom0 regular shudown code already brought down the domUs (of which there were none because we never started any), and we were running with 1 CPU only so no others were up. This opens up a whole host of other possibilities which could be playing an effect betwee the XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 28/07/2011 23:45, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:> Initially, an SMI was what I was thinking, but the triple fault occurs whether > you start bringing down CPUs or not. While waiting 10 seconds in the > platform_op select statment, the fault still occurs when all CPUs are still > up, all IRQs still enabled and potentially domU''s still up. (Also, from > studying the Xen3.4 code, I believe that interrupts are still actually up > during time_suspend(), but are soon brought down by lapic_suspend() later in > device_power_down().) > > Convertly, in the hacked up case where I ditched most of the shared S3/S5 > codepath and just hit the PM1A, the server correctly shut down and stayed shut > down, implying that the fault was caused by software (be it BIOS or OS) rather > than hardware. From what I understand of the APCI spec (and I claim very > little knowledge), there are a multitude of hardware events which could bring > the server out of S5, appearing as a triple fault, which would not be affected > by whether you had hit the PM1A register. > > In this specific example, dom0 regular shudown code already brought down the > domUs (of which there were none because we never started any), and we were > running with 1 CPU only so no others were up. This opens up a whole host of > other possibilities which could be playing an effect betwee the > XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down.Well I expect dom0 has done some going-to-sleep work that has left the platform on borrowed time w.r.t. bashing SLP_EN into the PM1 control register and actually finalising the shutdown. For example, it will have executed the _GTS ACPI method if there is one. That is supposed to happen immediately before writing PM1.SLP_EN, with no intervening interrupt activity or I/O. Obviously things don''t work out quite like that when running on Xen! This is an architectural limitation of how ACPI sleep is currently implemented for Xen. It may need some rethinking to do it really properly according to the spec. e.g., do a hypercall just to prepare Xen for shutdown, but return back to dom0 in some limited environment to actually have it do the final ACPI sleep work. Or have dom0 pass a pointer to a code block that Xen should simply jump at to get the sleep to happen (where that code block would basically be dom0''s acpi_enter_sleep() function). There are a few, somewhat distasteful, options that are more respectful of the ACPI spec than we are right now. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Cc''ing some of the Xen ACPI/PM maintainers to see if they have an opinion on this issue... On 29/07/2011 08:10, "Keir Fraser" <keir.xen@gmail.com> wrote:> On 28/07/2011 23:45, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote: > >> Initially, an SMI was what I was thinking, but the triple fault occurs >> whether >> you start bringing down CPUs or not. While waiting 10 seconds in the >> platform_op select statment, the fault still occurs when all CPUs are still >> up, all IRQs still enabled and potentially domU''s still up. (Also, from >> studying the Xen3.4 code, I believe that interrupts are still actually up >> during time_suspend(), but are soon brought down by lapic_suspend() later in >> device_power_down().) >> >> Convertly, in the hacked up case where I ditched most of the shared S3/S5 >> codepath and just hit the PM1A, the server correctly shut down and stayed >> shut >> down, implying that the fault was caused by software (be it BIOS or OS) >> rather >> than hardware. From what I understand of the APCI spec (and I claim very >> little knowledge), there are a multitude of hardware events which could bring >> the server out of S5, appearing as a triple fault, which would not be >> affected >> by whether you had hit the PM1A register. >> >> In this specific example, dom0 regular shudown code already brought down the >> domUs (of which there were none because we never started any), and we were >> running with 1 CPU only so no others were up. This opens up a whole host of >> other possibilities which could be playing an effect betwee the >> XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down. > > Well I expect dom0 has done some going-to-sleep work that has left the > platform on borrowed time w.r.t. bashing SLP_EN into the PM1 control > register and actually finalising the shutdown. > > For example, it will have executed the _GTS ACPI method if there is one. > That is supposed to happen immediately before writing PM1.SLP_EN, with no > intervening interrupt activity or I/O. Obviously things don''t work out quite > like that when running on Xen! > > This is an architectural limitation of how ACPI sleep is currently > implemented for Xen. It may need some rethinking to do it really properly > according to the spec. e.g., do a hypercall just to prepare Xen for > shutdown, but return back to dom0 in some limited environment to actually > have it do the final ACPI sleep work. Or have dom0 pass a pointer to a code > block that Xen should simply jump at to get the sleep to happen (where that > code block would basically be dom0''s acpi_enter_sleep() function). There are > a few, somewhat distasteful, options that are more respectful of the ACPI > spec than we are right now. > > -- Keir > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 29/07/11 08:10, Keir Fraser wrote:> On 28/07/2011 23:45, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote: > >> Initially, an SMI was what I was thinking, but the triple fault occurs whether >> you start bringing down CPUs or not. While waiting 10 seconds in the >> platform_op select statment, the fault still occurs when all CPUs are still >> up, all IRQs still enabled and potentially domU''s still up. (Also, from >> studying the Xen3.4 code, I believe that interrupts are still actually up >> during time_suspend(), but are soon brought down by lapic_suspend() later in >> device_power_down().) >> >> Convertly, in the hacked up case where I ditched most of the shared S3/S5 >> codepath and just hit the PM1A, the server correctly shut down and stayed shut >> down, implying that the fault was caused by software (be it BIOS or OS) rather >> than hardware. From what I understand of the APCI spec (and I claim very >> little knowledge), there are a multitude of hardware events which could bring >> the server out of S5, appearing as a triple fault, which would not be affected >> by whether you had hit the PM1A register. >> >> In this specific example, dom0 regular shudown code already brought down the >> domUs (of which there were none because we never started any), and we were >> running with 1 CPU only so no others were up. This opens up a whole host of >> other possibilities which could be playing an effect betwee the >> XENPF_enter_apci_sleep hypercall and Xen actually shutting itself down. > Well I expect dom0 has done some going-to-sleep work that has left the > platform on borrowed time w.r.t. bashing SLP_EN into the PM1 control > register and actually finalising the shutdown. > > For example, it will have executed the _GTS ACPI method if there is one. > That is supposed to happen immediately before writing PM1.SLP_EN, with no > intervening interrupt activity or I/O. Obviously things don''t work out quite > like that when running on Xen! > > This is an architectural limitation of how ACPI sleep is currently > implemented for Xen. It may need some rethinking to do it really properly > according to the spec. e.g., do a hypercall just to prepare Xen for > shutdown, but return back to dom0 in some limited environment to actually > have it do the final ACPI sleep work. Or have dom0 pass a pointer to a code > block that Xen should simply jump at to get the sleep to happen (where that > code block would basically be dom0''s acpi_enter_sleep() function). There are > a few, somewhat distasteful, options that are more respectful of the ACPI > spec than we are right now. > > -- KeirJust for information, this turned out to be a BIOS bug. It was setting a 6 second timer when executing _PTS, which hit the system reset if PM1{a,b} had not been hit when the timer expired. As Xen does all of its shutdown after the call to _PTS and before PM1{a,b}, there is a significant time gap, which was falling fowl of the timer in most cases. In this case, it seems likely that a BIOS fix can be done, as Supermicro do provide a custom BIOS for the NetScalar box in question. However, If anyone else comes across this issue, we did make a software solution. You can replace /etc/init.d/halt (or equivalent for your chosen dom0 distro) to KEXEC reboot into a native kernel which listens for a special command line parameter and calls pm_power_off_prepare() and pm_power_off() after the ACPI module has initialized[1]. This issue does however show that Xen itself is in breach of the ACPI spec, which is a dangerous situation to be in given the fragility of APCI at the best of times. In due course, I will put my mind to solving the dom0-Xen ACPI interaction problems if the question is still open. ~Andrew Cooper [1] Yes this is a hack. Sorry. Its the easiest solution without rewriting Xen -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 02/08/2011 07:14, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:> Just for information, this turned out to be a BIOS bug. It was setting > a 6 second timer when executing _PTS, which hit the system reset if > PM1{a,b} had not been hit when the timer expired. As Xen does all of > its shutdown after the call to _PTS and before PM1{a,b}, there is a > significant time gap, which was falling fowl of the timer in most cases.Six seconds though, that''s quite a long time! Is it a big box?> In this case, it seems likely that a BIOS fix can be done, as Supermicro > do provide a custom BIOS for the NetScalar box in question. > > However, If anyone else comes across this issue, we did make a software > solution. You can replace /etc/init.d/halt (or equivalent for your > chosen dom0 distro) to KEXEC reboot into a native kernel which listens > for a special command line parameter and calls pm_power_off_prepare() > and pm_power_off() after the ACPI module has initialized[1]. > > This issue does however show that Xen itself is in breach of the ACPI > spec, which is a dangerous situation to be in given the fragility of > APCI at the best of times. In due course, I will put my mind to solving > the dom0-Xen ACPI interaction problems if the question is still open.Yes, this is ultimately the issue. It''s going to be a pain to fix properly, unfortunately. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 02/08/11 15:26, Keir Fraser wrote:> On 02/08/2011 07:14, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote: > >> Just for information, this turned out to be a BIOS bug. It was setting >> a 6 second timer when executing _PTS, which hit the system reset if >> PM1{a,b} had not been hit when the timer expired. As Xen does all of >> its shutdown after the call to _PTS and before PM1{a,b}, there is a >> significant time gap, which was falling fowl of the timer in most cases. > Six seconds though, that''s quite a long time! Is it a big box?It is a Netscalar SDX box, designed to have 24 logical pcpus, 96GB ram, 320 pci-passed-through ixgbe virtual functions (claiming 3 irqs per vf). It seems that Xen spends a fair amount of time doing freeze_domains (even though dom0 has already shut down all domUs, albeit forcibly if they haven''t shut down nicely within 15 seconds), and bringing down the other CPUs (in particular, it spends ages fiddling around with irq affinities). Overall, there is probably quite a bit of optimization which could be done, but that still doesn''t excuse a BIOS deciding that "a long time" as per the ACPI spec is "less than 6 seconds". ~Andrew>> In this case, it seems likely that a BIOS fix can be done, as Supermicro >> do provide a custom BIOS for the NetScalar box in question. >> >> However, If anyone else comes across this issue, we did make a software >> solution. You can replace /etc/init.d/halt (or equivalent for your >> chosen dom0 distro) to KEXEC reboot into a native kernel which listens >> for a special command line parameter and calls pm_power_off_prepare() >> and pm_power_off() after the ACPI module has initialized[1]. >> >> This issue does however show that Xen itself is in breach of the ACPI >> spec, which is a dangerous situation to be in given the fragility of >> APCI at the best of times. In due course, I will put my mind to solving >> the dom0-Xen ACPI interaction problems if the question is still open. > Yes, this is ultimately the issue. It''s going to be a pain to fix properly, > unfortunately. > > -- Keir > >-- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Andrew Cooper 08/02/11 5:01 PM >>> >It seems that Xen spends a fair amount of time doing freeze_domains >(even though dom0 has already shut down all domUs, albeit forcibly if >they haven''t shut down nicely within 15 seconds), and bringing down the >other CPUs (in particular, it spends ages fiddling around with irq >affinities).Is that independent of using a serial console? That is, are the delays perhaps incurred just by that code being overly verbose? One of the odd things I had noticed now and then is that during shutdown, various IRQs get fixed up more than once (up to once per CPU brought down). There surely are ways to have them moved to CPU0 directly in the shutdown case. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel