Hi fellow Xen developers, I continue to get system hangs where the watchdog NMI in Xen is not doing its job. I am completely blind as to what is getting jammed. Tried multiple experiments to force the hang and in each, the watchdog has kicked in, so I know the mechanism works 99% of the time except in my one hang. So in the old days of PCI bus, I used to be able to generate a HW NMI by asserting the SERR signal in the connector. With the advent of PCIe, I believe that signal is no longer present, so I am looking for any other way to cause a system error. I have examined the PCI express mini-card specification looking for a signal I can use in the internal WiFi connector, but alas, none of the signals I read about seem like they would do what I need. I am not sure if there is anything I can short in the PCIe signals that could have a similar effect as the SERR signal. The platform is a Lenovo T500 laptop so the number of connectors to play with is limited. I also thought of causing a parity/ECC error but the GM45 chipset used in this laptop does not support ECC memory. So I''m basically looking for any other ideas on how to cause a fault by probing somewhere in the motherboard. This MB has a docking station connector but I have not been able to find the pinout list so I don''t know what is brought out there. At this point, I have no problem cracking up the case and soldering something on to the motherboard.. I just need to know what chips and signals to tap. Thanks in advance. Roger R. Cruz _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Sep 30, 2010 at 12:59:25PM -0500, Roger Cruz wrote:> Hi fellow Xen developers, > > > > I continue to get system hangs where the watchdog NMI in Xen is not > doing its job. I am completely blind as to what is getting jammed. > Tried multiple experiments to force the hang and in each, the watchdog > has kicked in, so I know the mechanism works 99% of the time except in > my one hang. > > > > So in the old days of PCI bus, I used to be able to generate a HW NMI by > asserting the SERR signal in the connector. With the advent of PCIe, INice.> believe that signal is no longer present, so I am looking for any other > way to cause a system error. I have examined the PCI expressWhat about the Mini PCI-e to PCI-e adapter: http://www.hwtools.net/adapter/PM2C.html And then plug in a PCI to PCI-e adapter: http://www.newegg.com/Product/Product.aspx?Item=N82E16815158165&nm_mc=OTC-Froogle&cm_mmc=OTC-Froogle-_-Add-On+Cards-_-STARTECH-_-15158165 And then assert the SERR#?> mini-card specification looking for a signal I can use in the internal > WiFi connector, but alas, none of the signals I read about seem like > they would do what I need. I am not sure if there is anything I can > short in the PCIe signals that could have a similar effect as the SERRPer this slide deck: http://www.pcisig.com/developers/main/training_materials/get_document?doc_id=cdf593816ee20b90d8603d4aeb081a726ddc3091 it looks as if you can program the PCIe bridge to fall to "legacy" mode. And per some folks post: http://forums.gentoo.org/viewtopic-t-752165.html it looks as if the SERR# signal is asserted on SMBus controller? Maybe there is a way to do it via that?> signal. The platform is a Lenovo T500 laptop so the number of > connectors to play with is limited. >IBM on the server sides used to have NMI buttons - it could be that Lenova hadn''t completly gotten rid of them. Since you are open to looking at the motherboard, maybe there is a spot marked #NMI ?> > > > I also thought of causing a parity/ECC error but the GM45 chipset used > in this laptop does not support ECC memory.> > > So I''m basically looking for any other ideas on how to cause a fault by > probing somewhere in the motherboard. This MB has a docking station > connector but I have not been able to find the pinout list so I don''t > know what is brought out there. At this point, I have no problemHow about just shorting the pins randomly :-)> cracking up the case and soldering something on to the motherboard.. I > just need to know what chips and signals to tap. > > > > Thanks in advance. > > > > Roger R. Cruz >> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Great ideas Konrad. I have ordered these parts. It will probably take a few days before they get here. The goal of using the HW NMI is to rule out any incorrect SW settings of the Performance Monitoring counters used in Xen to triggered the NMI. Someone else mentioned that another possibility as to why an NMI may not be triggered is that the system is stuck handling an SMI interrupt. I haven''t studied Xen code with respect to SMIs yet, but I assume that Xen doesn''t do much in that area right? I was under the impression that the BIOS usually set this up and the OSs could not even modify the handlers as they were in protected RAM. R. -----Original Message----- From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] Sent: Friday, October 01, 2010 10:15 AM To: Roger Cruz Cc: xen-devel@lists.xensource.com Subject: Re: [Xen-devel] How to generate a HW NMI On Thu, Sep 30, 2010 at 12:59:25PM -0500, Roger Cruz wrote:> Hi fellow Xen developers, > > > > I continue to get system hangs where the watchdog NMI in Xen is not > doing its job. I am completely blind as to what is getting jammed. > Tried multiple experiments to force the hang and in each, the watchdog > has kicked in, so I know the mechanism works 99% of the time except in > my one hang. > > > > So in the old days of PCI bus, I used to be able to generate a HW NMIby> asserting the SERR signal in the connector. With the advent of PCIe,I Nice.> believe that signal is no longer present, so I am looking for anyother> way to cause a system error. I have examined the PCI expressWhat about the Mini PCI-e to PCI-e adapter: http://www.hwtools.net/adapter/PM2C.html And then plug in a PCI to PCI-e adapter: http://www.newegg.com/Product/Product.aspx?Item=N82E16815158165&nm_mc=OT C-Froogle&cm_mmc=OTC-Froogle-_-Add-On+Cards-_-STARTECH-_-15158165 And then assert the SERR#?> mini-card specification looking for a signal I can use in the internal > WiFi connector, but alas, none of the signals I read about seem like > they would do what I need. I am not sure if there is anything I can > short in the PCIe signals that could have a similar effect as the SERRPer this slide deck: http://www.pcisig.com/developers/main/training_materials/get_document?do c_id=cdf593816ee20b90d8603d4aeb081a726ddc3091 it looks as if you can program the PCIe bridge to fall to "legacy" mode. And per some folks post: http://forums.gentoo.org/viewtopic-t-752165.html it looks as if the SERR# signal is asserted on SMBus controller? Maybe there is a way to do it via that?> signal. The platform is a Lenovo T500 laptop so the number of > connectors to play with is limited. >IBM on the server sides used to have NMI buttons - it could be that Lenova hadn''t completly gotten rid of them. Since you are open to looking at the motherboard, maybe there is a spot marked #NMI ?> > > > I also thought of causing a parity/ECC error but the GM45 chipset used > in this laptop does not support ECC memory.> > > So I''m basically looking for any other ideas on how to cause a faultby> probing somewhere in the motherboard. This MB has a docking station > connector but I have not been able to find the pinout list so I don''t > know what is brought out there. At this point, I have no problemHow about just shorting the pins randomly :-)> cracking up the case and soldering something on to the motherboard.. I > just need to know what chips and signals to tap. > > > > Thanks in advance. > > > > Roger R. Cruz >> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-develNo virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/01/10 02:34:00 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, Oct 01, 2010 at 02:33:20PM -0500, Roger Cruz wrote:> Great ideas Konrad. I have ordered these parts. It will probably take > a few days before they get here. > The goal of using the HW NMI is to rule out any incorrect SW settings of > the Performance Monitoring counters used in Xen to triggered the NMI.Right.> > Someone else mentioned that another possibility as to why an NMI may not > be triggered is that the system is stuck handling an SMI interrupt. I > haven''t studied Xen code with respect to SMIs yet, but I assume that Xen > doesn''t do much in that area right? I was under the impression that the > BIOS usually set this up and the OSs could not even modify the handlers > as they were in protected RAM.Ugh. That is true - we have no notion of when the SMIs run. Not that the SMIs are actually working 100% all the time. Another thought, and this might be a complete shoot in the dark. Look in the upstream (2.6.36-rc6) blacklist.c file. There is an entry for that specific ThinkPad which activates the ACPI _OSI, maybe that needs to be done? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Konrad, I found that pciback doesn''t accept CardBus device. It only handles type-0 and type-1. Any specific reason to skip it? That caused some trouble for for firewire passthru on my laptop. I want to know the reason before submitting submit a patch. Thanks, -Wei _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Oct-01 20:45 UTC
[Xen-devel] Re: pciback doesn''t take CardBus device
On Fri, Oct 01, 2010 at 03:36:45PM -0500, Huang2, Wei wrote:> Hi Konrad, > > I found that pciback doesn''t accept CardBus device. It only handles type-0 and type-1. Any specific reason to skip it? That caused some trouble for for firewire passthru on my laptop. I want to know the reason before submitting submit a patch.No reason at all. Was this working in the past (2.6.18?). I will gladly accept any patch. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I haven''t test 2.6.18 yet; but will do. The issue I found is with the following configuration. These devices are behind the same bridge. But because 46:06.5 is a CardBus and can''t be assigned, it blocks other devices from being assigned to a guest VM. I will create a patch for it. Thanks, -Wei ==========46:06.0 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller (rev 06) 46:06.1 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 25) 46:06.2 System peripheral: Ricoh Co Ltd R5C843 MMC Host Controller (rev 14) 46:06.3 System peripheral: Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter (rev 14) 46:06.4 System peripheral: Ricoh Co Ltd xD-Picture Card Controller (rev 14) 46:06.5 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev bb) ========== -----Original Message----- From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] Sent: Friday, October 01, 2010 3:46 PM To: Huang2, Wei Cc: Xen-devel Subject: Re: pciback doesn''t take CardBus device On Fri, Oct 01, 2010 at 03:36:45PM -0500, Huang2, Wei wrote:> Hi Konrad, > > I found that pciback doesn''t accept CardBus device. It only handles type-0 and type-1. Any specific reason to skip it? That caused some trouble for for firewire passthru on my laptop. I want to know the reason before submitting submit a patch.No reason at all. Was this working in the past (2.6.18?). I will gladly accept any patch. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan, I will try your suggestion of turning off SMIs. I am also interested in you conducting an experiment for me. If you can, please tell your kernel not to use any CPU power saving modes. In Xen I use max_cstate=0 in the bootline. I have found that when I do this, the hangs appear to go away (we had one customer report one since using this work-around, so it is not 100% working). Thanks Roger -----Original Message----- From: Jan Kiszka [mailto:jan.kiszka@siemens.com] Sent: Mon 10/4/2010 6:27 AM To: Roger Cruz Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com Subject: Re: How to generate a HW NMI Am 01.10.2010 21:33, Roger Cruz wrote:> Someone else mentioned that another possibility as to why an NMI may not > be triggered is that the system is stuck handling an SMI interrupt. I > haven''t studied Xen code with respect to SMIs yet, but I assume that Xen > doesn''t do much in that area right? I was under the impression that the > BIOS usually set this up and the OSs could not even modify the handlers > as they were in protected RAM.We happen to face strange freezes of KVM right now as well (CPU is apparently stuck in guest mode), and turning of SMIs cures them here [1]. However, it''s too early to draw final conclusions, we are still collecting test results & data on the systems. It would therefore be interesting to see if you case is similar to ours. If you feel brave enough to turn off your SMIs (there are rumors that CPUs /could/ get fried as some thermal management /might/ be done via SMIs), please check out [2], build it (requires libpci and a kernel source tree), and run "smitctrl -s 0" on your box. Should give something like this: SMI-enabled chipset found: PCI_VENDOR_ID_INTEL:PCI_DEVICE_ID_INTEL_PCH_LPC_MIN+7 (8086:3b07) SMI_EN register: 0006403b new value: 00000002 If the chipset is not detected, add the PCI device ID of your ISA bridge to the list in smictrl.c. If the new value still has bit 0 set, you are unlucky as your BIOS has locked some SMIs against disabling. Otherwise, SMIs are off now, and your lock up /may/ disappear. Looking forward to your results! Jan [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/60326 [2] http://git.kiszka.org/?p=smictrl.git -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Until Friday, all hard hangs that we and our customers had experienced were on Lenovo T500 and X200, even with their latest BIOSes. The Lenovo T400 has never hung for me and I don''t have any reports on them from the field. On Friday, I had an HP i5 hard hang with similar footprint as the Lenovos. When this hard hang happens, the Xen watchdog (which is driven by the NMI handler) will not do its job and cause a crash/stack trace. This is why we have started to suspect something with the BIOS and SMIs as they are the only thing that can block an NMI. I am pretty certain that this is somehow related to entering C3 power states and possibly at the same time an SMI comes in. The time it takes to hang varies from 30mins to 24 hrs. Roger -----Original Message----- From: Jan Kiszka [mailto:jan.kiszka@siemens.com] Sent: Monday, October 04, 2010 10:13 AM To: Roger Cruz Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com Subject: Re: How to generate a HW NMI Am 04.10.2010 15:56, Roger Cruz wrote:> Jan, > > I will try your suggestion of turning off SMIs. I am also interestedin you> conducting an experiment for me. If you can, please tell your kernelnot to use> any CPU power saving modes. In Xen I use max_cstate=0 in the bootline.I have> found that when I do this, the hangs appear to go away (we had onecustomer> report one since using this work-around, so it is not 100% working).Will do. My customer reported that he was able to easily crash his i7 notebook by pulling and re-plugging the power cable. I bet all of these events are trapped by the BIOS via power management SMIs... BTW, do you see any correlation between crashable boxes and BIOS vendors? We have no representative numbers yet, just one confirmed instable notebook that is Phoenix-based, while one AMI-based i7 server that is rock-stable. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10 02:35:00 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
This is a long shot, but since my thoughts jumped to it after reading this, I thought I''d post anyway. Some systems support a special "C1E" power state that can be enabled/disabled in the BIOS. My Dell Core2Duo laptop has this feature. I remember running into some weirdness that went away when I turned it off. Perhaps the power management code is somehow entering the BIOS to see if this is enabled and max_cstate isn''t controlling it since the check is done in the BIOS bypassing Xen? Google for C1E to find lots of information about this weird power state.> -----Original Message----- > From: Roger Cruz [mailto:roger.cruz@virtualcomputer.com] > Sent: Monday, October 04, 2010 8:19 AM > To: Jan Kiszka > Cc: xen-devel@lists.xensource.com; Konrad Rzeszutek Wilk > Subject: [Xen-devel] RE: How to generate a HW NMI > > Until Friday, all hard hangs that we and our customers had experienced > were on Lenovo T500 and X200, even with their latest BIOSes. The > Lenovo > T400 has never hung for me and I don''t have any reports on them from > the > field. On Friday, I had an HP i5 hard hang with similar footprint as > the Lenovos. When this hard hang happens, the Xen watchdog (which is > driven by the NMI handler) will not do its job and cause a crash/stack > trace. This is why we have started to suspect something with the BIOS > and SMIs as they are the only thing that can block an NMI. I am pretty > certain that this is somehow related to entering C3 power states and > possibly at the same time an SMI comes in. The time it takes to hang > varies from 30mins to 24 hrs. > > Roger > > > > > -----Original Message----- > From: Jan Kiszka [mailto:jan.kiszka@siemens.com] > Sent: Monday, October 04, 2010 10:13 AM > To: Roger Cruz > Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com > Subject: Re: How to generate a HW NMI > > Am 04.10.2010 15:56, Roger Cruz wrote: > > Jan, > > > > I will try your suggestion of turning off SMIs. I am also interested > in you > > conducting an experiment for me. If you can, please tell your kernel > not to use > > any CPU power saving modes. In Xen I use max_cstate=0 in the > bootline. > I have > > found that when I do this, the hangs appear to go away (we had one > customer > > report one since using this work-around, so it is not 100% working). > > Will do. My customer reported that he was able to easily crash his i7 > notebook by pulling and re-plugging the power cable. I bet all of these > events are trapped by the BIOS via power management SMIs... > > BTW, do you see any correlation between crashable boxes and BIOS > vendors? We have no representative numbers yet, just one confirmed > instable notebook that is Phoenix-based, while one AMI-based i7 server > that is rock-stable. > > Jan > > -- > Siemens AG, Corporate Technology, CT T DE IT 1 > Corporate Competence Center Embedded Linux > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: > 10/04/10 > 02:35:00 > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> BTW, "rmmod processor thermal" (should be equivalent to your XenI am not familiar with the thermal module but my guess is that they are not the same as the C3 states which can be entered when the kernel becomes idle. I believe the thermal plays with other type of state (P?) where it alters the voltage and frequency of the CPU to keep the CPU still running but at a particular % of the top speed. The C3 state causes the CPU clocks to shutdown entirely and then it is awaken by an external event. R. -----Original Message----- From: Jan Kiszka [mailto:jan.kiszka@siemens.com] Sent: Monday, October 04, 2010 11:23 AM To: Roger Cruz Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com Subject: Re: How to generate a HW NMI Am 04.10.2010 16:19, Roger Cruz wrote:> Until Friday, all hard hangs that we and our customers had experienced > were on Lenovo T500 and X200, even with their latest BIOSes.Yeah, the T500 was reported as problematic here as well. My Fujitsu Celsius H700 also crashes. In contrast, we have positive results from a Dell server with an Asus P6T Deluxe V2 board and a Core i7 920.> The Lenovo > T400 has never hung for me and I don''t have any reports on them fromthe> field. On Friday, I had an HP i5 hard hang with similar footprint asi5? Mmh, we only have reports from i7 so far. Which BIOS vendor?> the Lenovos. When this hard hang happens, the Xen watchdog (which is > driven by the NMI handler) will not do its job and cause a crash/stack > trace. > This is why we have started to suspect something with the BIOS > and SMIs as they are the only thing that can block an NMI. I ampretty> certain that this is somehow related to entering C3 power states and > possibly at the same time an SMI comes in.I tried various stuff under Linux as well: nmi_watchdog=1, tracing to VGA buffer right before/after guest-host switch (it always hangs after entry here), verified guest interruptibility before entry (though hypervisors usually do not play with the critical bits), read-out of host RAM (including kernel log buffer) via Firewire - it all points to a crash outside the scope of the host OS.> The time it takes to hang > varies from 30mins to 24 hrs.We are a bit more lucky, maybe due to our special guest (an old RTOS in 16-bit mode): I can reproduce the hang after a few minutes. BTW, "rmmod processor thermal" (should be equivalent to your Xen parameter) did not make a difference here. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10 02:35:00 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Here is some additional info from my experiments over the weekend. I took the Lenovo T500 and removed its internal WiFi miniPCIe card. In its place, I put in a miniPCIe to PCIe converter card with a PCIe socket. Into that socket, I placed a PCIe dump card. This card has a switch that when you press it, it creates an SERR error. Using the utility provided by the vendor, I enabled all the bridges between the card to carry the SERR signal to the CPU and cause the CPU to see it as an NMI. I tested the set-up several times. Every single time I pressed the switch, I got an NMI, followed by a kdump core. So I was sure the HW setup was working correctly. I left two Lenovo T500 running over the weekend and when I returned this morning, both had hung. Completely frozen. I pressed the NMI switch in both systems and nothing. No crashes, no coredumps. It looks as if the SERR/NMI is getting ignored/blocked or CPU is completely shutdown (STPCLK). This experiment helps me prove that the software watchdog code in Xen was not the problem and indeed the NMIs are getting blocked somehow. This is what I now need to investigate. Areas that I care to learn more about are the SMI handler and the external chip''s use of the STPCLK signal to the CPU. As an additional bit of info, the only response we get when the systems are hung is a beep when the power cord is unplugged/plugged from the laptop. I don''t know if the beep is done via a HW module or whether ACPI/BIOS is involved. Still looking for additional ideas. Regards, Roger R. Cruz -----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Roger Cruz Sent: Monday, October 04, 2010 3:03 PM To: Jan Kiszka Cc: xen-devel@lists.xensource.com; Konrad Rzeszutek Wilk Subject: [Xen-devel] RE: How to generate a HW NMI> BTW, "rmmod processor thermal" (should be equivalent to your XenI am not familiar with the thermal module but my guess is that they are not the same as the C3 states which can be entered when the kernel becomes idle. I believe the thermal plays with other type of state (P?) where it alters the voltage and frequency of the CPU to keep the CPU still running but at a particular % of the top speed. The C3 state causes the CPU clocks to shutdown entirely and then it is awaken by an external event. R. -----Original Message----- From: Jan Kiszka [mailto:jan.kiszka@siemens.com] Sent: Monday, October 04, 2010 11:23 AM To: Roger Cruz Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com Subject: Re: How to generate a HW NMI Am 04.10.2010 16:19, Roger Cruz wrote:> Until Friday, all hard hangs that we and our customers had experienced > were on Lenovo T500 and X200, even with their latest BIOSes.Yeah, the T500 was reported as problematic here as well. My Fujitsu Celsius H700 also crashes. In contrast, we have positive results from a Dell server with an Asus P6T Deluxe V2 board and a Core i7 920.> The Lenovo > T400 has never hung for me and I don''t have any reports on them fromthe> field. On Friday, I had an HP i5 hard hang with similar footprint asi5? Mmh, we only have reports from i7 so far. Which BIOS vendor?> the Lenovos. When this hard hang happens, the Xen watchdog (which is > driven by the NMI handler) will not do its job and cause a crash/stack > trace. > This is why we have started to suspect something with the BIOS > and SMIs as they are the only thing that can block an NMI. I ampretty> certain that this is somehow related to entering C3 power states and > possibly at the same time an SMI comes in.I tried various stuff under Linux as well: nmi_watchdog=1, tracing to VGA buffer right before/after guest-host switch (it always hangs after entry here), verified guest interruptibility before entry (though hypervisors usually do not play with the critical bits), read-out of host RAM (including kernel log buffer) via Firewire - it all points to a crash outside the scope of the host OS.> The time it takes to hang > varies from 30mins to 24 hrs.We are a bit more lucky, maybe due to our special guest (an old RTOS in 16-bit mode): I can reproduce the hang after a few minutes. BTW, "rmmod processor thermal" (should be equivalent to your Xen parameter) did not make a difference here. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10 02:35:00 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10 02:35:00 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Disabling SMIs is part of the experiments to be conducted today or tomorrow. I will keep u posted. On Oct 12, 2010, at 4:48 AM, "Jan Kiszka" <jan.kiszka@siemens.com> wrote:> Am 11.10.2010 23:20, Roger Cruz wrote: >> Here is some additional info from my experiments over the weekend. >> >> I took the Lenovo T500 and removed its internal WiFi miniPCIe >> card. In >> its place, I put in a miniPCIe to PCIe converter card with a PCIe >> socket. Into that socket, I placed a PCIe dump card. This card >> has a >> switch that when you press it, it creates an SERR error. Using the >> utility provided by the vendor, I enabled all the bridges between the >> card to carry the SERR signal to the CPU and cause the CPU to see >> it as >> an NMI. I tested the set-up several times. Every single time I >> pressed >> the switch, I got an NMI, followed by a kdump core. So I was sure >> the >> HW setup was working correctly. >> >> I left two Lenovo T500 running over the weekend and when I returned >> this >> morning, both had hung. Completely frozen. I pressed the NMI >> switch in >> both systems and nothing. No crashes, no coredumps. It looks as >> if the >> SERR/NMI is getting ignored/blocked or CPU is completely shutdown >> (STPCLK). >> >> This experiment helps me prove that the software watchdog code in Xen >> was not the problem and indeed the NMIs are getting blocked somehow. >> This is what I now need to investigate. Areas that I care to learn >> more >> about are the SMI handler and the external chip''s use of the STPCLK >> signal to the CPU. >> >> As an additional bit of info, the only response we get when the >> systems >> are hung is a beep when the power cord is unplugged/plugged from the >> laptop. I don''t know if the beep is done via a HW module or whether >> ACPI/BIOS is involved. >> >> Still looking for additional ideas. > > Already tried to disable SMIs? > > Jan > > -- > Siemens AG, Corporate Technology, CT T DE IT 1 > Corporate Competence Center Embedded Linux_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Jan, Just letting you know that I am grateful for the help you have been providing. I finally got around to doing the SMI test as you have described here. It takes a day or two to know for sure the problem is not going to happen so I will let the system stand still for a while. This is the output of your tool. Bit 0 was cleared so SMIs should be disabled at this point. root@hedley-t500:~# ./smictrl -s 0 SMI-enabled chipset found: PCI_VENDOR_ID_INTEL:PCI_DEVICE_ID_INTEL_ICH9_1 (8086:2917) SMI_EN register: 00062033 new value: 00000002 -----Original Message----- From: Jan Kiszka [mailto:jan.kiszka@siemens.com] Sent: Mon 10/4/2010 6:27 AM To: Roger Cruz Cc: Konrad Rzeszutek Wilk; xen-devel@lists.xensource.com Subject: Re: How to generate a HW NMI Am 01.10.2010 21:33, Roger Cruz wrote:> Someone else mentioned that another possibility as to why an NMI may not > be triggered is that the system is stuck handling an SMI interrupt. I > haven''t studied Xen code with respect to SMIs yet, but I assume that Xen > doesn''t do much in that area right? I was under the impression that the > BIOS usually set this up and the OSs could not even modify the handlers > as they were in protected RAM.We happen to face strange freezes of KVM right now as well (CPU is apparently stuck in guest mode), and turning of SMIs cures them here [1]. However, it''s too early to draw final conclusions, we are still collecting test results & data on the systems. It would therefore be interesting to see if you case is similar to ours. If you feel brave enough to turn off your SMIs (there are rumors that CPUs /could/ get fried as some thermal management /might/ be done via SMIs), please check out [2], build it (requires libpci and a kernel source tree), and run "smitctrl -s 0" on your box. Should give something like this: SMI-enabled chipset found: PCI_VENDOR_ID_INTEL:PCI_DEVICE_ID_INTEL_PCH_LPC_MIN+7 (8086:3b07) SMI_EN register: 0006403b new value: 00000002 If the chipset is not detected, add the PCI device ID of your ISA bridge to the list in smictrl.c. If the new value still has bit 0 set, you are unlucky as your BIOS has locked some SMIs against disabling. Otherwise, SMIs are off now, and your lock up /may/ disappear. Looking forward to your results! Jan [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/60326 [2] http://git.kiszka.org/?p=smictrl.git -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.862 / Virus Database: 271.1.1/3168 - Release Date: 10/05/10 02:34:00 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Oct 12, 2010 at 08:42:13AM -0400, Roger Cruz wrote:> Disabling SMIs is part of the experiments to be conducted today or > tomorrow. I will keep u posted.Soo, what happend? Machine melted down? It caught on fire? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel