I''m starting a new thread on this, to attempt to not confuse this issue, with the other S3 issue reported by Marek Marczykowski against 4.1 If you prefer I continue that thread instead, please let me know, and I will be happy to do so. Some background: I am attempting to chase down yet another S3 issue in the Xen-4.2 / unstable tree, seen on some (but not all) platforms. The particular machine I am able to reproduce it 100% of the time is a Lenovo T430 (Ivy bridge laptop) The symptoms of the failure are that it suspends just fine, but does not resume. When attempting to resume, by pressing the power button - the disk LED flashes, and the CDROM activity LED flashes, but then the system seems to put itsself back to sleep, as the power LED goes back to pulsing Note that the soft pulsing LED is distinctly different from the crash LED blink rate. I have tried a number of the tricks Jan suggested to me the last time we were down this path - so far to no success. The failure seems to be happening so soon in the resume process, that there is not yet a console available. I have resorted to putting BUG() in the code directly in the resume path, in an attempt to understand what is going on - since there seems to be something in this path that I don''t fully understand. In xen/arch/x86/acpi/power.c - acpi_enter_sleep_state() seems to be the code that actually puts the processor into S3. If I put a BUG() directly before the return of this function - I never seem to reach this. It continues to pulse the power LED, as described above. I would have expected a hypervisor crash upon attempting to wake up the system. Could this be a behavior caused by a bad resume vector? If so - how would I know it was bad? Any other ideas are welcome. Ben _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Hi Ben, On 02/01/13 13:08, Ben Guthro wrote:> I''m starting a new thread on this, to attempt to not confuse this > issue, with the other S3 issue reported by Marek Marczykowski against 4.1 > If you prefer I continue that thread instead, please let me know, and > I will be happy to do so. > > Some background: > I am attempting to chase down yet another S3 issue in the Xen-4.2 / > unstable tree, seen on some (but not all) platforms. > The particular machine I am able to reproduce it 100% of the time is a > Lenovo T430 (Ivy bridge laptop) >To help reproduce the issue it would be good to know what Linux kernel you were using. Also, Does Xen-4.1 work on this particular machine? If Xen-4.1 does not work, have you confirmed that baremetal suspend resume works? (I''m just covering the base''s here) I also notice that the laptop can have NVIDIA optimus technology. Does this particular T430 have an Nvidia GPU? Is there a way to force disable the NVIDIA GPU in the BIOS, this may help with displaying resume progress via the display.> The symptoms of the failure are that it suspends just fine, but does > not resume. > When attempting to resume, by pressing the power button - the disk LED > flashes, and the CDROM activity LED flashes, but then the system seems > to put itsself back to sleep, as the power LED goes back to pulsing > Note that the soft pulsing LED is distinctly different from the crash > LED blink rate.Is there any sign of the video POSTING (flickering screen etc) ?> > I have tried a number of the tricks Jan suggested to me the last time > we were down this path - so far to no success. > The failure seems to be happening so soon in the resume process, that > there is not yet a console available. > > I have resorted to putting BUG() in the code directly in the resume > path, in an attempt to understand what is going on - since there seems > to be something in this path that I don''t fully understand. > > In xen/arch/x86/acpi/power.c - acpi_enter_sleep_state() seems to be > the code that actually puts the processor into S3. > If I put a BUG() directly before the return of this function - I never > seem to reach this. It continues to pulse the power LED, as described > above. > I would have expected a hypervisor crash upon attempting to wake up > the system. >The actual wakeup vector is wakeup_start in xen/arch/x86/boot/wakeup.S Malcolm
On Wed, Jan 2, 2013 at 10:15 AM, Malcolm Crossley < malcolm.crossley@citrix.com> wrote:> Hi Ben, > > > On 02/01/13 13:08, Ben Guthro wrote: > >> I''m starting a new thread on this, to attempt to not confuse this issue, >> with the other S3 issue reported by Marek Marczykowski against 4.1 >> If you prefer I continue that thread instead, please let me know, and I >> will be happy to do so. >> >> Some background: >> I am attempting to chase down yet another S3 issue in the Xen-4.2 / >> unstable tree, seen on some (but not all) platforms. >> The particular machine I am able to reproduce it 100% of the time is a >> Lenovo T430 (Ivy bridge laptop) >> >> To help reproduce the issue it would be good to know what Linux kernel > you were using. >Currently, XenClient Enterprise is using a kernel based off of the ubuntu "precise" 3.2 kernel - http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-precise.git;a=summary However, we have a number of patches on top of this - one specific for S3 is Konrad''s older patches (attached acpi-s3.v9.patch) http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=refs/heads/devel/acpi-s3.v9 That said, I have also tested with the latest kernel.org kernel, as we maintain a set of patches against the tip of development, as well. This same failure has been seen with the latest kernel, as well as the latest xen-unstable code To minimize variables, I''ve stuck with the known failure case of Xen-4.2.y and linux-3.2> > Also, Does Xen-4.1 work on this particular machine? If Xen-4.1 does not > work, have you confirmed that baremetal suspend resume works? (I''m just > covering the base''s here) > >I have not tested 4.1 - but Xen 4.0.3 works with this same kernel. I have not tested bare metal, as I am reasonably convinced it is the hypervisor, since Xen-4.0.3 works> I also notice that the laptop can have NVIDIA optimus technology. Does > this particular T430 have an Nvidia GPU? Is there a way to force disable > the NVIDIA GPU in the BIOS, this may help with displaying resume progress > via the display.Optimus is not a factor in this case - this machine is Intel GPU only.> > The symptoms of the failure are that it suspends just fine, but does not >> resume. >> When attempting to resume, by pressing the power button - the disk LED >> flashes, and the CDROM activity LED flashes, but then the system seems to >> put itsself back to sleep, as the power LED goes back to pulsing >> Note that the soft pulsing LED is distinctly different from the crash LED >> blink rate. >> > Is there any sign of the video POSTING (flickering screen etc) ? > >No screen flicker, only the LED activity mentioned above> >> I have tried a number of the tricks Jan suggested to me the last time we >> were down this path - so far to no success. >> The failure seems to be happening so soon in the resume process, that >> there is not yet a console available. >> >> I have resorted to putting BUG() in the code directly in the resume path, >> in an attempt to understand what is going on - since there seems to be >> something in this path that I don''t fully understand. >> >> In xen/arch/x86/acpi/power.c - acpi_enter_sleep_state() seems to be the >> code that actually puts the processor into S3. >> If I put a BUG() directly before the return of this function - I never >> seem to reach this. It continues to pulse the power LED, as described above. >> I would have expected a hypervisor crash upon attempting to wake up the >> system. >> >> The actual wakeup vector is wakeup_start in xen/arch/x86/boot/wakeup.S >I''ll take a look at this, thanks for the pointer. Ben _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Wed, Jan 2, 2013 at 10:31 AM, Ben Guthro <ben@guthro.net> wrote:> The actual wakeup vector is wakeup_start in xen/arch/x86/boot/wakeup.S >> > > I''ll take a look at this, thanks for the pointer. >I''ve tried putting a "ud2" instruction at the start of wakeup_start - and the machine doesn''t seem to crash. I also tried a divide by zero in the same place, just for good measure. It would appear that this wakeup_start is not getting executed on resume. Presumably, the BIOS is causing the disk, and CDROM LEDs to flash, while enumerating the bus. A difference between Xen 4.0.y and 4.2.y seems to be the removal of the boot trampoline fixed address, that much of this is calculated as an offset of. Could an error in this path cause such a behavior? /btg _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Wed, Jan 02, 2013 at 10:31:18AM -0500, Ben Guthro wrote:> > I also notice that the laptop can have NVIDIA optimus technology. Does > this particular T430 have an Nvidia GPU? Is there a way to force disable > the NVIDIA GPU in the BIOS, this may help with displaying resume > progress via the display. > > Optimus is not a factor in this case - this machine is Intel GPU only. >Hmm.. do you have T430s then? I thought all T430 (without s) models have both the IGD + Nvidia GPUs. T430 BIOS does have an option to disable the Nvigia GPU though. -- Pasi
On Wed, Jan 2, 2013 at 12:14 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote:> On Wed, Jan 02, 2013 at 10:31:18AM -0500, Ben Guthro wrote: > > > > I also notice that the laptop can have NVIDIA optimus technology. > Does > > this particular T430 have an Nvidia GPU? Is there a way to force > disable > > the NVIDIA GPU in the BIOS, this may help with displaying resume > > progress via the display. > > > > Optimus is not a factor in this case - this machine is Intel GPU only. > > > > Hmm.. do you have T430s then? I thought all T430 (without s) models have > both the IGD + Nvidia GPUs. > > T430 BIOS does have an option to disable the Nvigia GPU though. > >No, just a T430 - no "s" - (BTW, those are completely different machines, from what I''ve seen) root@cobrakai:~# cat /sys/class/dmi/id/product_name 23445LU http://www.provantage.com/lenovo-23445lu~7LENO3A9.htm root@cobrakai:~# cat /sys/class/dmi/id/product_version ThinkPad T430 root@cobrakai:~# lspci -v 00:00.0 Host bridge: Intel Corporation Ivy Bridge DRAM Controller (rev 09) Subsystem: Lenovo Device 21f3 Flags: bus master, fast devsel, latency 0 Capabilities: [e0] Vendor Specific Information: Len=0c <?> Kernel driver in use: agpgart-intel Kernel modules: intel-agp 00:02.0 VGA compatible controller: Intel Corporation Ivy Bridge Graphics Controller (rev 09) (prog-if 00 [VGA controller]) Subsystem: Lenovo Device 21f3 Flags: bus master, fast devsel, latency 0, IRQ 303 Memory at f0000000 (64-bit, non-prefetchable) [size=4M] Memory at e0000000 (64-bit, prefetchable) [size=256M] I/O ports at 5000 [size=64] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 2 Capabilities: [a4] PCI Advanced Features Kernel driver in use: i915 Kernel modules: i915 00:14.0 USB controller: Intel Corporation Panther Point USB xHCI Host Controller (rev 04) (prog-if 30 [XHCI]) Subsystem: Lenovo Device 21f3 Flags: bus master, medium devsel, latency 0, IRQ 299 Memory at f2520000 (64-bit, non-prefetchable) [size=64K] Capabilities: [70] Power Management version 2 Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+ Kernel driver in use: xhci_hcd Kernel modules: xhci-hcd 00:16.0 Communication controller: Intel Corporation Panther Point MEI Controller #1 (rev 04) Subsystem: Lenovo Device 21f3 Flags: bus master, fast devsel, latency 0, IRQ 11 Memory at f2535000 (64-bit, non-prefetchable) [size=16] Capabilities: [50] Power Management version 3 Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+ 00:16.3 Serial controller: Intel Corporation Panther Point KT Controller (rev 04) (prog-if 02 [16550]) Subsystem: Lenovo Device 21f3 Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 19 I/O ports at 50b0 [size=8] Memory at f253c000 (32-bit, non-prefetchable) [size=4K] Capabilities: [c8] Power Management version 3 Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+ Kernel driver in use: serial 00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04) Subsystem: Lenovo Device 21f3 Flags: bus master, fast devsel, latency 0, IRQ 300 Memory at f2500000 (32-bit, non-prefetchable) [size=128K] Memory at f253b000 (32-bit, non-prefetchable) [size=4K] I/O ports at 5080 [size=32] Capabilities: [c8] Power Management version 2 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [e0] PCI Advanced Features Kernel driver in use: e1000e Kernel modules: e1000e 00:1a.0 USB controller: Intel Corporation Panther Point USB Enhanced Host Controller #2 (rev 04) (prog-if 20 [EHCI]) Subsystem: Lenovo Device 21f3 Flags: bus master, medium devsel, latency 0, IRQ 16 Memory at f253a000 (32-bit, non-prefetchable) [size=1K] Capabilities: [50] Power Management version 2 Capabilities: [58] Debug port: BAR=1 offset=00a0 Capabilities: [98] PCI Advanced Features Kernel driver in use: ehci_hcd Kernel modules: ehci-hcd 00:1b.0 Audio device: Intel Corporation Panther Point High Definition Audio Controller (rev 04) Subsystem: Lenovo Device 21f3 Flags: bus master, fast devsel, latency 0, IRQ 301 Memory at f2530000 (64-bit, non-prefetchable) [size=16K] Capabilities: [50] Power Management version 2 Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [130] Root Complex Link Kernel driver in use: snd_hda_intel Kernel modules: snd-hda-intel 00:1c.0 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 1 (rev c4) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=02, subordinate=02, sec-latency=0 I/O behind bridge: 00004000-00004fff Memory behind bridge: f1d00000-f24fffff Prefetchable memory behind bridge: 00000000f0400000-00000000f0bfffff Capabilities: [40] Express Root Port (Slot+), MSI 00 Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [90] Subsystem: Lenovo Device 21f3 Capabilities: [a0] Power Management version 2 Kernel driver in use: pcieport Kernel modules: shpchp 00:1c.1 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 2 (rev c4) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=03, subordinate=03, sec-latency=0 Memory behind bridge: f1c00000-f1cfffff Capabilities: [40] Express Root Port (Slot+), MSI 00 Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [90] Subsystem: Lenovo Device 21f3 Capabilities: [a0] Power Management version 2 Kernel driver in use: pcieport Kernel modules: shpchp 00:1c.2 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 3 (rev c4) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=04, subordinate=0b, sec-latency=0 I/O behind bridge: 00003000-00003fff Memory behind bridge: f1400000-f1bfffff Prefetchable memory behind bridge: 00000000f0c00000-00000000f13fffff Capabilities: [40] Express Root Port (Slot+), MSI 00 Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [90] Subsystem: Lenovo Device 21f3 Capabilities: [a0] Power Management version 2 Kernel driver in use: pcieport Kernel modules: shpchp 00:1d.0 USB controller: Intel Corporation Panther Point USB Enhanced Host Controller #1 (rev 04) (prog-if 20 [EHCI]) Subsystem: Lenovo Device 21f3 Flags: bus master, medium devsel, latency 0, IRQ 23 Memory at f2539000 (32-bit, non-prefetchable) [size=1K] Capabilities: [50] Power Management version 2 Capabilities: [58] Debug port: BAR=1 offset=00a0 Capabilities: [98] PCI Advanced Features Kernel driver in use: ehci_hcd Kernel modules: ehci-hcd 00:1f.0 ISA bridge: Intel Corporation Panther Point LPC Controller (rev 04) Subsystem: Lenovo Device 21f3 Flags: bus master, medium devsel, latency 0 Capabilities: [e0] Vendor Specific Information: Len=0c <?> 00:1f.2 SATA controller: Intel Corporation Panther Point 6 port SATA Controller [AHCI mode] (rev 04) (prog-if 01 [AHCI 1.0]) Subsystem: Lenovo Device 21f3 Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 298 I/O ports at 50a8 [size=8] I/O ports at 50bc [size=4] I/O ports at 50a0 [size=8] I/O ports at 50b8 [size=4] I/O ports at 5060 [size=32] Memory at f2538000 (32-bit, non-prefetchable) [size=2K] Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Capabilities: [70] Power Management version 3 Capabilities: [a8] SATA HBA v1.0 Capabilities: [b0] PCI Advanced Features Kernel driver in use: ahci 00:1f.3 SMBus: Intel Corporation Panther Point SMBus Controller (rev 04) Subsystem: Lenovo Device 21f3 Flags: medium devsel, IRQ 7 Memory at f2534000 (64-bit, non-prefetchable) [size=256] I/O ports at efa0 [size=32] Kernel modules: i2c-i801 02:00.0 System peripheral: Ricoh Co Ltd Device e823 (rev 07) (prog-if 01) Subsystem: Lenovo Device 21f3 Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at f1d00000 (32-bit, non-prefetchable) [size=256] Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Power Management version 3 Capabilities: [80] Express Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [800] Advanced Error Reporting Kernel driver in use: sdhci-pci Kernel modules: sdhci-pci 03:00.0 Network controller: Intel Corporation Centrino Advanced-N 6205 (rev 34) Subsystem: Intel Corporation Centrino Advanced-N 6205 AGN Flags: bus master, fast devsel, latency 0, IRQ 302 Memory at f1c00000 (64-bit, non-prefetchable) [size=8K] Capabilities: [c8] Power Management version 3 Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [e0] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 8c-70-5a-ff-ff-ae-b5-00 Kernel driver in use: iwlwifi Kernel modules: iwlwifi 04:00.0 Serial controller: NetMos Technology PCIe 9901 Multi-I/O Controller (prog-if 02 [16550]) Subsystem: Device a000:1000 Flags: bus master, fast devsel, latency 0, IRQ 18 I/O ports at 3000 [size=8] Memory at f1401000 (32-bit, non-prefetchable) [size=4K] Memory at f1400000 (32-bit, non-prefetchable) [size=4K] Capabilities: [80] Power Management version 3 Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [c0] Express Legacy Endpoint, MSI 00 Capabilities: [100] Power Budgeting <?> Capabilities: [200] Device Serial Number 88-99-ff-ee-dd-cc-bb-aa Kernel driver in use: serial _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 02/01/13 16:46, Ben Guthro wrote:> On Wed, Jan 2, 2013 at 10:31 AM, Ben Guthro <ben@guthro.net > <mailto:ben@guthro.net>> wrote: > > The actual wakeup vector is wakeup_start in > xen/arch/x86/boot/wakeup.S > > > I''ll take a look at this, thanks for the pointer. > > > I''ve tried putting a "ud2" instruction at the start of wakeup_start - > and the machine doesn''t seem to crash. > I also tried a divide by zero in the same place, just for good measure. > > It would appear that this wakeup_start is not getting executed on resume. > Presumably, the BIOS is causing the disk, and CDROM LEDs to flash, > while enumerating the bus. > > A difference between Xen 4.0.y and 4.2.y seems to be the removal of > the boot trampoline fixed address, that much of this is calculated as > an offset of. > Could an error in this path cause such a behavior?It seems the trampoline is allocated at a different location in Xen 4.2 (EBDA - 64k instead of 0x7c000). I have attached a quick patch to move the location back to 0x7c000 to see if that helps your system. I have compile and boot tested the patch but not had time to do a S3 test on it. Can you try it on your system? Can you also run the following command as root in dom0: hexdump -s 0x400 -n 32 /dev/mem> > /btgMalcolm _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Wed, Jan 2, 2013 at 3:35 PM, Malcolm Crossley < malcolm.crossley@citrix.com> wrote:> On 02/01/13 16:46, Ben Guthro wrote: > > On Wed, Jan 2, 2013 at 10:31 AM, Ben Guthro <ben@guthro.net> wrote: > >> The actual wakeup vector is wakeup_start in xen/arch/x86/boot/wakeup.S >>> >> >> I''ll take a look at this, thanks for the pointer. >> > > I''ve tried putting a "ud2" instruction at the start of wakeup_start - and > the machine doesn''t seem to crash. > I also tried a divide by zero in the same place, just for good measure. > > It would appear that this wakeup_start is not getting executed on resume. > Presumably, the BIOS is causing the disk, and CDROM LEDs to flash, while > enumerating the bus. > > A difference between Xen 4.0.y and 4.2.y seems to be the removal of the > boot trampoline fixed address, that much of this is calculated as an offset > of. > Could an error in this path cause such a behavior? > > > It seems the trampoline is allocated at a different location in Xen 4.2 > (EBDA - 64k instead of 0x7c000). I have attached a quick patch to move the > location back to 0x7c000 to see if that helps your system. I have compile > and boot tested the patch but not had time to do a S3 test on it. Can you > try it on your system? > >That patch hard codes it to 0x8c00, I think. In any case, I tried this, as well as 0x7c00, but neither helped. I also tried reverting the changeset that introduced this: http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=46fce9fd2b3557c97e6ce9beec9ed17ad87d6f94 none of this seems to have an effect, that I can see. I never seem to be reaching wakeup_start upon resume. I''ve been trying to trace through the ACPI facs parsing, to see if the math is wrong somewhere...but so far, it all looks correct.> Can you also run the following command as root in dom0: > > hexdump -s 0x400 -n 32 /dev/mem >hexdump -s 0x400 -n 32 /dev/mem 0000400 0000 0000 0000 0000 0000 0000 0000 9d80 0000410 0026 7600 0002 0000 0000 001e 001e 0000 0000420> > /btg > > > Malcolm >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 02.01.13 at 17:46, Ben Guthro <ben@guthro.net> wrote: > A difference between Xen 4.0.y and 4.2.y seems to be the removal of the > boot trampoline fixed address, that much of this is calculated as an offset > of. > Could an error in this path cause such a behavior?The general expectation would be for the system to reboot in such an event, but of course BIOS/chipset behavior matters here (i.e. the triple fault causing the reboot might make the BIOS put the system to sleep again because not all state was cleared properly by that time). If anything like that happens, putting in #UD or other exception raising things of course would appear to make no difference. Having already tried putting the trampoline back at the prior fixed location (which didn''t make a difference you said), there''s not much else I can suggest other than bisection starting from the 4.0.x baseline you know works on that laptop. This being a laptop, I suppose it doesn''t have a reset switch? Since if it did, rather than causing an exception (which may not have any visible effect, as described above) you could store stuff into certain I/O ports contents of which survives reboot. Or - that would work even without reset button - store some indicator into an unused CMOS slot (provided you can find one that doesn''t require you to update the checksum - if nothing else, one of the date fields may be suitable). Jan
On Thu, Jan 3, 2013 at 5:19 AM, Jan Beulich <JBeulich@suse.com> wrote:> > Having already tried putting the trampoline back at the prior fixed > location (which didn''t make a difference you said), there''s not much > else I can suggest other than bisection starting from the 4.0.x > baseline you know works on that laptop. > >OK, I''ve bisected this failure to the following changeset, and CC''ed the original author here. http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=67a93c8da5b99374ec02dbbc14a70e01ffdab20c I''ll try reverting this at the tip of the stable-4.2 tree, and see if it makes a difference Any thoughts on this would be appreciated. /btg _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 03.01.13 at 17:33, Ben Guthro <ben@guthro.net> wrote: > OK, I''ve bisected this failure to the following changeset, and CC''ed the > original author here. > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=67a93c8da5b99374ec02dbbc > 14a70e01ffdab20c > > I''ll try reverting this at the tip of the stable-4.2 tree, and see if it > makes a differenceAnother thing for double checking this is really the one would be to try booting with "no-mce".> Any thoughts on this would be appreciated.You aren''t running Xen with (almost) no memory left to it, are you? Jan
On Thu, Jan 3, 2013 at 12:08 PM, Jan Beulich <JBeulich@suse.com> wrote:> >>> On 03.01.13 at 17:33, Ben Guthro <ben@guthro.net> wrote: > > OK, I''ve bisected this failure to the following changeset, and CC''ed the > > original author here. > > > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=67a93c8da5b99374ec02dbbc > > 14a70e01ffdab20c > > > > I''ll try reverting this at the tip of the stable-4.2 tree, and see if it > > makes a difference > > Another thing for double checking this is really the one would be to > try booting with "no-mce". >Booting this changeset with no-mce makes the failure go away. Unfortunately, doing the same at the stable-4.2 tip does not. So - there must be some unintended side effect here, or there are multiple problems causing the same behavior.> > > Any thoughts on this would be appreciated. > > You aren''t running Xen with (almost) no memory left to it, are you? > >No, xen has plenty of memory _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Thu, Jan 3, 2013 at 12:28 PM, Ben Guthro <ben@guthro.net> wrote:> On Thu, Jan 3, 2013 at 12:08 PM, Jan Beulich <JBeulich@suse.com> wrote: > >> >>> On 03.01.13 at 17:33, Ben Guthro <ben@guthro.net> wrote: >> > OK, I''ve bisected this failure to the following changeset, and CC''ed the >> > original author here. >> > >> http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=67a93c8da5b99374ec02dbbc >> > 14a70e01ffdab20c >> > >> > I''ll try reverting this at the tip of the stable-4.2 tree, and see if it >> > makes a difference >> >> Another thing for double checking this is really the one would be to >> try booting with "no-mce". >> > > Booting this changeset with no-mce makes the failure go away. > > Unfortunately, doing the same at the stable-4.2 tip does not. > So - there must be some unintended side effect here, or there are multiple > problems causing the same behavior. >I''ve spent the day bisecting, and questioning my results. This seems to be timing related, as I seem to get different results if I suspend multiple times, or reboot and re-attempt with the same changeset. Without a reliable way to determine "good" or "bad" for a given changeset, it makes bisecting across such a large number of changesets reasonably useless. You have suggested storing some info in CMOS data... how would I even go about doing that? ...and what would you suggest storing there? /btg _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 03.01.13 at 22:26, Ben Guthro <ben@guthro.net> wrote: > You have suggested storing some info in CMOS data... how would I even go > about doing that?The usual port 0x70 and 0x71 accesses, just coded directly in assembly inside the trampoline code.> ...and what would you suggest storing there?Initially, just some indicator that you got to a certain point. You could, for example, simply increment the year: movb $RTC_YEAR, %al outb %al, $0x70 inb $0x71, %al incb %al outb %al, $0x71 (if necessary in the place you put this, saving/restoring %eax and/or eflags may need to be added). Jan
On Fri, Jan 4, 2013 at 3:34 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 03.01.13 at 22:26, Ben Guthro <ben@guthro.net> wrote: >> You have suggested storing some info in CMOS data... how would I even go >> about doing that? > > The usual port 0x70 and 0x71 accesses, just coded directly in > assembly inside the trampoline code. > >> ...and what would you suggest storing there? > > Initially, just some indicator that you got to a certain point. You > could, for example, simply increment the year: > > movb $RTC_YEAR, %al > outb %al, $0x70 > inb $0x71, %al > incb %al > outb %al, $0x71 > > (if necessary in the place you put this, saving/restoring %eax > and/or eflags may need to be added). > > Jan >I think this is a serious contender in the contest for, "most tedious way possible to debug software." The resume process seems to be getting much further than I expected. I''ve traced it as far as the following stack process_pending_softirqs() __cpu_up() cpu_up() enable_nonboot_cpus() enter_state() At the point of incrementing the date in process_pending_softirqs() - I rebooted, and got a BIOS error about the TSC being invalid...so it would seem I wrapped that integer...oops. Why these machines are getting stuck with pending softirqs is still a mystery to me. I can also change the behavior, such that it resumes once, if I put an mdelay(1000) in the process after disable_nonboot_cpus(); However, subsequent resumes seem to hang. As far as I can tell, this seems to happen on Lenovo laptops, but not others (or, at least, not the ones I''ve checked) It seems to at least happen on any Sandybridge, or Ivybridge class Thinkpads: T420 T430 T530 If you have any suggestions as to how to debug the soft IRQ problem, I''d welcome any pointers. Ben
I''ve managed to reproduce this failure on some hardware that gives me some hope of debugging it: A mobile Intel SDP machine. With this machine, I have a little BIOS POST display, giving me whole byte of debugging information that I can use as a "got here" status, instead of the CMOS poking. Liberal sprinkling of writes to ioport 80 once again points to IRQ related problems. Removing SMP from the equation, I seem to going through the following code stack: rcu_check_callbacks() process_pending_softirqs() rcu_barrier_action() rcu_barrier() enter_state() I''m kind of wandering around in the dark, at this point - do you have any pointers, as to what I should be looking for? Ben
>>> On 14.01.13 at 23:00, Ben Guthro <ben@guthro.net> wrote: > I''ve managed to reproduce this failure on some hardware that gives me > some hope of debugging it: A mobile Intel SDP machine. > > With this machine, I have a little BIOS POST display, giving me whole > byte of debugging information that I can use as a "got here" status, > instead of the CMOS poking. > Liberal sprinkling of writes to ioport 80 once again points to IRQ > related problems. > > Removing SMP from the equation, I seem to going through the following > code stack: > > rcu_check_callbacks() > process_pending_softirqs() > rcu_barrier_action() > rcu_barrier() > enter_state() > > I''m kind of wandering around in the dark, at this point - do you have > any pointers, as to what I should be looking for?Not immediately, i.e. without looking at what might be involved there. But this is different from the call stack you posted yesterday... And just to recap - are you not getting out of there, or is the system dying in some way? In the former case, try adding another rcu_barrier() right before the call to acpi_sleep_prepare(), and check that num_online_cpus() is really 1 at the already present rcu_barrier(). In the latter case, tracing it to the point where it hangs/shuts down/crashes is probably the only way. Jan
On Tue, Jan 15, 2013 at 3:33 AM, Jan Beulich <JBeulich@suse.com> wrote:> > Not immediately, i.e. without looking at what might be involved > there. But this is different from the call stack you posted > yesterday...They both take different paths to get there, but they both seem to be stuck in the for loop in __do_softirq() I didn''t verify the SMP case, but at least in the case of booting with "nosmp" - the rcu_pending() call is always true - so we seem to be stuck in an infinite loop.> And just to recap - are you not getting out of there, > or is the system dying in some way?It seems to never get out of __do_softirq() On the lenovo systems, this seemed to exhibit itself differently than on the Intel SDP, going back to the pulsing power LED. So, it is not crashed, but iit is certainly not proceeding the way it should.> In the former case, try adding > another rcu_barrier() right before the call to acpi_sleep_prepare(), > and check that num_online_cpus() is really 1 at the already present > rcu_barrier().I''ll give this a try if the "nosmp" tack leads nowhere, thanks.> In the latter case, tracing it to the point where it > hangs/shuts down/crashes is probably the only way. >I''ll continue down the rcu_check_callbacks() path, I guess.
On Tue, Jan 15, 2013 at 7:55 AM, Ben Guthro <ben@guthro.net> wrote:> I''ll continue down the rcu_check_callbacks() path, I guess.I believe I''ve found the culprit of the issue, but am unsure of what the proper solution is. It looks like after resume, on these newer machines, the ns16550 registers contain all FF''s - and so, the timer code was getting stuck in __ns16550_poll in the following stack: __ns16550_poll() execute_timer() timer_softirq_action() __do_softirq() process_pending_softirqs() rcu_barrier_action() rcu_barrier() enter_state() The while loop in this function was spinning, calling serial_rx_interrupt() over, and over again, since the LSR register was 0xFF A workaround seems to be to check some of the named registers at resume time, and bail out if they contain 0xFF''s: diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c index d77042e..b370581 100644 --- a/xen/drivers/char/ns16550.c +++ b/xen/drivers/char/ns16550.c @@ -342,6 +342,15 @@ static void ns16550_resume(struct serial_port *port) PCI_COMMAND, uart->cr); } + if ( (((unsigned char)ns_read_reg(uart, LSR)) == 0xff) && + (((unsigned char)ns_read_reg(uart, MCR)) == 0xff) && + (((unsigned char)ns_read_reg(uart, IER)) == 0xff) && + (((unsigned char)ns_read_reg(uart, IIR)) == 0xff) && + (((unsigned char)ns_read_reg(uart, LCR)) == 0xff) ) { + printk(KERN_ERR "ns16550 resume has bad register data!\n"); + return; + } + ns16550_setup_preirq(port->uart); ns16550_setup_postirq(port->uart); } This, of course means that you don''t get any serial data after resume, which is not ideal. I''m going to try to figure out if there is any chipset (Panther Point) specific initialization that should be getting done. If you (or anyone else) has any other thoughts on this serial initialization, please let me know. Ben
On 15/01/13 18:10, Ben Guthro wrote:> On Tue, Jan 15, 2013 at 7:55 AM, Ben Guthro <ben@guthro.net> wrote: >> I''ll continue down the rcu_check_callbacks() path, I guess. > I believe I''ve found the culprit of the issue, but am unsure of what > the proper solution is. > > It looks like after resume, on these newer machines, the ns16550 > registers contain all FF''s - and so, the timer code was getting stuck > in > __ns16550_poll in the following stack: > > __ns16550_poll() > execute_timer() > timer_softirq_action() > __do_softirq() > process_pending_softirqs() > rcu_barrier_action() > rcu_barrier() > enter_state() > > The while loop in this function was spinning, calling > serial_rx_interrupt() over, and over again, since the LSR register was > 0xFF > > A workaround seems to be to check some of the named registers at > resume time, and bail out if they contain 0xFF''s: > > diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c > index d77042e..b370581 100644 > --- a/xen/drivers/char/ns16550.c > +++ b/xen/drivers/char/ns16550.c > @@ -342,6 +342,15 @@ static void ns16550_resume(struct serial_port *port) > PCI_COMMAND, uart->cr); > } > > + if ( (((unsigned char)ns_read_reg(uart, LSR)) == 0xff) && > + (((unsigned char)ns_read_reg(uart, MCR)) == 0xff) && > + (((unsigned char)ns_read_reg(uart, IER)) == 0xff) && > + (((unsigned char)ns_read_reg(uart, IIR)) == 0xff) && > + (((unsigned char)ns_read_reg(uart, LCR)) == 0xff) ) { > + printk(KERN_ERR "ns16550 resume has bad register data!\n"); > + return; > + } > + > ns16550_setup_preirq(port->uart); > ns16550_setup_postirq(port->uart); > } > > > This, of course means that you don''t get any serial data after resume, > which is not ideal. > > > I''m going to try to figure out if there is any chipset (Panther Point) > specific initialization that should be getting done. > If you (or anyone else) has any other thoughts on this serial > initialization, please let me know. > > > BenYou get 0xFF when there is nothing responding to the ioport. If the 16550 is on a PCI card then it could be the PCI connection has not been setup again after the resume and you can''t get to that ioport range. Malcolm
On Tue, Jan 15, 2013 at 1:17 PM, Malcolm Crossley <malcolm.crossley@citrix.com> wrote:> You get 0xFF when there is nothing responding to the ioport. If the 16550 is > on a PCI card then it could be the PCI connection has not been setup again > after the resume and you can''t get to that ioport range.This is not a PCI card, it is on onboard card (io base 0x3f8) Ben
On 15/01/13 18:22, Ben Guthro wrote:> On Tue, Jan 15, 2013 at 1:17 PM, Malcolm Crossley > <malcolm.crossley@citrix.com> wrote: >> You get 0xFF when there is nothing responding to the ioport. If the 16550 is >> on a PCI card then it could be the PCI connection has not been setup again >> after the resume and you can''t get to that ioport range. > This is not a PCI card, it is on onboard card (io base 0x3f8) > > BenInteresting, it may be the serial device requires some ACPI method to be called to initialise/enable it correctly. A serial port on a HP Elitebook 8570p we have seems to not initialise the serial port after the BIOS has started. The serial only starts working when the Linux kernel runs the ACPI enable method (halfway through the kernel boot) . I''ve tried to decompile the ACPI AML and it looks like it''s enabling the serial via a microcontroller. It could be you have a similar microcontroller based serial port on your system which can only be initialised via ACPI. It might be worth checking that the io decode windows are enabled on the panther point chipset for the 0x3f8 port ranges. Check that bits 0-2 are 0 at address 0x80 and that bit 0 is 0 at address 0x82 in PCI device 0:1f config space. Malcolm
On Tue, Jan 15, 2013 at 1:32 PM, Malcolm Crossley <malcolm.crossley@citrix.com> wrote:> On 15/01/13 18:22, Ben Guthro wrote: >> >> On Tue, Jan 15, 2013 at 1:17 PM, Malcolm Crossley >> <malcolm.crossley@citrix.com> wrote: >>> >>> You get 0xFF when there is nothing responding to the ioport. If the 16550 >>> is >>> on a PCI card then it could be the PCI connection has not been setup >>> again >>> after the resume and you can''t get to that ioport range. >> >> This is not a PCI card, it is on onboard card (io base 0x3f8) >> >> Ben > > Interesting, it may be the serial device requires some ACPI method to be > called to initialise/enable it correctly. > > A serial port on a HP Elitebook 8570p we have seems to not initialise the > serial port after the BIOS has started. The serial only starts working when > the Linux kernel runs the ACPI enable method (halfway through the kernel > boot) . I''ve tried to decompile the ACPI AML and it looks like it''s enabling > the serial via a microcontroller. > > It could be you have a similar microcontroller based serial port on your > system which can only be initialised via ACPI. > > It might be worth checking that the io decode windows are enabled on the > panther point chipset for the 0x3f8 port ranges. Check that bits 0-2 are 0 > at address 0x80 and that bit 0 is 0 at address 0x82 in PCI device 0:1f > config space.It looks like bit 0 is 1 at 0x82 (if I''m reading this correctly): 00:1f.0 ISA bridge: Intel Corporation Panther Point LPC Controller (rev 04) Subsystem: Intel Corporation Device 7270 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Capabilities: [e0] Vendor Specific Information: Len=0c <?> 00: 86 80 55 1e 07 00 10 02 04 00 01 06 00 00 80 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 70 72 30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00 40: 01 04 00 00 80 00 00 00 01 05 00 00 10 00 00 00 50: f8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 8b 80 8a 8a 90 00 00 00 85 80 8b 85 f8 f0 00 00 70: 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 80: 70 00 0f 3c 81 06 7c 00 41 16 0c 00 c1 07 3c 00 90: e1 02 1c 00 00 0f 00 00 00 00 00 00 00 00 00 00 a0: 14 0e a0 00 48 39 06 00 00 47 00 00 00 00 00 02 b0: 00 00 00 00 00 00 00 00 04 80 00 20 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 33 22 11 00 67 45 00 00 c0 fc 00 00 08 00 00 00 e0: 09 00 0c 10 00 00 00 00 a3 02 e4 02 00 00 00 00 f0: 01 c0 d1 fe 81 30 1a 00 87 0f 04 08 00 00 00 00 Is that something that needs to be re-enabled at resume time?
On 15/01/13 18:38, Ben Guthro wrote:> On Tue, Jan 15, 2013 at 1:32 PM, Malcolm Crossley > <malcolm.crossley@citrix.com> wrote: >> On 15/01/13 18:22, Ben Guthro wrote: >>> On Tue, Jan 15, 2013 at 1:17 PM, Malcolm Crossley >>> <malcolm.crossley@citrix.com> wrote: >>>> You get 0xFF when there is nothing responding to the ioport. If the 16550 >>>> is >>>> on a PCI card then it could be the PCI connection has not been setup >>>> again >>>> after the resume and you can''t get to that ioport range. >>> This is not a PCI card, it is on onboard card (io base 0x3f8) >>> >>> Ben >> Interesting, it may be the serial device requires some ACPI method to be >> called to initialise/enable it correctly. >> >> A serial port on a HP Elitebook 8570p we have seems to not initialise the >> serial port after the BIOS has started. The serial only starts working when >> the Linux kernel runs the ACPI enable method (halfway through the kernel >> boot) . I''ve tried to decompile the ACPI AML and it looks like it''s enabling >> the serial via a microcontroller. >> >> It could be you have a similar microcontroller based serial port on your >> system which can only be initialised via ACPI. >> >> It might be worth checking that the io decode windows are enabled on the >> panther point chipset for the 0x3f8 port ranges. Check that bits 0-2 are 0 >> at address 0x80 and that bit 0 is 0 at address 0x82 in PCI device 0:1f >> config space. > It looks like bit 0 is 1 at 0x82 (if I''m reading this correctly): > > 00:1f.0 ISA bridge: Intel Corporation Panther Point LPC Controller (rev 04) > Subsystem: Intel Corporation Device 7270 > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0 > Capabilities: [e0] Vendor Specific Information: Len=0c <?> > 00: 86 80 55 1e 07 00 10 02 04 00 01 06 00 00 80 00 > 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 70 72 > 30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00 > 40: 01 04 00 00 80 00 00 00 01 05 00 00 10 00 00 00 > 50: f8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 60: 8b 80 8a 8a 90 00 00 00 85 80 8b 85 f8 f0 00 00 > 70: 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 > 80: 70 00 0f 3c 81 06 7c 00 41 16 0c 00 c1 07 3c 00 > 90: e1 02 1c 00 00 0f 00 00 00 00 00 00 00 00 00 00 > a0: 14 0e a0 00 48 39 06 00 00 47 00 00 00 00 00 02 > b0: 00 00 00 00 00 00 00 00 04 80 00 20 00 00 00 00 > c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > d0: 33 22 11 00 67 45 00 00 c0 fc 00 00 08 00 00 00 > e0: 09 00 0c 10 00 00 00 00 a3 02 e4 02 00 00 00 00 > f0: 01 c0 d1 fe 81 30 1a 00 87 0f 04 08 00 00 00 00 > > > Is that something that needs to be re-enabled at resume time?Sorry I made a mistake, bit 0 should 1 at address 0x82.
On 02/01/13 14:08, Ben Guthro wrote:> I''m starting a new thread on this, to attempt to not confuse this > issue, with the other S3 issue reported by Marek Marczykowski against 4.1 > If you prefer I continue that thread instead, please let me know, and > I will be happy to do so. > > Some background: > I am attempting to chase down yet another S3 issue in the Xen-4.2 / > unstable tree, seen on some (but not all) platforms. > The particular machine I am able to reproduce it 100% of the time is a > Lenovo T430 (Ivy bridge laptop) >I''ve been debugging this same issue alongside Ben, happening on Lenovo T520 laptop. Found out that this is fixable by putting mdelay(500) in arch/x86/acpi/power.c : enter_state() pretty much doesn''t matter where in this function this is placed, fixes the issue. Further debugging what happens during just the period of this mdelay() revealed that there is hardware apic timer interrupt firing during that period, which if not serviced before the S3 suspend, will cause a failure after resume. So this interrupt is serviced in apic.c apic_timer_interrupt(). It mainly just asserts TIMER_SOFTIRQ. Indeed, replacing the mdelay() with raise_softirq(TIMER_SOFTIRQ) anywhere in enter_state() fixes the problem as well. So my theory is that the local apic timer state is lost during the S3 suspend, causing a failure to fire off the timer interrupt and subsequent failure to assert the TIMER_SOFTIRQ. Given that some timers in sched_credit.c and schedule.c seem to be hanging on this timer softirq in order to keep the scheduler going, I suspect the scheduler stops working properly after the resume. Does that sound plausible? Asserting the TIMER_SOFTIRQ on resume path seems to be one way of fixing this, is there any better way? Like for example some tweaks to lapic_suspend() / lapic_resume() to do extra preservation of the lapic timer/interrupt state ?
>>> On 15.01.13 at 19:10, Ben Guthro <ben@guthro.net> wrote: > On Tue, Jan 15, 2013 at 7:55 AM, Ben Guthro <ben@guthro.net> wrote: >> I''ll continue down the rcu_check_callbacks() path, I guess. > > I believe I''ve found the culprit of the issue, but am unsure of what > the proper solution is. > > It looks like after resume, on these newer machines, the ns16550 > registers contain all FF''s - and so, the timer code was getting stuck > in __ns16550_poll in the following stack:Interesting. This isn''t a plug in PCI device, is it? Which would mean this is a BIOS bug (not bringing the device back online, perhaps by keeping it disabled in some LPC register).> A workaround seems to be to check some of the named registers at > resume time, and bail out if they contain 0xFF''s: > ... > This, of course means that you don''t get any serial data after resume, > which is not ideal.Yeah, but better than not resuming. I.e. if we can really nail this down to a platform issue, applying a workaround like what you suggested would seem worth considering. But I suppose this isn''t helping on the laptop then? And to me this would also imply that if you run without serial console, there wouldn''t be an issue. Jan
(re-adding Cc list)>>> On 16.01.13 at 11:43, Ben Guthro <ben@guthro.net> wrote: > On Wed, Jan 16, 2013 at 4:35 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 15.01.13 at 19:10, Ben Guthro <ben@guthro.net> wrote: >>> On Tue, Jan 15, 2013 at 7:55 AM, Ben Guthro <ben@guthro.net> wrote: >>>> I''ll continue down the rcu_check_callbacks() path, I guess. >>> >>> I believe I''ve found the culprit of the issue, but am unsure of what >>> the proper solution is. >>> >>> It looks like after resume, on these newer machines, the ns16550 >>> registers contain all FF''s - and so, the timer code was getting stuck >>> in __ns16550_poll in the following stack: >> >> Interesting. This isn''t a plug in PCI device, is it? Which would >> mean this is a BIOS bug (not bringing the device back online, >> perhaps by keeping it disabled in some LPC register). > > No, it appears to be the legacy COM1 0x3f8 device. > >> >>> A workaround seems to be to check some of the named registers at >>> resume time, and bail out if they contain 0xFF''s: >>> ... >>> This, of course means that you don''t get any serial data after resume, >>> which is not ideal. >> >> Yeah, but better than not resuming. I.e. if we can really nail this >> down to a platform issue, applying a workaround like what you >> suggested would seem worth considering. >> >> But I suppose this isn''t helping on the laptop then? > > It seemed to resolve the hang on both the Ivy Bridge Intel Mobile SDP > (which is effectively laptop hardware in a desktop case) - as well as > the Lenovo T430 machines. > > Unfortunately, it did not resolve it for Tomasz''s machine, or another > Sandy Bridge laptop I tried (Lenovo X230T) - so there may be more than > one issue here. > >> And to me this >> would also imply that if you run without serial console, there >> wouldn''t be an issue. > > I thought this as well - but if I read the code correctly, it seems > that the ns16550 is set up for the legacy devices in > xen/arch/x86/setup.c _start_xen(), regardless of whether serial is > configured on the command line (if the hardware exists): > > /* We initialise the serial devices very early so we can get debugging. > */ > ns16550.io_base = 0x3f8; > ns16550.irq = 4; > ns16550_init(0, &ns16550); > ns16550.io_base = 0x2f8; > ns16550.irq = 3; > ns16550_init(1, &ns16550);Yeah, but serial_resume() doesn''t call their resume handlers unless their state is serial_initialized, which it can get to only through serial_parse_handle() seeing the right handle. Jan
On Wed, Jan 16, 2013 at 5:57 AM, Jan Beulich <JBeulich@suse.com> wrote:> (re-adding Cc list) >apologies.>>>> On 16.01.13 at 11:43, Ben Guthro <ben@guthro.net> wrote: >> On Wed, Jan 16, 2013 at 4:35 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>>> On 15.01.13 at 19:10, Ben Guthro <ben@guthro.net> wrote: >>>> On Tue, Jan 15, 2013 at 7:55 AM, Ben Guthro <ben@guthro.net> wrote: >>>>> I''ll continue down the rcu_check_callbacks() path, I guess. >>>> >>>> I believe I''ve found the culprit of the issue, but am unsure of what >>>> the proper solution is. >>>> >>>> It looks like after resume, on these newer machines, the ns16550 >>>> registers contain all FF''s - and so, the timer code was getting stuck >>>> in __ns16550_poll in the following stack: >>> >>> Interesting. This isn''t a plug in PCI device, is it? Which would >>> mean this is a BIOS bug (not bringing the device back online, >>> perhaps by keeping it disabled in some LPC register). >> >> No, it appears to be the legacy COM1 0x3f8 device. >> >>> >>>> A workaround seems to be to check some of the named registers at >>>> resume time, and bail out if they contain 0xFF''s: >>>> ... >>>> This, of course means that you don''t get any serial data after resume, >>>> which is not ideal. >>> >>> Yeah, but better than not resuming. I.e. if we can really nail this >>> down to a platform issue, applying a workaround like what you >>> suggested would seem worth considering. >>> >>> But I suppose this isn''t helping on the laptop then? >> >> It seemed to resolve the hang on both the Ivy Bridge Intel Mobile SDP >> (which is effectively laptop hardware in a desktop case) - as well as >> the Lenovo T430 machines. >> >> Unfortunately, it did not resolve it for Tomasz''s machine, or another >> Sandy Bridge laptop I tried (Lenovo X230T) - so there may be more than >> one issue here. >> >>> And to me this >>> would also imply that if you run without serial console, there >>> wouldn''t be an issue. >> >> I thought this as well - but if I read the code correctly, it seems >> that the ns16550 is set up for the legacy devices in >> xen/arch/x86/setup.c _start_xen(), regardless of whether serial is >> configured on the command line (if the hardware exists): >> >> /* We initialise the serial devices very early so we can get debugging. >> */ >> ns16550.io_base = 0x3f8; >> ns16550.irq = 4; >> ns16550_init(0, &ns16550); >> ns16550.io_base = 0x2f8; >> ns16550.irq = 3; >> ns16550_init(1, &ns16550); > > Yeah, but serial_resume() doesn''t call their resume handlers > unless their state is serial_initialized, which it can get to only > through serial_parse_handle() seeing the right handle.hmm, OK, I guess I missed that part. I''ll look closer today, to see if there is something in the config space of this device that isn''t getting preserved.
>>> On 16.01.13 at 12:05, Ben Guthro <ben@guthro.net> wrote: > On Wed, Jan 16, 2013 at 5:57 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>> And to me this >>>> would also imply that if you run without serial console, there >>>> wouldn''t be an issue. >>> >>> I thought this as well - but if I read the code correctly, it seems >>> that the ns16550 is set up for the legacy devices in >>> xen/arch/x86/setup.c _start_xen(), regardless of whether serial is >>> configured on the command line (if the hardware exists): >>> >>> /* We initialise the serial devices very early so we can get debugging. >>> */ >>> ns16550.io_base = 0x3f8; >>> ns16550.irq = 4; >>> ns16550_init(0, &ns16550); >>> ns16550.io_base = 0x2f8; >>> ns16550.irq = 3; >>> ns16550_init(1, &ns16550); >> >> Yeah, but serial_resume() doesn''t call their resume handlers >> unless their state is serial_initialized, which it can get to only >> through serial_parse_handle() seeing the right handle. > > hmm, OK, I guess I missed that part. > > I''ll look closer today, to see if there is something in the config > space of this device that isn''t getting preserved.Config space of a non-PCI device? Jan
On Wed, Jan 16, 2013 at 6:09 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 16.01.13 at 12:05, Ben Guthro <ben@guthro.net> wrote: >> On Wed, Jan 16, 2013 at 5:57 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> And to me this >>>>> would also imply that if you run without serial console, there >>>>> wouldn''t be an issue. >>>> >>>> I thought this as well - but if I read the code correctly, it seems >>>> that the ns16550 is set up for the legacy devices in >>>> xen/arch/x86/setup.c _start_xen(), regardless of whether serial is >>>> configured on the command line (if the hardware exists): >>>> >>>> /* We initialise the serial devices very early so we can get debugging. >>>> */ >>>> ns16550.io_base = 0x3f8; >>>> ns16550.irq = 4; >>>> ns16550_init(0, &ns16550); >>>> ns16550.io_base = 0x2f8; >>>> ns16550.irq = 3; >>>> ns16550_init(1, &ns16550); >>> >>> Yeah, but serial_resume() doesn''t call their resume handlers >>> unless their state is serial_initialized, which it can get to only >>> through serial_parse_handle() seeing the right handle. >> >> hmm, OK, I guess I missed that part. >> >> I''ll look closer today, to see if there is something in the config >> space of this device that isn''t getting preserved. > > Config space of a non-PCI device?Your reply made me second-guess my assumption about the device - I thought you meant to imply that it must be going through the PCI path (perhaps I read too much into this) I have been working under the assumption that the device at 0x3f8 is not a PCI device, because of the io base... and never really verified that it wasn''t going through the PCI path. I suppose a PCI device could provide that device, as well. I''ll have to look around for the Panther Point Chipset docs, to see if it mentions anything about this. Ben
On Tue, Jan 15, 2013 at 1:39 PM, Malcolm Crossley <malcolm.crossley@citrix.com> wrote:> On 15/01/13 18:38, Ben Guthro wrote: >> >> On Tue, Jan 15, 2013 at 1:32 PM, Malcolm Crossley >> <malcolm.crossley@citrix.com> wrote: >>> >>> On 15/01/13 18:22, Ben Guthro wrote: >>>> >>>> On Tue, Jan 15, 2013 at 1:17 PM, Malcolm Crossley >>>> <malcolm.crossley@citrix.com> wrote: >>>>> >>>>> You get 0xFF when there is nothing responding to the ioport. If the >>>>> 16550 >>>>> is >>>>> on a PCI card then it could be the PCI connection has not been setup >>>>> again >>>>> after the resume and you can''t get to that ioport range. >>>> >>>> This is not a PCI card, it is on onboard card (io base 0x3f8) >>>> >>>> Ben >>> >>> Interesting, it may be the serial device requires some ACPI method to be >>> called to initialise/enable it correctly. >>> >>> A serial port on a HP Elitebook 8570p we have seems to not initialise the >>> serial port after the BIOS has started. The serial only starts working >>> when >>> the Linux kernel runs the ACPI enable method (halfway through the kernel >>> boot) . I''ve tried to decompile the ACPI AML and it looks like it''s >>> enabling >>> the serial via a microcontroller. >>> >>> It could be you have a similar microcontroller based serial port on your >>> system which can only be initialised via ACPI. >>> >>> It might be worth checking that the io decode windows are enabled on the >>> panther point chipset for the 0x3f8 port ranges. Check that bits 0-2 are >>> 0 >>> at address 0x80 and that bit 0 is 0 at address 0x82 in PCI device 0:1f >>> config space. >> >> It looks like bit 0 is 1 at 0x82 (if I''m reading this correctly): >> >> 00:1f.0 ISA bridge: Intel Corporation Panther Point LPC Controller (rev >> 04) >> Subsystem: Intel Corporation Device 7270 >> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- >> ParErr- >> Stepping- SERR- FastB2B- DisINTx- >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- >> <TAbort- <MAbort- >SERR- <PERR- INTx- >> Latency: 0 >> Capabilities: [e0] Vendor Specific Information: Len=0c <?> >> 00: 86 80 55 1e 07 00 10 02 04 00 01 06 00 00 80 00 >> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 70 72 >> 30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00 >> 40: 01 04 00 00 80 00 00 00 01 05 00 00 10 00 00 00 >> 50: f8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 60: 8b 80 8a 8a 90 00 00 00 85 80 8b 85 f8 f0 00 00 >> 70: 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 >> 80: 70 00 0f 3c 81 06 7c 00 41 16 0c 00 c1 07 3c 00 >> 90: e1 02 1c 00 00 0f 00 00 00 00 00 00 00 00 00 00 >> a0: 14 0e a0 00 48 39 06 00 00 47 00 00 00 00 00 02 >> b0: 00 00 00 00 00 00 00 00 04 80 00 20 00 00 00 00 >> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> d0: 33 22 11 00 67 45 00 00 c0 fc 00 00 08 00 00 00 >> e0: 09 00 0c 10 00 00 00 00 a3 02 e4 02 00 00 00 00 >> f0: 01 c0 d1 fe 81 30 1a 00 87 0f 04 08 00 00 00 00 >> >> >> Is that something that needs to be re-enabled at resume time? > > Sorry I made a mistake, bit 0 should 1 at address 0x82. > >It appears that Malcolm is correct, in this regard. On the mobile SDP (and other newer laptops) - it looks like the serial device is not part of the PCH, but a SuperIO card hanging off of the LPC bus. Disassembling the DSDT, and looking at the output of "lspnp -b -vv" shows this device providing the legacy port io base addresses. Presumably, the BIOS executes the AML at boot time to set this device up, but we don''t seem to do anything of the sort in Xen, which gives F''s when accessing the ioport. I''m still investigating how this device might properly be re-enabled in Xen.