thr3ads.net - Xen devel - S3 resume issues [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Ben Guthro

2013-Jan-02 13:08 UTC

S3 resume issues

I''m starting a new thread on this, to attempt to not confuse this
issue,
with the other S3 issue reported by Marek Marczykowski against 4.1
If you prefer I continue that thread instead, please let me know, and I
will be happy to do so.

Some background:
I am attempting to chase down yet another S3 issue in the Xen-4.2 /
unstable tree, seen on some (but not all) platforms.
The particular machine I am able to reproduce it 100% of the time is a
Lenovo T430 (Ivy bridge laptop)

The symptoms of the failure are that it suspends just fine, but does not
resume.
When attempting to resume, by pressing the power button - the disk LED
flashes, and the CDROM activity LED flashes, but then the system seems to
put itsself back to sleep, as the power LED goes back to pulsing
Note that the soft pulsing LED is distinctly different from the crash LED
blink rate.

I have tried a number of the tricks Jan suggested to me the last time we
were down this path - so far to no success.
The failure seems to be happening so soon in the resume process, that there
is not yet a console available.

I have resorted to putting BUG() in the code directly in the resume path,
in an attempt to understand what is going on - since there seems to be
something in this path that I don''t fully understand.

In xen/arch/x86/acpi/power.c - acpi_enter_sleep_state() seems to be the
code that actually puts the processor into S3.
If I put a BUG() directly before the return of this function - I never seem
to reach this. It continues to pulse the power LED, as described above.
I would have expected a hypervisor crash upon attempting to wake up the
system.


Could this be a behavior caused by a bad resume vector?
If so - how would I know it was bad?

Any other ideas are welcome.

Ben


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Malcolm Crossley

2013-Jan-02 15:15 UTC

head link

Re: S3 resume issues

Hi Ben,

On 02/01/13 13:08, Ben Guthro wrote:> I''m starting a new thread on this, to attempt to not confuse this 
> issue, with the other S3 issue reported by Marek Marczykowski against 4.1
> If you prefer I continue that thread instead, please let me know, and 
> I will be happy to do so.
>
> Some background:
> I am attempting to chase down yet another S3 issue in the Xen-4.2 / 
> unstable tree, seen on some (but not all) platforms.
> The particular machine I am able to reproduce it 100% of the time is a 
> Lenovo T430 (Ivy bridge laptop)
>To help reproduce the issue it would be good to know what Linux kernel 
you were using.

Also, Does Xen-4.1 work on this particular machine? If Xen-4.1 does not 
work, have you confirmed that baremetal suspend resume works? (I''m just
covering the base''s here)

I also notice that the laptop can have NVIDIA optimus technology. Does 
this particular T430 have an Nvidia GPU? Is there a way to force disable 
the NVIDIA GPU in the BIOS, this may help with displaying resume 
progress via the display.> The symptoms of the failure are that it suspends just fine, but does 
> not resume.
> When attempting to resume, by pressing the power button - the disk LED 
> flashes, and the CDROM activity LED flashes, but then the system seems 
> to put itsself back to sleep, as the power LED goes back to pulsing
> Note that the soft pulsing LED is distinctly different from the crash 
> LED blink rate.Is there any sign of the video POSTING (flickering screen etc)
?>
> I have tried a number of the tricks Jan suggested to me the last time 
> we were down this path - so far to no success.
> The failure seems to be happening so soon in the resume process, that 
> there is not yet a console available.
>
> I have resorted to putting BUG() in the code directly in the resume 
> path, in an attempt to understand what is going on - since there seems 
> to be something in this path that I don''t fully understand.
>
> In xen/arch/x86/acpi/power.c - acpi_enter_sleep_state() seems to be 
> the code that actually puts the processor into S3.
> If I put a BUG() directly before the return of this function - I never 
> seem to reach this. It continues to pulse the power LED, as described 
> above.
> I would have expected a hypervisor crash upon attempting to wake up 
> the system.
>The actual wakeup vector is wakeup_start in xen/arch/x86/boot/wakeup.S

Malcolm

Ben Guthro

2013-Jan-02 15:31 UTC

head link

Re: S3 resume issues

On Wed, Jan 2, 2013 at 10:15 AM, Malcolm Crossley <
malcolm.crossley@citrix.com> wrote:
> Hi Ben,
>
>
> On 02/01/13 13:08, Ben Guthro wrote:
>
>> I''m starting a new thread on this, to attempt to not confuse
this issue,
>> with the other S3 issue reported by Marek Marczykowski against 4.1
>> If you prefer I continue that thread instead, please let me know, and I
>> will be happy to do so.
>>
>> Some background:
>> I am attempting to chase down yet another S3 issue in the Xen-4.2 /
>> unstable tree, seen on some (but not all) platforms.
>> The particular machine I am able to reproduce it 100% of the time is a
>> Lenovo T430 (Ivy bridge laptop)
>>
>>  To help reproduce the issue it would be good to know what Linux kernel
> you were using.
>
Currently, XenClient Enterprise is using a kernel based off of the ubuntu
"precise" 3.2 kernel -
http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-precise.git;a=summary

However, we have a number of patches on top of this - one specific for S3
is Konrad''s older patches (attached acpi-s3.v9.patch)
http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=shortlog;h=refs/heads/devel/acpi-s3.v9


That said, I have also tested with the latest kernel.org kernel, as we
maintain a set of patches against the tip of development, as well.
This same failure has been seen with the latest kernel, as well as the
latest xen-unstable code

To minimize variables, I''ve stuck with the known failure case of
Xen-4.2.y
and linux-3.2

>
> Also, Does Xen-4.1 work on this particular machine? If Xen-4.1 does not
> work, have you confirmed that baremetal suspend resume works? (I''m
just
> covering the base''s here)
>
>I have not tested 4.1 - but Xen 4.0.3 works with this same kernel.
I have not tested bare metal, as I am reasonably convinced it is the
hypervisor, since Xen-4.0.3 works


> I also notice that the laptop can have NVIDIA optimus technology. Does
> this particular T430 have an Nvidia GPU? Is there a way to force disable
> the NVIDIA GPU in the BIOS, this may help with displaying resume progress
> via the display.

Optimus is not a factor in this case - this machine is Intel GPU only.

>
>  The symptoms of the failure are that it suspends just fine, but does not
>> resume.
>> When attempting to resume, by pressing the power button - the disk LED
>> flashes, and the CDROM activity LED flashes, but then the system seems
to
>> put itsself back to sleep, as the power LED goes back to pulsing
>> Note that the soft pulsing LED is distinctly different from the crash
LED
>> blink rate.
>>
> Is there any sign of the video POSTING (flickering screen etc) ?
>
>No screen flicker, only the LED activity mentioned above

>
>> I have tried a number of the tricks Jan suggested to me the last time
we
>> were down this path - so far to no success.
>> The failure seems to be happening so soon in the resume process, that
>> there is not yet a console available.
>>
>> I have resorted to putting BUG() in the code directly in the resume
path,
>> in an attempt to understand what is going on - since there seems to be
>> something in this path that I don''t fully understand.
>>
>> In xen/arch/x86/acpi/power.c - acpi_enter_sleep_state() seems to be the
>> code that actually puts the processor into S3.
>> If I put a BUG() directly before the return of this function - I never
>> seem to reach this. It continues to pulse the power LED, as described
above.
>> I would have expected a hypervisor crash upon attempting to wake up the
>> system.
>>
>>  The actual wakeup vector is wakeup_start in xen/arch/x86/boot/wakeup.S
>
I''ll take a look at this, thanks for the pointer.


Ben



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ben Guthro

2013-Jan-02 16:46 UTC

head link

Re: S3 resume issues

On Wed, Jan 2, 2013 at 10:31 AM, Ben Guthro <ben@guthro.net> wrote:
> The actual wakeup vector is wakeup_start in xen/arch/x86/boot/wakeup.S
>>
>
> I''ll take a look at this, thanks for the pointer.
>
I''ve tried putting a "ud2" instruction at the start of
wakeup_start - and
the machine doesn''t seem to crash.
I also tried a divide by zero in the same place, just for good measure.

It would appear that this wakeup_start is not getting executed on resume.
Presumably, the BIOS is causing the disk, and CDROM LEDs to flash, while
enumerating the bus.

A difference between Xen 4.0.y and 4.2.y seems to be the removal of the
boot trampoline fixed address, that much of this is calculated as an offset
of.
Could an error in this path cause such a behavior?

/btg

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Pasi Kärkkäinen

2013-Jan-02 17:14 UTC

head link

Re: S3 resume issues

On Wed, Jan 02, 2013 at 10:31:18AM -0500, Ben Guthro
wrote:> 
>      I also notice that the laptop can have NVIDIA optimus technology. Does
>      this particular T430 have an Nvidia GPU? Is there a way to force
disable
>      the NVIDIA GPU in the BIOS, this may help with displaying resume
>      progress via the display.
> 
>    Optimus is not a factor in this case - this machine is Intel GPU only.
>
Hmm.. do you have T430s then? I thought all T430 (without s) models have both
the IGD + Nvidia GPUs.

T430 BIOS does have an option to disable the Nvigia GPU though.

-- Pasi

Ben Guthro

2013-Jan-02 17:20 UTC

head link

Re: S3 resume issues

On Wed, Jan 2, 2013 at 12:14 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote:
> On Wed, Jan 02, 2013 at 10:31:18AM -0500, Ben Guthro wrote:
> >
> >      I also notice that the laptop can have NVIDIA optimus technology.
> Does
> >      this particular T430 have an Nvidia GPU? Is there a way to force
> disable
> >      the NVIDIA GPU in the BIOS, this may help with displaying resume
> >      progress via the display.
> >
> >    Optimus is not a factor in this case - this machine is Intel GPU
only.
> >
>
> Hmm.. do you have T430s then? I thought all T430 (without s) models have
> both the IGD + Nvidia GPUs.
>
> T430 BIOS does have an option to disable the Nvigia GPU though.
>
>No, just a T430 - no "s" - (BTW, those are completely different
machines,
from what I''ve seen)

root@cobrakai:~# cat /sys/class/dmi/id/product_name
23445LU

http://www.provantage.com/lenovo-23445lu~7LENO3A9.htm

root@cobrakai:~# cat /sys/class/dmi/id/product_version
ThinkPad T430

root@cobrakai:~# lspci -v
00:00.0 Host bridge: Intel Corporation Ivy Bridge DRAM Controller (rev 09)
Subsystem: Lenovo Device 21f3
Flags: bus master, fast devsel, latency 0
Capabilities: [e0] Vendor Specific Information: Len=0c <?>
Kernel driver in use: agpgart-intel
Kernel modules: intel-agp

00:02.0 VGA compatible controller: Intel Corporation Ivy Bridge Graphics
Controller (rev 09) (prog-if 00 [VGA controller])
Subsystem: Lenovo Device 21f3
Flags: bus master, fast devsel, latency 0, IRQ 303
Memory at f0000000 (64-bit, non-prefetchable) [size=4M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
I/O ports at 5000 [size=64]
Expansion ROM at <unassigned> [disabled]
Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [d0] Power Management version 2
Capabilities: [a4] PCI Advanced Features
Kernel driver in use: i915
Kernel modules: i915

00:14.0 USB controller: Intel Corporation Panther Point USB xHCI Host
Controller (rev 04) (prog-if 30 [XHCI])
Subsystem: Lenovo Device 21f3
Flags: bus master, medium devsel, latency 0, IRQ 299
Memory at f2520000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [70] Power Management version 2
Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+
Kernel driver in use: xhci_hcd
Kernel modules: xhci-hcd

00:16.0 Communication controller: Intel Corporation Panther Point MEI
Controller #1 (rev 04)
Subsystem: Lenovo Device 21f3
Flags: bus master, fast devsel, latency 0, IRQ 11
Memory at f2535000 (64-bit, non-prefetchable) [size=16]
Capabilities: [50] Power Management version 3
Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+

00:16.3 Serial controller: Intel Corporation Panther Point KT Controller
(rev 04) (prog-if 02 [16550])
Subsystem: Lenovo Device 21f3
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 19
I/O ports at 50b0 [size=8]
Memory at f253c000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [c8] Power Management version 3
Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
Kernel driver in use: serial

00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network
Connection (rev 04)
Subsystem: Lenovo Device 21f3
Flags: bus master, fast devsel, latency 0, IRQ 300
Memory at f2500000 (32-bit, non-prefetchable) [size=128K]
Memory at f253b000 (32-bit, non-prefetchable) [size=4K]
I/O ports at 5080 [size=32]
Capabilities: [c8] Power Management version 2
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [e0] PCI Advanced Features
Kernel driver in use: e1000e
Kernel modules: e1000e

00:1a.0 USB controller: Intel Corporation Panther Point USB Enhanced Host
Controller #2 (rev 04) (prog-if 20 [EHCI])
Subsystem: Lenovo Device 21f3
Flags: bus master, medium devsel, latency 0, IRQ 16
Memory at f253a000 (32-bit, non-prefetchable) [size=1K]
Capabilities: [50] Power Management version 2
Capabilities: [58] Debug port: BAR=1 offset=00a0
Capabilities: [98] PCI Advanced Features
Kernel driver in use: ehci_hcd
Kernel modules: ehci-hcd

00:1b.0 Audio device: Intel Corporation Panther Point High Definition Audio
Controller (rev 04)
Subsystem: Lenovo Device 21f3
Flags: bus master, fast devsel, latency 0, IRQ 301
Memory at f2530000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [50] Power Management version 2
Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [130] Root Complex Link
Kernel driver in use: snd_hda_intel
Kernel modules: snd-hda-intel

00:1c.0 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 1
(rev c4) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
I/O behind bridge: 00004000-00004fff
Memory behind bridge: f1d00000-f24fffff
Prefetchable memory behind bridge: 00000000f0400000-00000000f0bfffff
Capabilities: [40] Express Root Port (Slot+), MSI 00
Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: Lenovo Device 21f3
Capabilities: [a0] Power Management version 2
Kernel driver in use: pcieport
Kernel modules: shpchp

00:1c.1 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 2
(rev c4) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
Memory behind bridge: f1c00000-f1cfffff
Capabilities: [40] Express Root Port (Slot+), MSI 00
Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: Lenovo Device 21f3
Capabilities: [a0] Power Management version 2
Kernel driver in use: pcieport
Kernel modules: shpchp

00:1c.2 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 3
(rev c4) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=04, subordinate=0b, sec-latency=0
I/O behind bridge: 00003000-00003fff
Memory behind bridge: f1400000-f1bfffff
Prefetchable memory behind bridge: 00000000f0c00000-00000000f13fffff
Capabilities: [40] Express Root Port (Slot+), MSI 00
Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: Lenovo Device 21f3
Capabilities: [a0] Power Management version 2
Kernel driver in use: pcieport
Kernel modules: shpchp

00:1d.0 USB controller: Intel Corporation Panther Point USB Enhanced Host
Controller #1 (rev 04) (prog-if 20 [EHCI])
Subsystem: Lenovo Device 21f3
Flags: bus master, medium devsel, latency 0, IRQ 23
Memory at f2539000 (32-bit, non-prefetchable) [size=1K]
Capabilities: [50] Power Management version 2
Capabilities: [58] Debug port: BAR=1 offset=00a0
Capabilities: [98] PCI Advanced Features
Kernel driver in use: ehci_hcd
Kernel modules: ehci-hcd

00:1f.0 ISA bridge: Intel Corporation Panther Point LPC Controller (rev 04)
Subsystem: Lenovo Device 21f3
Flags: bus master, medium devsel, latency 0
Capabilities: [e0] Vendor Specific Information: Len=0c <?>

00:1f.2 SATA controller: Intel Corporation Panther Point 6 port SATA
Controller [AHCI mode] (rev 04) (prog-if 01 [AHCI 1.0])
Subsystem: Lenovo Device 21f3
Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 298
I/O ports at 50a8 [size=8]
I/O ports at 50bc [size=4]
I/O ports at 50a0 [size=8]
I/O ports at 50b8 [size=4]
I/O ports at 5060 [size=32]
Memory at f2538000 (32-bit, non-prefetchable) [size=2K]
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [70] Power Management version 3
Capabilities: [a8] SATA HBA v1.0
Capabilities: [b0] PCI Advanced Features
Kernel driver in use: ahci

00:1f.3 SMBus: Intel Corporation Panther Point SMBus Controller (rev 04)
Subsystem: Lenovo Device 21f3
Flags: medium devsel, IRQ 7
Memory at f2534000 (64-bit, non-prefetchable) [size=256]
I/O ports at efa0 [size=32]
Kernel modules: i2c-i801

02:00.0 System peripheral: Ricoh Co Ltd Device e823 (rev 07) (prog-if 01)
Subsystem: Lenovo Device 21f3
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f1d00000 (32-bit, non-prefetchable) [size=256]
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Power Management version 3
Capabilities: [80] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [800] Advanced Error Reporting
Kernel driver in use: sdhci-pci
Kernel modules: sdhci-pci

03:00.0 Network controller: Intel Corporation Centrino Advanced-N 6205 (rev
34)
Subsystem: Intel Corporation Centrino Advanced-N 6205 AGN
Flags: bus master, fast devsel, latency 0, IRQ 302
Memory at f1c00000 (64-bit, non-prefetchable) [size=8K]
Capabilities: [c8] Power Management version 3
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [e0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 8c-70-5a-ff-ff-ae-b5-00
Kernel driver in use: iwlwifi
Kernel modules: iwlwifi

04:00.0 Serial controller: NetMos Technology PCIe 9901 Multi-I/O Controller
(prog-if 02 [16550])
Subsystem: Device a000:1000
Flags: bus master, fast devsel, latency 0, IRQ 18
I/O ports at 3000 [size=8]
Memory at f1401000 (32-bit, non-prefetchable) [size=4K]
Memory at f1400000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [80] Power Management version 3
Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [c0] Express Legacy Endpoint, MSI 00
Capabilities: [100] Power Budgeting <?>
Capabilities: [200] Device Serial Number 88-99-ff-ee-dd-cc-bb-aa
Kernel driver in use: serial


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Malcolm Crossley

2013-Jan-02 20:35 UTC

head link

Re: S3 resume issues

On 02/01/13 16:46, Ben Guthro wrote:> On Wed, Jan 2, 2013 at 10:31 AM, Ben Guthro <ben@guthro.net 
> <mailto:ben@guthro.net>> wrote:
>
>         The actual wakeup vector is wakeup_start in
>         xen/arch/x86/boot/wakeup.S
>
>
>     I''ll take a look at this, thanks for the pointer.
>
>
> I''ve tried putting a "ud2" instruction at the start of
wakeup_start -
> and the machine doesn''t seem to crash.
> I also tried a divide by zero in the same place, just for good measure.
>
> It would appear that this wakeup_start is not getting executed on resume.
> Presumably, the BIOS is causing the disk, and CDROM LEDs to flash, 
> while enumerating the bus.
>
> A difference between Xen 4.0.y and 4.2.y seems to be the removal of 
> the boot trampoline fixed address, that much of this is calculated as 
> an offset of.
> Could an error in this path cause such a behavior?
It seems the trampoline is allocated at a different location in Xen 4.2 
(EBDA - 64k instead of 0x7c000). I have attached a quick patch to move 
the location back to 0x7c000 to see if that helps your system. I have 
compile and boot tested the patch but not had time to do a S3 test on 
it. Can you try it on your system?

Can you also run the following command as root in dom0:

hexdump -s 0x400 -n 32 /dev/mem>
> /btg
Malcolm



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ben Guthro

2013-Jan-02 20:50 UTC

head link

Re: S3 resume issues

On Wed, Jan 2, 2013 at 3:35 PM, Malcolm Crossley <
malcolm.crossley@citrix.com> wrote:
>  On 02/01/13 16:46, Ben Guthro wrote:
>
> On Wed, Jan 2, 2013 at 10:31 AM, Ben Guthro <ben@guthro.net> wrote:
>
>>  The actual wakeup vector is wakeup_start in xen/arch/x86/boot/wakeup.S
>>>
>>
>>  I''ll take a look at this, thanks for the pointer.
>>
>
> I''ve tried putting a "ud2" instruction at the start of
wakeup_start - and
> the machine doesn''t seem to crash.
> I also tried a divide by zero in the same place, just for good measure.
>
>  It would appear that this wakeup_start is not getting executed on resume.
> Presumably, the BIOS is causing the disk, and CDROM LEDs to flash, while
> enumerating the bus.
>
>  A difference between Xen 4.0.y and 4.2.y seems to be the removal of the
> boot trampoline fixed address, that much of this is calculated as an offset
> of.
> Could an error in this path cause such a behavior?
>
>
> It seems the trampoline is allocated at a different location in Xen 4.2
> (EBDA - 64k instead of 0x7c000). I have attached a quick patch to move the
> location back to 0x7c000 to see if that helps your system. I have compile
> and boot tested the patch but not had time to do a S3 test on it. Can you
> try it on your system?
>
>That patch hard codes it to 0x8c00, I think.
In any case, I tried this, as well as 0x7c00, but neither helped.

I also tried reverting the changeset that introduced this:
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=46fce9fd2b3557c97e6ce9beec9ed17ad87d6f94

none of this seems to have an effect, that I can see.

I never seem to be reaching wakeup_start upon resume.

I''ve been trying to trace through the ACPI facs parsing, to see if the
math
is wrong somewhere...but so far, it all looks correct.

> Can you also run the following command as root in dom0:
>
> hexdump -s 0x400 -n 32 /dev/mem
>
hexdump -s 0x400 -n 32 /dev/mem
0000400 0000 0000 0000 0000 0000 0000 0000 9d80
0000410 0026 7600 0002 0000 0000 001e 001e 0000
0000420



>
>  /btg
>
>
> Malcolm
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Jan-03 10:19 UTC

head link

Re: S3 resume issues

>>> On 02.01.13 at 17:46, Ben Guthro <ben@guthro.net> wrote:
> A difference between Xen 4.0.y and 4.2.y seems to be the removal of the
> boot trampoline fixed address, that much of this is calculated as an offset
> of.
> Could an error in this path cause such a behavior?
The general expectation would be for the system to reboot in such
an event, but of course BIOS/chipset behavior matters here (i.e.
the triple fault causing the reboot might make the BIOS put the
system to sleep again because not all state was cleared properly
by that time). If anything like that happens, putting in #UD or
other exception raising things of course would appear to make no
difference.

Having already tried putting the trampoline back at the prior fixed
location (which didn''t make a difference you said), there''s
not much
else I can suggest other than bisection starting from the 4.0.x
baseline you know works on that laptop.

This being a laptop, I suppose it doesn''t have a reset switch? Since
if it did, rather than causing an exception (which may not have any
visible effect, as described above) you could store stuff into certain
I/O ports contents of which survives reboot. Or - that would work
even without reset button - store some indicator into an unused
CMOS slot (provided you can find one that doesn''t require you to
update the checksum - if nothing else, one of the date fields may
be suitable).

Jan

Ben Guthro

2013-Jan-03 16:33 UTC

head link

Re: S3 resume issues

On Thu, Jan 3, 2013 at 5:19 AM, Jan Beulich <JBeulich@suse.com> wrote:
>
> Having already tried putting the trampoline back at the prior fixed
> location (which didn''t make a difference you said),
there''s not much
> else I can suggest other than bisection starting from the 4.0.x
> baseline you know works on that laptop.
>
>OK, I''ve bisected this failure to the following changeset, and
CC''ed the
original author here.
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=67a93c8da5b99374ec02dbbc14a70e01ffdab20c

I''ll try reverting this at the tip of the stable-4.2 tree, and see if
it
makes a difference

Any thoughts on this would be appreciated.

/btg


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Jan-03 17:08 UTC

head link

Re: S3 resume issues

>>> On 03.01.13 at 17:33, Ben Guthro <ben@guthro.net> wrote:
> OK, I''ve bisected this failure to the following changeset, and
CC''ed the
> original author here.
>
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=67a93c8da5b99374ec02dbbc
> 14a70e01ffdab20c
> 
> I''ll try reverting this at the tip of the stable-4.2 tree, and see
if it
> makes a difference
Another thing for double checking this is really the one would be to
try booting with "no-mce".
> Any thoughts on this would be appreciated.
You aren''t running Xen with (almost) no memory left to it, are you?

Jan

Ben Guthro

2013-Jan-03 17:28 UTC

head link

Re: S3 resume issues

On Thu, Jan 3, 2013 at 12:08 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>> On 03.01.13 at 17:33, Ben Guthro <ben@guthro.net> wrote:
> > OK, I''ve bisected this failure to the following changeset,
and CC''ed the
> > original author here.
> >
>
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=67a93c8da5b99374ec02dbbc
> > 14a70e01ffdab20c
> >
> > I''ll try reverting this at the tip of the stable-4.2 tree,
and see if it
> > makes a difference
>
> Another thing for double checking this is really the one would be to
> try booting with "no-mce".
>
Booting this changeset with no-mce makes the failure go away.

Unfortunately, doing the same at the stable-4.2 tip does not.
So - there must be some unintended side effect here, or there are multiple
problems causing the same behavior.


>
> > Any thoughts on this would be appreciated.
>
> You aren''t running Xen with (almost) no memory left to it, are
you?
>
>No, xen has plenty of memory


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ben Guthro

2013-Jan-03 21:26 UTC

head link

Re: S3 resume issues

On Thu, Jan 3, 2013 at 12:28 PM, Ben Guthro <ben@guthro.net> wrote:
> On Thu, Jan 3, 2013 at 12:08 PM, Jan Beulich <JBeulich@suse.com>
wrote:
>
>> >>> On 03.01.13 at 17:33, Ben Guthro <ben@guthro.net>
wrote:
>> > OK, I''ve bisected this failure to the following
changeset, and CC''ed the
>> > original author here.
>> >
>>
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=67a93c8da5b99374ec02dbbc
>> > 14a70e01ffdab20c
>> >
>> > I''ll try reverting this at the tip of the stable-4.2
tree, and see if it
>> > makes a difference
>>
>> Another thing for double checking this is really the one would be to
>> try booting with "no-mce".
>>
>
> Booting this changeset with no-mce makes the failure go away.
>
> Unfortunately, doing the same at the stable-4.2 tip does not.
> So - there must be some unintended side effect here, or there are multiple
> problems causing the same behavior.
>
I''ve spent the day bisecting, and questioning my results.

This seems to be timing related, as I seem to get different results if I
suspend multiple times, or reboot and re-attempt with the same changeset.
Without a reliable way to determine "good" or "bad" for a
given changeset,
it makes bisecting across such a large number of changesets reasonably
useless.

You have suggested storing some info in  CMOS data... how would I even go
about doing that?
...and what would you suggest storing there?


/btg


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Jan-04 08:34 UTC

head link

Re: S3 resume issues

>>> On 03.01.13 at 22:26, Ben Guthro <ben@guthro.net> wrote:
> You have suggested storing some info in  CMOS data... how would I even go
> about doing that?
The usual port 0x70 and 0x71 accesses, just coded directly in
assembly inside the trampoline code.
> ...and what would you suggest storing there?
Initially, just some indicator that you got to a certain point. You
could, for example, simply increment the year:

	movb	$RTC_YEAR, %al
	outb	%al, $0x70
	inb	$0x71, %al
	incb	%al
	outb	%al, $0x71

(if necessary in the place you put this, saving/restoring %eax
and/or eflags may need to be added).

Jan

Ben Guthro

2013-Jan-11 20:32 UTC

head link

Re: S3 resume issues

On Fri, Jan 4, 2013 at 3:34 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 03.01.13 at 22:26, Ben Guthro <ben@guthro.net> wrote:
>> You have suggested storing some info in  CMOS data... how would I even
go
>> about doing that?
>
> The usual port 0x70 and 0x71 accesses, just coded directly in
> assembly inside the trampoline code.
>
>> ...and what would you suggest storing there?
>
> Initially, just some indicator that you got to a certain point. You
> could, for example, simply increment the year:
>
>         movb    $RTC_YEAR, %al
>         outb    %al, $0x70
>         inb     $0x71, %al
>         incb    %al
>         outb    %al, $0x71
>
> (if necessary in the place you put this, saving/restoring %eax
> and/or eflags may need to be added).
>
> Jan
>
I think this is a serious contender in the contest for, "most tedious
way possible to debug software."

The resume process seems to be getting much further than I expected.
I''ve traced it as far as the following stack

process_pending_softirqs()
__cpu_up()
cpu_up()
enable_nonboot_cpus()
enter_state()

At the point of incrementing the date in process_pending_softirqs() -
I rebooted, and got a BIOS error about the TSC being invalid...so it
would seem I wrapped that integer...oops.
Why these machines are getting stuck with pending softirqs is still a
mystery to me.

I can also change the behavior, such that it resumes once, if I put an
mdelay(1000) in the process after disable_nonboot_cpus();
However, subsequent resumes seem to hang.

As far as I can tell, this seems to happen on Lenovo laptops, but not
others (or, at least, not the ones I''ve checked)
It seems to at least happen on any Sandybridge, or Ivybridge class Thinkpads:
T420
T430
T530

If you have any suggestions as to how to debug the soft IRQ problem,
I''d welcome any pointers.

Ben

Ben Guthro

2013-Jan-14 22:00 UTC

head link

Re: S3 resume issues

I''ve managed to reproduce this failure on some hardware that gives me
some hope of debugging it: A mobile Intel SDP machine.

With this machine, I have a little BIOS POST display, giving me whole
byte of debugging information that I can use as a "got here" status,
instead of the CMOS poking.
Liberal sprinkling of writes to ioport 80 once again points to IRQ
related problems.

Removing SMP from the equation, I seem to going through the following
code stack:

rcu_check_callbacks()
process_pending_softirqs()
rcu_barrier_action()
rcu_barrier()
enter_state()

I''m kind of wandering around in the dark, at this point - do you have
any pointers, as to what I should be looking for?

Ben

Jan Beulich

2013-Jan-15 08:33 UTC

head link

Re: S3 resume issues

>>> On 14.01.13 at 23:00, Ben Guthro <ben@guthro.net> wrote:
> I''ve managed to reproduce this failure on some hardware that gives
me
> some hope of debugging it: A mobile Intel SDP machine.
> 
> With this machine, I have a little BIOS POST display, giving me whole
> byte of debugging information that I can use as a "got here"
status,
> instead of the CMOS poking.
> Liberal sprinkling of writes to ioport 80 once again points to IRQ
> related problems.
> 
> Removing SMP from the equation, I seem to going through the following
> code stack:
> 
> rcu_check_callbacks()
> process_pending_softirqs()
> rcu_barrier_action()
> rcu_barrier()
> enter_state()
> 
> I''m kind of wandering around in the dark, at this point - do you
have
> any pointers, as to what I should be looking for?
Not immediately, i.e. without looking at what might be involved
there. But this is different from the call stack you posted
yesterday... And just to recap - are you not getting out of there,
or is the system dying in some way? In the former case, try adding
another rcu_barrier() right before the call to acpi_sleep_prepare(),
and check that num_online_cpus() is really 1 at the already present
rcu_barrier(). In the latter case, tracing it to the point where it
hangs/shuts down/crashes is probably the only way.

Jan

Ben Guthro

2013-Jan-15 12:55 UTC

head link

Re: S3 resume issues

On Tue, Jan 15, 2013 at 3:33 AM, Jan Beulich <JBeulich@suse.com>
wrote:>
> Not immediately, i.e. without looking at what might be involved
> there. But this is different from the call stack you posted
> yesterday...
They both take different paths to get there, but they both seem to be
stuck in the for loop in __do_softirq()
I didn''t verify the SMP case, but at least in the case of booting with
"nosmp" - the rcu_pending() call is always true - so we seem to be
stuck in an infinite loop.
> And just to recap - are you not getting out of there,
> or is the system dying in some way?
It seems to never get out of __do_softirq()
On the lenovo systems, this seemed to exhibit itself differently than
on the Intel SDP, going back to the pulsing power LED.

So, it is not crashed, but iit is certainly not proceeding the way it should.
> In the former case, try adding
> another rcu_barrier() right before the call to acpi_sleep_prepare(),
> and check that num_online_cpus() is really 1 at the already present
> rcu_barrier().
I''ll give this a try if the "nosmp" tack leads nowhere,
thanks.
> In the latter case, tracing it to the point where it
> hangs/shuts down/crashes is probably the only way.
>
I''ll continue down the rcu_check_callbacks() path, I guess.

Ben Guthro

2013-Jan-15 18:10 UTC

head link

Re: S3 resume issues

On Tue, Jan 15, 2013 at 7:55 AM, Ben Guthro <ben@guthro.net>
wrote:> I''ll continue down the rcu_check_callbacks() path, I guess.
I believe I''ve found the culprit of the issue, but am unsure of what
the proper solution is.

It looks like after resume, on these newer machines, the ns16550
registers contain all FF''s - and so, the timer code was getting stuck
in
__ns16550_poll in the following stack:

__ns16550_poll()
execute_timer()
timer_softirq_action()
__do_softirq()
process_pending_softirqs()
rcu_barrier_action()
rcu_barrier()
enter_state()

The while loop in this function was spinning, calling
serial_rx_interrupt() over, and over again, since the LSR register was
0xFF

A workaround seems to be to check some of the named registers at
resume time, and bail out if they contain 0xFF''s:

diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c
index d77042e..b370581 100644
--- a/xen/drivers/char/ns16550.c
+++ b/xen/drivers/char/ns16550.c
@@ -342,6 +342,15 @@ static void ns16550_resume(struct serial_port *port)
                         PCI_COMMAND, uart->cr);
     }

+    if ( (((unsigned char)ns_read_reg(uart, LSR)) == 0xff) &&
+         (((unsigned char)ns_read_reg(uart, MCR)) == 0xff) &&
+         (((unsigned char)ns_read_reg(uart, IER)) == 0xff) &&
+         (((unsigned char)ns_read_reg(uart, IIR)) == 0xff) &&
+         (((unsigned char)ns_read_reg(uart, LCR)) == 0xff) ) {
+        printk(KERN_ERR "ns16550 resume has bad register data!\n");
+       return;
+    }
+
     ns16550_setup_preirq(port->uart);
     ns16550_setup_postirq(port->uart);
 }


This, of course means that you don''t get any serial data after resume,
which is not ideal.


I''m going to try to figure out if there is any chipset (Panther Point)
specific initialization that should be getting done.
If you (or anyone else) has any other thoughts on this serial
initialization, please let me know.


Ben

Malcolm Crossley

2013-Jan-15 18:17 UTC

head link

Re: S3 resume issues

On 15/01/13 18:10, Ben Guthro wrote:> On Tue, Jan 15, 2013 at 7:55 AM, Ben Guthro <ben@guthro.net> wrote:
>> I''ll continue down the rcu_check_callbacks() path, I guess.
> I believe I''ve found the culprit of the issue, but am unsure of
what
> the proper solution is.
>
> It looks like after resume, on these newer machines, the ns16550
> registers contain all FF''s - and so, the timer code was getting
stuck
> in
> __ns16550_poll in the following stack:
>
> __ns16550_poll()
> execute_timer()
> timer_softirq_action()
> __do_softirq()
> process_pending_softirqs()
> rcu_barrier_action()
> rcu_barrier()
> enter_state()
>
> The while loop in this function was spinning, calling
> serial_rx_interrupt() over, and over again, since the LSR register was
> 0xFF
>
> A workaround seems to be to check some of the named registers at
> resume time, and bail out if they contain 0xFF''s:
>
> diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c
> index d77042e..b370581 100644
> --- a/xen/drivers/char/ns16550.c
> +++ b/xen/drivers/char/ns16550.c
> @@ -342,6 +342,15 @@ static void ns16550_resume(struct serial_port *port)
>                           PCI_COMMAND, uart->cr);
>       }
>
> +    if ( (((unsigned char)ns_read_reg(uart, LSR)) == 0xff) &&
> +         (((unsigned char)ns_read_reg(uart, MCR)) == 0xff) &&
> +         (((unsigned char)ns_read_reg(uart, IER)) == 0xff) &&
> +         (((unsigned char)ns_read_reg(uart, IIR)) == 0xff) &&
> +         (((unsigned char)ns_read_reg(uart, LCR)) == 0xff) ) {
> +        printk(KERN_ERR "ns16550 resume has bad register
data!\n");
> +       return;
> +    }
> +
>       ns16550_setup_preirq(port->uart);
>       ns16550_setup_postirq(port->uart);
>   }
>
>
> This, of course means that you don''t get any serial data after
resume,
> which is not ideal.
>
>
> I''m going to try to figure out if there is any chipset (Panther
Point)
> specific initialization that should be getting done.
> If you (or anyone else) has any other thoughts on this serial
> initialization, please let me know.
>
>
> BenYou get 0xFF when there is nothing responding to the ioport. If the 
16550 is on a PCI card then it could be the PCI connection has not been 
setup again after the resume and you can''t get to that ioport range.

Malcolm

Ben Guthro

2013-Jan-15 18:22 UTC

head link

Re: S3 resume issues

On Tue, Jan 15, 2013 at 1:17 PM, Malcolm Crossley
<malcolm.crossley@citrix.com> wrote:> You get 0xFF when there is nothing responding to the ioport. If the 16550
is
> on a PCI card then it could be the PCI connection has not been setup again
> after the resume and you can''t get to that ioport range.
This is not a PCI card, it is on onboard card (io base 0x3f8)

Ben

Malcolm Crossley

2013-Jan-15 18:32 UTC

head link

Re: S3 resume issues

On 15/01/13 18:22, Ben Guthro wrote:> On Tue, Jan 15, 2013 at 1:17 PM, Malcolm Crossley
> <malcolm.crossley@citrix.com> wrote:
>> You get 0xFF when there is nothing responding to the ioport. If the
16550 is
>> on a PCI card then it could be the PCI connection has not been setup
again
>> after the resume and you can''t get to that ioport range.
> This is not a PCI card, it is on onboard card (io base 0x3f8)
>
> BenInteresting, it may be the serial device requires some ACPI method to be 
called to initialise/enable it correctly.

A serial port on a HP Elitebook 8570p we have seems to not initialise 
the serial port after the BIOS has started. The serial only starts 
working when the Linux kernel runs the ACPI enable method (halfway 
through the kernel boot) . I''ve tried to decompile the ACPI AML and it 
looks like it''s enabling the serial via a microcontroller.

It could be you have a similar microcontroller based serial port on your 
system which can only be initialised via ACPI.

It might be worth checking that the io decode windows are enabled on the 
panther point chipset for the 0x3f8 port ranges. Check that bits 0-2 are 
0 at address 0x80 and that bit 0 is 0 at address 0x82 in PCI device 0:1f 
config space.

Malcolm

Ben Guthro

2013-Jan-15 18:38 UTC

head link

Re: S3 resume issues

On Tue, Jan 15, 2013 at 1:32 PM, Malcolm Crossley
<malcolm.crossley@citrix.com> wrote:> On 15/01/13 18:22, Ben Guthro wrote:
>>
>> On Tue, Jan 15, 2013 at 1:17 PM, Malcolm Crossley
>> <malcolm.crossley@citrix.com> wrote:
>>>
>>> You get 0xFF when there is nothing responding to the ioport. If the
16550
>>> is
>>> on a PCI card then it could be the PCI connection has not been
setup
>>> again
>>> after the resume and you can''t get to that ioport range.
>>
>> This is not a PCI card, it is on onboard card (io base 0x3f8)
>>
>> Ben
>
> Interesting, it may be the serial device requires some ACPI method to be
> called to initialise/enable it correctly.
>
> A serial port on a HP Elitebook 8570p we have seems to not initialise the
> serial port after the BIOS has started. The serial only starts working when
> the Linux kernel runs the ACPI enable method (halfway through the kernel
> boot) . I''ve tried to decompile the ACPI AML and it looks like
it''s enabling
> the serial via a microcontroller.
>
> It could be you have a similar microcontroller based serial port on your
> system which can only be initialised via ACPI.
>
> It might be worth checking that the io decode windows are enabled on the
> panther point chipset for the 0x3f8 port ranges. Check that bits 0-2 are 0
> at address 0x80 and that bit 0 is 0 at address 0x82 in PCI device 0:1f
> config space.
It looks like bit 0 is 1 at 0x82 (if I''m reading this correctly):

00:1f.0 ISA bridge: Intel Corporation Panther Point LPC Controller (rev 04)
	Subsystem: Intel Corporation Device 7270
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Capabilities: [e0] Vendor Specific Information: Len=0c <?>
00: 86 80 55 1e 07 00 10 02 04 00 01 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 70 72
30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00
40: 01 04 00 00 80 00 00 00 01 05 00 00 10 00 00 00
50: f8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 8b 80 8a 8a 90 00 00 00 85 80 8b 85 f8 f0 00 00
70: 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0
80: 70 00 0f 3c 81 06 7c 00 41 16 0c 00 c1 07 3c 00
90: e1 02 1c 00 00 0f 00 00 00 00 00 00 00 00 00 00
a0: 14 0e a0 00 48 39 06 00 00 47 00 00 00 00 00 02
b0: 00 00 00 00 00 00 00 00 04 80 00 20 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 33 22 11 00 67 45 00 00 c0 fc 00 00 08 00 00 00
e0: 09 00 0c 10 00 00 00 00 a3 02 e4 02 00 00 00 00
f0: 01 c0 d1 fe 81 30 1a 00 87 0f 04 08 00 00 00 00


Is that something that needs to be re-enabled at resume time?

Malcolm Crossley

2013-Jan-15 18:39 UTC

head link

Re: S3 resume issues

On 15/01/13 18:38, Ben Guthro wrote:> On Tue, Jan 15, 2013 at 1:32 PM, Malcolm Crossley
> <malcolm.crossley@citrix.com> wrote:
>> On 15/01/13 18:22, Ben Guthro wrote:
>>> On Tue, Jan 15, 2013 at 1:17 PM, Malcolm Crossley
>>> <malcolm.crossley@citrix.com> wrote:
>>>> You get 0xFF when there is nothing responding to the ioport. If
the 16550
>>>> is
>>>> on a PCI card then it could be the PCI connection has not been
setup
>>>> again
>>>> after the resume and you can''t get to that ioport
range.
>>> This is not a PCI card, it is on onboard card (io base 0x3f8)
>>>
>>> Ben
>> Interesting, it may be the serial device requires some ACPI method to
be
>> called to initialise/enable it correctly.
>>
>> A serial port on a HP Elitebook 8570p we have seems to not initialise
the
>> serial port after the BIOS has started. The serial only starts working
when
>> the Linux kernel runs the ACPI enable method (halfway through the
kernel
>> boot) . I''ve tried to decompile the ACPI AML and it looks like
it''s enabling
>> the serial via a microcontroller.
>>
>> It could be you have a similar microcontroller based serial port on
your
>> system which can only be initialised via ACPI.
>>
>> It might be worth checking that the io decode windows are enabled on
the
>> panther point chipset for the 0x3f8 port ranges. Check that bits 0-2
are 0
>> at address 0x80 and that bit 0 is 0 at address 0x82 in PCI device 0:1f
>> config space.
> It looks like bit 0 is 1 at 0x82 (if I''m reading this correctly):
>
> 00:1f.0 ISA bridge: Intel Corporation Panther Point LPC Controller (rev 04)
> 	Subsystem: Intel Corporation Device 7270
> 	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx-
> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> 	Latency: 0
> 	Capabilities: [e0] Vendor Specific Information: Len=0c <?>
> 00: 86 80 55 1e 07 00 10 02 04 00 01 06 00 00 80 00
> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 70 72
> 30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00
> 40: 01 04 00 00 80 00 00 00 01 05 00 00 10 00 00 00
> 50: f8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 60: 8b 80 8a 8a 90 00 00 00 85 80 8b 85 f8 f0 00 00
> 70: 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0
> 80: 70 00 0f 3c 81 06 7c 00 41 16 0c 00 c1 07 3c 00
> 90: e1 02 1c 00 00 0f 00 00 00 00 00 00 00 00 00 00
> a0: 14 0e a0 00 48 39 06 00 00 47 00 00 00 00 00 02
> b0: 00 00 00 00 00 00 00 00 04 80 00 20 00 00 00 00
> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 33 22 11 00 67 45 00 00 c0 fc 00 00 08 00 00 00
> e0: 09 00 0c 10 00 00 00 00 a3 02 e4 02 00 00 00 00
> f0: 01 c0 d1 fe 81 30 1a 00 87 0f 04 08 00 00 00 00
>
>
> Is that something that needs to be re-enabled at resume time?Sorry I made a mistake, bit 0 should 1 at address 0x82.

Tomasz Wroblewski

2013-Jan-16 02:18 UTC

head link

Re: S3 resume issues

On 02/01/13 14:08, Ben Guthro wrote:> I''m starting a new thread on this, to attempt to not confuse this 
> issue, with the other S3 issue reported by Marek Marczykowski against 4.1
> If you prefer I continue that thread instead, please let me know, and 
> I will be happy to do so.
>
> Some background:
> I am attempting to chase down yet another S3 issue in the Xen-4.2 / 
> unstable tree, seen on some (but not all) platforms.
> The particular machine I am able to reproduce it 100% of the time is a 
> Lenovo T430 (Ivy bridge laptop)
>I''ve been debugging this same issue alongside Ben, happening on Lenovo 
T520 laptop. Found out that this is fixable by putting

mdelay(500)

in arch/x86/acpi/power.c : enter_state()

pretty much doesn''t matter where in this function this is placed, fixes
the issue. Further debugging what happens during just the period of this 
mdelay() revealed that there is hardware apic timer interrupt firing 
during that period, which if not serviced before the S3 suspend, will 
cause a failure after resume.

So this interrupt is serviced in apic.c apic_timer_interrupt(). It 
mainly just asserts TIMER_SOFTIRQ. Indeed, replacing the mdelay() with 
raise_softirq(TIMER_SOFTIRQ) anywhere in enter_state() fixes the problem 
as well.

So my theory is that the local apic timer state is lost during the S3 
suspend, causing a failure to fire off the timer interrupt and 
subsequent failure to assert the TIMER_SOFTIRQ. Given that some timers 
in sched_credit.c and schedule.c seem to be hanging on this timer 
softirq in order to keep the scheduler going, I suspect the scheduler 
stops working properly after the resume.

Does that sound plausible? Asserting the TIMER_SOFTIRQ on resume path 
seems to be one way of fixing this, is there any better way? Like for 
example some tweaks to lapic_suspend() / lapic_resume() to do extra 
preservation of the lapic timer/interrupt state ?

Jan Beulich

2013-Jan-16 09:35 UTC

head link

Re: S3 resume issues

>>> On 15.01.13 at 19:10, Ben Guthro <ben@guthro.net> wrote:
> On Tue, Jan 15, 2013 at 7:55 AM, Ben Guthro <ben@guthro.net> wrote:
>> I''ll continue down the rcu_check_callbacks() path, I guess.
> 
> I believe I''ve found the culprit of the issue, but am unsure of
what
> the proper solution is.
> 
> It looks like after resume, on these newer machines, the ns16550
> registers contain all FF''s - and so, the timer code was getting
stuck
> in __ns16550_poll in the following stack:
Interesting. This isn''t a plug in PCI device, is it? Which would
mean this is a BIOS bug (not bringing the device back online,
perhaps by keeping it disabled in some LPC register).
> A workaround seems to be to check some of the named registers at
> resume time, and bail out if they contain 0xFF''s:
> ...
> This, of course means that you don''t get any serial data after
resume,
> which is not ideal.
Yeah, but better than not resuming. I.e. if we can really nail this
down to a platform issue, applying a workaround like what you
suggested would seem worth considering.

But I suppose this isn''t helping on the laptop then? And to me this
would also imply that if you run without serial console, there
wouldn''t be an issue.

Jan

Jan Beulich

2013-Jan-16 10:57 UTC

head link

Re: S3 resume issues

(re-adding Cc list)
>>> On 16.01.13 at 11:43, Ben Guthro <ben@guthro.net> wrote:
> On Wed, Jan 16, 2013 at 4:35 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>>> On 15.01.13 at 19:10, Ben Guthro <ben@guthro.net>
wrote:
>>> On Tue, Jan 15, 2013 at 7:55 AM, Ben Guthro <ben@guthro.net>
wrote:
>>>> I''ll continue down the rcu_check_callbacks() path, I
guess.
>>>
>>> I believe I''ve found the culprit of the issue, but am
unsure of what
>>> the proper solution is.
>>>
>>> It looks like after resume, on these newer machines, the ns16550
>>> registers contain all FF''s - and so, the timer code was
getting stuck
>>> in __ns16550_poll in the following stack:
>>
>> Interesting. This isn''t a plug in PCI device, is it? Which
would
>> mean this is a BIOS bug (not bringing the device back online,
>> perhaps by keeping it disabled in some LPC register).
> 
> No, it appears to be the legacy COM1 0x3f8 device.
> 
>>
>>> A workaround seems to be to check some of the named registers at
>>> resume time, and bail out if they contain 0xFF''s:
>>> ...
>>> This, of course means that you don''t get any serial data
after resume,
>>> which is not ideal.
>>
>> Yeah, but better than not resuming. I.e. if we can really nail this
>> down to a platform issue, applying a workaround like what you
>> suggested would seem worth considering.
>>
>> But I suppose this isn''t helping on the laptop then?
> 
> It seemed to resolve the hang on both the Ivy Bridge Intel Mobile SDP
> (which is effectively laptop hardware in a desktop case) - as well as
> the Lenovo T430 machines.
> 
> Unfortunately, it did not resolve it for Tomasz''s machine, or
another
> Sandy Bridge laptop I tried (Lenovo X230T) - so there may be more than
> one issue here.
> 
>> And to me this
>> would also imply that if you run without serial console, there
>> wouldn''t be an issue.
> 
> I thought this as well - but if I read the code correctly, it seems
> that the ns16550 is set up for the legacy devices in
> xen/arch/x86/setup.c _start_xen(), regardless of whether serial is
> configured on the command line (if the hardware exists):
> 
>     /* We initialise the serial devices very early so we can get debugging.
> */
>     ns16550.io_base = 0x3f8;
>     ns16550.irq     = 4;
>     ns16550_init(0, &ns16550);
>     ns16550.io_base = 0x2f8;
>     ns16550.irq     = 3;
>     ns16550_init(1, &ns16550);
Yeah, but serial_resume() doesn''t call their resume handlers
unless their state is serial_initialized, which it can get to only
through serial_parse_handle() seeing the right handle.

Jan

Ben Guthro

2013-Jan-16 11:05 UTC

head link

Re: S3 resume issues

On Wed, Jan 16, 2013 at 5:57 AM, Jan Beulich <JBeulich@suse.com>
wrote:> (re-adding Cc list)
>
apologies.
>>>> On 16.01.13 at 11:43, Ben Guthro <ben@guthro.net> wrote:
>> On Wed, Jan 16, 2013 at 4:35 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>>>> On 15.01.13 at 19:10, Ben Guthro <ben@guthro.net>
wrote:
>>>> On Tue, Jan 15, 2013 at 7:55 AM, Ben Guthro
<ben@guthro.net> wrote:
>>>>> I''ll continue down the rcu_check_callbacks() path,
I guess.
>>>>
>>>> I believe I''ve found the culprit of the issue, but am
unsure of what
>>>> the proper solution is.
>>>>
>>>> It looks like after resume, on these newer machines, the
ns16550
>>>> registers contain all FF''s - and so, the timer code
was getting stuck
>>>> in __ns16550_poll in the following stack:
>>>
>>> Interesting. This isn''t a plug in PCI device, is it? Which
would
>>> mean this is a BIOS bug (not bringing the device back online,
>>> perhaps by keeping it disabled in some LPC register).
>>
>> No, it appears to be the legacy COM1 0x3f8 device.
>>
>>>
>>>> A workaround seems to be to check some of the named registers
at
>>>> resume time, and bail out if they contain 0xFF''s:
>>>> ...
>>>> This, of course means that you don''t get any serial
data after resume,
>>>> which is not ideal.
>>>
>>> Yeah, but better than not resuming. I.e. if we can really nail this
>>> down to a platform issue, applying a workaround like what you
>>> suggested would seem worth considering.
>>>
>>> But I suppose this isn''t helping on the laptop then?
>>
>> It seemed to resolve the hang on both the Ivy Bridge Intel Mobile SDP
>> (which is effectively laptop hardware in a desktop case) - as well as
>> the Lenovo T430 machines.
>>
>> Unfortunately, it did not resolve it for Tomasz''s machine, or
another
>> Sandy Bridge laptop I tried (Lenovo X230T) - so there may be more than
>> one issue here.
>>
>>> And to me this
>>> would also imply that if you run without serial console, there
>>> wouldn''t be an issue.
>>
>> I thought this as well - but if I read the code correctly, it seems
>> that the ns16550 is set up for the legacy devices in
>> xen/arch/x86/setup.c _start_xen(), regardless of whether serial is
>> configured on the command line (if the hardware exists):
>>
>>     /* We initialise the serial devices very early so we can get
debugging.
>> */
>>     ns16550.io_base = 0x3f8;
>>     ns16550.irq     = 4;
>>     ns16550_init(0, &ns16550);
>>     ns16550.io_base = 0x2f8;
>>     ns16550.irq     = 3;
>>     ns16550_init(1, &ns16550);
>
> Yeah, but serial_resume() doesn''t call their resume handlers
> unless their state is serial_initialized, which it can get to only
> through serial_parse_handle() seeing the right handle.
hmm, OK, I guess I missed that part.

I''ll look closer today, to see if there is something in the config
space of this device that isn''t getting preserved.

Jan Beulich

2013-Jan-16 11:09 UTC

head link

Re: S3 resume issues

>>> On 16.01.13 at 12:05, Ben Guthro <ben@guthro.net> wrote:
> On Wed, Jan 16, 2013 at 5:57 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>> And to me this
>>>> would also imply that if you run without serial console, there
>>>> wouldn''t be an issue.
>>>
>>> I thought this as well - but if I read the code correctly, it seems
>>> that the ns16550 is set up for the legacy devices in
>>> xen/arch/x86/setup.c _start_xen(), regardless of whether serial is
>>> configured on the command line (if the hardware exists):
>>>
>>>     /* We initialise the serial devices very early so we can get
debugging.
>>> */
>>>     ns16550.io_base = 0x3f8;
>>>     ns16550.irq     = 4;
>>>     ns16550_init(0, &ns16550);
>>>     ns16550.io_base = 0x2f8;
>>>     ns16550.irq     = 3;
>>>     ns16550_init(1, &ns16550);
>>
>> Yeah, but serial_resume() doesn''t call their resume handlers
>> unless their state is serial_initialized, which it can get to only
>> through serial_parse_handle() seeing the right handle.
> 
> hmm, OK, I guess I missed that part.
> 
> I''ll look closer today, to see if there is something in the config
> space of this device that isn''t getting preserved.
Config space of a non-PCI device?

Jan

Ben Guthro

2013-Jan-16 11:17 UTC

head link

Re: S3 resume issues

On Wed, Jan 16, 2013 at 6:09 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 16.01.13 at 12:05, Ben Guthro <ben@guthro.net> wrote:
>> On Wed, Jan 16, 2013 at 5:57 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>>> And to me this
>>>>> would also imply that if you run without serial console,
there
>>>>> wouldn''t be an issue.
>>>>
>>>> I thought this as well - but if I read the code correctly, it
seems
>>>> that the ns16550 is set up for the legacy devices in
>>>> xen/arch/x86/setup.c _start_xen(), regardless of whether serial
is
>>>> configured on the command line (if the hardware exists):
>>>>
>>>>     /* We initialise the serial devices very early so we can
get debugging.
>>>> */
>>>>     ns16550.io_base = 0x3f8;
>>>>     ns16550.irq     = 4;
>>>>     ns16550_init(0, &ns16550);
>>>>     ns16550.io_base = 0x2f8;
>>>>     ns16550.irq     = 3;
>>>>     ns16550_init(1, &ns16550);
>>>
>>> Yeah, but serial_resume() doesn''t call their resume
handlers
>>> unless their state is serial_initialized, which it can get to only
>>> through serial_parse_handle() seeing the right handle.
>>
>> hmm, OK, I guess I missed that part.
>>
>> I''ll look closer today, to see if there is something in the
config
>> space of this device that isn''t getting preserved.
>
> Config space of a non-PCI device?
Your reply made me second-guess my assumption about the device - I
thought you meant to imply that it must be going through the PCI path
(perhaps I read too much into this)

I have been working under the assumption that the device at 0x3f8 is
not a PCI device, because of the io base... and never really verified
that it wasn''t going through the PCI path. I suppose a PCI device
could provide that device, as well.

I''ll have to look around for the Panther Point Chipset docs, to see if
it mentions anything about this.

Ben

Ben Guthro

2013-Jan-16 16:16 UTC

head link

Re: S3 resume issues

On Tue, Jan 15, 2013 at 1:39 PM, Malcolm Crossley
<malcolm.crossley@citrix.com> wrote:> On 15/01/13 18:38, Ben Guthro wrote:
>>
>> On Tue, Jan 15, 2013 at 1:32 PM, Malcolm Crossley
>> <malcolm.crossley@citrix.com> wrote:
>>>
>>> On 15/01/13 18:22, Ben Guthro wrote:
>>>>
>>>> On Tue, Jan 15, 2013 at 1:17 PM, Malcolm Crossley
>>>> <malcolm.crossley@citrix.com> wrote:
>>>>>
>>>>> You get 0xFF when there is nothing responding to the
ioport. If the
>>>>> 16550
>>>>> is
>>>>> on a PCI card then it could be the PCI connection has not
been setup
>>>>> again
>>>>> after the resume and you can''t get to that ioport
range.
>>>>
>>>> This is not a PCI card, it is on onboard card (io base 0x3f8)
>>>>
>>>> Ben
>>>
>>> Interesting, it may be the serial device requires some ACPI method
to be
>>> called to initialise/enable it correctly.
>>>
>>> A serial port on a HP Elitebook 8570p we have seems to not
initialise the
>>> serial port after the BIOS has started. The serial only starts
working
>>> when
>>> the Linux kernel runs the ACPI enable method (halfway through the
kernel
>>> boot) . I''ve tried to decompile the ACPI AML and it looks
like it''s
>>> enabling
>>> the serial via a microcontroller.
>>>
>>> It could be you have a similar microcontroller based serial port on
your
>>> system which can only be initialised via ACPI.
>>>
>>> It might be worth checking that the io decode windows are enabled
on the
>>> panther point chipset for the 0x3f8 port ranges. Check that bits
0-2 are
>>> 0
>>> at address 0x80 and that bit 0 is 0 at address 0x82 in PCI device
0:1f
>>> config space.
>>
>> It looks like bit 0 is 1 at 0x82 (if I''m reading this
correctly):
>>
>> 00:1f.0 ISA bridge: Intel Corporation Panther Point LPC Controller (rev
>> 04)
>>         Subsystem: Intel Corporation Device 7270
>>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>> ParErr-
>> Stepping- SERR- FastB2B- DisINTx-
>>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort-
>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>         Latency: 0
>>         Capabilities: [e0] Vendor Specific Information: Len=0c
<?>
>> 00: 86 80 55 1e 07 00 10 02 04 00 01 06 00 00 80 00
>> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 70 72
>> 30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00
>> 40: 01 04 00 00 80 00 00 00 01 05 00 00 10 00 00 00
>> 50: f8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 60: 8b 80 8a 8a 90 00 00 00 85 80 8b 85 f8 f0 00 00
>> 70: 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0 78 f0
>> 80: 70 00 0f 3c 81 06 7c 00 41 16 0c 00 c1 07 3c 00
>> 90: e1 02 1c 00 00 0f 00 00 00 00 00 00 00 00 00 00
>> a0: 14 0e a0 00 48 39 06 00 00 47 00 00 00 00 00 02
>> b0: 00 00 00 00 00 00 00 00 04 80 00 20 00 00 00 00
>> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> d0: 33 22 11 00 67 45 00 00 c0 fc 00 00 08 00 00 00
>> e0: 09 00 0c 10 00 00 00 00 a3 02 e4 02 00 00 00 00
>> f0: 01 c0 d1 fe 81 30 1a 00 87 0f 04 08 00 00 00 00
>>
>>
>> Is that something that needs to be re-enabled at resume time?
>
> Sorry I made a mistake, bit 0 should 1 at address 0x82.
>
>
It appears that Malcolm is correct, in this regard.

On the mobile SDP (and other newer laptops) - it looks like the serial
device is not part of the PCH, but a SuperIO card hanging off of the
LPC bus.
Disassembling the DSDT, and looking at the output of "lspnp -b -vv"
shows this device providing the legacy port io base addresses.

Presumably, the BIOS executes the AML at boot time to set this device
up, but we don''t seem to do anything of the sort in Xen, which gives
F''s when accessing the ioport.

I''m still investigating how this device might properly be re-enabled in
Xen.

Xen devel - Jan 2013 - S3 resume issues

S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues

Re: S3 resume issues