thr3ads.net - freebsd stable - suspend/resume regression [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Kevin Oberman

2015-Jul-25 22:54 UTC

suspend/resume regression

John,

I'm concerned that two issues may be getting conflated.

The issue I thought we were looking at was the failure of some systems
(T520, X220, T430) to resume after a number of PCI enhancements were MFCed.
This is completely unrelated to the USB issue I was experiencing when
trying to test the problem on HEAD. The more I think about it, the more I
think that the USB "issue" is just how things need to work.

Specifically, if you are booting a full, multi-user system from a USB
connected drive, suspending and resuming will leave the system in an
untenable condition that will force a panic. At least I don't see how the
OS can determine that the disk present on resume is unchanged from that
present when the system was suspended. Modern disk IDs greatly improve the
situation, but I am unaware of any way to be sure that a removable drive
(such as a USB) has not been removed and plugged into some other system
that might have written to it. My knowledge of such things is very dated,
going back to my days doing kernel programming about 25-30 year ago on VMS,
so someone may have resolved the issue, but I don't understand exactly how.
I guess that the risk might be low enough to just go ahead and pray that
nobody did something really, really stupid like unplugging the drive,
plugging it in elsewhere, and writing to it.

The real issue is just resuming the system after  r281874 was MFCed as a
part of 284034. No USB connected file systems are involved. I m happy to
see that it has been reverted for 10.2, but clearly, these changes are
needed down the line and I hope the issue can be resolved well before 11.0.
(This assumes a 10.3 before 11.0 happens next year.)

Thanks for the time you have spent on this and I'll be happy to help out
with testing in the future. Things will be easier now that I have a disk
with head on it. I wasted way too much time trying to get HEAD to work in a
USB drive and with related issues.

Kevin Oberman, Network Engineer, Retired
E-mail: rkoberman at gmail.com
PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683

On Wed, Jul 22, 2015 at 10:46 PM, Kevin Oberman <rkoberman at gmail.com>
wrote:
> On Tue, Jul 21, 2015 at 3:56 PM, John Baldwin <jhb at freebsd.org>
wrote:
>
>> On Saturday, July 18, 2015 10:22:33 PM Kevin Oberman wrote:
>> > I just confirmed that my system resumes on HEAD of July 16 but
fails on
>> > 10.2-BETA2. So the problem limited to 10. I'm guessing that
some other
>> > change made to pci that has not been MFCed is the cause, but it is
only
>> > causing a problem on some hardware. I have seen no reports about
systems
>> > other than Lenovo systems.
>>
>> So my x220 does fail with a USB disk on 10, but I also get a weird
>> behavior
>> where it seems to wake up (disk lights up) and then goes back to sleep
and
>> never resumes again.  I'm not sure if this is due to using a USB
disk or
>> not.  I get the same result when I disable power management during
suspend
>> which was reported to fix other laptops IIRC.
>>
>> Please try this:
>>
>> Index: sys/dev/acpica/acpi.c
>>
==================================================================>> ---
sys/dev/acpica/acpi.c       (revision 285761)
>> +++ sys/dev/acpica/acpi.c       (working copy)
>> @@ -691,7 +691,7 @@
>>  static void
>>  acpi_set_power_children(device_t dev, int state)
>>  {
>> -       device_t child, parent;
>> +       device_t child;
>>         device_t *devlist;
>>         struct pci_devinfo *dinfo;
>>         int dstate, i, numdevs;
>> @@ -703,13 +703,12 @@
>>          * Retrieve and set D-state for the sleep state if _SxD is
>> present.
>>          * Skip children who aren't attached since they are handled
>> separately.
>>          */
>> -       parent = device_get_parent(dev);
>>         for (i = 0; i < numdevs; i++) {
>>                 child = devlist[i];
>>                 dinfo = device_get_ivars(child);
>>                 dstate = state;
>>                 if (device_is_attached(child) &&
>> -                   acpi_device_pwr_for_sleep(parent, dev, &dstate)
== 0)
>> +                   acpi_device_pwr_for_sleep(dev, child, &dstate)
== 0)
>>                         acpi_set_powerstate(child, dstate);
>>         }
>>         free(devlist, M_TEMP);
>> Index: sys/dev/pci/pci.c
>>
==================================================================>> ---
sys/dev/pci/pci.c   (revision 285761)
>> +++ sys/dev/pci/pci.c   (working copy)
>> @@ -3671,7 +3671,7 @@
>>                 child = devlist[i];
>>                 dstate = state;
>>                 if (device_is_attached(child) &&
>> -                   PCIB_POWER_FOR_SLEEP(pcib, dev, &dstate) == 0)
>> +                   PCIB_POWER_FOR_SLEEP(pcib, child, &dstate) ==
0)
>>                         pci_set_powerstate(child, dstate);
>>         }
>>  }
>> Index: .
>>
==================================================================>> --- .
(revision 285761)
>> +++ .   (working copy)
>>
>> Property changes on: .
>> ___________________________________________________________________
>> Modified: svn:mergeinfo
>>    Merged /head:r274386,274397
>>
>>
>> --
>> John Baldwin
>>
>
> Tried both sysctls and the patch. Nothing worked. Ticket updated with the
> information.
> --
> Kevin Oberman, Network Engineer, Retired
> E-mail: rkoberman at gmail.com
> PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683
>
>

Adrian Chadd

2015-Jul-26 00:45 UTC

head link

suspend/resume regression

Hi,

Yes, the USB device suspend/resume thing is a more generic
suspend/resume problem. Warner has some ideas - eg, registering a "is
this a new device?" method; the device driver will check if the device
has changed upon resume and optionally go through a detach/reattach
process. So for USB it could be something about the serial or FS
label; for wifi drivers it could be the MAC / serial number of the
NIC, etc.


-adrian

Claude Buisson

2015-Jul-29 12:58 UTC

head link

suspend/resume regression

On 07/26/2015 00:54, Kevin Oberman wrote:> John,
>
> I'm concerned that two issues may be getting conflated.
>
> The issue I thought we were looking at was the failure of some systems
> (T520, X220, T430) to resume after a number of PCI enhancements were MFCed.
> This is completely unrelated to the USB issue I was experiencing when
> trying to test the problem on HEAD. The more I think about it, the more I
> think that the USB "issue" is just how things need to work.
>
> Specifically, if you are booting a full, multi-user system from a USB
> connected drive, suspending and resuming will leave the system in an
> untenable condition that will force a panic. At least I don't see how
the
> OS can determine that the disk present on resume is unchanged from that
> present when the system was suspended. Modern disk IDs greatly improve the
> situation, but I am unaware of any way to be sure that a removable drive
> (such as a USB) has not been removed and plugged into some other system
> that might have written to it. My knowledge of such things is very dated,
> going back to my days doing kernel programming about 25-30 year ago on VMS,
> so someone may have resolved the issue, but I don't understand exactly
how.
> I guess that the risk might be low enough to just go ahead and pray that
> nobody did something really, really stupid like unplugging the drive,
> plugging it in elsewhere, and writing to it.
>
> The real issue is just resuming the system after  r281874 was MFCed as a
> part of 284034. No USB connected file systems are involved. I m happy to
> see that it has been reverted for 10.2, but clearly, these changes are
> needed down the line and I hope the issue can be resolved well before 11.0.
> (This assumes a 10.3 before 11.0 happens next year.)
>
I have done some tests on my T530 at r285668 and had some (good and bad)
surprises:

0) historically i915kms+drm2 could not be loaded by loader.conf without
locking the machine, but needed to be loaded by rc.conf (kld_list). Now
these modules can be loaded by loader.conf.

1) resume does not work with a non patched kernel, but works when the
MFC of r281874 is reverted (i.e. r285863 applied) - in console mode (vt)
and X.org.

2) and now is my bad surprise: when i915kms+drm2+iic*+kbdmux are not
loaded at all, suspend works (in console mode of course), but not
resume, both with the nonpatched and the patched kernel. After resume
the screen keeps being black, but the system can be logged to with ssh,
but cannot be powered off nor rebooted from another system. Furthermore
the log shows unidentified _USB_ devices at resume (which never appeared
in any log before):

Jul 29 12:28:12 watson devd: Executing '/etc/rc.suspend acpi 0x03'
Jul 29 12:28:12 watson acpi: suspend at 20150729 12:28:12
Jul 29 12:28:37 watson kernel: uhub0: at usbus0, port 1, addr 1
(disconnected)
Jul 29 12:28:37 watson kernel: uhub1: at usbus1, port 1, addr 1
(disconnected)
Jul 29 12:28:37 watson kernel: ugen1.2: <vendor 0x8087> at usbus1
(disconnected)
Jul 29 12:28:37 watson kernel: uhub4: at uhub1, port 1, addr 2
(disconnected)
Jul 29 12:28:37 watson kernel: ugen1.3: <Chicony Electronics Co., Ltd.>
at usbus1 (disconnected)
Jul 29 12:28:37 watson kernel: uhub2: at usbus2, port 1, addr 1
(disconnected)
Jul 29 12:28:37 watson kernel: ugen2.2: <vendor 0x8087> at usbus2
(disconnected)
Jul 29 12:28:37 watson kernel: uhub3: at uhub2, port 1, addr 2
(disconnected)
Jul 29 12:28:37 watson kernel: ugen2.3: <Logitech> at usbus2
(disconnected)
Jul 29 12:28:37 watson kernel: ums0: at uhub3, port 5, addr 3 (disconnected)
Jul 29 12:28:37 watson kernel: acpi0: cleared fixed power button status
Jul 29 12:28:37 watson kernel: em0: link state changed to DOWN
Jul 29 12:28:37 watson kernel: xhci0: Port routing mask set to 0xffffffff
Jul 29 12:28:37 watson kernel: uhub0: <0x8086 XHCI root HUB, class 9/0,
rev 3.00/1.00, addr 1> on usbus0
Jul 29 12:28:37 watson kernel: uhub1: <Intel EHCI root HUB, class 9/0,
rev 2.00/1.00, addr 1> on usbus2
Jul 29 12:28:37 watson kernel: uhub2: <Intel EHCI root HUB, class 9/0,
rev 2.00/1.00, addr 1> on usbus1
Jul 29 12:28:38 watson kernel: uhub0: 8 ports with 8 removable, self powered
Jul 29 12:28:37 watson devd: Executing '/etc/rc.resume acpi 0x03'
Jul 29 12:28:38 watson acpi: resumed at 20150729 12:28:38
Jul 29 12:28:38 watson kernel: uhub2: 3 ports with 3 removable, self powered
Jul 29 12:28:38 watson kernel: uhub1: 3 ports with 3 removable, self powered
Jul 29 12:28:38 watson kernel: em0: link state changed to UP
Jul 29 12:28:38 watson devd: Executing '/etc/rc.d/dhclient quietstart
em0'
Jul 29 12:28:39 watson kernel: ugen2.2: <vendor 0x8087> at usbus2
Jul 29 12:28:39 watson kernel: uhub3: <vendor 0x8087 product 0x0024,
class 9/0, rev 2.00/0.00, addr 2> on usbus2
Jul 29 12:28:39 watson kernel: ugen1.2: <vendor 0x8087> at usbus1
Jul 29 12:28:39 watson kernel: uhub4: <vendor 0x8087 product 0x0024,
class 9/0, rev 2.00/0.00, addr 2> on usbus1
Jul 29 12:28:40 watson kernel: uhub4: 6 ports with 6 removable, self powered
Jul 29 12:28:41 watson kernel: uhub3: 8 ports with 8 removable, self powered
Jul 29 12:28:41 watson kernel: ugen1.3: <Chicony Electronics Co., Ltd.>
at usbus1
Jul 29 12:28:41 watson devd: Executing 'logger Unknown USB device:
vendor 0x04f2 product 0xb2ea bus uhub4'
Jul 29 12:28:41 watson root: Unknown USB device: vendor 0x04f2 product
0xb2ea bus uhub4
Jul 29 12:28:41 watson devd: Executing 'logger Unknown USB device:
vendor 0x04f2 product 0xb2ea bus uhub4'
Jul 29 12:28:41 watson root: Unknown USB device: vendor 0x04f2 product
0xb2ea bus uhub4
Jul 29 12:28:41 watson kernel: ugen2.3: <Logitech> at usbus2
Jul 29 12:28:41 watson devd: Executing 'logger Unknown USB device:
vendor 0x046d product 0xc52b bus uhub3'
Jul 29 12:28:41 watson root: Unknown USB device: vendor 0x046d product
0xc52b bus uhub3
Jul 29 12:28:41 watson kernel: ums0: <Logitech USB Receiver, class 0/0,
rev 2.00/24.00, addr 3> on usbus2
Jul 29 12:28:41 watson kernel: ums0: 16 buttons and [XYZT] coordinates ID=2
Jul 29 12:28:41 watson devd: Executing 'logger Unknown USB device:
vendor 0x046d product 0xc52b bus uhub3'
Jul 29 12:28:41 watson root: Unknown USB device: vendor 0x046d product
0xc52b bus uhub3

I dare say that there is some mess somewhere..

4) last minute tests: I get the same resume problem as 3) supra when
booting from a USB stick with a 11-CURRENT snapshot, both
20150330-r28086 and 20150722-r285794 (and cannot obtain anything useful
from /var/log/messages)

Claude Buisson

John Baldwin

2015-Jul-30 01:25 UTC

head link

suspend/resume regression

On Saturday, July 25, 2015 03:54:40 PM Kevin Oberman
wrote:> John,
> 
> I'm concerned that two issues may be getting conflated.
> 
> The issue I thought we were looking at was the failure of some systems
> (T520, X220, T430) to resume after a number of PCI enhancements were MFCed.
> This is completely unrelated to the USB issue I was experiencing when
> trying to test the problem on HEAD. The more I think about it, the more I
> think that the USB "issue" is just how things need to work.
Well, the USB thing could be smarter, but it's a bit of a PITA.  What if
you take the USB stick out, mess with it on another system, then plug
it back in before resume?  All the cached file data in the RAM of the
resumed system would need to be invalidated, etc.

However, I ended up copying a HEAD kernel onto my USB stick and seeing 
that I at least got the console back before it panic'd.  This was sufficient
to let me test the reversion patch via the USB stick (and would be sufficient
for seeing if we can merge it again for 10.3).
> The real issue is just resuming the system after  r281874 was MFCed as a
> part of 284034. No USB connected file systems are involved. I m happy to
> see that it has been reverted for 10.2, but clearly, these changes are
> needed down the line and I hope the issue can be resolved well before 11.0.
> (This assumes a 10.3 before 11.0 happens next year.)
So it works fine in 11.0 on my x220, and as other folks reported in the PR,
so 11.0 is fine.  It is also needed for PCI-e hotplug to work after resume
(using out-of-tree patches for PCI-e hotplug that jmg@ has).  If I merge it
to 10.3 it won't be until I've verified that whatever I merge works on
my
x220 as well as the T440.

-- 
John Baldwin

freebsd stable - Jul 2015 - suspend/resume regression

suspend/resume regression

suspend/resume regression

suspend/resume regression

suspend/resume regression