Michael S. Tsirkin
2018-Nov-27 03:53 UTC
[Nouveau] 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
So a new thinkpad: 01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1) Hangs whenever I try to poke at the card. It starts happily enough with [ 3.971515] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66) [ 3.971553] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66) [ 3.971721] pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported [ 3.971726] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle [ 3.971727] nouveau: detected PR support, will not use DSM [ 3.971745] nouveau 0000:01:00.0: enabling device (0006 -> 0007) [ 3.971923] nouveau 0000:01:00.0: NVIDIA GP107 (137000a1) [ 4.009875] PM: Image not found (code -22) [ 4.135752] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB [ 4.135753] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB [ 4.135754] nouveau 0000:01:00.0: DRM: BIT table 'A' not found [ 4.135755] nouveau 0000:01:00.0: DRM: BIT table 'L' not found [ 4.135756] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 [ 4.135756] nouveau 0000:01:00.0: DRM: DCB version 4.1 [ 4.135757] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f76 04600020 [ 4.135758] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f62 00020010 [ 4.135759] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f46 04600010 [ 4.135760] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f56 04600020 [ 4.135761] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047 [ 4.135761] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161 [ 4.135762] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001246 [ 4.135763] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002346 [ 4.508355] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 4.508355] [drm] Driver supports precise vblank timestamp query. [ 4.509812] [drm] Cannot find any crtc or sizes [ 4.510144] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 2 Although that type mismatch is a bit worrying. And I'm not sure what prints PM: Image not found. But after a short while it gets pretty busy: [ 52.917009] No Local Variables are initialized for Method [NVPO] [ 52.917011] No Arguments are initialized for method [NVPO] [ 52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) [ 52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) [ 52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) [ 52.917108] acpi device:00: Failed to change power state to D0 [ 52.969287] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) [ 52.969289] pci_raw_set_power_state: 2 callbacks suppressed [ 52.969291] nouveau 0000:01:00.0: Refused to change power state, currently in D3 [ 53.029514] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) [ 53.041027] nouveau 0000:01:00.0: Refused to change power state, currently in D3 [ 53.041035] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) [ 53.053008] nouveau 0000:01:00.0: Refused to change power state, currently in D3 And then kernel proceeds to throw up errors at random places, e.g. [ 67.021892] cfg80211: failed to load regulatory.db [ 67.021895] cfg80211: failed to load regulatory.db [ 67.021897] cfg80211: failed to load regulatory.db [ 67.021900] cfg80211: failed to load regulatory.db [ 67.021927] cfg80211: failed to load regulatory.db [ 67.021928] cfg80211: failed to load regulatory.db [ 67.021932] cfg80211: failed to load regulatory.db [ 67.021934] cfg80211: failed to load regulatory.db [ 67.024463] cfg80211: failed to load regulatory.db [ 99.980625] iwlwifi 0000:00:14.3: Error sending STATISTICS_CMD: time out after 2000ms. followed by soft lockups and sometimes hard lockups in places like attempts to walk skb lists. Adding runpm=0 does away with this issue. The specific test was with noaccel=1 - it does not seem to change things for me. I poked at the ACPI method NVPO and yes it does actually seem to execute a while loop waiting for some register to become 0. Which I guess never happens? Because card is in a low power state and so reads return ffffffff maybe? X isn't happy even with runpm=0 but that might be a different issue - I thought runpm=0 might be an easier place to start debugging things given there are logs of the failure. Using 4.20.0-rc3 right now. Userspace bits are from fedora 29: xorg-x11-drv-nouveau-1.0.15-6.fc29.x86_64 firmware is pretty recent: linux-firmware-20181008-88.gitc6b6265d.fc29.noarch More hints for debugging would be appreciated. If anyone wants me to play with a different kernel tree, let me know. -- MST
Mika Westerberg
2018-Nov-27 09:36 UTC
[Nouveau] 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
+linux-acpi Hi Michael, On Mon, Nov 26, 2018 at 10:53:26PM -0500, Michael S. Tsirkin wrote:> So a new thinkpad: > 01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1) > > Hangs whenever I try to poke at the card. It starts happily enough with > > [ 3.971515] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66) > [ 3.971553] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66) > [ 3.971721] pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported > [ 3.971726] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle > [ 3.971727] nouveau: detected PR support, will not use DSM > [ 3.971745] nouveau 0000:01:00.0: enabling device (0006 -> 0007) > [ 3.971923] nouveau 0000:01:00.0: NVIDIA GP107 (137000a1) > [ 4.009875] PM: Image not found (code -22) > [ 4.135752] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB > [ 4.135753] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB > [ 4.135754] nouveau 0000:01:00.0: DRM: BIT table 'A' not found > [ 4.135755] nouveau 0000:01:00.0: DRM: BIT table 'L' not found > [ 4.135756] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 > [ 4.135756] nouveau 0000:01:00.0: DRM: DCB version 4.1 > [ 4.135757] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f76 04600020 > [ 4.135758] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f62 00020010 > [ 4.135759] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f46 04600010 > [ 4.135760] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f56 04600020 > [ 4.135761] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047 > [ 4.135761] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161 > [ 4.135762] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001246 > [ 4.135763] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002346 > [ 4.508355] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). > [ 4.508355] [drm] Driver supports precise vblank timestamp query. > [ 4.509812] [drm] Cannot find any crtc or sizes > [ 4.510144] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 2 > > > Although that type mismatch is a bit worrying. And I'm not sure what > prints PM: Image not found.That is fine, it just says it cannot find a hibernation image from swap device. I guess you have resume=... in the kernel command line.> But after a short while it gets pretty busy: > > > [ 52.917009] No Local Variables are initialized for Method [NVPO] > [ 52.917011] No Arguments are initialized for method [NVPO] > [ 52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > [ 52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > [ 52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > [ 52.917108] acpi device:00: Failed to change power state to D0Here it seems to fail to turn on the power resource for the device.> [ 52.969287] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) > [ 52.969289] pci_raw_set_power_state: 2 callbacks suppressed > [ 52.969291] nouveau 0000:01:00.0: Refused to change power state, currently in D3 > [ 53.029514] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) > [ 53.041027] nouveau 0000:01:00.0: Refused to change power state, currently in D3 > [ 53.041035] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) > [ 53.053008] nouveau 0000:01:00.0: Refused to change power state, currently in D3 > > > And then kernel proceeds to throw up errors at random places, e.g. > > [ 67.021892] cfg80211: failed to load regulatory.db > [ 67.021895] cfg80211: failed to load regulatory.db > [ 67.021897] cfg80211: failed to load regulatory.db > [ 67.021900] cfg80211: failed to load regulatory.db > [ 67.021927] cfg80211: failed to load regulatory.db > [ 67.021928] cfg80211: failed to load regulatory.db > [ 67.021932] cfg80211: failed to load regulatory.db > [ 67.021934] cfg80211: failed to load regulatory.db > [ 67.024463] cfg80211: failed to load regulatory.db > [ 99.980625] iwlwifi 0000:00:14.3: Error sending STATISTICS_CMD: time out after 2000ms. > > followed by soft lockups and sometimes hard lockups in places > like attempts to walk skb lists. > > Adding runpm=0 does away with this issue. > > The specific test was with noaccel=1 - it does not seem to change > things for me. > > I poked at the ACPI method NVPO and yes it does actually > seem to execute a while loop waiting for some register > to become 0. Which I guess never happens? Because card > is in a low power state and so reads return ffffffff maybe?Yes, it could be the case.> X isn't happy even with runpm=0 but that might be a different > issue - I thought runpm=0 might be an easier place to start debugging > things given there are logs of the failure. > > Using 4.20.0-rc3 right now. > > Userspace bits are from fedora 29: > xorg-x11-drv-nouveau-1.0.15-6.fc29.x86_64 > > firmware is pretty recent: > linux-firmware-20181008-88.gitc6b6265d.fc29.noarch > > More hints for debugging would be appreciated. > If anyone wants me to play with a different kernel tree, > let me know.Can you share full dmesg and acpidump of the system? I would like to check the power resource methods.
Michael S. Tsirkin
2018-Nov-28 02:49 UTC
[Nouveau] 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
On Tue, Nov 27, 2018 at 11:36:50AM +0200, Mika Westerberg wrote:> +linux-acpi > > Hi Michael, > > On Mon, Nov 26, 2018 at 10:53:26PM -0500, Michael S. Tsirkin wrote: > > So a new thinkpad: > > 01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1) > > > > Hangs whenever I try to poke at the card. It starts happily enough with > > > > [ 3.971515] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66) > > [ 3.971553] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66) > > [ 3.971721] pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported > > [ 3.971726] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle > > [ 3.971727] nouveau: detected PR support, will not use DSMBTW this is also a bit strange. It says it will not use DSM - but did it maybe use DSM previously? The ACPI Warning seems to indicate that it did ... And just to complete the picture here's the _DSM from ACPI: Method (_DSM, 4, Serialized) // _DSM: Device-Specific Method { If ((Arg0 == ToUUID ("3e5b41c6-eb1d-4260-9d15-c71fbadae414"))) { Switch (ToInteger (Arg2)) { Case (Zero) { If ((Arg1 == One)) { Return (0x0001E7FF) } } Case (One) { If ((Arg1 == One)) { If (((S0ID == One) && (OSYS < 0x07DF))) { If (((DerefOf (Arg3 [Zero]) & 0xFF) == One)) { \GUAM (One) } Local0 = (DerefOf (Arg3 [One]) & 0xFF) If ((Local0 == Zero)) { \GUAM (Zero) } } If ((DerefOf (Arg3 [Zero]) == Zero)) { Local0 = CLID /* \_SB_.PCI0.GFX0.CLID */ If ((0x80000000 & Local0)) { CLID &= 0x0F GLID (CLID) } } Return (One) } } Case (0x02) { If ((Arg1 == One)) { Return (One) } } Case (0x03) { If ((Arg1 == One)) { Return (One) } } Case (0x04) { If ((Arg1 == One)) { Return (One) } } Case (0x05) { If ((Arg1 == One)) { Return (One) } } Case (0x06) { If ((Arg1 == One)) { Return (One) } } Case (0x07) { If ((Arg1 == One)) { IBTT = (DerefOf (Arg3 [Zero]) & 0xFF) Return (One) } } Case (0x08) { If ((Arg1 == One)) { IPSC = (DerefOf (Arg3 [Zero]) & 0xFF) If ((DerefOf (Arg3 [One]) & 0xFF)) { IPAT = (DerefOf (Arg3 [One]) & 0xFF) IPAT-- } IBIA = ((DerefOf (Arg3 [0x02]) >> 0x04) & 0x07) Return (One) } } Case (0x09) { If ((Arg1 == One)) { Return (One) } } Case (0x0A) { If ((Arg1 == One)) { Local0 = (LIDS << 0x08) Local0 += 0x0100 Return (Local0) } } Case (0x0D) { If ((Arg1 == One)) { Local0 = ((DerefOf (Arg3 [0x03]) << 0x18) | (DerefOf ( Arg3 [0x02]) << 0x10)) Local0 &= 0xEFFF0000 Local0 &= (DerefOf (DBTB [IBTT]) << 0x10) Local0 |= IBTT Return (Local0) } } Case (0x0E) { If ((Arg1 == One)) { Local0 = IPSC /* \IPSC */ Local0 |= (IPAT << 0x08) Local0 += 0x0100 Local0 |= (LIDS << 0x10) Local0 += 0x00010000 Local0 |= (IBIA << 0x14) Return (Local0) } } Case (0x0F) { If ((Arg1 == One)) { Local0 = GIVD /* \_SB_.PCI0.GFX0.GIVD */ Local0 ^= One Local0 |= (GMFN << One) Local0 |= 0x1800 Local0 |= (IDMS << 0x11) Local0 |= (DerefOf (DerefOf (CDCT [HVCO]) [CDVL]) << 0x15) Return (Local0) } } Case (0x10) { If ((Arg1 == One)) { Name (KSVP, Package (0x02) { 0x80000000, 0x8000 }) KSVP [Zero] = KSV0 /* \KSV0 */ KSVP [One] = KSV1 /* \KSV1 */ Return (KSVP) /* \_SB_.PCI0.GFX0._DSM.KSVP */ } } } } Return (Buffer (One) { 0x00 // . }) } I am not sure what makes it think that Arg3 is a buffer and not a package: IIRC Index (decoded C-style here as []) can apply to either. Am I maybe misunderstanding the warning?> > [ 3.971745] nouveau 0000:01:00.0: enabling device (0006 -> 0007) > > [ 3.971923] nouveau 0000:01:00.0: NVIDIA GP107 (137000a1) > > [ 4.009875] PM: Image not found (code -22) > > [ 4.135752] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB > > [ 4.135753] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB > > [ 4.135754] nouveau 0000:01:00.0: DRM: BIT table 'A' not found > > [ 4.135755] nouveau 0000:01:00.0: DRM: BIT table 'L' not found > > [ 4.135756] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 > > [ 4.135756] nouveau 0000:01:00.0: DRM: DCB version 4.1 > > [ 4.135757] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f76 04600020 > > [ 4.135758] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f62 00020010 > > [ 4.135759] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f46 04600010 > > [ 4.135760] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f56 04600020 > > [ 4.135761] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047 > > [ 4.135761] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161 > > [ 4.135762] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001246 > > [ 4.135763] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002346 > > [ 4.508355] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). > > [ 4.508355] [drm] Driver supports precise vblank timestamp query. > > [ 4.509812] [drm] Cannot find any crtc or sizes > > [ 4.510144] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 2 > > > > > > Although that type mismatch is a bit worrying. And I'm not sure what > > prints PM: Image not found. > > That is fine, it just says it cannot find a hibernation image from swap > device. I guess you have resume=... in the kernel command line. > > > But after a short while it gets pretty busy: > > > > > > [ 52.917009] No Local Variables are initialized for Method [NVPO] > > [ 52.917011] No Arguments are initialized for method [NVPO] > > [ 52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > > [ 52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > > [ 52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > > [ 52.917108] acpi device:00: Failed to change power state to D0 > > Here it seems to fail to turn on the power resource for the device. > > > [ 52.969287] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) > > [ 52.969289] pci_raw_set_power_state: 2 callbacks suppressed > > [ 52.969291] nouveau 0000:01:00.0: Refused to change power state, currently in D3 > > [ 53.029514] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) > > [ 53.041027] nouveau 0000:01:00.0: Refused to change power state, currently in D3 > > [ 53.041035] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) > > [ 53.053008] nouveau 0000:01:00.0: Refused to change power state, currently in D3 > > > > > > And then kernel proceeds to throw up errors at random places, e.g. > > > > [ 67.021892] cfg80211: failed to load regulatory.db > > [ 67.021895] cfg80211: failed to load regulatory.db > > [ 67.021897] cfg80211: failed to load regulatory.db > > [ 67.021900] cfg80211: failed to load regulatory.db > > [ 67.021927] cfg80211: failed to load regulatory.db > > [ 67.021928] cfg80211: failed to load regulatory.db > > [ 67.021932] cfg80211: failed to load regulatory.db > > [ 67.021934] cfg80211: failed to load regulatory.db > > [ 67.024463] cfg80211: failed to load regulatory.db > > [ 99.980625] iwlwifi 0000:00:14.3: Error sending STATISTICS_CMD: time out after 2000ms.BTW this one might indicate that somehow iwlwifi got powered down too somehow. It's at 00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10) so really shouldn't be affected, but go figure. If driver really is getting all-ones from the device, it just might try to poke at a wrong b:d.f by mistake maybe ...> > followed by soft lockups and sometimes hard lockups in places > > like attempts to walk skb lists. > > > > Adding runpm=0 does away with this issue. > > > > The specific test was with noaccel=1 - it does not seem to change > > things for me. > > > > I poked at the ACPI method NVPO and yes it does actually > > seem to execute a while loop waiting for some register > > to become 0. Which I guess never happens? Because card > > is in a low power state and so reads return ffffffff maybe? > > Yes, it could be the case. > > > X isn't happy even with runpm=0 but that might be a different > > issue - I thought runpm=0 might be an easier place to start debugging > > things given there are logs of the failure. > > > > Using 4.20.0-rc3 right now. > > > > Userspace bits are from fedora 29: > > xorg-x11-drv-nouveau-1.0.15-6.fc29.x86_64 > > > > firmware is pretty recent: > > linux-firmware-20181008-88.gitc6b6265d.fc29.noarch > > > > More hints for debugging would be appreciated. > > If anyone wants me to play with a different kernel tree, > > let me know. > > Can you share full dmesg and acpidump of the system? I would like to > check the power resource methods.Pls find it as an attachment here: https://bugs.freedesktop.org/show_bug.cgi?id=108873 If you like I can send it to you directly too - spamming the list with it isn't helpful I guess? -- MST
Possibly Parallel Threads
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- [Bug 108873] New: nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups