Rafael J. Wysocki
2019-Nov-21 16:39 UTC
[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Thu, Nov 21, 2019 at 5:06 PM Karol Herbst <kherbst at redhat.com> wrote:> > On Thu, Nov 21, 2019 at 4:47 PM Rafael J. Wysocki <rafael at kernel.org> wrote: > > > > On Thu, Nov 21, 2019 at 1:53 PM Karol Herbst <kherbst at redhat.com> wrote: > > > > > > On Thu, Nov 21, 2019 at 12:46 PM Mika Westerberg > > > <mika.westerberg at intel.com> wrote: > > > > > > > > On Thu, Nov 21, 2019 at 12:34:22PM +0100, Rafael J. Wysocki wrote: > > > > > On Thu, Nov 21, 2019 at 12:28 PM Mika Westerberg > > > > > <mika.westerberg at intel.com> wrote: > > > > > > > > > > > > On Wed, Nov 20, 2019 at 11:29:33PM +0100, Rafael J. Wysocki wrote: > > > > > > > > last week or so I found systems where the GPU was under the "PCI > > > > > > > > Express Root Port" (name from lspci) and on those systems all of that > > > > > > > > seems to work. So I am wondering if it's indeed just the 0x1901 one, > > > > > > > > which also explains Mikas case that Thunderbolt stuff works as devices > > > > > > > > never get populated under this particular bridge controller, but under > > > > > > > > those "Root Port"s > > > > > > > > > > > > > > It always is a PCIe port, but its location within the SoC may matter. > > > > > > > > > > > > Exactly. Intel hardware has PCIe ports on CPU side (these are called > > > > > > PEG, PCI Express Graphics, ports), and the PCH side. I think the IP is > > > > > > still the same. > > > > > > > > > > > > yeah, I meant the bridge controller with the ID 0x1901 is on the CPU > > > side. And if the Nvidia GPU is on a port on the PCH side it all seems > > > to work just fine. > > > > But that may involve different AML too, may it not? > > > > > > > > > Also some custom AML-based power management is involved and that may > > > > > > > be making specific assumptions on the configuration of the SoC and the > > > > > > > GPU at the time of its invocation which unfortunately are not known to > > > > > > > us. > > > > > > > > > > > > > > However, it looks like the AML invoked to power down the GPU from > > > > > > > acpi_pci_set_power_state() gets confused if it is not in PCI D0 at > > > > > > > that point, so it looks like that AML tries to access device memory on > > > > > > > the GPU (beyond the PCI config space) or similar which is not > > > > > > > accessible in PCI power states below D0. > > > > > > > > > > > > Or the PCI config space of the GPU when the parent root port is in D3hot > > > > > > (as it is the case here). Also then the GPU config space is not > > > > > > accessible. > > > > > > > > > > Why would the parent port be in D3hot at that point? Wouldn't that be > > > > > a suspend ordering violation? > > > > > > > > No. We put the GPU into D3hot first, then the root port and then turn > > > > off the power resource (which is attached to the root port) resulting > > > > the topology entering D3cold. > > > > > > > > > > If the kernel does a D0 -> D3hot -> D0 cycle this works as well, but > > > the power savings are way lower, so I kind of prefer skipping D3hot > > > instead of D3cold. Skipping D3hot doesn't seem to make any difference > > > in power savings in my testing. > > > > OK > > > > What exactly did you do to skip D3cold in your testing? > > > > For that I poked into the PCI registers directly and skipped doing the > ACPI calls and simply checked for the idle power consumption on my > laptop.That doesn't involve the PCIe port PM, however.> But I guess I should retest with calling pci_d3cold_disable > from nouveau instead? Or is there a different preferable way of > testing this?There is a sysfs attribute called "d3cold_allowed" which can be used for "blocking" D3cold, so can you please retest using that?
Lyude Paul
2019-Nov-26 23:10 UTC
[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
[big snip]> There is a sysfs attribute called "d3cold_allowed" which can be used > for "blocking" D3cold, so can you please retest using that? >Hey-this is almost certainly not the right place in this thread to respond, but this thread has gotten so deep evolution can't push the subject further to the right, heh. So I'll just respond here. I've been following this and helping out Karol with testing here and there. They had me test Bjorn's PCI branch on the X1 Extreme 2nd generation, which has a turing GPU and 8086:1901 PCI bridge. I was about to say "the patch fixed things, hooray!" but it seems that after trying runtime suspend/resume a couple times things fall apart again: [ 27.680433] nouveau 0000:01:00.0: enabling device (0000 -> 0003) [ 27.680578] nouveau 0000:01:00.0: NVIDIA TU117 (167000a1) [ 27.763967] nouveau 0000:01:00.0: bios: version 90.17.20.00.16 [ 27.764664] nouveau 0000:01:00.0: fb: 4096 MiB GDDR5 [ 27.806115] vga_switcheroo: enabled [ 27.806221] [TTM] Zone kernel: Available graphics memory: 16244510 KiB [ 27.806222] [TTM] Zone dma32: Available graphics memory: 2097152 KiB [ 27.806222] [TTM] Initializing pool allocator [ 27.806224] [TTM] Initializing DMA pool allocator [ 27.806249] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB [ 27.806249] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB [ 27.806250] nouveau 0000:01:00.0: DRM: BIT table 'A' not found [ 27.806251] nouveau 0000:01:00.0: DRM: BIT table 'L' not found [ 27.806251] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 [ 27.806252] nouveau 0000:01:00.0: DRM: DCB version 4.1 [ 27.806253] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f66 04600020 [ 27.806253] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f52 00020010 [ 27.806254] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f36 04600010 [ 27.806254] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f46 04600020 [ 27.806255] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047 [ 27.806255] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161 [ 27.806256] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001248 [ 27.806256] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002348 [ 27.806257] nouveau 0000:01:00.0: DRM: Pointer to flat panel table invalid [ 27.806415] nouveau 0000:01:00.0: DRM: failed to create kernel channel, -22 [ 27.806530] nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies [ 28.114808] nouveau 0000:01:00.0: DRM: unknown connector type 48 [ 28.114943] nouveau 0000:01:00.0: DRM: unknown connector type 48 [ 28.115026] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 28.115027] [drm] Driver supports precise vblank timestamp query. [ 28.116611] [drm] Cannot find any crtc or sizes [ 28.117452] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1 [ 28.118074] [drm] Cannot find any crtc or sizes [ 28.119523] [drm] Cannot find any crtc or sizes [ 34.081503] nouveau 0000:01:00.0: DRM: suspending console... [ 34.081508] nouveau 0000:01:00.0: DRM: suspending display... [ 34.081528] nouveau 0000:01:00.0: DRM: evicting buffers... [ 34.081531] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle... [ 34.081551] nouveau 0000:01:00.0: DRM: suspending fence... [ 34.091173] nouveau 0000:01:00.0: DRM: suspending object tree... [ 37.729746] nouveau 0000:01:00.0: DRM: resuming object tree... [ 38.042076] nouveau 0000:01:00.0: DRM: resuming fence... [ 38.042167] nouveau 0000:01:00.0: DRM: resuming display... [ 38.042175] nouveau 0000:01:00.0: DRM: resuming console... [ 44.309325] nouveau 0000:01:00.0: DRM: suspending console... [ 44.309331] nouveau 0000:01:00.0: DRM: suspending display... [ 44.309349] nouveau 0000:01:00.0: DRM: evicting buffers... [ 44.309352] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle... [ 44.309371] nouveau 0000:01:00.0: DRM: suspending fence... [ 44.318938] nouveau 0000:01:00.0: DRM: suspending object tree... [ 76.577644] nouveau 0000:01:00.0: DRM: resuming object tree... [ 76.890266] nouveau 0000:01:00.0: DRM: resuming fence... [ 76.890362] nouveau 0000:01:00.0: DRM: resuming display... [ 76.890379] nouveau 0000:01:00.0: DRM: resuming console... [ 82.721356] nouveau 0000:01:00.0: DRM: suspending console... [ 82.721361] nouveau 0000:01:00.0: DRM: suspending display... [ 82.721380] nouveau 0000:01:00.0: DRM: evicting buffers... [ 82.721383] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle... [ 82.721403] nouveau 0000:01:00.0: DRM: suspending fence... [ 82.730483] nouveau 0000:01:00.0: DRM: suspending object tree... [ 681.369950] nouveau 0000:01:00.0: DRM: resuming object tree... [ 681.690464] nouveau 0000:01:00.0: DRM: resuming fence... [ 681.690555] nouveau 0000:01:00.0: DRM: resuming display... [ 681.690568] nouveau 0000:01:00.0: DRM: resuming console... [ 686.873629] nouveau 0000:01:00.0: DRM: suspending console... [ 686.873634] nouveau 0000:01:00.0: DRM: suspending display... [ 686.873653] nouveau 0000:01:00.0: DRM: evicting buffers... [ 686.873656] nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle... [ 686.873676] nouveau 0000:01:00.0: DRM: suspending fence... [ 686.883247] nouveau 0000:01:00.0: DRM: suspending object tree... [ 752.866484] ACPI Error: Aborting method \_SB.PCI0.PEG0.PEGP.NVPO due to previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529) [ 752.866508] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529) [ 752.866521] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529) [ 752.866542] acpi device:00: Failed to change power state to D0 [ 754.209030] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) [ 755.848894] nouveau 0000:01:00.0: not ready 1023ms after Switch to D0; waiting [ 756.936876] nouveau 0000:01:00.0: not ready 2047ms after Switch to D0; waiting [ 759.048849] nouveau 0000:01:00.0: not ready 4095ms after Switch to D0; waiting [ 763.208807] nouveau 0000:01:00.0: not ready 8191ms after Switch to D0; waiting [ 771.912692] nouveau 0000:01:00.0: not ready 16383ms after Switch to D0; waiting [ 788.808505] nouveau 0000:01:00.0: not ready 32767ms after Switch to D0; waiting 752.866542 is where I start trying to resume the GPU. lspci -nn: 00:00.0 Host bridge [0600]: Intel Corporation Device [8086:3e20] (rev 0d) 00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 0d) 00:02.0 VGA compatible controller [0300]: Intel Corporation UHD Graphics 630 (Mobile) [8086:3e9b] (rev 02) 00:04.0 Signal processing controller [1180]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903] (rev 0d) 00:08.0 System peripheral [0880]: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911] 00:12.0 Signal processing controller [1180]: Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379] (rev 10) 00:14.0 USB controller [0c03]: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d] (rev 10) 00:14.2 RAM memory [0500]: Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f] (rev 10) 00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368] (rev 10) 00:16.0 Communication controller [0780]: Intel Corporation Cannon Lake PCH HECI Controller [8086:a360] (rev 10) 00:1b.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #17 [8086:a340] (rev f0) 00:1b.4 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 [8086:a32c] (rev f0) 00:1c.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 [8086:a338] (rev f0) 00:1d.0 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 [8086:a330] (rev f0) 00:1d.6 PCI bridge [0604]: Intel Corporation Cannon Lake PCH PCI Express Root Port #15 [8086:a336] (rev f0) 00:1e.0 Communication controller [0780]: Intel Corporation Device [8086:a328] (rev 10) 00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a30e] (rev 10) 00:1f.3 Audio device [0403]: Intel Corporation Cannon Lake PCH cAVS [8086:a348] (rev 10) 00:1f.4 SMBus [0c05]: Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323] (rev 10) 00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller [8086:a324] (rev 10) 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (7) I219-LM [8086:15bb] (rev 10) 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1f91] (rev a1) 01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10fa] (rev a1) 02:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808] 04:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06) 05:00.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06) 05:01.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06) 05:02.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06) 05:04.0 PCI bridge [0604]: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] [8086:15ea] (rev 06) 06:00.0 System peripheral [0880]: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] [8086:15eb] (rev 06) 2c:00.0 USB controller [0c03]: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] [8086:15ec] (rev 06) 52:00.0 Network controller [0280]: Intel Corporation Wi-Fi 6 AX200 [8086:2723] (rev 1a) 53:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader [10ec:525a] (rev 01) -- Cheers, Lyude Paul
Mika Westerberg
2019-Nov-27 11:49 UTC
[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Tue, Nov 26, 2019 at 06:10:36PM -0500, Lyude Paul wrote:> Hey-this is almost certainly not the right place in this thread to respond, > but this thread has gotten so deep evolution can't push the subject further to > the right, heh. So I'll just respond here.:)> I've been following this and helping out Karol with testing here and there. > They had me test Bjorn's PCI branch on the X1 Extreme 2nd generation, which > has a turing GPU and 8086:1901 PCI bridge. > > I was about to say "the patch fixed things, hooray!" but it seems that after > trying runtime suspend/resume a couple times things fall apart again:You mean $subject patch, no?> [ 686.883247] nouveau 0000:01:00.0: DRM: suspending object tree... > [ 752.866484] ACPI Error: Aborting method \_SB.PCI0.PEG0.PEGP.NVPO due to previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529) > [ 752.866508] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529) > [ 752.866521] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20190816/psparse-529)This is probably the culprit. The same AML code fails to properly turn on the device. Is acpidump from this system available somewhere?
Possibly Parallel Threads
- [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
- [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
- [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups