Karol Herbst
2019-Sep-30 16:05 UTC
[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
still happens with your patch applied. The machine simply gets shut down. dmesg can be found here: https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt If there are no other things to try out, I will post the updated patch shortly. On Mon, Sep 30, 2019 at 11:29 AM Mika Westerberg <mika.westerberg at linux.intel.com> wrote:> > On Mon, Sep 30, 2019 at 11:15:48AM +0200, Karol Herbst wrote: > > On Mon, Sep 30, 2019 at 10:05 AM Mika Westerberg > > <mika.westerberg at linux.intel.com> wrote: > > > > > > Hi Karol, > > > > > > On Fri, Sep 27, 2019 at 11:53:48PM +0200, Karol Herbst wrote: > > > > > What exactly is the serious issue? I guess it's that the rescan > > > > > doesn't detect the GPU, which means it's not responding to config > > > > > accesses? Is there any timing component here, e.g., maybe we're > > > > > missing some delay like the ones Mika is adding to the reset paths? > > > > > > > > When I was checking up on some of the PCI registers of the bridge > > > > controller, the slot detection told me that there is no device > > > > recognized anymore. I don't know which register it was anymore, though > > > > I guess one could read it up in the SoC spec document by Intel. > > > > > > > > My guess is, that the bridge controller fails to detect the GPU being > > > > here or actively threw it of the bus or something. But a normal system > > > > suspend/resume cycle brings the GPU back online (doing a rescan via > > > > sysfs gets the device detected again) > > > > > > Can you elaborate a bit what kind of scenario the issue happens (e.g > > > steps how it reproduces)? It was not 100% clear from the changelog. Also > > > what the result when the failure happens? > > > > > > > yeah, I already have an updated patch in the works which also does the > > rework Bjorn suggested. Had no time yet to test if I didn't mess it > > up. > > > > I am also thinking of adding a kernel parameter to enable this > > workaround on demand, but not quite sure on that one yet. > > Right, I think it would be good to figure out the root cause before > adding any workarounds ;-) It might very well be that we are just > missing something the PCIe spec requires but not implemented in Linux. > > > > I see there is a script that does something but unfortunately I'm not > > > fluent in Python so can't extract the steps how the issue can be > > > reproduced ;-) > > > > > > One thing that I'm working on is that Linux PCI subsystem misses certain > > > delays that are needed after D3cold -> D0 transition, otherwise the > > > device and/or link may not be ready before we access it. What you are > > > experiencing sounds similar. I wonder if you could try the following > > > patch and see if it makes any difference? > > > > > > https://patchwork.kernel.org/patch/11106611/ > > > > I think I already tried this path. The problem isn't that the device > > isn't accessible too late, but that it seems that the device > > completely falls off the bus. But I can retest again just to be sure. > > Yes, please try it and share full dmesg if/when the failure still happens.
Mika Westerberg
2019-Sep-30 16:30 UTC
[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:> still happens with your patch applied. The machine simply gets shut down. > > dmesg can be found here: > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txtLooking your dmesg: Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1 Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1 I would assume it runtime suspends here. Then it wakes up because of PCI access from userspace: Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed and for some reason it does not get resumed properly. There are also few warnings from ACPI that might be relevant: Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) This seems to be Dell XPS 9560 which I think has been around some time already so I wonder why we only see issues now. Has it ever worked for you or maybe there is a regression that causes it to happen now?
Karol Herbst
2019-Sep-30 16:36 UTC
[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg <mika.westerberg at linux.intel.com> wrote:> > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote: > > still happens with your patch applied. The machine simply gets shut down. > > > > dmesg can be found here: > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt > > Looking your dmesg: > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1 > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1 > > I would assume it runtime suspends here. Then it wakes up because of PCI > access from userspace: > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed > > and for some reason it does not get resumed properly. There are also few > warnings from ACPI that might be relevant: > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) >afaik this is the case for essentially every laptop out there.> This seems to be Dell XPS 9560 which I think has been around some time > already so I wonder why we only see issues now. Has it ever worked for > you or maybe there is a regression that causes it to happen now?oh, it's broken since forever, we just tried to get more information from Nvidia if they know what this is all about, but we got nothing useful. We were also hoping to find a reliable fix or workaround we could have inside nouveau to fix that as I think nouveau is the only driver actually hit by this issue, but nothing turned out to be reliable enough.
Possibly Parallel Threads
- [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
- [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
- [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
- [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups