Karol Herbst
2019-Sep-30 16:36 UTC
[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg <mika.westerberg at linux.intel.com> wrote:> > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote: > > still happens with your patch applied. The machine simply gets shut down. > > > > dmesg can be found here: > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt > > Looking your dmesg: > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1 > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1 > > I would assume it runtime suspends here. Then it wakes up because of PCI > access from userspace: > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed > > and for some reason it does not get resumed properly. There are also few > warnings from ACPI that might be relevant: > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) >afaik this is the case for essentially every laptop out there.> This seems to be Dell XPS 9560 which I think has been around some time > already so I wonder why we only see issues now. Has it ever worked for > you or maybe there is a regression that causes it to happen now?oh, it's broken since forever, we just tried to get more information from Nvidia if they know what this is all about, but we got nothing useful. We were also hoping to find a reliable fix or workaround we could have inside nouveau to fix that as I think nouveau is the only driver actually hit by this issue, but nothing turned out to be reliable enough.
Mika Westerberg
2019-Oct-01 08:46 UTC
[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:> On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg > <mika.westerberg at linux.intel.com> wrote: > > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote: > > > still happens with your patch applied. The machine simply gets shut down. > > > > > > dmesg can be found here: > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt > > > > Looking your dmesg: > > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1 > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1 > > > > I would assume it runtime suspends here. Then it wakes up because of PCI > > access from userspace: > > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed > > > > and for some reason it does not get resumed properly. There are also few > > warnings from ACPI that might be relevant: > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) > > > > afaik this is the case for essentially every laptop out there.OK, so they are harmless?> > This seems to be Dell XPS 9560 which I think has been around some time > > already so I wonder why we only see issues now. Has it ever worked for > > you or maybe there is a regression that causes it to happen now? > > oh, it's broken since forever, we just tried to get more information > from Nvidia if they know what this is all about, but we got nothing > useful. > > We were also hoping to find a reliable fix or workaround we could have > inside nouveau to fix that as I think nouveau is the only driver > actually hit by this issue, but nothing turned out to be reliable > enough.Can't you just block runtime PM from the nouveau driver until this is understood better? That can be done by calling pm_runtime_forbid() (or not calling pm_runtime_allow() in the driver). Or in case of PCI driver you just don't decrease the reference count when probe() ends. I think that would be much better than blocking any devices behind Kabylake PCIe root ports from entering D3 (I don't really think the problem is in the root ports itself but there is something we are missing when the NVIDIA GPU is put into D3cold or back from there).
Karol Herbst
2019-Oct-01 08:56 UTC
[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
On Tue, Oct 1, 2019 at 10:47 AM Mika Westerberg <mika.westerberg at linux.intel.com> wrote:> > On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote: > > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg > > <mika.westerberg at linux.intel.com> wrote: > > > > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote: > > > > still happens with your patch applied. The machine simply gets shut down. > > > > > > > > dmesg can be found here: > > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt > > > > > > Looking your dmesg: > > > > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1 > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies > > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1 > > > > > > I would assume it runtime suspends here. Then it wakes up because of PCI > > > access from userspace: > > > > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed > > > > > > and for some reason it does not get resumed properly. There are also few > > > warnings from ACPI that might be relevant: > > > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) > > > > > > > afaik this is the case for essentially every laptop out there. > > OK, so they are harmless? >yes> > > This seems to be Dell XPS 9560 which I think has been around some time > > > already so I wonder why we only see issues now. Has it ever worked for > > > you or maybe there is a regression that causes it to happen now? > > > > oh, it's broken since forever, we just tried to get more information > > from Nvidia if they know what this is all about, but we got nothing > > useful. > > > > We were also hoping to find a reliable fix or workaround we could have > > inside nouveau to fix that as I think nouveau is the only driver > > actually hit by this issue, but nothing turned out to be reliable > > enough. > > Can't you just block runtime PM from the nouveau driver until this is > understood better? That can be done by calling pm_runtime_forbid() (or > not calling pm_runtime_allow() in the driver). Or in case of PCI driver > you just don't decrease the reference count when probe() ends. >the thing is, it does work for a lot of laptops. We could only observe this on kaby lake and skylake ones. Even on Cannon Lakes it seems to work just fine.> I think that would be much better than blocking any devices behind > Kabylake PCIe root ports from entering D3 (I don't really think the > problem is in the root ports itself but there is something we are > missing when the NVIDIA GPU is put into D3cold or back from there).I highly doubt there is anything wrong with the GPU alone as we have too many indications which tell us otherwise. Anyway, at this point I don't know where to look further for what's actually wrong. And apparently it works on Windows, but I don't know why and I have no idea what Windows does on such systems to make it work reliably.
Bjorn Helgaas
2019-Oct-01 13:27 UTC
[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:> On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg > <mika.westerberg at linux.intel.com> wrote: > > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote: > > > still happens with your patch applied. The machine simply gets shut down. > > > > > > dmesg can be found here: > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt > > > > Looking your dmesg: > > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1 > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1 > > > > I would assume it runtime suspends here. Then it wakes up because of PCI > > access from userspace: > > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed > > > > and for some reason it does not get resumed properly. There are also few > > warnings from ACPI that might be relevant: > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) > > afaik this is the case for essentially every laptop out there.I think we should look into this a little bit. acpi_ns_check_argument_types() checks the argument type and prints this message, but AFAICT it doesn't actually fix anything or prevent execution of the method, so I have no idea what happens when we actually execute the _DSM. If we execute this _DSM as part of power management, and the _DSM doesn't work right, it would be no surprise that we have problems. Maybe we could learn something by turning on ACPI_DB_PARSE output (see Documentation/firmware-guide/acpi/debug.rst). You must have an acpidump already from all your investigation. Can you put it somewhere, e.g., bugzilla.kernel.org, and include a URL?
Karol Herbst
2019-Oct-01 16:21 UTC
[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
On Tue, Oct 1, 2019 at 3:27 PM Bjorn Helgaas <helgaas at kernel.org> wrote:> > On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote: > > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg > > <mika.westerberg at linux.intel.com> wrote: > > > > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote: > > > > still happens with your patch applied. The machine simply gets shut down. > > > > > > > > dmesg can be found here: > > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt > > > > > > Looking your dmesg: > > > > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1 > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies > > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1 > > > > > > I would assume it runtime suspends here. Then it wakes up because of PCI > > > access from userspace: > > > > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed > > > > > > and for some reason it does not get resumed properly. There are also few > > > warnings from ACPI that might be relevant: > > > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59) > > > > afaik this is the case for essentially every laptop out there. > > I think we should look into this a little bit. > acpi_ns_check_argument_types() checks the argument type and prints > this message, but AFAICT it doesn't actually fix anything or prevent > execution of the method, so I have no idea what happens when we > actually execute the _DSM. >I can assure you that this warning happens on every single laptop out there with dual Nvidia graphics and it's essentially just a firmware bug. And it never caused any issues on any of the older laptops (or newest one for that matter).> If we execute this _DSM as part of power management, and the _DSM > doesn't work right, it would be no surprise that we have problems. > > Maybe we could learn something by turning on ACPI_DB_PARSE output (see > Documentation/firmware-guide/acpi/debug.rst). > > You must have an acpidump already from all your investigation. Can > you put it somewhere, e.g., bugzilla.kernel.org, and include a URL?Will do so later, right now I am traveling to XDC and will have more time for that then.
Possibly Parallel Threads
- [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
- [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
- [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
- [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges
- [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges