thr3ads.net - Nouveau - [Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges [Sep 2019]

If this information is useful, please help other people find it:
Share via:

Karol Herbst

2019-Sep-30 16:05 UTC

[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

still happens with your patch applied. The machine simply gets shut down.

dmesg can be found here:
https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt

If there are no other things to try out, I will post the updated patch shortly.

On Mon, Sep 30, 2019 at 11:29 AM Mika Westerberg
<mika.westerberg at linux.intel.com> wrote:>
> On Mon, Sep 30, 2019 at 11:15:48AM +0200, Karol Herbst wrote:
> > On Mon, Sep 30, 2019 at 10:05 AM Mika Westerberg
> > <mika.westerberg at linux.intel.com> wrote:
> > >
> > > Hi Karol,
> > >
> > > On Fri, Sep 27, 2019 at 11:53:48PM +0200, Karol Herbst wrote:
> > > > > What exactly is the serious issue?  I guess it's
that the rescan
> > > > > doesn't detect the GPU, which means it's not
responding to config
> > > > > accesses?  Is there any timing component here, e.g.,
maybe we're
> > > > > missing some delay like the ones Mika is adding to the
reset paths?
> > > >
> > > > When I was checking up on some of the PCI registers of the
bridge
> > > > controller, the slot detection told me that there is no
device
> > > > recognized anymore. I don't know which register it was
anymore, though
> > > > I guess one could read it up in the SoC spec document by
Intel.
> > > >
> > > > My guess is, that the bridge controller fails to detect the
GPU being
> > > > here or actively threw it of the bus or something. But a
normal system
> > > > suspend/resume cycle brings the GPU back online (doing a
rescan via
> > > > sysfs gets the device detected again)
> > >
> > > Can you elaborate a bit what kind of scenario the issue happens
(e.g
> > > steps how it reproduces)? It was not 100% clear from the
changelog. Also
> > > what the result when the failure happens?
> > >
> >
> > yeah, I already have an updated patch in the works which also does the
> > rework Bjorn suggested. Had no time yet to test if I didn't mess
it
> > up.
> >
> > I am also thinking of adding a kernel parameter to enable this
> > workaround on demand, but not quite sure on that one yet.
>
> Right, I think it would be good to figure out the root cause before
> adding any workarounds ;-) It might very well be that we are just
> missing something the PCIe spec requires but not implemented in Linux.
>
> > > I see there is a script that does something but unfortunately
I'm not
> > > fluent in Python so can't extract the steps how the issue can
be
> > > reproduced ;-)
> > >
> > > One thing that I'm working on is that Linux PCI subsystem
misses certain
> > > delays that are needed after D3cold -> D0 transition,
otherwise the
> > > device and/or link may not be ready before we access it. What you
are
> > > experiencing sounds similar. I wonder if you could try the
following
> > > patch and see if it makes any difference?
> > >
> > > https://patchwork.kernel.org/patch/11106611/
> >
> > I think I already tried this path. The problem isn't that the
device
> > isn't accessible too late, but that it seems that the device
> > completely falls off the bus. But I can retest again just to be sure.
>
> Yes, please try it and share full dmesg if/when the failure still happens.

Mika Westerberg

2019-Sep-30 16:30 UTC

head link

[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst
wrote:> still happens with your patch applied. The machine simply gets shut down.
> 
> dmesg can be found here:
>
https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
Looking your dmesg:

Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer
copies
Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for
0000:01:00.0 on minor 1

I would assume it runtime suspends here. Then it wakes up because of PCI
access from userspace:

Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed

and for some reason it does not get resumed properly. There are also few
warnings from ACPI that might be relevant:

Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type
mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type
mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)

This seems to be Dell XPS 9560 which I think has been around some time
already so I wonder why we only see issues now. Has it ever worked for
you or maybe there is a regression that causes it to happen now?

Karol Herbst

2019-Sep-30 16:36 UTC

head link

[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
<mika.westerberg at linux.intel.com> wrote:>
> On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > still happens with your patch applied. The machine simply gets shut
down.
> >
> > dmesg can be found here:
> >
https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
>
> Looking your dmesg:
>
> Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for
buffer copies
> Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for
0000:01:00.0 on minor 1
>
> I would assume it runtime suspends here. Then it wakes up because of PCI
> access from userspace:
>
> Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
>
> and for some reason it does not get resumed properly. There are also few
> warnings from ACPI that might be relevant:
>
> Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type
mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4
type mismatch - Found [Buffer], ACPI requires [Package]
(20190509/nsarguments-59)
>
afaik this is the case for essentially every laptop out there.
> This seems to be Dell XPS 9560 which I think has been around some time
> already so I wonder why we only see issues now. Has it ever worked for
> you or maybe there is a regression that causes it to happen now?
oh, it's broken since forever, we just tried to get more information
from Nvidia if they know what this is all about, but we got nothing
useful.

We were also hoping to find a reliable fix or workaround we could have
inside nouveau to fix that as I think nouveau is the only driver
actually hit by this issue, but nothing turned out to be reliable
enough.

Reasonably Related Threads

Search for more apparently analagous threads

Nouveau - Sep 2019 - [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

[Nouveau] [RFC PATCH] pci: prevent putting pcie devices into lower device states on certain intel bridges

Reasonably Related Threads