thr3ads.net - Nouveau - [Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges [Dec 2019]

If this information is useful, please help other people find it:
Share via:

Rafael J. Wysocki

2019-Dec-09 11:38 UTC

[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

On Mon, Dec 9, 2019 at 12:17 PM Karol Herbst <kherbst at redhat.com>
wrote:>
> anybody any other ideas?
Not yet, but I'm trying to collect some more information.
> It seems that both patches don't really fix
> the issue and I have no idea left on my side to try out. The only
> thing left I could do to further investigate would be to reverse
> engineer the Nvidia driver as they support runpm on Turing+ GPUs now,
> but I've heard users having similar issues to the one Lyude told us
> about... and I couldn't verify that the patches help there either in a
> reliable way.
It looks like the newer (8+) versions of Windows expect the GPU driver
to prepare the GPU for power removal in some specific way and the
latter fails if the GPU has not been prepared as expected.

Because testing indicates that the Windows 7 path in the platform
firmware works, it may be worth trying to do what it does to the PCIe
link before invoking the _OFF method for the power resource
controlling the GPU power.

If the Mika's theory that the Win7 path simply turns the PCIe link off
is correct, then whatever the _OFF method tries to do to the link
after that should not matter.
> On Wed, Nov 27, 2019 at 8:55 PM Lyude Paul <lyude at redhat.com>
wrote:
> >
> > On Wed, 2019-11-27 at 12:51 +0100, Karol Herbst wrote:
> > > On Wed, Nov 27, 2019 at 12:49 PM Mika Westerberg
> > > <mika.westerberg at intel.com> wrote:
> > > > On Tue, Nov 26, 2019 at 06:10:36PM -0500, Lyude Paul wrote:
> > > > > Hey-this is almost certainly not the right place in
this thread to
> > > > > respond,
> > > > > but this thread has gotten so deep evolution can't
push the subject
> > > > > further to
> > > > > the right, heh. So I'll just respond here.
> > > >
> > > > :)
> > > >
> > > > > I've been following this and helping out Karol with
testing here and
> > > > > there.
> > > > > They had me test Bjorn's PCI branch on the X1
Extreme 2nd generation,
> > > > > which
> > > > > has a turing GPU and 8086:1901 PCI bridge.
> > > > >
> > > > > I was about to say "the patch fixed things,
hooray!" but it seems that
> > > > > after
> > > > > trying runtime suspend/resume a couple times things
fall apart again:
> > > >
> > > > You mean $subject patch, no?
> > > >
> > >
> > > no, I told Lyude to test the pci/pm branch as the runpm errors we
saw
> > > on that machine looked different. Some BAR error the GPU reported
> > > after it got resumed, so I was wondering if the delays were
helping
> > > with that. But after some cycles it still caused the same issue,
that
> > > the GPU disappeared. Later testing also showed that my patch also
> > > didn't seem to help with this error sadly :/
> > >
> > > > > [  686.883247] nouveau 0000:01:00.0: DRM: suspending
object tree...
> > > > > [  752.866484] ACPI Error: Aborting method
\_SB.PCI0.PEG0.PEGP.NVPO due
> > > > > to previous error (AE_AML_LOOP_TIMEOUT)
(20190816/psparse-529)
> > > > > [  752.866508] ACPI Error: Aborting method
\_SB.PCI0.PGON due to
> > > > > previous error (AE_AML_LOOP_TIMEOUT)
(20190816/psparse-529)
> > > > > [  752.866521] ACPI Error: Aborting method
\_SB.PCI0.PEG0.PG00._ON due
> > > > > to previous error (AE_AML_LOOP_TIMEOUT)
(20190816/psparse-529)
> > > >
> > > > This is probably the culprit. The same AML code fails to
properly turn
> > > > on the device.
> > > >
> > > > Is acpidump from this system available somewhere?
> >
> > Attached it to this email
> >
> > > >
> > --
> > Cheers,
> >         Lyude Paul
>

Karol Herbst

2019-Dec-09 12:24 UTC

head link

[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

On Mon, Dec 9, 2019 at 12:39 PM Rafael J. Wysocki <rafael at kernel.org>
wrote:>
> On Mon, Dec 9, 2019 at 12:17 PM Karol Herbst <kherbst at redhat.com>
wrote:
> >
> > anybody any other ideas?
>
> Not yet, but I'm trying to collect some more information.
>
> > It seems that both patches don't really fix
> > the issue and I have no idea left on my side to try out. The only
> > thing left I could do to further investigate would be to reverse
> > engineer the Nvidia driver as they support runpm on Turing+ GPUs now,
> > but I've heard users having similar issues to the one Lyude told
us
> > about... and I couldn't verify that the patches help there either
in a
> > reliable way.
>
> It looks like the newer (8+) versions of Windows expect the GPU driver
> to prepare the GPU for power removal in some specific way and the
> latter fails if the GPU has not been prepared as expected.
>
> Because testing indicates that the Windows 7 path in the platform
> firmware works, it may be worth trying to do what it does to the PCIe
> link before invoking the _OFF method for the power resource
> controlling the GPU power.
>
ohh, that actually makes sense. Didn't think of that yet.
> If the Mika's theory that the Win7 path simply turns the PCIe link off
> is correct, then whatever the _OFF method tries to do to the link
> after that should not matter.
>
By the way, and I was only thinking about it after sending my last
email out, do you think we should fail the runtime resume path if the
device gets stuck in a power state?

Currently pci core always calls into the driver regardless, but maybe
for D3cold it really makes sense to just bail and refuse to resume? I
think I tried that as an early "fix" and might even have a patch
around. This should at least prevent crashes inside drivers trying to
access invalid memory or getting stuck in loops.
> > On Wed, Nov 27, 2019 at 8:55 PM Lyude Paul <lyude at redhat.com>
wrote:
> > >
> > > On Wed, 2019-11-27 at 12:51 +0100, Karol Herbst wrote:
> > > > On Wed, Nov 27, 2019 at 12:49 PM Mika Westerberg
> > > > <mika.westerberg at intel.com> wrote:
> > > > > On Tue, Nov 26, 2019 at 06:10:36PM -0500, Lyude Paul
wrote:
> > > > > > Hey-this is almost certainly not the right place
in this thread to
> > > > > > respond,
> > > > > > but this thread has gotten so deep evolution
can't push the subject
> > > > > > further to
> > > > > > the right, heh. So I'll just respond here.
> > > > >
> > > > > :)
> > > > >
> > > > > > I've been following this and helping out Karol
with testing here and
> > > > > > there.
> > > > > > They had me test Bjorn's PCI branch on the X1
Extreme 2nd generation,
> > > > > > which
> > > > > > has a turing GPU and 8086:1901 PCI bridge.
> > > > > >
> > > > > > I was about to say "the patch fixed things,
hooray!" but it seems that
> > > > > > after
> > > > > > trying runtime suspend/resume a couple times
things fall apart again:
> > > > >
> > > > > You mean $subject patch, no?
> > > > >
> > > >
> > > > no, I told Lyude to test the pci/pm branch as the runpm
errors we saw
> > > > on that machine looked different. Some BAR error the GPU
reported
> > > > after it got resumed, so I was wondering if the delays were
helping
> > > > with that. But after some cycles it still caused the same
issue, that
> > > > the GPU disappeared. Later testing also showed that my patch
also
> > > > didn't seem to help with this error sadly :/
> > > >
> > > > > > [  686.883247] nouveau 0000:01:00.0: DRM:
suspending object tree...
> > > > > > [  752.866484] ACPI Error: Aborting method
\_SB.PCI0.PEG0.PEGP.NVPO due
> > > > > > to previous error (AE_AML_LOOP_TIMEOUT)
(20190816/psparse-529)
> > > > > > [  752.866508] ACPI Error: Aborting method
\_SB.PCI0.PGON due to
> > > > > > previous error (AE_AML_LOOP_TIMEOUT)
(20190816/psparse-529)
> > > > > > [  752.866521] ACPI Error: Aborting method
\_SB.PCI0.PEG0.PG00._ON due
> > > > > > to previous error (AE_AML_LOOP_TIMEOUT)
(20190816/psparse-529)
> > > > >
> > > > > This is probably the culprit. The same AML code fails
to properly turn
> > > > > on the device.
> > > > >
> > > > > Is acpidump from this system available somewhere?
> > >
> > > Attached it to this email
> > >
> > > > >
> > > --
> > > Cheers,
> > >         Lyude Paul
> >
>

Dave Airlie

2019-Dec-10 19:58 UTC

head link

[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

On Mon, 9 Dec 2019 at 21:39, Rafael J. Wysocki <rafael at kernel.org>
wrote:>
> On Mon, Dec 9, 2019 at 12:17 PM Karol Herbst <kherbst at redhat.com>
wrote:
> >
> > anybody any other ideas?
>
> Not yet, but I'm trying to collect some more information.
>
> > It seems that both patches don't really fix
> > the issue and I have no idea left on my side to try out. The only
> > thing left I could do to further investigate would be to reverse
> > engineer the Nvidia driver as they support runpm on Turing+ GPUs now,
> > but I've heard users having similar issues to the one Lyude told
us
> > about... and I couldn't verify that the patches help there either
in a
> > reliable way.
>
> It looks like the newer (8+) versions of Windows expect the GPU driver
> to prepare the GPU for power removal in some specific way and the
> latter fails if the GPU has not been prepared as expected.
>
> Because testing indicates that the Windows 7 path in the platform
> firmware works, it may be worth trying to do what it does to the PCIe
> link before invoking the _OFF method for the power resource
> controlling the GPU power.
>
Remember the pre Win8 path required calling a DSM method to actually
power the card down, I think by the time we reach these methods in
those cases the card is already gone.

Dave.

Karol Herbst

2019-Dec-10 20:49 UTC

head link

[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

On Tue, Dec 10, 2019 at 8:58 PM Dave Airlie <airlied at gmail.com>
wrote:>
> On Mon, 9 Dec 2019 at 21:39, Rafael J. Wysocki <rafael at kernel.org>
wrote:
> >
> > On Mon, Dec 9, 2019 at 12:17 PM Karol Herbst <kherbst at
redhat.com> wrote:
> > >
> > > anybody any other ideas?
> >
> > Not yet, but I'm trying to collect some more information.
> >
> > > It seems that both patches don't really fix
> > > the issue and I have no idea left on my side to try out. The only
> > > thing left I could do to further investigate would be to reverse
> > > engineer the Nvidia driver as they support runpm on Turing+ GPUs
now,
> > > but I've heard users having similar issues to the one Lyude
told us
> > > about... and I couldn't verify that the patches help there
either in a
> > > reliable way.
> >
> > It looks like the newer (8+) versions of Windows expect the GPU driver
> > to prepare the GPU for power removal in some specific way and the
> > latter fails if the GPU has not been prepared as expected.
> >
> > Because testing indicates that the Windows 7 path in the platform
> > firmware works, it may be worth trying to do what it does to the PCIe
> > link before invoking the _OFF method for the power resource
> > controlling the GPU power.
> >
>
> Remember the pre Win8 path required calling a DSM method to actually
> power the card down, I think by the time we reach these methods in
> those cases the card is already gone.
>
> Dave.
>
The point was that the firmware seems to do more in the legacy paths
and maybe we just have to do those things inside the driver instead
when using the new method. Also the _DSM call just wraps around the
interfaces on newer firmware anyway. The OS check is usually what
makes the difference. I might be wrong about the _DSM call just
wrapping though, but I think I saw it at least in some firmware at
some point.

Seemingly Similar Threads

Search for more possibly parallel threads

Nouveau - Dec 2019 - [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

Seemingly Similar Threads