Peter Wu
2016-May-30 17:03 UTC
[Nouveau] [PATCH 1/9] drm/nouveau: Don't leak runtime pm ref on driver unload
On Sun, May 29, 2016 at 05:50:06PM +0200, Lukas Wunner wrote:> Hi Peter, > > On Fri, May 27, 2016 at 03:07:33AM +0200, Peter Wu wrote: > > On Tue, May 24, 2016 at 06:03:27PM +0200, Lukas Wunner wrote: > > > nouveau_drm_load() calls pm_runtime_put() if nouveau_runtime_pm != 0, > > > but nouveau_drm_unload() calls pm_runtime_get_sync() unconditionally. > > > We therefore leak a runtime pm ref whenever nouveau is loaded with > > > runpm=0 and then unloaded. The GPU will subsequently never runtime > > > suspend even if nouveau is loaded again with runpm=1. > > > > > > Fix by taking the runtime pm ref under the same condition that it was > > > released on driver load. > > > > > > Fixes: 5addcf0a5f0f ("nouveau: add runtime PM support (v0.9)") > > > Cc: Dave Airlie <airlied at redhat.com> > > > Reported-by: Karol Herbst <nouveau at karolherbst.de> > > > Tested-by: Karol Herbst <nouveau at karolherbst.de> > > > Signed-off-by: Lukas Wunner <lukas at wunner.de> > > > > Looks good, I tested this scenario: > > > > ru(){ cat /sys/bus/pci/devices/0000\:01:00.0/power/runtime_usage;} > > ru # reports 1 > > modprobe nouveau runpm=0 > > ru # reports 2 > > rmmod nouveau > > ru # reports 1 > > > > Without runpm=0 the count drops to 0 in the second step and stays 0 in > > the third step. After applying patch 2/9, this correctly reports 1 as > > expected (this is the same as manually setting power/control to on). > > How exactly did you reach the situation where the root port didn't wake > up when you tried to load nouveau again? (IRC conversation this week.)Ensure that the pci/pm patches are applied, then: 0. Unload nouveau (I have blacklisted it for testing). 1. Enable rpm for the root port and children (control = auto). 2. Verify in the kernel logs that the devices are sleeping: pcieport 0000:00:01.0: power state changed by ACPI to D3cold 3. (Optional, to rule out issues with delays:) Disable rpm for the Nvidia device (control = on). 4. modprobe nouveau. The above test with v4.6 + 4 pci/pm patches (8b71f565) gives: 50.245795 MXM: GUID detected in BIOS 50.245948 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 50.246016 ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 50.246044 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 50.246110 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F 50.246256 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 50.246289 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F 50.246443 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 50.246457 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F 50.246932 pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported 50.247005 VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle 50.247084 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PG00._ON] at AML address ffffc9000001086e length 11D 50.390140 pcieport 0000:00:01.0: power state changed by ACPI to D0 50.491893 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0._DSW] at AML address ffffc90000010a2d length 1D 50.492285 pcieport 0000:00:01.0: PME# disabled 50.492583 nouveau 0000:01:00.0: unknown chipset (ffffffff) 50.492687 nouveau: probe of 0000:01:00.0 failed with error -12 50.501990 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0._S0W] at AML address ffffc90000010a8e length 2 50.502403 pcieport 0000:00:01.0: PME# enabled 50.502601 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0._DSW] at AML address ffffc90000010a2d length 1D 50.513005 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PG00._OFF] at AML address ffffc90000010994 length 6D 50.533258 pcieport 0000:00:01.0: power state changed by ACPI to D3cold (Note that this patch is not included.) When nouveau is operating normally, I see that _PS0 is also called (which does not happen above). If you think that mixing power resources with DSM causes this issue, I also tried to apply my power resources work for nouveau but it gives the same problem: 20.183306 MXM: GUID detected in BIOS 20.183606 ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 20.184158 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 20.184547 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) 20.185152 pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported 20.185351 VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle 20.185384 nouveau: detected PR support, will not use DSM 20.185552 nouveau 0000:01:00.0: enabling device (0000 -> 0003) 20.185873 nouveau 0000:01:00.0: unknown chipset (ffffffff) 20.185946 nouveau: probe of 0000:01:00.0 failed with error -12> What's happening is, the PCI core will keep unbound devices (i.e., > without driver) in D0 but the runtime status is allowed to change > to "suspended". So it'll appear to the kernel as if it was suspended > but in reality it stays in D0. > > Once runtime pm for PCIe ports gets merged, the root port above the > GPU will indeed go to D3 in such a situation because the check > pm_children_suspended() (called from rpm_check_suspend_allowed()) > returns true. > > I'm not sure if this is desirable or not. If we keep unbound devices > in D0, should we allow ports above them to go to D3?Maybe Rafael (linux-pm / linux-pci) can answer this question better? The comments in local_pci_probe, pci_pm_runtime_suspend and pci_pm_runtime_resume suggest that unbound devices are assumed in D0 which is apparently not the case when runtime PM is enabled.> In any case, when nouveau is loaded again, local_pci_probe() will > call pm_runtime_get_sync(), which will implicitly set the runtime > status to "active" and which should also wake parents. So how did > you ever reach a point where you loaded nouveau and the root port > stayed asleep? Clearly we have a bug there, question is where. > This shouldn't work only if pm_runtime_forbid() was called on > driver unload. > > Thanks for the extensive testing! > LukasBoth devices (root port and Nvidia) were resumed, but somehow the Nvidia card was not fully initialized/ready (as you can see in the above logs). Peter> > > > Peter > > > > > --- > > > drivers/gpu/drm/nouveau/nouveau_drm.c | 5 ++++- > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c > > > index 11f8dd9..faf7438 100644 > > > --- a/drivers/gpu/drm/nouveau/nouveau_drm.c > > > +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c > > > @@ -498,7 +498,10 @@ nouveau_drm_unload(struct drm_device *dev) > > > { > > > struct nouveau_drm *drm = nouveau_drm(dev); > > > > > > - pm_runtime_get_sync(dev->dev); > > > + if (nouveau_runtime_pm != 0) { > > > + pm_runtime_get_sync(dev->dev); > > > + } > > > + > > > nouveau_fbcon_fini(dev); > > > nouveau_accel_fini(drm); > > > nouveau_hwmon_fini(dev); > > > -- > > > 2.8.1 > > > > > > _______________________________________________ > > > Nouveau mailing list > > > Nouveau at lists.freedesktop.org > > > https://lists.freedesktop.org/mailman/listinfo/nouveau
Lukas Wunner
2016-May-31 11:34 UTC
[Nouveau] [PATCH 1/9] drm/nouveau: Don't leak runtime pm ref on driver unload
On Mon, May 30, 2016 at 07:03:46PM +0200, Peter Wu wrote:> On Sun, May 29, 2016 at 05:50:06PM +0200, Lukas Wunner wrote: > > How exactly did you reach the situation where the root port didn't wake > > up when you tried to load nouveau again? (IRC conversation this week.) > > Ensure that the pci/pm patches are applied, then: > > 0. Unload nouveau (I have blacklisted it for testing). > 1. Enable rpm for the root port and children (control = auto). > 2. Verify in the kernel logs that the devices are sleeping: > pcieport 0000:00:01.0: power state changed by ACPI to D3cold > 3. (Optional, to rule out issues with delays:) Disable rpm for the > Nvidia device (control = on). > 4. modprobe nouveau. > > The above test with v4.6 + 4 pci/pm patches (8b71f565) gives: > > 50.245795 MXM: GUID detected in BIOS > 50.245948 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 > 50.246016 ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > 50.246044 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 > 50.246110 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > 50.246256 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > 50.246289 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > 50.246443 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > 50.246457 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > 50.246932 pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported > 50.247005 VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle > 50.247084 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PG00._ON] at AML address ffffc9000001086e length 11D > 50.390140 pcieport 0000:00:01.0: power state changed by ACPI to D0 > 50.491893 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0._DSW] at AML address ffffc90000010a2d length 1D > 50.492285 pcieport 0000:00:01.0: PME# disabled > 50.492583 nouveau 0000:01:00.0: unknown chipset (ffffffff) > 50.492687 nouveau: probe of 0000:01:00.0 failed with error -12I've tested this on a MacBook Pro, which does not have ACPI _PR3 methods for the root port to which the discrete GPU is attached. The port can thus only suspend to D3hot, not D3cold. Even without patch [2/9], when unloading nouveau and letting the root port go to D3hot, the port is subsequently correctly resumed to D0 when reloading nouveau. So the issue that you're seeing without patch [2/9] seems to be specific to Optimus/_PR3 machines. If possible you should try to get it working without patch [2/9] because that patch is really optional (as I've written in the commit message). I'm totally unfamiliar with Optimus but maybe lspci could help to debug this? Lukas
Peter Wu
2016-May-31 11:41 UTC
[Nouveau] [PATCH 1/9] drm/nouveau: Don't leak runtime pm ref on driver unload
On Tue, May 31, 2016 at 01:34:43PM +0200, Lukas Wunner wrote:> On Mon, May 30, 2016 at 07:03:46PM +0200, Peter Wu wrote: > > On Sun, May 29, 2016 at 05:50:06PM +0200, Lukas Wunner wrote: > > > How exactly did you reach the situation where the root port didn't wake > > > up when you tried to load nouveau again? (IRC conversation this week.) > > > > Ensure that the pci/pm patches are applied, then: > > > > 0. Unload nouveau (I have blacklisted it for testing). > > 1. Enable rpm for the root port and children (control = auto). > > 2. Verify in the kernel logs that the devices are sleeping: > > pcieport 0000:00:01.0: power state changed by ACPI to D3cold > > 3. (Optional, to rule out issues with delays:) Disable rpm for the > > Nvidia device (control = on). > > 4. modprobe nouveau. > > > > The above test with v4.6 + 4 pci/pm patches (8b71f565) gives: > > > > 50.245795 MXM: GUID detected in BIOS > > 50.245948 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 > > 50.246016 ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > > 50.246044 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492 > > 50.246110 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > > 50.246256 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > > 50.246289 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > > 50.246443 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95) > > 50.246457 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F > > 50.246932 pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported > > 50.247005 VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle > > 50.247084 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0.PG00._ON] at AML address ffffc9000001086e length 11D > > 50.390140 pcieport 0000:00:01.0: power state changed by ACPI to D0 > > 50.491893 nseval-0227 ns_evaluate : **** Execute method [\_SB.PCI0.PEG0._DSW] at AML address ffffc90000010a2d length 1D > > 50.492285 pcieport 0000:00:01.0: PME# disabled > > 50.492583 nouveau 0000:01:00.0: unknown chipset (ffffffff) > > 50.492687 nouveau: probe of 0000:01:00.0 failed with error -12 > > I've tested this on a MacBook Pro, which does not have ACPI _PR3 > methods for the root port to which the discrete GPU is attached. > The port can thus only suspend to D3hot, not D3cold. > > Even without patch [2/9], when unloading nouveau and letting the > root port go to D3hot, the port is subsequently correctly resumed > to D0 when reloading nouveau. > > So the issue that you're seeing without patch [2/9] seems to be > specific to Optimus/_PR3 machines. If possible you should try to > get it working without patch [2/9] because that patch is really > optional (as I've written in the commit message). I'm totally > unfamiliar with Optimus but maybe lspci could help to debug this?Without 2/9 I can prevent the issue by writing "on" to /sys/bus/pci/devices/0000:00:01.0/power/control (the PCIe port), but that effectively gives the same result as applying 2/9. The problem occurs when the power is lost (by putting the PCIe port in D3cold). Maybe it is a bug in the PCI core that does not re-initialize devices under the port, but since a workaround is available (2/9), I will focus on other issues first. Maybe it is worth to mention this issue in the commit message for 2/9 though. -- Kind regards, Peter Wu https://lekensteyn.nl
Seemingly Similar Threads
- [PATCH 1/9] drm/nouveau: Don't leak runtime pm ref on driver unload
- [PATCH 1/9] drm/nouveau: Don't leak runtime pm ref on driver unload
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups