thr3ads.net - Nouveau - [Nouveau] [PATCH 0/3] drm/nouveau: Fixup module probe to add ->shutdown() [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Lyude Paul

2018-Aug-23 01:40 UTC

[Nouveau] [PATCH 0/3] drm/nouveau: Fixup module probe to add ->shutdown()

This series is intended to add support for shutting down the GPU on
kernel shutdown/reboot using the ->shutdown() hook, similar to what
amdgpu does. This is mainly intended to workaround a bios issue on the
P50 that was preventing nouveau from initializing the dedicated GM107
GPU on that system properly. You can find more details on this issue in
the patch labeled "Shut down GPU on kernel shutdown".

Doing this so that it wouldn't race with userspace took a
little bit of work (and uncovered one small bug in the process). Mainly,
I had to convert nouveau to use the new drm_dev_alloc() and
drm_dev_register() helpers to ensure that there isn't any chance that we
can race with userspace during shutdown. This also involved removing
->load() and ->unload() and moving their work into nouveau_drm_probe().

============================== IMPORTANT ==============================This
patch series WILL NOT WORK PROPERLY unless you also apply the
following patches which fix some leftover device removal issues:

https://patchwork.freedesktop.org/series/48596/
This fixes kernel panics on removal on systems with nvidia controlled
backlights

https://patchwork.freedesktop.org/series/47843/
This fixes issues with DRM connectors getting leaked on device removal,
due to hpd_work still being active when unloading nouveau.

============================== IMPORTANT ==============================
This also uncovered one small bug in nouveau_drm_load().

Lyude Paul (3):
  drm/nouveau: Fix potential memory leak in nouveau_drm_load()
  drm/nouveau: Start using new drm_dev initialization helpers
  drm/nouveau: Shut down GPU on kernel shutdown

 drivers/gpu/drm/nouveau/nouveau_drm.c | 180 +++++++++++++++-----------
 1 file changed, 106 insertions(+), 74 deletions(-)

-- 
2.17.1

Lyude Paul

2018-Aug-23 01:40 UTC

head link

[Nouveau] [PATCH 1/3] drm/nouveau: Fix potential memory leak in nouveau_drm_load()

We forget to free drm in all instances of failure, and additionally also
forget to destroy the master client if the other client fails
initialization.

Signed-off-by: Lyude Paul <lyude at redhat.com>
Cc: Karol Herbst <kherbst at redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_drm.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c
b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 5fdc1fbe2ee5..6c88ff776ff6 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -538,11 +538,11 @@ nouveau_drm_load(struct drm_device *dev, unsigned long
flags)
 
 	ret = nouveau_cli_init(drm, "DRM-master", &drm->master);
 	if (ret)
-		return ret;
+		goto fail_alloc;
 
 	ret = nouveau_cli_init(drm, "DRM", &drm->client);
 	if (ret)
-		return ret;
+		goto fail_master;
 
 	dev->irq_enabled = true;
 
@@ -607,7 +607,9 @@ nouveau_drm_load(struct drm_device *dev, unsigned long
flags)
 fail_ttm:
 	nouveau_vga_fini(drm);
 	nouveau_cli_fini(&drm->client);
+fail_master:
 	nouveau_cli_fini(&drm->master);
+fail_alloc:
 	kfree(drm);
 	return ret;
 }
-- 
2.17.1

Lyude Paul

2018-Aug-23 01:40 UTC

head link

[Nouveau] [PATCH 2/3] drm/nouveau: Start using new drm_dev initialization helpers

Per the documentation in drm_get_pci_dev(), this function is deprecated
and shouldn't be used anymore. As it turns out, we're going to need to
stop using drm_get_pci_dev() anyway in order to allow us to turn off the
card before full system shutdowns, otherwise we'll hit race conditions
with userspace while trying to tear down the card on shutdown.

So, start using drm_dev_get() and drm_dev_put(), and just turn our
load/unload callbacks into open coded init/fini() functions.

Signed-off-by: Lyude Paul <lyude at redhat.com>
Cc: Karol Herbst <kherbst at redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_drm.c | 173 +++++++++++++++-----------
 1 file changed, 101 insertions(+), 72 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c
b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 6c88ff776ff6..b88b338dc79c 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -458,75 +458,8 @@ nouveau_accel_init(struct nouveau_drm *drm)
 	nouveau_bo_move_init(drm);
 }
 
-static int nouveau_drm_probe(struct pci_dev *pdev,
-			     const struct pci_device_id *pent)
-{
-	struct nvkm_device *device;
-	struct apertures_struct *aper;
-	bool boot = false;
-	int ret;
-
-	if (vga_switcheroo_client_probe_defer(pdev))
-		return -EPROBE_DEFER;
-
-	/* We need to check that the chipset is supported before booting
-	 * fbdev off the hardware, as there's no way to put it back.
-	 */
-	ret = nvkm_device_pci_new(pdev, NULL, "error", true, false, 0,
&device);
-	if (ret)
-		return ret;
-
-	nvkm_device_del(&device);
-
-	/* Remove conflicting drivers (vesafb, efifb etc). */
-	aper = alloc_apertures(3);
-	if (!aper)
-		return -ENOMEM;
-
-	aper->ranges[0].base = pci_resource_start(pdev, 1);
-	aper->ranges[0].size = pci_resource_len(pdev, 1);
-	aper->count = 1;
-
-	if (pci_resource_len(pdev, 2)) {
-		aper->ranges[aper->count].base = pci_resource_start(pdev, 2);
-		aper->ranges[aper->count].size = pci_resource_len(pdev, 2);
-		aper->count++;
-	}
-
-	if (pci_resource_len(pdev, 3)) {
-		aper->ranges[aper->count].base = pci_resource_start(pdev, 3);
-		aper->ranges[aper->count].size = pci_resource_len(pdev, 3);
-		aper->count++;
-	}
-
-#ifdef CONFIG_X86
-	boot = pdev->resource[PCI_ROM_RESOURCE].flags & IORESOURCE_ROM_SHADOW;
-#endif
-	if (nouveau_modeset != 2)
-		drm_fb_helper_remove_conflicting_framebuffers(aper, "nouveaufb",
boot);
-	kfree(aper);
-
-	ret = nvkm_device_pci_new(pdev, nouveau_config, nouveau_debug,
-				  true, true, ~0ULL, &device);
-	if (ret)
-		return ret;
-
-	pci_set_master(pdev);
-
-	if (nouveau_atomic)
-		driver_pci.driver_features |= DRIVER_ATOMIC;
-
-	ret = drm_get_pci_dev(pdev, pent, &driver_pci);
-	if (ret) {
-		nvkm_device_del(&device);
-		return ret;
-	}
-
-	return 0;
-}
-
 static int
-nouveau_drm_load(struct drm_device *dev, unsigned long flags)
+nouveau_drm_device_init(struct drm_device *dev)
 {
 	struct nouveau_drm *drm;
 	int ret;
@@ -615,7 +548,7 @@ nouveau_drm_load(struct drm_device *dev, unsigned long
flags)
 }
 
 static void
-nouveau_drm_unload(struct drm_device *dev)
+nouveau_drm_device_fini(struct drm_device *dev)
 {
 	struct nouveau_drm *drm = nouveau_drm(dev);
 
@@ -644,18 +577,116 @@ nouveau_drm_unload(struct drm_device *dev)
 	kfree(drm);
 }
 
+static int nouveau_drm_probe(struct pci_dev *pdev,
+			     const struct pci_device_id *pent)
+{
+	struct nvkm_device *device;
+	struct drm_device *drm_dev;
+	struct apertures_struct *aper;
+	bool boot = false;
+	int ret;
+
+	if (vga_switcheroo_client_probe_defer(pdev))
+		return -EPROBE_DEFER;
+
+	/* We need to check that the chipset is supported before booting
+	 * fbdev off the hardware, as there's no way to put it back.
+	 */
+	ret = nvkm_device_pci_new(pdev, NULL, "error", true, false, 0,
&device);
+	if (ret)
+		return ret;
+
+	nvkm_device_del(&device);
+
+	/* Remove conflicting drivers (vesafb, efifb etc). */
+	aper = alloc_apertures(3);
+	if (!aper)
+		return -ENOMEM;
+
+	aper->ranges[0].base = pci_resource_start(pdev, 1);
+	aper->ranges[0].size = pci_resource_len(pdev, 1);
+	aper->count = 1;
+
+	if (pci_resource_len(pdev, 2)) {
+		aper->ranges[aper->count].base = pci_resource_start(pdev, 2);
+		aper->ranges[aper->count].size = pci_resource_len(pdev, 2);
+		aper->count++;
+	}
+
+	if (pci_resource_len(pdev, 3)) {
+		aper->ranges[aper->count].base = pci_resource_start(pdev, 3);
+		aper->ranges[aper->count].size = pci_resource_len(pdev, 3);
+		aper->count++;
+	}
+
+#ifdef CONFIG_X86
+	boot = pdev->resource[PCI_ROM_RESOURCE].flags & IORESOURCE_ROM_SHADOW;
+#endif
+	if (nouveau_modeset != 2)
+		drm_fb_helper_remove_conflicting_framebuffers(aper, "nouveaufb",
boot);
+	kfree(aper);
+
+	ret = nvkm_device_pci_new(pdev, nouveau_config, nouveau_debug,
+				  true, true, ~0ULL, &device);
+	if (ret)
+		return ret;
+
+	pci_set_master(pdev);
+
+	if (nouveau_atomic)
+		driver_pci.driver_features |= DRIVER_ATOMIC;
+
+	drm_dev = drm_dev_alloc(&driver_pci, &pdev->dev);
+	if (IS_ERR(drm_dev)) {
+		ret = PTR_ERR(drm_dev);
+		goto fail_nvkm;
+	}
+
+	ret = pci_enable_device(pdev);
+	if (ret)
+		goto fail_drm;
+
+	drm_dev->pdev = pdev;
+	pci_set_drvdata(pdev, drm_dev);
+
+	ret = nouveau_drm_device_init(drm_dev);
+	if (ret)
+		goto fail_pci;
+
+	ret = drm_dev_register(drm_dev, pent->driver_data);
+	if (ret)
+		goto fail_drm_dev_init;
+
+	return 0;
+
+fail_drm_dev_init:
+	nouveau_drm_device_fini(drm_dev);
+fail_pci:
+	pci_disable_device(pdev);
+fail_drm:
+	drm_dev_put(drm_dev);
+fail_nvkm:
+	nvkm_device_del(&device);
+	return ret;
+}
+
 void
 nouveau_drm_device_remove(struct drm_device *dev)
 {
+	struct pci_dev *pdev = dev->pdev;
 	struct nouveau_drm *drm = nouveau_drm(dev);
 	struct nvkm_client *client;
 	struct nvkm_device *device;
 
+	drm_dev_unregister(dev);
+
 	dev->irq_enabled = false;
 	client = nvxx_client(&drm->client.base);
 	device = nvkm_device_find(client->device);
-	drm_put_dev(dev);
 
+	nouveau_drm_device_fini(dev);
+	pci_disable_device(pdev);
+	drm_dev_put(dev);
 	nvkm_device_del(&device);
 }
 
@@ -1022,8 +1053,6 @@ driver_stub = {
 		DRIVER_GEM | DRIVER_MODESET | DRIVER_PRIME | DRIVER_RENDER |
 		DRIVER_KMS_LEGACY_CONTEXT,
 
-	.load = nouveau_drm_load,
-	.unload = nouveau_drm_unload,
 	.open = nouveau_drm_open,
 	.postclose = nouveau_drm_postclose,
 	.lastclose = nouveau_vga_lastclose,
-- 
2.17.1

Lyude Paul

2018-Aug-23 01:40 UTC

head link

[Nouveau] [PATCH 3/3] drm/nouveau: Shut down GPU on kernel shutdown

A little while ago I sent some patches to try to fix issues with
initializing the GM107 GPU with nouveau on the ThinkPad P50. The issues
I was witnessing were rather bizarre: seemingly at random, initializing
the GPU would fail with failed mthds from disp that nouveau had not
actually kicked through the evo channel yet. Example:

    [    1.603467] nouveau 0000:01:00.0: disp: outp 02:0006:0f48: aux power
-> demand
    [    1.603931] nouveau 0000:01:00.0: disp: outp 03:0002:0f48: no heads (0 3
2)
    [    1.604375] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: no heads (0 3
4)
    [    1.604858] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power
-> always
    [    1.605354] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power
-> demand
    [    1.605815] nouveau 0000:01:00.0: disp: outp 05:0002:0f81: no heads (0 3
2)
--->[    1.607289] nouveau 0000:01:00.0: disp: chid 0 mthd 0000 data 00000400
00001000 00000002
--->[    1.608818] nouveau 0000:01:00.0: disp: chid 1 mthd 0000 data 00000400
00001000 00000002
--->[    1.609500] nouveau 0000:01:00.0: disp: chid 2 mthd 0000 data 00000400
00001000 00000002
    [    1.612392] [drm:drm_dp_dpcd_read [drm_kms_helper]] sor-0006-0f42:
0x00000 AUX -> (ret=  1) 12
    [    1.612774] [drm:drm_dp_dpcd_write [drm_kms_helper]] sor-0006-0f42:
0x00111 AUX <- (ret=  1) 00
    [    1.635748] [drm:drm_dp_dpcd_access [drm_kms_helper]] Too many retries,
giving up. First error: -6
    [    1.635752] [drm:drm_dp_dpcd_read [drm_kms_helper]] sor-0006-0f48:
0x00000 AUX -> (ret= -6)
    [    1.658128] [drm:drm_dp_dpcd_access [drm_kms_helper]] Too many retries,
giving up. First error: -6
    [    1.658131] [drm:drm_dp_dpcd_read [drm_kms_helper]] sor-0006-0f81:
0x00000 AUX -> (ret= -6)

These failures would also occur /before/ nouveau had actually pushed
anything to the evo channel. Then, later the rest of the GPU would start
failing like so:

[    3.851956] ------------[ cut here ]------------
[    3.851958] nouveau 0000:01:00.0: timeout
[    3.851995] WARNING: CPU: 0 PID: 62 at
drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.c:1560
gf100_grctx_generate+0x89d/0x8b0 [nouveau]
[    3.851997] Modules linked in: serio_raw crc32c_intel xhci_pci i915(O+)
xhci_hcd nouveau(O) video mxm_wmi wmi i2c_algo_bit drm_kms_helper(O) syscopyarea
sysfillrect sysimgblt fb_sys_fops ttm(O) drm(O) i2c_core
[    3.852010] CPU: 0 PID: 62 Comm: kworker/0:2 Tainted: G           O     
4.18.0-rc8Lyude-Test+ #7
[    3.852011] Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET78W (1.51
) 05/18/2018
[    3.852018] Workqueue: events output_poll_execute [drm_kms_helper]
[    3.852105] RIP: 0010:gf100_grctx_generate+0x89d/0x8b0 [nouveau]
[    3.852107] Code: ff 49 8b 7c 24 10 48 8b 5f 50 48 85 db 75 04 48 8b 5f 10 e8
25 5d 30 e1 48 89 da 48 c7 c7 4e e7 2a a0 48 89 c6 e8 65 c1 e9 e0 <0f> 0b
bb f0 ff ff ff e9 68 f9 ff ff 0f 1f 80 00 00 00 00 0f 1f 44
[    3.852127] RSP: 0018:ffffc9000027b898 EFLAGS: 00010282
[    3.852128] RAX: 0000000000000000 RBX: ffff880876c20bd0 RCX: 0000000000000006
[    3.852130] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff88089b415570
[    3.852132] RBP: ffffc9000027b958 R08: 0000000000000000 R09: 0000000000000000
[    3.852133] R10: ffff880876685f00 R11: ffffffff8140cc60 R12: ffff8808716d2000
[    3.852135] R13: ffffc9000027b8d0 R14: ffffc9000027b8c8 R15: ffff88087165c000
[    3.852137] FS:  0000000000000000(0000) GS:ffff88089b400000(0000)
knlGS:0000000000000000
[    3.852139] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.852140] CR2: 00005621d2c180b8 CR3: 000000000200a005 CR4: 00000000003606f0
[    3.852142] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    3.852144] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    3.852145] Call Trace:
[    3.852168]  ? nv04_timer_read+0x48/0x60 [nouveau]
[    3.852191]  gf100_gr_init_ctxctl+0x536/0xa40 [nouveau]
[    3.852212]  gf100_gr_init+0x563/0x590 [nouveau]
[    3.852234]  gf100_gr_init_+0x5b/0x60 [nouveau]
[    3.852255]  nvkm_gr_init+0x1d/0x20 [nouveau]
[    3.852267]  nvkm_engine_init+0xb9/0x1f0 [nouveau]
[    3.852280]  nvkm_subdev_init+0xbc/0x210 [nouveau]
[    3.852292]  nvkm_engine_ref.part.0+0x4a/0x70 [nouveau]
[    3.852304]  nvkm_engine_ref+0x13/0x20 [nouveau]
[    3.852316]  nvkm_ioctl_new+0x12c/0x260 [nouveau]
[    3.852337]  ? nvkm_fifo_chan_dtor+0x100/0x100 [nouveau]
[    3.852358]  ? gf100_fermi_mthd+0x100/0x100 [nouveau]
[    3.852371]  nvkm_ioctl+0xe2/0x180 [nouveau]
[    3.852392]  nvkm_client_ioctl+0x12/0x20 [nouveau]
[    3.852403]  nvif_object_ioctl+0x47/0x50 [nouveau]
[    3.852415]  nvif_object_init+0xc8/0x120 [nouveau]
[    3.852435]  nvc0_fbcon_accel_init+0x5b/0x950 [nouveau]
[    3.852455]  nouveau_fbcon_create+0x5bb/0x5e0 [nouveau]
[    3.852460]  ? drm_setup_crtcs+0x247/0xa60 [drm_kms_helper]
[    3.852464]  __drm_fb_helper_initial_config_and_unlock+0x1c0/0x410
[drm_kms_helper]
[    3.852468]  drm_fb_helper_hotplug_event.part.33+0xa9/0xb0 [drm_kms_helper]
[    3.852472]  drm_fb_helper_hotplug_event+0x1c/0x30 [drm_kms_helper]
[    3.852492]  nouveau_fbcon_output_poll_changed+0xb6/0x110 [nouveau]
[    3.852496]  drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper]
[    3.852500]  output_poll_execute+0x198/0x1c0 [drm_kms_helper]
[    3.852504]  process_one_work+0x1b2/0x370
[    3.852506]  worker_thread+0x37/0x3a0
[    3.852508]  kthread+0x120/0x140
[    3.852510]  ? wq_update_unbound_numa+0x10/0x10
[    3.852511]  ? kthread_create_worker_on_cpu+0x70/0x70
[    3.852514]  ret_from_fork+0x35/0x40
[    3.852516] ---[ end trace 583fe2d8feb59e4a ]---
[    3.852733] nouveau 0000:01:00.0: gr: failed to construct context
[    3.852737] nouveau 0000:01:00.0: gr: init failed, -16

Originally I had a good bit of trouble even reproducing this issue
whatsoever. I've since then managed to figure out a reproducer that
seems to work about 70% of the time:

- Boot the machine while docked, load nouveau
- Undock the machine, wait for nouveau to go into runtime suspend
- Reboot the machine. If done correctly, you should be able to see
  nouveau briefly resume itself before the shutdown finishes
- On the next boot, the following should happen (if it doesn't, go back
  to step one):
  - If nouveau isn't loaded within 10-20 seconds of booting, you will
    probably see an unclaimed interrupt warning.
  - Once nouveau is loaded, you'll see the symptoms I've described here

At first I assumed that the BIOS was probably trying to probe for
displays before loading the OS, leaving us with a GPU in a funky
sort-of-on state. I tried some solutions that involved shutting down the
various display channels that were left on, but eventually discovered
that we were starting off with more then just disp channels left on: the
entire gr was left on as well. Additionally; it was pointed out to me by
Ben Skeggs that the BIOS doesn't make any use of evo channels anyway.

After some investigation we found the real cause of the problem, and
unfortunately it's far worse than leaving a few channels on: the version
of the BIOS on this P50 appears to have a bug which makes it so that on
full system reboots, the dedicated GPU somehow does not always get power
cycled. In fact, it's even left in nearly the same state it was in
before we finished rebooting! How awful.

While it's quite clear there's one rather impressive BIOS bug going on
here that needs to get fixed, we can at least solve most of the symptoms
of this issue by making nouveau a little better about cleaning up after
itself on kernel shutdowns/reboots. This is something nouveau is going
to need to be able to do if it's ever going to be used for things like
PCI passthrough anyway, since we want to avoid passing the GPU around
from VM guest to VM host if it's still half-way initialized.

Luckily I have some contacts at Lenovo, so I will be bringing this up
and referencing this patch to make sure that this gets fixed properly in
the P50 BIOS as well, especially since not having this fixed in the BIOS
means it's possible for us to fail to reboot if we put both the card and
the kernel into a bad state and require a full reboot. But until then,
this patch should make the problem significantly less noticeable.

For reference, the BIOS version on this P50 is version 1.52.

Signed-off-by: Lyude Paul <lyude at redhat.com>
Cc: Karol Herbst <kherbst at redhat.com>
[omitting Cc to stable. I'd /love/ to get this into a stable kernel, but
unfortunately there's too many large changes this depends on to do that]

Signed-off-by: Lyude Paul <lyude at redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_drm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c
b/drivers/gpu/drm/nouveau/nouveau_drm.c
index b88b338dc79c..c37641496324 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -1151,6 +1151,7 @@ nouveau_drm_pci_driver = {
 	.id_table = nouveau_drm_pci_table,
 	.probe = nouveau_drm_probe,
 	.remove = nouveau_drm_remove,
+	.shutdown = nouveau_drm_remove,
 	.driver.pm = &nouveau_pm_ops,
 };
 
-- 
2.17.1

Seemingly Similar Threads

Search for more possibly parallel threads

Nouveau - Aug 2018 - [PATCH 0/3] drm/nouveau: Fixup module probe to add ->shutdown()

[Nouveau] [PATCH 0/3] drm/nouveau: Fixup module probe to add ->shutdown()

[Nouveau] [PATCH 1/3] drm/nouveau: Fix potential memory leak in nouveau_drm_load()

[Nouveau] [PATCH 2/3] drm/nouveau: Start using new drm_dev initialization helpers

[Nouveau] [PATCH 3/3] drm/nouveau: Shut down GPU on kernel shutdown

Seemingly Similar Threads