Lyude Paul
2018-Aug-23 01:40 UTC
[Nouveau] [PATCH 0/3] drm/nouveau: Fixup module probe to add ->shutdown()
This series is intended to add support for shutting down the GPU on kernel shutdown/reboot using the ->shutdown() hook, similar to what amdgpu does. This is mainly intended to workaround a bios issue on the P50 that was preventing nouveau from initializing the dedicated GM107 GPU on that system properly. You can find more details on this issue in the patch labeled "Shut down GPU on kernel shutdown". Doing this so that it wouldn't race with userspace took a little bit of work (and uncovered one small bug in the process). Mainly, I had to convert nouveau to use the new drm_dev_alloc() and drm_dev_register() helpers to ensure that there isn't any chance that we can race with userspace during shutdown. This also involved removing ->load() and ->unload() and moving their work into nouveau_drm_probe(). ============================== IMPORTANT ==============================This patch series WILL NOT WORK PROPERLY unless you also apply the following patches which fix some leftover device removal issues: https://patchwork.freedesktop.org/series/48596/ This fixes kernel panics on removal on systems with nvidia controlled backlights https://patchwork.freedesktop.org/series/47843/ This fixes issues with DRM connectors getting leaked on device removal, due to hpd_work still being active when unloading nouveau. ============================== IMPORTANT ============================== This also uncovered one small bug in nouveau_drm_load(). Lyude Paul (3): drm/nouveau: Fix potential memory leak in nouveau_drm_load() drm/nouveau: Start using new drm_dev initialization helpers drm/nouveau: Shut down GPU on kernel shutdown drivers/gpu/drm/nouveau/nouveau_drm.c | 180 +++++++++++++++----------- 1 file changed, 106 insertions(+), 74 deletions(-) -- 2.17.1
Lyude Paul
2018-Aug-23 01:40 UTC
[Nouveau] [PATCH 1/3] drm/nouveau: Fix potential memory leak in nouveau_drm_load()
We forget to free drm in all instances of failure, and additionally also forget to destroy the master client if the other client fails initialization. Signed-off-by: Lyude Paul <lyude at redhat.com> Cc: Karol Herbst <kherbst at redhat.com> --- drivers/gpu/drm/nouveau/nouveau_drm.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 5fdc1fbe2ee5..6c88ff776ff6 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -538,11 +538,11 @@ nouveau_drm_load(struct drm_device *dev, unsigned long flags) ret = nouveau_cli_init(drm, "DRM-master", &drm->master); if (ret) - return ret; + goto fail_alloc; ret = nouveau_cli_init(drm, "DRM", &drm->client); if (ret) - return ret; + goto fail_master; dev->irq_enabled = true; @@ -607,7 +607,9 @@ nouveau_drm_load(struct drm_device *dev, unsigned long flags) fail_ttm: nouveau_vga_fini(drm); nouveau_cli_fini(&drm->client); +fail_master: nouveau_cli_fini(&drm->master); +fail_alloc: kfree(drm); return ret; } -- 2.17.1
Lyude Paul
2018-Aug-23 01:40 UTC
[Nouveau] [PATCH 2/3] drm/nouveau: Start using new drm_dev initialization helpers
Per the documentation in drm_get_pci_dev(), this function is deprecated and shouldn't be used anymore. As it turns out, we're going to need to stop using drm_get_pci_dev() anyway in order to allow us to turn off the card before full system shutdowns, otherwise we'll hit race conditions with userspace while trying to tear down the card on shutdown. So, start using drm_dev_get() and drm_dev_put(), and just turn our load/unload callbacks into open coded init/fini() functions. Signed-off-by: Lyude Paul <lyude at redhat.com> Cc: Karol Herbst <kherbst at redhat.com> --- drivers/gpu/drm/nouveau/nouveau_drm.c | 173 +++++++++++++++----------- 1 file changed, 101 insertions(+), 72 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 6c88ff776ff6..b88b338dc79c 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -458,75 +458,8 @@ nouveau_accel_init(struct nouveau_drm *drm) nouveau_bo_move_init(drm); } -static int nouveau_drm_probe(struct pci_dev *pdev, - const struct pci_device_id *pent) -{ - struct nvkm_device *device; - struct apertures_struct *aper; - bool boot = false; - int ret; - - if (vga_switcheroo_client_probe_defer(pdev)) - return -EPROBE_DEFER; - - /* We need to check that the chipset is supported before booting - * fbdev off the hardware, as there's no way to put it back. - */ - ret = nvkm_device_pci_new(pdev, NULL, "error", true, false, 0, &device); - if (ret) - return ret; - - nvkm_device_del(&device); - - /* Remove conflicting drivers (vesafb, efifb etc). */ - aper = alloc_apertures(3); - if (!aper) - return -ENOMEM; - - aper->ranges[0].base = pci_resource_start(pdev, 1); - aper->ranges[0].size = pci_resource_len(pdev, 1); - aper->count = 1; - - if (pci_resource_len(pdev, 2)) { - aper->ranges[aper->count].base = pci_resource_start(pdev, 2); - aper->ranges[aper->count].size = pci_resource_len(pdev, 2); - aper->count++; - } - - if (pci_resource_len(pdev, 3)) { - aper->ranges[aper->count].base = pci_resource_start(pdev, 3); - aper->ranges[aper->count].size = pci_resource_len(pdev, 3); - aper->count++; - } - -#ifdef CONFIG_X86 - boot = pdev->resource[PCI_ROM_RESOURCE].flags & IORESOURCE_ROM_SHADOW; -#endif - if (nouveau_modeset != 2) - drm_fb_helper_remove_conflicting_framebuffers(aper, "nouveaufb", boot); - kfree(aper); - - ret = nvkm_device_pci_new(pdev, nouveau_config, nouveau_debug, - true, true, ~0ULL, &device); - if (ret) - return ret; - - pci_set_master(pdev); - - if (nouveau_atomic) - driver_pci.driver_features |= DRIVER_ATOMIC; - - ret = drm_get_pci_dev(pdev, pent, &driver_pci); - if (ret) { - nvkm_device_del(&device); - return ret; - } - - return 0; -} - static int -nouveau_drm_load(struct drm_device *dev, unsigned long flags) +nouveau_drm_device_init(struct drm_device *dev) { struct nouveau_drm *drm; int ret; @@ -615,7 +548,7 @@ nouveau_drm_load(struct drm_device *dev, unsigned long flags) } static void -nouveau_drm_unload(struct drm_device *dev) +nouveau_drm_device_fini(struct drm_device *dev) { struct nouveau_drm *drm = nouveau_drm(dev); @@ -644,18 +577,116 @@ nouveau_drm_unload(struct drm_device *dev) kfree(drm); } +static int nouveau_drm_probe(struct pci_dev *pdev, + const struct pci_device_id *pent) +{ + struct nvkm_device *device; + struct drm_device *drm_dev; + struct apertures_struct *aper; + bool boot = false; + int ret; + + if (vga_switcheroo_client_probe_defer(pdev)) + return -EPROBE_DEFER; + + /* We need to check that the chipset is supported before booting + * fbdev off the hardware, as there's no way to put it back. + */ + ret = nvkm_device_pci_new(pdev, NULL, "error", true, false, 0, &device); + if (ret) + return ret; + + nvkm_device_del(&device); + + /* Remove conflicting drivers (vesafb, efifb etc). */ + aper = alloc_apertures(3); + if (!aper) + return -ENOMEM; + + aper->ranges[0].base = pci_resource_start(pdev, 1); + aper->ranges[0].size = pci_resource_len(pdev, 1); + aper->count = 1; + + if (pci_resource_len(pdev, 2)) { + aper->ranges[aper->count].base = pci_resource_start(pdev, 2); + aper->ranges[aper->count].size = pci_resource_len(pdev, 2); + aper->count++; + } + + if (pci_resource_len(pdev, 3)) { + aper->ranges[aper->count].base = pci_resource_start(pdev, 3); + aper->ranges[aper->count].size = pci_resource_len(pdev, 3); + aper->count++; + } + +#ifdef CONFIG_X86 + boot = pdev->resource[PCI_ROM_RESOURCE].flags & IORESOURCE_ROM_SHADOW; +#endif + if (nouveau_modeset != 2) + drm_fb_helper_remove_conflicting_framebuffers(aper, "nouveaufb", boot); + kfree(aper); + + ret = nvkm_device_pci_new(pdev, nouveau_config, nouveau_debug, + true, true, ~0ULL, &device); + if (ret) + return ret; + + pci_set_master(pdev); + + if (nouveau_atomic) + driver_pci.driver_features |= DRIVER_ATOMIC; + + drm_dev = drm_dev_alloc(&driver_pci, &pdev->dev); + if (IS_ERR(drm_dev)) { + ret = PTR_ERR(drm_dev); + goto fail_nvkm; + } + + ret = pci_enable_device(pdev); + if (ret) + goto fail_drm; + + drm_dev->pdev = pdev; + pci_set_drvdata(pdev, drm_dev); + + ret = nouveau_drm_device_init(drm_dev); + if (ret) + goto fail_pci; + + ret = drm_dev_register(drm_dev, pent->driver_data); + if (ret) + goto fail_drm_dev_init; + + return 0; + +fail_drm_dev_init: + nouveau_drm_device_fini(drm_dev); +fail_pci: + pci_disable_device(pdev); +fail_drm: + drm_dev_put(drm_dev); +fail_nvkm: + nvkm_device_del(&device); + return ret; +} + void nouveau_drm_device_remove(struct drm_device *dev) { + struct pci_dev *pdev = dev->pdev; struct nouveau_drm *drm = nouveau_drm(dev); struct nvkm_client *client; struct nvkm_device *device; + drm_dev_unregister(dev); + dev->irq_enabled = false; client = nvxx_client(&drm->client.base); device = nvkm_device_find(client->device); - drm_put_dev(dev); + nouveau_drm_device_fini(dev); + pci_disable_device(pdev); + drm_dev_put(dev); nvkm_device_del(&device); } @@ -1022,8 +1053,6 @@ driver_stub = { DRIVER_GEM | DRIVER_MODESET | DRIVER_PRIME | DRIVER_RENDER | DRIVER_KMS_LEGACY_CONTEXT, - .load = nouveau_drm_load, - .unload = nouveau_drm_unload, .open = nouveau_drm_open, .postclose = nouveau_drm_postclose, .lastclose = nouveau_vga_lastclose, -- 2.17.1
Lyude Paul
2018-Aug-23 01:40 UTC
[Nouveau] [PATCH 3/3] drm/nouveau: Shut down GPU on kernel shutdown
A little while ago I sent some patches to try to fix issues with initializing the GM107 GPU with nouveau on the ThinkPad P50. The issues I was witnessing were rather bizarre: seemingly at random, initializing the GPU would fail with failed mthds from disp that nouveau had not actually kicked through the evo channel yet. Example: [ 1.603467] nouveau 0000:01:00.0: disp: outp 02:0006:0f48: aux power -> demand [ 1.603931] nouveau 0000:01:00.0: disp: outp 03:0002:0f48: no heads (0 3 2) [ 1.604375] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: no heads (0 3 4) [ 1.604858] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power -> always [ 1.605354] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power -> demand [ 1.605815] nouveau 0000:01:00.0: disp: outp 05:0002:0f81: no heads (0 3 2) --->[ 1.607289] nouveau 0000:01:00.0: disp: chid 0 mthd 0000 data 00000400 00001000 00000002 --->[ 1.608818] nouveau 0000:01:00.0: disp: chid 1 mthd 0000 data 00000400 00001000 00000002 --->[ 1.609500] nouveau 0000:01:00.0: disp: chid 2 mthd 0000 data 00000400 00001000 00000002 [ 1.612392] [drm:drm_dp_dpcd_read [drm_kms_helper]] sor-0006-0f42: 0x00000 AUX -> (ret= 1) 12 [ 1.612774] [drm:drm_dp_dpcd_write [drm_kms_helper]] sor-0006-0f42: 0x00111 AUX <- (ret= 1) 00 [ 1.635748] [drm:drm_dp_dpcd_access [drm_kms_helper]] Too many retries, giving up. First error: -6 [ 1.635752] [drm:drm_dp_dpcd_read [drm_kms_helper]] sor-0006-0f48: 0x00000 AUX -> (ret= -6) [ 1.658128] [drm:drm_dp_dpcd_access [drm_kms_helper]] Too many retries, giving up. First error: -6 [ 1.658131] [drm:drm_dp_dpcd_read [drm_kms_helper]] sor-0006-0f81: 0x00000 AUX -> (ret= -6) These failures would also occur /before/ nouveau had actually pushed anything to the evo channel. Then, later the rest of the GPU would start failing like so: [ 3.851956] ------------[ cut here ]------------ [ 3.851958] nouveau 0000:01:00.0: timeout [ 3.851995] WARNING: CPU: 0 PID: 62 at drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxgf100.c:1560 gf100_grctx_generate+0x89d/0x8b0 [nouveau] [ 3.851997] Modules linked in: serio_raw crc32c_intel xhci_pci i915(O+) xhci_hcd nouveau(O) video mxm_wmi wmi i2c_algo_bit drm_kms_helper(O) syscopyarea sysfillrect sysimgblt fb_sys_fops ttm(O) drm(O) i2c_core [ 3.852010] CPU: 0 PID: 62 Comm: kworker/0:2 Tainted: G O 4.18.0-rc8Lyude-Test+ #7 [ 3.852011] Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET78W (1.51 ) 05/18/2018 [ 3.852018] Workqueue: events output_poll_execute [drm_kms_helper] [ 3.852105] RIP: 0010:gf100_grctx_generate+0x89d/0x8b0 [nouveau] [ 3.852107] Code: ff 49 8b 7c 24 10 48 8b 5f 50 48 85 db 75 04 48 8b 5f 10 e8 25 5d 30 e1 48 89 da 48 c7 c7 4e e7 2a a0 48 89 c6 e8 65 c1 e9 e0 <0f> 0b bb f0 ff ff ff e9 68 f9 ff ff 0f 1f 80 00 00 00 00 0f 1f 44 [ 3.852127] RSP: 0018:ffffc9000027b898 EFLAGS: 00010282 [ 3.852128] RAX: 0000000000000000 RBX: ffff880876c20bd0 RCX: 0000000000000006 [ 3.852130] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff88089b415570 [ 3.852132] RBP: ffffc9000027b958 R08: 0000000000000000 R09: 0000000000000000 [ 3.852133] R10: ffff880876685f00 R11: ffffffff8140cc60 R12: ffff8808716d2000 [ 3.852135] R13: ffffc9000027b8d0 R14: ffffc9000027b8c8 R15: ffff88087165c000 [ 3.852137] FS: 0000000000000000(0000) GS:ffff88089b400000(0000) knlGS:0000000000000000 [ 3.852139] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3.852140] CR2: 00005621d2c180b8 CR3: 000000000200a005 CR4: 00000000003606f0 [ 3.852142] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3.852144] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3.852145] Call Trace: [ 3.852168] ? nv04_timer_read+0x48/0x60 [nouveau] [ 3.852191] gf100_gr_init_ctxctl+0x536/0xa40 [nouveau] [ 3.852212] gf100_gr_init+0x563/0x590 [nouveau] [ 3.852234] gf100_gr_init_+0x5b/0x60 [nouveau] [ 3.852255] nvkm_gr_init+0x1d/0x20 [nouveau] [ 3.852267] nvkm_engine_init+0xb9/0x1f0 [nouveau] [ 3.852280] nvkm_subdev_init+0xbc/0x210 [nouveau] [ 3.852292] nvkm_engine_ref.part.0+0x4a/0x70 [nouveau] [ 3.852304] nvkm_engine_ref+0x13/0x20 [nouveau] [ 3.852316] nvkm_ioctl_new+0x12c/0x260 [nouveau] [ 3.852337] ? nvkm_fifo_chan_dtor+0x100/0x100 [nouveau] [ 3.852358] ? gf100_fermi_mthd+0x100/0x100 [nouveau] [ 3.852371] nvkm_ioctl+0xe2/0x180 [nouveau] [ 3.852392] nvkm_client_ioctl+0x12/0x20 [nouveau] [ 3.852403] nvif_object_ioctl+0x47/0x50 [nouveau] [ 3.852415] nvif_object_init+0xc8/0x120 [nouveau] [ 3.852435] nvc0_fbcon_accel_init+0x5b/0x950 [nouveau] [ 3.852455] nouveau_fbcon_create+0x5bb/0x5e0 [nouveau] [ 3.852460] ? drm_setup_crtcs+0x247/0xa60 [drm_kms_helper] [ 3.852464] __drm_fb_helper_initial_config_and_unlock+0x1c0/0x410 [drm_kms_helper] [ 3.852468] drm_fb_helper_hotplug_event.part.33+0xa9/0xb0 [drm_kms_helper] [ 3.852472] drm_fb_helper_hotplug_event+0x1c/0x30 [drm_kms_helper] [ 3.852492] nouveau_fbcon_output_poll_changed+0xb6/0x110 [nouveau] [ 3.852496] drm_kms_helper_hotplug_event+0x2a/0x30 [drm_kms_helper] [ 3.852500] output_poll_execute+0x198/0x1c0 [drm_kms_helper] [ 3.852504] process_one_work+0x1b2/0x370 [ 3.852506] worker_thread+0x37/0x3a0 [ 3.852508] kthread+0x120/0x140 [ 3.852510] ? wq_update_unbound_numa+0x10/0x10 [ 3.852511] ? kthread_create_worker_on_cpu+0x70/0x70 [ 3.852514] ret_from_fork+0x35/0x40 [ 3.852516] ---[ end trace 583fe2d8feb59e4a ]--- [ 3.852733] nouveau 0000:01:00.0: gr: failed to construct context [ 3.852737] nouveau 0000:01:00.0: gr: init failed, -16 Originally I had a good bit of trouble even reproducing this issue whatsoever. I've since then managed to figure out a reproducer that seems to work about 70% of the time: - Boot the machine while docked, load nouveau - Undock the machine, wait for nouveau to go into runtime suspend - Reboot the machine. If done correctly, you should be able to see nouveau briefly resume itself before the shutdown finishes - On the next boot, the following should happen (if it doesn't, go back to step one): - If nouveau isn't loaded within 10-20 seconds of booting, you will probably see an unclaimed interrupt warning. - Once nouveau is loaded, you'll see the symptoms I've described here At first I assumed that the BIOS was probably trying to probe for displays before loading the OS, leaving us with a GPU in a funky sort-of-on state. I tried some solutions that involved shutting down the various display channels that were left on, but eventually discovered that we were starting off with more then just disp channels left on: the entire gr was left on as well. Additionally; it was pointed out to me by Ben Skeggs that the BIOS doesn't make any use of evo channels anyway. After some investigation we found the real cause of the problem, and unfortunately it's far worse than leaving a few channels on: the version of the BIOS on this P50 appears to have a bug which makes it so that on full system reboots, the dedicated GPU somehow does not always get power cycled. In fact, it's even left in nearly the same state it was in before we finished rebooting! How awful. While it's quite clear there's one rather impressive BIOS bug going on here that needs to get fixed, we can at least solve most of the symptoms of this issue by making nouveau a little better about cleaning up after itself on kernel shutdowns/reboots. This is something nouveau is going to need to be able to do if it's ever going to be used for things like PCI passthrough anyway, since we want to avoid passing the GPU around from VM guest to VM host if it's still half-way initialized. Luckily I have some contacts at Lenovo, so I will be bringing this up and referencing this patch to make sure that this gets fixed properly in the P50 BIOS as well, especially since not having this fixed in the BIOS means it's possible for us to fail to reboot if we put both the card and the kernel into a bad state and require a full reboot. But until then, this patch should make the problem significantly less noticeable. For reference, the BIOS version on this P50 is version 1.52. Signed-off-by: Lyude Paul <lyude at redhat.com> Cc: Karol Herbst <kherbst at redhat.com> [omitting Cc to stable. I'd /love/ to get this into a stable kernel, but unfortunately there's too many large changes this depends on to do that] Signed-off-by: Lyude Paul <lyude at redhat.com> --- drivers/gpu/drm/nouveau/nouveau_drm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index b88b338dc79c..c37641496324 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -1151,6 +1151,7 @@ nouveau_drm_pci_driver = { .id_table = nouveau_drm_pci_table, .probe = nouveau_drm_probe, .remove = nouveau_drm_remove, + .shutdown = nouveau_drm_remove, .driver.pm = &nouveau_pm_ops, }; -- 2.17.1
Possibly Parallel Threads
- next/master boot bisection: Oops in nouveau driver on jetson-tk1
- [PATCH 2/3] drm/nouveau: manage nouveau_drm lifetime with devres
- [PATCH] drm/nouveau: support for platform devices
- next/master boot bisection: Oops in nouveau driver on jetson-tk1
- [PATCH v2] drm/nouveau: support for platform devices