Lyude Paul
2018-Aug-20 17:20 UTC
[Nouveau] [PATCH 0/2] drm/nouveau: Fix GM107 disp init failures on ThinkPad P50
This series fixes some intermittent issues with bringing up the dedicated GM107 GPU that I've been observing on my ThinkPad P50. More details within. Lyude Paul (2): drm/nouveau: Fix GM107 disp core chan init on ThinkPad P50 drm/nouveau: Fix GM107 disp dmac chan init on ThinkPad P50 .../drm/nouveau/nvkm/engine/disp/coregf119.c | 21 +++++++++++++++++-- .../drm/nouveau/nvkm/engine/disp/dmacgf119.c | 19 +++++++++++++++-- 2 files changed, 36 insertions(+), 4 deletions(-) -- 2.17.1
Lyude Paul
2018-Aug-20 17:20 UTC
[Nouveau] [PATCH 1/2] drm/nouveau: Fix GM107 disp core chan init on ThinkPad P50
I've been experiencing a rather strange looking bug on the P50 I've got for work. After a number of reboots, nouveau will fail to initialize the dedicated GPU on the system at boot properly. Things start off with this disp mthd failure: ... [ 2.088505] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power -> demand [ 2.088516] nouveau 0000:01:00.0: disp: outp 05:0002:0f81: no heads (0 3 2) [ 2.088620] nouveau 0000:01:00.0: disp: init completed in 329us [ 2.088957] nouveau 0000:01:00.0: disp: chid 0 mthd 0000 data 00000400 00001000 00000002 the failure ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [ 2.151517] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 2.151517] [drm] Driver supports precise vblank timestamp query. [ 2.151521] 0088 1 core507d_init [ 2.151522] f0000000 After the error happens, parts of the card start timing out and eventually the GR fails to hold it's golden context and starts timing out: [ 10.163137] ------------[ cut here ]------------ [ 10.163169] nouveau 0000:01:00.0: timeout [ 10.163218] WARNING: CPU: 4 PID: 98 at drivers/gpu/drm/nouveau/nvkm/engine/disp/coregf119.c:181 gf119_disp_core_fini+0xe6/0x140 [nouveau] [ 10.163246] Modules linked in: joydev vfat fat intel_rapl iTCO_wdt x86_pkg_temp_thermal coretemp crc32_pclmul psmouse wmi_bmof i2c_i801 mei_me tpm_tis mei tpm_tis_core tpm thinkpad_acpi pcc_cpufreq ax88179_178a usbnet mii nouveau mxm_wmi i915 ttm i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel serio_raw xhci_pci drm xhci_hcd i2c_core wmi video [ 10.163330] CPU: 4 PID: 98 Comm: kworker/4:1 Kdump: loaded Not tainted 4.18.0-rc8Lyude-Test+ #7 [ 10.163349] Hardware name: LENOVO 20EQS64N0B/20EQS64N0B, BIOS N1EET78W (1.51 ) 05/18/2018 [ 10.163370] Workqueue: pm pm_runtime_work [ 10.163404] RIP: 0010:gf119_disp_core_fini+0xe6/0x140 [nouveau] [ 10.163418] Code: 5e 41 5f 5d c3 49 8b 7c 24 10 48 8b 5f 50 48 85 db 74 5f e8 1c 5b 0f e1 48 89 da 48 c7 c7 b3 b2 4e a0 48 89 c6 e8 5c bf c8 e0 <0f> 0b 41 8b 47 50 85 c0 74 c6 49 8b 7c 24 78 48 81 c7 90 04 61 00 [ 10.163476] RSP: 0018:ffffc90000a83b00 EFLAGS: 00010286 [ 10.163489] RAX: 0000000000000000 RBX: ffff8808773c6bd0 RCX: 0000000000000006 [ 10.163506] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff88089b515570 [ 10.163523] RBP: ffffc90000a83b28 R08: 0000000000000000 R09: 0000000000aaaaaa [ 10.163539] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8808715b2c00 [ 10.163556] R13: ffff88087779d780 R14: 00000001e68f0200 R15: ffff88086f91b000 [ 10.163573] FS: 0000000000000000(0000) GS:ffff88089b500000(0000) knlGS:0000000000000000 [ 10.163591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 10.163605] CR2: 00007f3d7953d180 CR3: 000000000200a003 CR4: 00000000003606e0 [ 10.163622] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 10.163639] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 10.163655] Call Trace: [ 10.163686] nv50_disp_chan_fini+0x23/0x40 [nouveau] [ 10.163711] nvkm_object_fini+0xbf/0x150 [nouveau] [ 10.163735] nvkm_object_fini+0x76/0x150 [nouveau] [ 10.163759] nvkm_object_fini+0x76/0x150 [nouveau] [ 10.163783] nvkm_object_fini+0x76/0x150 [nouveau] [ 10.163807] nvkm_object_fini+0x76/0x150 [nouveau] [ 10.163840] nvkm_client_suspend+0x13/0x20 [nouveau] [ 10.163864] nvif_client_suspend+0x1d/0x20 [nouveau] [ 10.163898] nouveau_do_suspend+0x113/0x310 [nouveau] [ 10.163931] nouveau_pmops_runtime_suspend+0x57/0xe0 [nouveau] [ 10.163947] ? pci_has_legacy_pm_support+0x70/0x70 [ 10.163960] pci_pm_runtime_suspend+0x6b/0x180 [ 10.163972] ? pci_has_legacy_pm_support+0x70/0x70 [ 10.163985] ? pci_has_legacy_pm_support+0x70/0x70 [ 10.163997] __rpm_callback+0xcc/0x1e0 [ 10.164009] ? __switch_to_asm+0x40/0x70 [ 10.164020] ? pci_has_legacy_pm_support+0x70/0x70 [ 10.164033] rpm_callback+0x24/0x80 [ 10.164043] ? pci_has_legacy_pm_support+0x70/0x70 [ 10.164055] rpm_suspend+0x142/0x600 [ 10.164066] ? __switch_to_asm+0x40/0x70 [ 10.164100] pm_runtime_work+0x79/0x90 [ 10.164112] process_one_work+0x1b2/0x370 [ 10.164140] worker_thread+0x37/0x3a0 [ 10.164150] kthread+0x120/0x140 [ 10.164160] ? wq_update_unbound_numa+0x10/0x10 [ 10.164172] ? kthread_create_worker_on_cpu+0x70/0x70 [ 10.164186] ret_from_fork+0x35/0x40 [ 10.164196] ---[ end trace d5c556c207f0c26b ]--- You'll notice from those traces that the very first evo kick happens /after/ the mthd failure on the display channel, not before. Additionally, there is no point at this part of the initialization process where we actually call mthd 0000 from nouveau. Upon closer inspection, I discovered that this mysterious phantom disp failure seems to be the result of someone else (probably the VBIOS or the BIOS of the P50) leaving the disp core channel enabled by the time nouveau begins to start initializing it. This was confirmed by observing that the 0x610490 register holds a value of 0x490a009b when the card is in this broken state, as opposed to the usual 0x48070088 or 0x48000088 observed on most cards pre-init. It appears we can fix this by checking for the unknown mask 0x000a0000, and simply shutting down the channel like we normally would on suspend or driver unload before we start trying to initialize it. This appears to be close to what nouveau does for older cards, as a similar workaround can be seen in nv50_disp_core_init(). Unfortunately, I'm still not entirely clear on what conditions actually cause this problem to be reproduced. Everyone else I've talked to so far with a P50 doesn't report ever having hit this issue. As well, I haven't managed to find a clear reproducer for this besides rebooting the machine until the bug happens, while alternating between booting while docked and while on battery every so often. This fixes most random initialization errors on my ThinkPad P50 with a GM107 GPU. Signed-off-by: Lyude Paul <lyude at redhat.com> Cc: Karol Herbst <kherbst at redhat.com> Cc: stable at vger.kernel.org --- .../drm/nouveau/nvkm/engine/disp/coregf119.c | 21 +++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/coregf119.c b/drivers/gpu/drm/nouveau/nvkm/engine/disp/coregf119.c index d162b9cf4eac..7534b5e9246f 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/coregf119.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/coregf119.c @@ -166,8 +166,8 @@ gf119_disp_core_mthd = { } }; -void -gf119_disp_core_fini(struct nv50_disp_chan *chan) +static bool +gf119_disp_core_deactivate(struct nv50_disp_chan *chan) { struct nvkm_subdev *subdev = &chan->disp->base.engine.subdev; struct nvkm_device *device = subdev->device; @@ -181,7 +181,16 @@ gf119_disp_core_fini(struct nv50_disp_chan *chan) ) < 0) { nvkm_error(subdev, "core fini: %08x\n", nvkm_rd32(device, 0x610490)); + return false; } + + return true; +} + +void +gf119_disp_core_fini(struct nv50_disp_chan *chan) +{ + gf119_disp_core_deactivate(chan); } static int @@ -190,6 +199,14 @@ gf119_disp_core_init(struct nv50_disp_chan *chan) struct nvkm_subdev *subdev = &chan->disp->base.engine.subdev; struct nvkm_device *device = subdev->device; + /* attempt to unstick the channel from some unknown state */ + if ((nvkm_rd32(device, 0x610490) & 0x000a0000) == 0x000a0000 && + WARN_ON(!gf119_disp_core_deactivate(chan))) { + + nvkm_error(subdev, "core won't shut down, aborting\n"); + return -EBUSY; + } + /* initialise channel for dma command submission */ nvkm_wr32(device, 0x610494, chan->push); nvkm_wr32(device, 0x610498, 0x00010000); -- 2.17.1
Lyude Paul
2018-Aug-20 17:20 UTC
[Nouveau] [PATCH 2/2] drm/nouveau: Fix GM107 disp dmac chan init on ThinkPad P50
Just like how the P50 will occasionally leave the disp's core channel on before nouveau starts initializing, it will occasionally do the same thing with the rest of the dmac channel in addition to the core channel. Example: [ 1.604375] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: no heads (0 3 4) [ 1.604858] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power -> always [ 1.605354] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power -> demand [ 1.605815] nouveau 0000:01:00.0: disp: outp 05:0002:0f81: no heads (0 3 2) [ 1.607289] nouveau 0000:01:00.0: disp: chid 0 mthd 0000 data 00000400 00001000 00000002 [ 1.608818] nouveau 0000:01:00.0: disp: chid 1 mthd 0000 data 00000400 00001000 00000002 [ 1.609500] nouveau 0000:01:00.0: disp: chid 2 mthd 0000 data 00000400 00001000 00000002 Which of course, later causes other parts of the card to start timing out and failing. Closer inspection shows the same thing happening as with our core channel; 0x610490 + (ctrl * 0x10) always has the same unknown 0x000a0000 mask set when the phantom mthd failures start appearing. So, implement the same workaround we use for the core disp channel to the rest of the disp channels. This along with the previous patch fix random initialization failures observed with the Thinkpad P50. Signed-off-by: Lyude Paul <lyude at redhat.com> Cc: Karol Herbst <karolherbst at gmail.com> Cc: stable at vger.kernel.org --- .../drm/nouveau/nvkm/engine/disp/dmacgf119.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/dmacgf119.c b/drivers/gpu/drm/nouveau/nvkm/engine/disp/dmacgf119.c index edf7dd0d931d..7bc91f260e27 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/dmacgf119.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/dmacgf119.c @@ -35,8 +35,8 @@ gf119_disp_dmac_bind(struct nv50_disp_chan *chan, chan->chid.user << 27 | 0x00000001); } -void -gf119_disp_dmac_fini(struct nv50_disp_chan *chan) +static bool +gf119_disp_dmac_deactivate(struct nv50_disp_chan *chan) { struct nvkm_subdev *subdev = &chan->disp->base.engine.subdev; struct nvkm_device *device = subdev->device; @@ -52,7 +52,16 @@ gf119_disp_dmac_fini(struct nv50_disp_chan *chan) ) < 0) { nvkm_error(subdev, "ch %d fini: %08x\n", user, nvkm_rd32(device, 0x610490 + (ctrl * 0x10))); + return false; } + + return true; +} + +void +gf119_disp_dmac_fini(struct nv50_disp_chan *chan) +{ + gf119_disp_dmac_deactivate(chan); } static int @@ -63,6 +72,12 @@ gf119_disp_dmac_init(struct nv50_disp_chan *chan) int ctrl = chan->chid.ctrl; int user = chan->chid.user; + /* shut down the channel if it was left on, probably by the VBIOS */ + if ((nvkm_rd32(device, 0x610490 + (ctrl * 0x10)) & 0x000a0000) == 0x000a0000 && + WARN_ON(!gf119_disp_dmac_deactivate(chan))) { + return -EBUSY; + } + /* initialise channel for dma command submission */ nvkm_wr32(device, 0x610494 + (ctrl * 0x0010), chan->push); nvkm_wr32(device, 0x610498 + (ctrl * 0x0010), 0x00010000); -- 2.17.1
Lyude Paul
2018-Aug-21 16:53 UTC
[Nouveau] [PATCH 2/2] drm/nouveau: Fix GM107 disp dmac chan init on ThinkPad P50
As a note: I don't think this patch is ready /just/ yet now as I just hit this problem again this morning (and it looks like I'm checking the wrong mask for dmac, it appears to be slightly different from the core), looking into this now On Mon, 2018-08-20 at 13:20 -0400, Lyude Paul wrote:> Just like how the P50 will occasionally leave the disp's core channel on > before nouveau starts initializing, it will occasionally do the same > thing with the rest of the dmac channel in addition to the core channel. > Example: > > [ 1.604375] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: no heads (0 3 4) > [ 1.604858] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power -> > always > [ 1.605354] nouveau 0000:01:00.0: disp: outp 04:0006:0f81: aux power -> > demand > [ 1.605815] nouveau 0000:01:00.0: disp: outp 05:0002:0f81: no heads (0 3 2) > [ 1.607289] nouveau 0000:01:00.0: disp: chid 0 mthd 0000 data 00000400 > 00001000 00000002 > [ 1.608818] nouveau 0000:01:00.0: disp: chid 1 mthd 0000 data 00000400 > 00001000 00000002 > [ 1.609500] nouveau 0000:01:00.0: disp: chid 2 mthd 0000 data 00000400 > 00001000 00000002 > > Which of course, later causes other parts of the card to start timing > out and failing. Closer inspection shows the same thing happening as > with our core channel; 0x610490 + (ctrl * 0x10) always has the same > unknown 0x000a0000 mask set when the phantom mthd failures start > appearing. > > So, implement the same workaround we use for the core disp channel to > the rest of the disp channels. > > This along with the previous patch fix random initialization failures > observed with the Thinkpad P50. > > Signed-off-by: Lyude Paul <lyude at redhat.com> > Cc: Karol Herbst <karolherbst at gmail.com> > Cc: stable at vger.kernel.org > --- > .../drm/nouveau/nvkm/engine/disp/dmacgf119.c | 19 +++++++++++++++++-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/dmacgf119.c > b/drivers/gpu/drm/nouveau/nvkm/engine/disp/dmacgf119.c > index edf7dd0d931d..7bc91f260e27 100644 > --- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/dmacgf119.c > +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/dmacgf119.c > @@ -35,8 +35,8 @@ gf119_disp_dmac_bind(struct nv50_disp_chan *chan, > chan->chid.user << 27 | 0x00000001); > } > > -void > -gf119_disp_dmac_fini(struct nv50_disp_chan *chan) > +static bool > +gf119_disp_dmac_deactivate(struct nv50_disp_chan *chan) > { > struct nvkm_subdev *subdev = &chan->disp->base.engine.subdev; > struct nvkm_device *device = subdev->device; > @@ -52,7 +52,16 @@ gf119_disp_dmac_fini(struct nv50_disp_chan *chan) > ) < 0) { > nvkm_error(subdev, "ch %d fini: %08x\n", user, > nvkm_rd32(device, 0x610490 + (ctrl * 0x10))); > + return false; > } > + > + return true; > +} > + > +void > +gf119_disp_dmac_fini(struct nv50_disp_chan *chan) > +{ > + gf119_disp_dmac_deactivate(chan); > } > > static int > @@ -63,6 +72,12 @@ gf119_disp_dmac_init(struct nv50_disp_chan *chan) > int ctrl = chan->chid.ctrl; > int user = chan->chid.user; > > + /* shut down the channel if it was left on, probably by the VBIOS */ > + if ((nvkm_rd32(device, 0x610490 + (ctrl * 0x10)) & 0x000a0000) => 0x000a0000 && > + WARN_ON(!gf119_disp_dmac_deactivate(chan))) { > + return -EBUSY; > + } > + > /* initialise channel for dma command submission */ > nvkm_wr32(device, 0x610494 + (ctrl * 0x0010), chan->push); > nvkm_wr32(device, 0x610498 + (ctrl * 0x0010), 0x00010000);
Maybe Matching Threads
- [PATCH 2/2] drm/nouveau: Fix GM107 disp dmac chan init on ThinkPad P50
- nouveau problem
- [PATCH][next] drm/nouveau/disp: avoid potential overflow on shift of int value
- [PATCH 0/3] drm/nouveau: Fixup module probe to add ->shutdown()
- [bug report] drm/nouveau/disp/nv50-: add channel interfaces to control error interrupts