Bjorn Helgaas
2019-Mar-21 22:48 UTC
[Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50
[+cc Rafael] On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote:> On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote: > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote: > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote: > > > > On a very specific subset of ThinkPad P50 SKUs, particularly > > > > ones that come with a Quadro M1000M chip instead of the M2000M > > > > variant, the BIOS seems to have a very nasty habit of not > > > > always resetting the secondary Nvidia GPU between full reboots > > > > if the laptop is configured in Hybrid Graphics mode. The > > > > reason for this happening is unknown, but the following steps > > > > and possibly a good bit of patience will reproduce the issue: > > > > > > > > 1. Boot up the laptop normally in Hybrid graphics mode > > > > 2. Make sure nouveau is loaded and that the GPU is awake > > > > 2. Allow the nvidia GPU to runtime suspend itself after being idle > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b may help) > > > > 4. If nouveau loads up properly, reboot the machine again and go back to > > > > step 2 until you reproduce the issue > > > > > > > > This results in some very strange behavior: the GPU will quite > > > > literally be left in exactly the same state it was in when the > > > > previously booted kernel started the reboot. This has all > > > > sorts of bad sideaffects: for starters, this completely breaks > > > > nouveau starting with a mysterious EVO channel failure that > > > > happens well before we've actually used the EVO channel for > > > > anything:Thanks for the hybrid tutorial (snipped from this response). IIUC, what you said was that in hybrid mode, the Intel GPU drives the built-in display and the Nvidia GPU drives any external displays and may be used for DRI PRIME rendering (whatever that is). But since you say the Nvidia device gets runtime suspended, I assume there's no external display here and you're not using DRI PRIME. I wonder if it's related to the fact that the Nvidia GPU has been runtime suspended before you do the reboot. Can you try turning of runtime power management for the GPU by setting the runpm module parameter to 0? I *think* this would be booting with "nouveau.runpm=0".> > > Is there a bug report for this? Bugzilla.kernel.org would be ideal, > > > including "lspci -vvxxx" and dmidecode for the system. > > > > > Not yet, but there has been discussion about this between nouveau > > developers on our IRC channel. > > I lied: yes there actually is a bug report for this, but it's > currently on the Red Hat bugzilla. I can get more information from > it if you need (with lenovo's approval of course).Can you please make a bugzilla.kernel.org entry with as much information (dmesg, "lspci -vvxxx", dmidecode, etc) as you can get approval for? You can include the Red Hat bugzilla URL in the commit log, too, but that's not quite as good because we have no control over whether it's public.> And additionally: I've been working with Lenovo on this issue for a > couple of months now, and we've gone through dozens of different > trial BIOSes with no success thus far. However, Lenovo is currently > working on trying to add this workaround into their BIOS but I've > been told that this change is going to take a decent amount of time > since they need to test it across multiple operating systems. I'd be > happy to come back and add a conditional later to turn this > workaround off for later BIOS versions once Lenovo has released a > proper fix.Sounds like Lenovo is going to a lot of trouble for this. The ideal thing from my point of view would be if they could figure out why this works on Windows but not on Linux. I doubt Windows has a quirk like this, so if we could figure out why it works on Windows, we could likely do something similar in Linux.> > > > So to do this, we add a new pci quirk using > > > > DECLARE_PCI_FIXUP_CLASS_FINAL that will be invoked before the PCI probe > > > > at boot finishes. From there, we check to make sure that this is indeed > > > > the specific P50 variant of this GPU. We also make sure that the GPU PCI > > > > device is advertising NoReset- in order to prevent us from trying to > > > > reset the GPU when the machine is in Dedicated graphics mode (where the > > > > GPU being initialized by the BIOS is normal and expected). Finally, we > > > > try mapping the MMIO space for the GPU which should only work if the GPU > > > > is actually active in D0 mode. We can then read the magic 0x2240c > > > > register on the GPU, which will have bit 1 set if the GPU's firmware has > > > > already been posted during a previous boot. Once we've confirmed all of > > > > this, we reset the PCI device and re-disable it - bringing the GPU back > > > > into a healthy state. > > > > > > > > Signed-off-by: Lyude Paul <lyude at redhat.com> > > > > Cc: nouveau at lists.freedesktop.org > > > > Cc: dri-devel at lists.freedesktop.org > > > > Cc: Karol Herbst <kherbst at redhat.com> > > > > Cc: Ben Skeggs <skeggsb at gmail.com> > > > > Cc: stable at vger.kernel.org > > > > --- > > > > drivers/pci/quirks.c | 65 ++++++++++++++++++++++++++++++++++++++++++++ > > > > 1 file changed, 65 insertions(+) > > > > > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > > > index b0a413f3f7ca..948492fda8bf 100644 > > > > --- a/drivers/pci/quirks.c > > > > +++ b/drivers/pci/quirks.c > > > > @@ -5117,3 +5117,68 @@ SWITCHTEC_QUIRK(0x8573); /* PFXI 48XG3 */ > > > > SWITCHTEC_QUIRK(0x8574); /* PFXI 64XG3 */ > > > > SWITCHTEC_QUIRK(0x8575); /* PFXI 80XG3 */ > > > > SWITCHTEC_QUIRK(0x8576); /* PFXI 96XG3 */ > > > > + > > > > +/* > > > > + * On certain Lenovo Thinkpad P50 SKUs, specifically those with a > > > > Nvidia > > > > + * Quadro M1000M, the BIOS will occasionally make the mistake of not > > > > resetting > > > > + * the nvidia GPU between reboots if the system is configured to use > > > > hybrid > > > > + * graphics mode. This results in the GPU being left in whatever state > > > > it > > > > was > > > > + * in during the previous boot which causes spurious interrupts from > > > > the > > > > GPU, > > > > + * which in turn cause us to disable the wrong IRQs and end up breaking > > > > the > > > > + * touchpad. Unsurprisingly, this also completely breaks nouveau. > > > > + * > > > > + * Luckily, it seems a simple reset of the PCI device for the nvidia > > > > GPU > > > > + * manages to bring the GPU back into a clean state and fix all of > > > > these > > > > + * issues. Additionally since the GPU will report NoReset+ when the > > > > machine is > > > > + * configured in Dedicated display mode, we don't need to worry about > > > > + * accidentally resetting the GPU when it's supposed to already be > > > > + * initialized. > > > > + */ > > > > +static void > > > > +quirk_lenovo_thinkpad_p50_nvgpu_survives_reboot(struct pci_dev *pdev) > > > > +{ > > > > + void __iomem *map; > > > > + int ret; > > > > + > > > > + if (pdev->subsystem_vendor != PCI_VENDOR_ID_LENOVO || > > > > + pdev->subsystem_device != 0x222e || > > > > + !pdev->reset_fn) > > > > + return; > > > > + > > > > + /* > > > > + * If we can't enable the device's mmio space, it's probably not even > > > > + * initialized. This is fine, and means we can just skip the quirk > > > > + * entirely. > > > > + */ > > > > + if (pci_enable_device_mem(pdev)) { > > > > + pci_dbg(pdev, "Can't enable device mem, no reset needed\n"); > > > > + return; > > > > + } > > > > + > > > > + /* Taken from drivers/gpu/drm/nouveau/engine/device/base.c */ > > > > + map = ioremap(pci_resource_start(pdev, 0), 0x102000); > > > > + if (!map) { > > > > + pci_err(pdev, "Can't map MMIO space, this is probably very > > > > bad\n"); > > > > + goto out_disable; > > > > + } > > > > + > > > > + /* > > > > + * Be extra careful, and make sure that the GPU firmware is posted > > > > + * before trying a reset > > > > + */ > > > > + if (ioread32(map + 0x2240c) & 0x2) { > > > > + pci_info(pdev, > > > > + FW_BUG "GPU left initialized by EFI, resetting\n"); > > > > + ret = pci_reset_function(pdev); > > > > + if (ret < 0) > > > > + pci_err(pdev, "Failed to reset GPU: %d\n", ret); > > > > + } > > > > + > > > > + iounmap(map); > > > > +out_disable: > > > > + pci_disable_device(pdev); > > > > +} > > > > + > > > > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, > > > > + PCI_CLASS_DISPLAY_VGA, 8, > > > > + quirk_lenovo_thinkpad_p50_nvgpu_survives_reboot) > > > > ; > > > > -- > > > > 2.20.1 > > > > > -- > Cheers, > Lyude Paul >
Bjorn Helgaas
2019-Mar-22 11:30 UTC
[Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50
On Thu, Mar 21, 2019 at 05:48:19PM -0500, Bjorn Helgaas wrote:> On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote: > > On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote: > > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote: > > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote: > > > > > On a very specific subset of ThinkPad P50 SKUs, particularly > > > > > ones that come with a Quadro M1000M chip instead of the M2000M > > > > > variant, the BIOS seems to have a very nasty habit of not > > > > > always resetting the secondary Nvidia GPU between full reboots > > > > > if the laptop is configured in Hybrid Graphics mode. The > > > > > reason for this happening is unknown, but the following steps > > > > > and possibly a good bit of patience will reproduce the issue: > > > > > > > > > > 1. Boot up the laptop normally in Hybrid graphics mode > > > > > 2. Make sure nouveau is loaded and that the GPU is awake > > > > > 2. Allow the nvidia GPU to runtime suspend itself after being idle > > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b may help) > > > > > 4. If nouveau loads up properly, reboot the machine again and go back to > > > > > step 2 until you reproduce the issue > > > > > > > > > > This results in some very strange behavior: the GPU will quite > > > > > literally be left in exactly the same state it was in when the > > > > > previously booted kernel started the reboot. This has all > > > > > sorts of bad sideaffects: for starters, this completely breaks > > > > > nouveau starting with a mysterious EVO channel failure that > > > > > happens well before we've actually used the EVO channel for > > > > > anything: > > Thanks for the hybrid tutorial (snipped from this response). IIUC, > what you said was that in hybrid mode, the Intel GPU drives the > built-in display and the Nvidia GPU drives any external displays and > may be used for DRI PRIME rendering (whatever that is). But since you > say the Nvidia device gets runtime suspended, I assume there's no > external display here and you're not using DRI PRIME. > > I wonder if it's related to the fact that the Nvidia GPU has been > runtime suspended before you do the reboot. Can you try turning of > runtime power management for the GPU by setting the runpm module > parameter to 0? I *think* this would be booting with > "nouveau.runpm=0".Sorry, I wasn't really thinking here. You already *said* this is related to runtime suspend. It only happens when the Nvidia GPU has been suspended. I don't know that much about suspend, but ISTR seeing comments about resuming devices before we shutdown. If we do that, maybe there's some kind of race between that resume and the reboot? Bjorn
Lyude Paul
2019-Mar-22 23:50 UTC
[Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50
Note: I did read your response lower down in the thread, but I wanted to make sure I addressed one of the comments here (see below) On Thu, 2019-03-21 at 17:48 -0500, Bjorn Helgaas wrote:> [+cc Rafael] > > On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote: > > On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote: > > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote: > > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote: > > > > > On a very specific subset of ThinkPad P50 SKUs, particularly > > > > > ones that come with a Quadro M1000M chip instead of the M2000M > > > > > variant, the BIOS seems to have a very nasty habit of not > > > > > always resetting the secondary Nvidia GPU between full reboots > > > > > if the laptop is configured in Hybrid Graphics mode. The > > > > > reason for this happening is unknown, but the following steps > > > > > and possibly a good bit of patience will reproduce the issue: > > > > > > > > > > 1. Boot up the laptop normally in Hybrid graphics mode > > > > > 2. Make sure nouveau is loaded and that the GPU is awake > > > > > 2. Allow the nvidia GPU to runtime suspend itself after being idle > > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b may > > > > > help) > > > > > 4. If nouveau loads up properly, reboot the machine again and go > > > > > back to > > > > > step 2 until you reproduce the issue > > > > > > > > > > This results in some very strange behavior: the GPU will quite > > > > > literally be left in exactly the same state it was in when the > > > > > previously booted kernel started the reboot. This has all > > > > > sorts of bad sideaffects: for starters, this completely breaks > > > > > nouveau starting with a mysterious EVO channel failure that > > > > > happens well before we've actually used the EVO channel for > > > > > anything: > > Thanks for the hybrid tutorial (snipped from this response). IIUC, > what you said was that in hybrid mode, the Intel GPU drives the > built-in display and the Nvidia GPU drives any external displays and > may be used for DRI PRIME rendering (whatever that is). But since you > say the Nvidia device gets runtime suspended, I assume there's no > external display here and you're not using DRI PRIME. > > I wonder if it's related to the fact that the Nvidia GPU has been > runtime suspended before you do the reboot. Can you try turning of > runtime power management for the GPU by setting the runpm module > parameter to 0? I *think* this would be booting with > "nouveau.runpm=0". > > > > > Is there a bug report for this? Bugzilla.kernel.org would be ideal, > > > > including "lspci -vvxxx" and dmidecode for the system. > > > > > > > Not yet, but there has been discussion about this between nouveau > > > developers on our IRC channel. > > > > I lied: yes there actually is a bug report for this, but it's > > currently on the Red Hat bugzilla. I can get more information from > > it if you need (with lenovo's approval of course). > > Can you please make a bugzilla.kernel.org entry with as much > information (dmesg, "lspci -vvxxx", dmidecode, etc) as you can get > approval for? You can include the Red Hat bugzilla URL in the commit > log, too, but that's not quite as good because we have no control over > whether it's public. > > > And additionally: I've been working with Lenovo on this issue for a > > couple of months now, and we've gone through dozens of different > > trial BIOSes with no success thus far. However, Lenovo is currently > > working on trying to add this workaround into their BIOS but I've > > been told that this change is going to take a decent amount of time > > since they need to test it across multiple operating systems. I'd be > > happy to come back and add a conditional later to turn this > > workaround off for later BIOS versions once Lenovo has released a > > proper fix. > > Sounds like Lenovo is going to a lot of trouble for this. The ideal > thing from my point of view would be if they could figure out why this > works on Windows but not on Linux. I doubt Windows has a quirk like > this, so if we could figure out why it works on Windows, we could > likely do something similar in Linux.I did actually try this route after first finding this bug, but unfortunately from what I understand there isn't really much more Lenovo can do other then give us a patched BIOS or look at their own BIOS to see if it's the cause. Anyway, went ahead and filed a bug with as much information as I could get my hands on here (different email then the one I'm talking to you from): https://bugzilla.kernel.org/show_bug.cgi?id=203003> > > > > > So to do this, we add a new pci quirk using > > > > > DECLARE_PCI_FIXUP_CLASS_FINAL that will be invoked before the PCI > > > > > probe > > > > > at boot finishes. From there, we check to make sure that this is > > > > > indeed > > > > > the specific P50 variant of this GPU. We also make sure that the GPU > > > > > PCI > > > > > device is advertising NoReset- in order to prevent us from trying to > > > > > reset the GPU when the machine is in Dedicated graphics mode (where > > > > > the > > > > > GPU being initialized by the BIOS is normal and expected). Finally, > > > > > we > > > > > try mapping the MMIO space for the GPU which should only work if the > > > > > GPU > > > > > is actually active in D0 mode. We can then read the magic 0x2240c > > > > > register on the GPU, which will have bit 1 set if the GPU's firmware > > > > > has > > > > > already been posted during a previous boot. Once we've confirmed all > > > > > of > > > > > this, we reset the PCI device and re-disable it - bringing the GPU > > > > > back > > > > > into a healthy state. > > > > > > > > > > Signed-off-by: Lyude Paul <lyude at redhat.com> > > > > > Cc: nouveau at lists.freedesktop.org > > > > > Cc: dri-devel at lists.freedesktop.org > > > > > Cc: Karol Herbst <kherbst at redhat.com> > > > > > Cc: Ben Skeggs <skeggsb at gmail.com> > > > > > Cc: stable at vger.kernel.org > > > > > --- > > > > > drivers/pci/quirks.c | 65 > > > > > ++++++++++++++++++++++++++++++++++++++++++++ > > > > > 1 file changed, 65 insertions(+) > > > > > > > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > > > > index b0a413f3f7ca..948492fda8bf 100644 > > > > > --- a/drivers/pci/quirks.c > > > > > +++ b/drivers/pci/quirks.c > > > > > @@ -5117,3 +5117,68 @@ SWITCHTEC_QUIRK(0x8573); /* PFXI 48XG3 */ > > > > > SWITCHTEC_QUIRK(0x8574); /* PFXI 64XG3 */ > > > > > SWITCHTEC_QUIRK(0x8575); /* PFXI 80XG3 */ > > > > > SWITCHTEC_QUIRK(0x8576); /* PFXI 96XG3 */ > > > > > + > > > > > +/* > > > > > + * On certain Lenovo Thinkpad P50 SKUs, specifically those with a > > > > > Nvidia > > > > > + * Quadro M1000M, the BIOS will occasionally make the mistake of > > > > > not > > > > > resetting > > > > > + * the nvidia GPU between reboots if the system is configured to > > > > > use > > > > > hybrid > > > > > + * graphics mode. This results in the GPU being left in whatever > > > > > state > > > > > it > > > > > was > > > > > + * in during the previous boot which causes spurious interrupts > > > > > from > > > > > the > > > > > GPU, > > > > > + * which in turn cause us to disable the wrong IRQs and end up > > > > > breaking > > > > > the > > > > > + * touchpad. Unsurprisingly, this also completely breaks nouveau. > > > > > + * > > > > > + * Luckily, it seems a simple reset of the PCI device for the > > > > > nvidia > > > > > GPU > > > > > + * manages to bring the GPU back into a clean state and fix all of > > > > > these > > > > > + * issues. Additionally since the GPU will report NoReset+ when the > > > > > machine is > > > > > + * configured in Dedicated display mode, we don't need to worry > > > > > about > > > > > + * accidentally resetting the GPU when it's supposed to already be > > > > > + * initialized. > > > > > + */ > > > > > +static void > > > > > +quirk_lenovo_thinkpad_p50_nvgpu_survives_reboot(struct pci_dev > > > > > *pdev) > > > > > +{ > > > > > + void __iomem *map; > > > > > + int ret; > > > > > + > > > > > + if (pdev->subsystem_vendor != PCI_VENDOR_ID_LENOVO || > > > > > + pdev->subsystem_device != 0x222e || > > > > > + !pdev->reset_fn) > > > > > + return; > > > > > + > > > > > + /* > > > > > + * If we can't enable the device's mmio space, it's probably > > > > > not even > > > > > + * initialized. This is fine, and means we can just skip the > > > > > quirk > > > > > + * entirely. > > > > > + */ > > > > > + if (pci_enable_device_mem(pdev)) { > > > > > + pci_dbg(pdev, "Can't enable device mem, no reset > > > > > needed\n"); > > > > > + return; > > > > > + } > > > > > + > > > > > + /* Taken from drivers/gpu/drm/nouveau/engine/device/base.c */ > > > > > + map = ioremap(pci_resource_start(pdev, 0), 0x102000); > > > > > + if (!map) { > > > > > + pci_err(pdev, "Can't map MMIO space, this is probably > > > > > very > > > > > bad\n"); > > > > > + goto out_disable; > > > > > + } > > > > > + > > > > > + /* > > > > > + * Be extra careful, and make sure that the GPU firmware is > > > > > posted > > > > > + * before trying a reset > > > > > + */ > > > > > + if (ioread32(map + 0x2240c) & 0x2) { > > > > > + pci_info(pdev, > > > > > + FW_BUG "GPU left initialized by EFI, > > > > > resetting\n"); > > > > > + ret = pci_reset_function(pdev); > > > > > + if (ret < 0) > > > > > + pci_err(pdev, "Failed to reset GPU: %d\n", > > > > > ret); > > > > > + } > > > > > + > > > > > + iounmap(map); > > > > > +out_disable: > > > > > + pci_disable_device(pdev); > > > > > +} > > > > > + > > > > > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, 0x13b1, > > > > > + PCI_CLASS_DISPLAY_VGA, 8, > > > > > + quirk_lenovo_thinkpad_p50_nvgpu_survives > > > > > _reboot) > > > > > ; > > > > > -- > > > > > 2.20.1 > > > > > > > -- > > Cheers, > > Lyude Paul > >-- Cheers, Lyude Paul
Lyude Paul
2019-Apr-03 17:27 UTC
[Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50
Hi, any update on this/do you guys need any more information here? Would very much like to get this upstream On Fri, 2019-03-22 at 06:30 -0500, Bjorn Helgaas wrote:> On Thu, Mar 21, 2019 at 05:48:19PM -0500, Bjorn Helgaas wrote: > > On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote: > > > On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote: > > > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote: > > > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote: > > > > > > On a very specific subset of ThinkPad P50 SKUs, particularly > > > > > > ones that come with a Quadro M1000M chip instead of the M2000M > > > > > > variant, the BIOS seems to have a very nasty habit of not > > > > > > always resetting the secondary Nvidia GPU between full reboots > > > > > > if the laptop is configured in Hybrid Graphics mode. The > > > > > > reason for this happening is unknown, but the following steps > > > > > > and possibly a good bit of patience will reproduce the issue: > > > > > > > > > > > > 1. Boot up the laptop normally in Hybrid graphics mode > > > > > > 2. Make sure nouveau is loaded and that the GPU is awake > > > > > > 2. Allow the nvidia GPU to runtime suspend itself after being idle > > > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b may > > > > > > help) > > > > > > 4. If nouveau loads up properly, reboot the machine again and go > > > > > > back to > > > > > > step 2 until you reproduce the issue > > > > > > > > > > > > This results in some very strange behavior: the GPU will quite > > > > > > literally be left in exactly the same state it was in when the > > > > > > previously booted kernel started the reboot. This has all > > > > > > sorts of bad sideaffects: for starters, this completely breaks > > > > > > nouveau starting with a mysterious EVO channel failure that > > > > > > happens well before we've actually used the EVO channel for > > > > > > anything: > > > > Thanks for the hybrid tutorial (snipped from this response). IIUC, > > what you said was that in hybrid mode, the Intel GPU drives the > > built-in display and the Nvidia GPU drives any external displays and > > may be used for DRI PRIME rendering (whatever that is). But since you > > say the Nvidia device gets runtime suspended, I assume there's no > > external display here and you're not using DRI PRIME. > > > > I wonder if it's related to the fact that the Nvidia GPU has been > > runtime suspended before you do the reboot. Can you try turning of > > runtime power management for the GPU by setting the runpm module > > parameter to 0? I *think* this would be booting with > > "nouveau.runpm=0". > > Sorry, I wasn't really thinking here. You already *said* this is > related to runtime suspend. It only happens when the Nvidia GPU has > been suspended. > > I don't know that much about suspend, but ISTR seeing comments about > resuming devices before we shutdown. If we do that, maybe there's > some kind of race between that resume and the reboot? > > Bjorn-- Cheers, Lyude Paul
Bjorn Helgaas
2019-Apr-04 14:17 UTC
[Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50
[+cc Hans, author of 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime) resume")] On Fri, Mar 22, 2019 at 06:30:15AM -0500, Bjorn Helgaas wrote:> On Thu, Mar 21, 2019 at 05:48:19PM -0500, Bjorn Helgaas wrote: > > On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote: > > > On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote: > > > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote: > > > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote: > > > > > > On a very specific subset of ThinkPad P50 SKUs, particularly > > > > > > ones that come with a Quadro M1000M chip instead of the M2000M > > > > > > variant, the BIOS seems to have a very nasty habit of not > > > > > > always resetting the secondary Nvidia GPU between full reboots > > > > > > if the laptop is configured in Hybrid Graphics mode. The > > > > > > reason for this happening is unknown, but the following steps > > > > > > and possibly a good bit of patience will reproduce the issue: > > > > > > > > > > > > 1. Boot up the laptop normally in Hybrid graphics mode > > > > > > 2. Make sure nouveau is loaded and that the GPU is awake > > > > > > 2. Allow the nvidia GPU to runtime suspend itself after being idle > > > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b may help) > > > > > > 4. If nouveau loads up properly, reboot the machine again and go back to > > > > > > step 2 until you reproduce the issue > > > > > > > > > > > > This results in some very strange behavior: the GPU will quite > > > > > > literally be left in exactly the same state it was in when the > > > > > > previously booted kernel started the reboot. This has all > > > > > > sorts of bad sideaffects: for starters, this completely breaks > > > > > > nouveau starting with a mysterious EVO channel failure that > > > > > > happens well before we've actually used the EVO channel for > > > > > > anything: > > > > Thanks for the hybrid tutorial (snipped from this response). IIUC, > > what you said was that in hybrid mode, the Intel GPU drives the > > built-in display and the Nvidia GPU drives any external displays and > > may be used for DRI PRIME rendering (whatever that is). But since you > > say the Nvidia device gets runtime suspended, I assume there's no > > external display here and you're not using DRI PRIME. > > > > I wonder if it's related to the fact that the Nvidia GPU has been > > runtime suspended before you do the reboot. Can you try turning of > > runtime power management for the GPU by setting the runpm module > > parameter to 0? I *think* this would be booting with > > "nouveau.runpm=0". > > Sorry, I wasn't really thinking here. You already *said* this is > related to runtime suspend. It only happens when the Nvidia GPU has > been suspended. > > I don't know that much about suspend, but ISTR seeing comments about > resuming devices before we shutdown. If we do that, maybe there's > some kind of race between that resume and the reboot?I think we do in fact resume PCI devices before shutdown. Here's the path I'm looking at: device_shutdown pm_runtime_get_noresume pm_runtime_barrier dev->bus->shutdown pci_device_shutdown pm_runtime_resume __pm_runtime_resume(dev, 0) rpm_resume(dev, 0) __update_runtime_status(dev, RPM_RESUMING) callback = RPM_GET_CALLBACK(dev, runtime_resume) rpm_callback(callback, dev) __rpm_callback pci_pm_runtime_resume drv->pm->runtime_resume nouveau_pmops_runtime_resume nouveau_do_resume schedule_work(hpd_work) # <--- ... nouveau_display_hpd_work pm_runtime_get_sync drm_helper_hpd_irq_event pm_runtime_mark_last_busy pm_runtime_put_sync I'm curious about that "schedule_work(hpd_work)" near the end because no other drivers seem to use schedule_work() in the runtime_resume path, and I don't know how that synchronizes with the shutdown process. I don't see anything that waits for nouveau_display_hpd_work() to complete, so it seems like something that could be a race. I wonder this problem would be easier to reproduce if you added a sleep in nouveau_display_hpd_work() as in the first hunk below, and I wonder if the problem would then go away if you stopped scheduling hpd_work as in the second hunk? Obviously the second hunk isn't a solution, it's just an attempt to figure out if I'm looking in the right area. Bjorn diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c index 55c0fa451163..e50806012d41 100644 --- a/drivers/gpu/drm/nouveau/nouveau_display.c +++ b/drivers/gpu/drm/nouveau/nouveau_display.c @@ -350,6 +350,7 @@ nouveau_display_hpd_work(struct work_struct *work) pm_runtime_get_sync(drm->dev->dev); + msleep(2000); drm_helper_hpd_irq_event(drm->dev); pm_runtime_mark_last_busy(drm->dev->dev); diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 5020265bfbd9..48da72caa017 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -946,9 +946,6 @@ nouveau_pmops_runtime_resume(struct device *dev) nvif_mask(&device->object, 0x088488, (1 << 25), (1 << 25)); drm_dev->switch_power_state = DRM_SWITCH_POWER_ON; - /* Monitors may have been connected / disconnected during suspend */ - schedule_work(&nouveau_drm(drm_dev)->hpd_work); - return ret; }
Possibly Parallel Threads
- [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50
- [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50
- [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50
- [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50
- [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50