Mika Westerberg
2019-Nov-20 15:53 UTC
[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Wed, Nov 20, 2019 at 04:37:14PM +0100, Karol Herbst wrote:> On Wed, Nov 20, 2019 at 4:15 PM Mika Westerberg > <mika.westerberg at intel.com> wrote: > > > > On Wed, Nov 20, 2019 at 01:11:52PM +0100, Karol Herbst wrote: > > > On Wed, Nov 20, 2019 at 1:09 PM Mika Westerberg > > > <mika.westerberg at intel.com> wrote: > > > > > > > > On Wed, Nov 20, 2019 at 12:58:00PM +0100, Karol Herbst wrote: > > > > > overall, what I really want to know is, _why_ does it work on windows? > > > > > > > > So do I ;-) > > > > > > > > > Or what are we doing differently on Linux so that it doesn't work? If > > > > > anybody has any idea on how we could dig into this and figure it out > > > > > on this level, this would probably allow us to get closer to the root > > > > > cause? no? > > > > > > > > Have you tried to use the acpi_rev_override parameter in your system and > > > > does it have any effect? > > > > > > > > Also did you try to trace the ACPI _ON/_OFF() methods? I think that > > > > should hopefully reveal something. > > > > > > > > > > I think I did in the past and it seemed to have worked, there is just > > > one big issue with this: it's a Dell specific workaround afaik, and > > > this issue plagues not just Dell, but we've seen it on HP and Lenovo > > > laptops as well, and I've heard about users having the same issues on > > > Asus and MSI laptops as well. > > > > Maybe it is not a workaround at all but instead it simply determines > > whether the system supports RTD3 or something like that (IIRC Windows 8 > > started supporting it). Maybe Dell added check for Linux because at that > > time Linux did not support it. > > > > the point is, it's not checking it by default, so by default you still > run into the windows 8 codepath.Well you can add the quirk to acpi_rev_dmi_table[] so it goes to that path by default. There are a bunch of similar entries for Dell machines. Of course this does not help the non-Dell users so we would still need to figure out the root cause.> > In case RTD3 is supported it invokes LKDS() which probably does the L2 > > or L3 entry and this is for some reason does not work the same way in > > Linux than it does with Windows 8+. > > > > I don't remember if this happens only with nouveau or with the > > proprietary driver as well but looking at the nouveau runtime PM suspend > > hook (assuming I'm looking at the correct code): > > > > static int > > nouveau_pmops_runtime_suspend(struct device *dev) > > { > > struct pci_dev *pdev = to_pci_dev(dev); > > struct drm_device *drm_dev = pci_get_drvdata(pdev); > > int ret; > > > > if (!nouveau_pmops_runtime()) { > > pm_runtime_forbid(dev); > > return -EBUSY; > > } > > > > nouveau_switcheroo_optimus_dsm(); > > ret = nouveau_do_suspend(drm_dev, true); > > pci_save_state(pdev); > > pci_disable_device(pdev); > > pci_ignore_hotplug(pdev); > > pci_set_power_state(pdev, PCI_D3cold); > > drm_dev->switch_power_state = DRM_SWITCH_POWER_DYNAMIC_OFF; > > return ret; > > } > > > > Normally PCI drivers leave the PCI bus PM things to PCI core but here > > the driver does these. So I wonder if it makes any difference if we let > > the core handle all that: > > > > static int > > nouveau_pmops_runtime_suspend(struct device *dev) > > { > > struct pci_dev *pdev = to_pci_dev(dev); > > struct drm_device *drm_dev = pci_get_drvdata(pdev); > > int ret; > > > > if (!nouveau_pmops_runtime()) { > > pm_runtime_forbid(dev); > > return -EBUSY; > > } > > > > nouveau_switcheroo_optimus_dsm(); > > ret = nouveau_do_suspend(drm_dev, true); > > pci_ignore_hotplug(pdev); > > drm_dev->switch_power_state = DRM_SWITCH_POWER_DYNAMIC_OFF; > > return ret; > > } > > > > and similar for the nouveau_pmops_runtime_resume(). > > > > yeah, I tried that at some point and it didn't help either. The reason > we call those from inside Nouveau is to support systems pre _PR where > nouveau invokes custom _DSM calls on its own. We could potentially > check for that though.OK.
Mika Westerberg
2019-Nov-20 16:23 UTC
[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Wed, Nov 20, 2019 at 05:53:07PM +0200, Mika Westerberg wrote:> On Wed, Nov 20, 2019 at 04:37:14PM +0100, Karol Herbst wrote: > > On Wed, Nov 20, 2019 at 4:15 PM Mika Westerberg > > <mika.westerberg at intel.com> wrote: > > > > > > On Wed, Nov 20, 2019 at 01:11:52PM +0100, Karol Herbst wrote: > > > > On Wed, Nov 20, 2019 at 1:09 PM Mika Westerberg > > > > <mika.westerberg at intel.com> wrote: > > > > > > > > > > On Wed, Nov 20, 2019 at 12:58:00PM +0100, Karol Herbst wrote: > > > > > > overall, what I really want to know is, _why_ does it work on windows? > > > > > > > > > > So do I ;-) > > > > > > > > > > > Or what are we doing differently on Linux so that it doesn't work? If > > > > > > anybody has any idea on how we could dig into this and figure it out > > > > > > on this level, this would probably allow us to get closer to the root > > > > > > cause? no? > > > > > > > > > > Have you tried to use the acpi_rev_override parameter in your system and > > > > > does it have any effect? > > > > > > > > > > Also did you try to trace the ACPI _ON/_OFF() methods? I think that > > > > > should hopefully reveal something. > > > > > > > > > > > > > I think I did in the past and it seemed to have worked, there is just > > > > one big issue with this: it's a Dell specific workaround afaik, and > > > > this issue plagues not just Dell, but we've seen it on HP and Lenovo > > > > laptops as well, and I've heard about users having the same issues on > > > > Asus and MSI laptops as well. > > > > > > Maybe it is not a workaround at all but instead it simply determines > > > whether the system supports RTD3 or something like that (IIRC Windows 8 > > > started supporting it). Maybe Dell added check for Linux because at that > > > time Linux did not support it. > > > > > > > the point is, it's not checking it by default, so by default you still > > run into the windows 8 codepath. > > Well you can add the quirk to acpi_rev_dmi_table[] so it goes to that > path by default. There are a bunch of similar entries for Dell machines. > > Of course this does not help the non-Dell users so we would still need > to figure out the root cause.I think I asked you to test the PCIe delay patch and it did not help but I wonder if it helps if we increase the delay. As an experiment could you try Bjorn's pci/pm branch. The last two commits are for the delay. If you could pull that branch and apply the following patch on top and give it a try? Then post the dmesg somewhere so we can see whether it did the delay at all. diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 1f319b1175da..1ad6f1372ed5 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4697,12 +4697,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev) return; } - /* Take d3cold_delay requirements into account */ - delay = pci_bus_max_d3cold_delay(dev->subordinate); - if (!delay) { - up_read(&pci_bus_sem); - return; - } + delay = 500; child = list_first_entry(&dev->subordinate->devices, struct pci_dev, bus_list); @@ -4715,7 +4710,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev) * management for them (see pci_bridge_d3_possible()). */ if (!pci_is_pcie(dev)) { - pci_dbg(dev, "waiting %d ms for secondary bus\n", 1000 + delay); + pci_info(dev, "waiting %d ms for secondary bus\n", 1000 + delay); msleep(1000 + delay); return; } @@ -4741,10 +4736,10 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev) return; if (pcie_get_speed_cap(dev) <= PCIE_SPEED_5_0GT) { - pci_dbg(dev, "waiting %d ms for downstream link\n", delay); + pci_info(dev, "waiting %d ms for downstream link\n", delay); msleep(delay); } else { - pci_dbg(dev, "waiting %d ms for downstream link, after activation\n", + pci_info(dev, "waiting %d ms for downstream link, after activation\n", delay); if (!pcie_wait_for_link_delay(dev, true, delay)) { /* Did not train, no need to wait any further */ @@ -4753,7 +4748,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev) } if (!pci_device_is_present(child)) { - pci_dbg(child, "waiting additional %d ms to become accessible\n", delay); + pci_info(child, "waiting additional %d ms to become accessible\n", delay); msleep(delay); } }
Karol Herbst
2019-Nov-20 21:36 UTC
[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
with the branch and patch applied: https://gist.githubusercontent.com/karolherbst/03c4c8141b0fa292d781badfa186479e/raw/5c62640afbc57d6e69ea924c338bd2836e770d02/gistfile1.txt On Wed, Nov 20, 2019 at 5:23 PM Mika Westerberg <mika.westerberg at intel.com> wrote:> > On Wed, Nov 20, 2019 at 05:53:07PM +0200, Mika Westerberg wrote: > > On Wed, Nov 20, 2019 at 04:37:14PM +0100, Karol Herbst wrote: > > > On Wed, Nov 20, 2019 at 4:15 PM Mika Westerberg > > > <mika.westerberg at intel.com> wrote: > > > > > > > > On Wed, Nov 20, 2019 at 01:11:52PM +0100, Karol Herbst wrote: > > > > > On Wed, Nov 20, 2019 at 1:09 PM Mika Westerberg > > > > > <mika.westerberg at intel.com> wrote: > > > > > > > > > > > > On Wed, Nov 20, 2019 at 12:58:00PM +0100, Karol Herbst wrote: > > > > > > > overall, what I really want to know is, _why_ does it work on windows? > > > > > > > > > > > > So do I ;-) > > > > > > > > > > > > > Or what are we doing differently on Linux so that it doesn't work? If > > > > > > > anybody has any idea on how we could dig into this and figure it out > > > > > > > on this level, this would probably allow us to get closer to the root > > > > > > > cause? no? > > > > > > > > > > > > Have you tried to use the acpi_rev_override parameter in your system and > > > > > > does it have any effect? > > > > > > > > > > > > Also did you try to trace the ACPI _ON/_OFF() methods? I think that > > > > > > should hopefully reveal something. > > > > > > > > > > > > > > > > I think I did in the past and it seemed to have worked, there is just > > > > > one big issue with this: it's a Dell specific workaround afaik, and > > > > > this issue plagues not just Dell, but we've seen it on HP and Lenovo > > > > > laptops as well, and I've heard about users having the same issues on > > > > > Asus and MSI laptops as well. > > > > > > > > Maybe it is not a workaround at all but instead it simply determines > > > > whether the system supports RTD3 or something like that (IIRC Windows 8 > > > > started supporting it). Maybe Dell added check for Linux because at that > > > > time Linux did not support it. > > > > > > > > > > the point is, it's not checking it by default, so by default you still > > > run into the windows 8 codepath. > > > > Well you can add the quirk to acpi_rev_dmi_table[] so it goes to that > > path by default. There are a bunch of similar entries for Dell machines. > > > > Of course this does not help the non-Dell users so we would still need > > to figure out the root cause. > > I think I asked you to test the PCIe delay patch and it did not help but > I wonder if it helps if we increase the delay. As an experiment could > you try Bjorn's pci/pm branch. The last two commits are for the delay. > > If you could pull that branch and apply the following patch on top and > give it a try? Then post the dmesg somewhere so we can see whether it > did the delay at all. > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 1f319b1175da..1ad6f1372ed5 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -4697,12 +4697,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev) > return; > } > > - /* Take d3cold_delay requirements into account */ > - delay = pci_bus_max_d3cold_delay(dev->subordinate); > - if (!delay) { > - up_read(&pci_bus_sem); > - return; > - } > + delay = 500; > > child = list_first_entry(&dev->subordinate->devices, struct pci_dev, > bus_list); > @@ -4715,7 +4710,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev) > * management for them (see pci_bridge_d3_possible()). > */ > if (!pci_is_pcie(dev)) { > - pci_dbg(dev, "waiting %d ms for secondary bus\n", 1000 + delay); > + pci_info(dev, "waiting %d ms for secondary bus\n", 1000 + delay); > msleep(1000 + delay); > return; > } > @@ -4741,10 +4736,10 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev) > return; > > if (pcie_get_speed_cap(dev) <= PCIE_SPEED_5_0GT) { > - pci_dbg(dev, "waiting %d ms for downstream link\n", delay); > + pci_info(dev, "waiting %d ms for downstream link\n", delay); > msleep(delay); > } else { > - pci_dbg(dev, "waiting %d ms for downstream link, after activation\n", > + pci_info(dev, "waiting %d ms for downstream link, after activation\n", > delay); > if (!pcie_wait_for_link_delay(dev, true, delay)) { > /* Did not train, no need to wait any further */ > @@ -4753,7 +4748,7 @@ void pci_bridge_wait_for_secondary_bus(struct pci_dev *dev) > } > > if (!pci_device_is_present(child)) { > - pci_dbg(child, "waiting additional %d ms to become accessible\n", delay); > + pci_info(child, "waiting additional %d ms to become accessible\n", delay); > msleep(delay); > } > } >
Rafael J. Wysocki
2019-Nov-20 21:37 UTC
[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Wed, Nov 20, 2019 at 4:53 PM Mika Westerberg <mika.westerberg at intel.com> wrote:> > On Wed, Nov 20, 2019 at 04:37:14PM +0100, Karol Herbst wrote: > > On Wed, Nov 20, 2019 at 4:15 PM Mika Westerberg > > <mika.westerberg at intel.com> wrote: > > > > > > On Wed, Nov 20, 2019 at 01:11:52PM +0100, Karol Herbst wrote: > > > > On Wed, Nov 20, 2019 at 1:09 PM Mika Westerberg > > > > <mika.westerberg at intel.com> wrote: > > > > > > > > > > On Wed, Nov 20, 2019 at 12:58:00PM +0100, Karol Herbst wrote: > > > > > > overall, what I really want to know is, _why_ does it work on windows? > > > > > > > > > > So do I ;-) > > > > > > > > > > > Or what are we doing differently on Linux so that it doesn't work? If > > > > > > anybody has any idea on how we could dig into this and figure it out > > > > > > on this level, this would probably allow us to get closer to the root > > > > > > cause? no? > > > > > > > > > > Have you tried to use the acpi_rev_override parameter in your system and > > > > > does it have any effect? > > > > > > > > > > Also did you try to trace the ACPI _ON/_OFF() methods? I think that > > > > > should hopefully reveal something. > > > > > > > > > > > > > I think I did in the past and it seemed to have worked, there is just > > > > one big issue with this: it's a Dell specific workaround afaik, and > > > > this issue plagues not just Dell, but we've seen it on HP and Lenovo > > > > laptops as well, and I've heard about users having the same issues on > > > > Asus and MSI laptops as well. > > > > > > Maybe it is not a workaround at all but instead it simply determines > > > whether the system supports RTD3 or something like that (IIRC Windows 8 > > > started supporting it). Maybe Dell added check for Linux because at that > > > time Linux did not support it. > > > > > > > the point is, it's not checking it by default, so by default you still > > run into the windows 8 codepath. > > Well you can add the quirk to acpi_rev_dmi_table[] so it goes to that > path by default. There are a bunch of similar entries for Dell machines.OK, so the "Linux path" works and the other doesn't. I thought that this was the other way around, sorry for the confusion.> Of course this does not help the non-Dell users so we would still need > to figure out the root cause.Right. Whatever it is, though, AML appears to be involved in it and AFAICS there's no evidence that it affects any root ports that are not populated with NVidia GPUs. Now, one thing is still not clear to me from the discussion so far: is the _PR3 method you mentioned defined under the GPU device object or under the port device object?
Karol Herbst
2019-Nov-20 21:40 UTC
[Nouveau] [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Wed, Nov 20, 2019 at 10:37 PM Rafael J. Wysocki <rafael at kernel.org> wrote:> > On Wed, Nov 20, 2019 at 4:53 PM Mika Westerberg > <mika.westerberg at intel.com> wrote: > > > > On Wed, Nov 20, 2019 at 04:37:14PM +0100, Karol Herbst wrote: > > > On Wed, Nov 20, 2019 at 4:15 PM Mika Westerberg > > > <mika.westerberg at intel.com> wrote: > > > > > > > > On Wed, Nov 20, 2019 at 01:11:52PM +0100, Karol Herbst wrote: > > > > > On Wed, Nov 20, 2019 at 1:09 PM Mika Westerberg > > > > > <mika.westerberg at intel.com> wrote: > > > > > > > > > > > > On Wed, Nov 20, 2019 at 12:58:00PM +0100, Karol Herbst wrote: > > > > > > > overall, what I really want to know is, _why_ does it work on windows? > > > > > > > > > > > > So do I ;-) > > > > > > > > > > > > > Or what are we doing differently on Linux so that it doesn't work? If > > > > > > > anybody has any idea on how we could dig into this and figure it out > > > > > > > on this level, this would probably allow us to get closer to the root > > > > > > > cause? no? > > > > > > > > > > > > Have you tried to use the acpi_rev_override parameter in your system and > > > > > > does it have any effect? > > > > > > > > > > > > Also did you try to trace the ACPI _ON/_OFF() methods? I think that > > > > > > should hopefully reveal something. > > > > > > > > > > > > > > > > I think I did in the past and it seemed to have worked, there is just > > > > > one big issue with this: it's a Dell specific workaround afaik, and > > > > > this issue plagues not just Dell, but we've seen it on HP and Lenovo > > > > > laptops as well, and I've heard about users having the same issues on > > > > > Asus and MSI laptops as well. > > > > > > > > Maybe it is not a workaround at all but instead it simply determines > > > > whether the system supports RTD3 or something like that (IIRC Windows 8 > > > > started supporting it). Maybe Dell added check for Linux because at that > > > > time Linux did not support it. > > > > > > > > > > the point is, it's not checking it by default, so by default you still > > > run into the windows 8 codepath. > > > > Well you can add the quirk to acpi_rev_dmi_table[] so it goes to that > > path by default. There are a bunch of similar entries for Dell machines. > > OK, so the "Linux path" works and the other doesn't. > > I thought that this was the other way around, sorry for the confusion. > > > Of course this does not help the non-Dell users so we would still need > > to figure out the root cause. > > Right. > > Whatever it is, though, AML appears to be involved in it and AFAICS > there's no evidence that it affects any root ports that are not > populated with NVidia GPUs. >last week or so I found systems where the GPU was under the "PCI Express Root Port" (name from lspci) and on those systems all of that seems to work. So I am wondering if it's indeed just the 0x1901 one, which also explains Mikas case that Thunderbolt stuff works as devices never get populated under this particular bridge controller, but under those "Root Port"s> Now, one thing is still not clear to me from the discussion so far: is > the _PR3 method you mentioned defined under the GPU device object or > under the port device object? >
Apparently Analagous Threads
- [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
- [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
- [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
- [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
- [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges