Michael S. Tsirkin
2018-Nov-28 20:30 UTC
[Nouveau] 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
On Wed, Nov 28, 2018 at 05:55:44PM +0200, Mika Westerberg wrote:> On Wed, Nov 28, 2018 at 10:09:22AM -0500, Michael S. Tsirkin wrote: > > Yea all this is weird, in particular I wonder why does everyone > > using dsm insists on saying Arg4 > > when they actually mean Arg3. ACPI numbers arguments from 0. > > > > So it's a bit ugly, and maybe worth fixing but unlikely to be > > an actual issue simply because we end up not using DSM in the end. > > I agree. > > > Poking at the probing code in nouveau_pr3_present, I started to wonder: > > should I try to hack it to disable d3cold and pr3 and see what > > happens? > > I guess it is worth a try. You can do it from sysfs for the graphics > PCI device there is an attribute d3cold_allowed that controls this. > > [snip]But probably too late by time nouveau is up at boot?> > > > 00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10) > > > > > > > > so really shouldn't be affected, but go figure. If driver really is getting > > > > all-ones from the device, it just might try to poke at a wrong b:d.f by mistake > > > > maybe ... > > > > > > Or it the power resource is shared by wifi as well. > > > > Is there a way to find out through e.g. sysfs? > > It is not shared, I checked from the acpidump you provided. Possibly the > infinite loop in AML when executing NVPO method have some effect on > this. > > [snip] > > > > No need to send, I can read it from the bugzilla just fine. Can you attach > > > acpidump there as well? > > > > Done. lspci -x too just in case. > > Looking at the dmesg: > > [ 52.917009] No Local Variables are initialized for Method [NVPO] > [ 52.917011] No Arguments are initialized for method [NVPO] > [ 52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > [ 52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > [ 52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > > So what happens here is that Linux turns off power resource > \_SB.PCI0.PEG0.PG00 by calling its _OFF method (happens when the root > port is runtime suspended). This ends up calling \_SB.PCI0.PGON which > calls \_SB.PCI0.PEG0.PEGP.NVPO. > > The last method looks like this: > > Method (NVPO, 0, NotSerialized) > { > While ((\_SB.PCI0.P0LS < 0x03)) > { > Sleep (One) > } > > So basically it polls P0LS register infinitely if the returned value is > less than 3. I suspect this is the issue and it then makes the other > like wifi to fail to execute its methods. > > P0LS comes from this operation region: > > OperationRegion (OPG0, SystemMemory, (XBAS + 0x8000), 0x1000) > Field (OPG0, AnyAcc, NoLock, Preserve) > { > ... > Offset (0x216), > P0LS, 4, > > This is some host bridge register but not sure which because XBAS value > cannot be determined from the acpidump.Oh I think XBAS is in SSDT4: OperationRegion (SANV, SystemMemory, 0x4FBF7018, 0x01F4) Field (SANV, AnyAcc, Lock, Preserve) { ASLB, 32, IMON, 8, IGDS, 8, IBTT, 8, IPAT, 8, IPSC, 8, IBIA, 8, ISSC, 8, IDMS, 8, IF1E, 8, HVCO, 8, GSMI, 8, PAVP, 8, CADL, 8, CSTE, 16, NSTE, 16, NDID, 8, DID1, 32, DID2, 32, DID3, 32, DID4, 32, DID5, 32, DID6, 32, DID7, 32, DID8, 32, DID9, 32, DIDA, 32, DIDB, 32, DIDC, 32, DIDD, 32, DIDE, 32, DIDF, 32, DIDX, 32, NXD1, 32, NXD2, 32, NXD3, 32, NXD4, 32, NXD5, 32, NXD6, 32, NXD7, 32, NXD8, 32, NXDX, 32, LIDS, 8, KSV0, 32, KSV1, 8, BRTL, 8, ALSE, 8, ALAF, 8, LLOW, 8, LHIH, 8, ALFP, 8, IPTP, 8, EDPV, 8, SGMD, 8, SGFL, 8, SGGP, 8, HRE0, 8, HRG0, 32, HRA0, 8, PWE0, 8, PWG0, 32, PWA0, 8, P1GP, 8, HRE1, 8, HRG1, 32, HRA1, 8, PWE1, 8, PWG1, 32, PWA1, 8, P2GP, 8, HRE2, 8, HRG2, 32, HRA2, 8, PWE2, 8, PWG2, 32, PWA2, 8, DLPW, 16, DLHR, 16, EECP, 8, XBAS, 32, GBAS, 16, NVGA, 32, NVHA, 32, AMDA, 32, LTRX, 8, OBFX, 8, LTRY, 8, OBFY, 8, LTRZ, 8, OBFZ, 8, LTRW, 8, OBFA, 8, SMSL, 16, SNSL, 16, P0UB, 8, P1UB, 8, P2UB, 8, P3UB, 8, PCSL, 8, PBGE, 8, M64B, 64, M64L, 64, CPEX, 32, EEC1, 8, EEC2, 8, SBN0, 8, SBN1, 8, SBN2, 8, M32B, 32, M32L, 32, P0WK, 32, P1WK, 32, P2WK, 32, VTDS, 8, VTB1, 32, VTB2, 32, VTB3, 32, VE1V, 16, VE2V, 16, SBN3, 8, P3GP, 8, HRE3, 8, HRG3, 32, HRA3, 8, PWE3, 8, PWG3, 32, PWA3, 8, P3WK, 32, EEC3, 8, RPIN, 8, RPBA, 32, Offset (0x1F4) } If my math is correct, this is offset 1456 bits, ie 0xb6 bytes, and so 0x4fbf70ce XBAS + 0x8000 is 0x4fbff0ce ? cat /proc/iomem shows that this is 4ee5f000-4fca0fff : ACPI Non-volatile Storage -- MST
Karol Herbst
2018-Nov-28 23:21 UTC
[Nouveau] 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
this was already debugged and there is no point in searching inside the Firmware. It's not a firmware bug or anything. The proper fix is to do something inside Nouveau so that we don't upset the device and being able to runtime resume it again. The initial thing we do inside Nouveau to cause those issues is to run that so called "DEVINIT" script inside the vbios to initialize the GPU, problem is, it changes something on the PCIe configuration so that the GPU isn't able to runtime resume anymore. I am in contact with Nvidia about that issue and hopefully we get the proper answers. When I was digging into that myself I was able to make the situation more stable by setting the PCIE link speed to the boot defaults, but that was still pretty unstable. Anyway, because the binary driver fails here as well (through bumblebee and so on) there isn't much of reverse engineering we can do besides guessing and trying it on literally every hardware until it works. We also have an upstream bug for this issue: https://bugzilla.kernel.org/show_bug.cgi?id=156341 On Wed, Nov 28, 2018 at 9:30 PM Michael S. Tsirkin <mst at redhat.com> wrote:> > On Wed, Nov 28, 2018 at 05:55:44PM +0200, Mika Westerberg wrote: > > On Wed, Nov 28, 2018 at 10:09:22AM -0500, Michael S. Tsirkin wrote: > > > Yea all this is weird, in particular I wonder why does everyone > > > using dsm insists on saying Arg4 > > > when they actually mean Arg3. ACPI numbers arguments from 0. > > > > > > So it's a bit ugly, and maybe worth fixing but unlikely to be > > > an actual issue simply because we end up not using DSM in the end. > > > > I agree. > > > > > Poking at the probing code in nouveau_pr3_present, I started to wonder: > > > should I try to hack it to disable d3cold and pr3 and see what > > > happens? > > > > I guess it is worth a try. You can do it from sysfs for the graphics > > PCI device there is an attribute d3cold_allowed that controls this. > > > > [snip] > > But probably too late by time nouveau is up at boot? > > > > > > 00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10) > > > > > > > > > > so really shouldn't be affected, but go figure. If driver really is getting > > > > > all-ones from the device, it just might try to poke at a wrong b:d.f by mistake > > > > > maybe ... > > > > > > > > Or it the power resource is shared by wifi as well. > > > > > > Is there a way to find out through e.g. sysfs? > > > > It is not shared, I checked from the acpidump you provided. Possibly the > > infinite loop in AML when executing NVPO method have some effect on > > this. > > > > [snip] > > > > > > No need to send, I can read it from the bugzilla just fine. Can you attach > > > > acpidump there as well? > > > > > > Done. lspci -x too just in case. > > > > Looking at the dmesg: > > > > [ 52.917009] No Local Variables are initialized for Method [NVPO] > > [ 52.917011] No Arguments are initialized for method [NVPO] > > [ 52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > > [ 52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > > [ 52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516) > > > > So what happens here is that Linux turns off power resource > > \_SB.PCI0.PEG0.PG00 by calling its _OFF method (happens when the root > > port is runtime suspended). This ends up calling \_SB.PCI0.PGON which > > calls \_SB.PCI0.PEG0.PEGP.NVPO. > > > > The last method looks like this: > > > > Method (NVPO, 0, NotSerialized) > > { > > While ((\_SB.PCI0.P0LS < 0x03)) > > { > > Sleep (One) > > } > > > > So basically it polls P0LS register infinitely if the returned value is > > less than 3. I suspect this is the issue and it then makes the other > > like wifi to fail to execute its methods. > > > > P0LS comes from this operation region: > > > > OperationRegion (OPG0, SystemMemory, (XBAS + 0x8000), 0x1000) > > Field (OPG0, AnyAcc, NoLock, Preserve) > > { > > ... > > Offset (0x216), > > P0LS, 4, > > > > This is some host bridge register but not sure which because XBAS value > > cannot be determined from the acpidump. > > Oh I think XBAS is in SSDT4: > > OperationRegion (SANV, SystemMemory, 0x4FBF7018, 0x01F4) > Field (SANV, AnyAcc, Lock, Preserve) > { > ASLB, 32, > IMON, 8, > IGDS, 8, > IBTT, 8, > IPAT, 8, > IPSC, 8, > IBIA, 8, > ISSC, 8, > IDMS, 8, > IF1E, 8, > HVCO, 8, > GSMI, 8, > PAVP, 8, > CADL, 8, > CSTE, 16, > NSTE, 16, > NDID, 8, > DID1, 32, > DID2, 32, > DID3, 32, > DID4, 32, > DID5, 32, > DID6, 32, > DID7, 32, > DID8, 32, > DID9, 32, > DIDA, 32, > DIDB, 32, > DIDC, 32, > DIDD, 32, > DIDE, 32, > DIDF, 32, > DIDX, 32, > NXD1, 32, > NXD2, 32, > NXD3, 32, > NXD4, 32, > NXD5, 32, > NXD6, 32, > NXD7, 32, > NXD8, 32, > NXDX, 32, > LIDS, 8, > KSV0, 32, > KSV1, 8, > BRTL, 8, > ALSE, 8, > ALAF, 8, > LLOW, 8, > LHIH, 8, > ALFP, 8, > IPTP, 8, > EDPV, 8, > SGMD, 8, > SGFL, 8, > SGGP, 8, > HRE0, 8, > HRG0, 32, > HRA0, 8, > PWE0, 8, > PWG0, 32, > PWA0, 8, > P1GP, 8, > HRE1, 8, > HRG1, 32, > HRA1, 8, > PWE1, 8, > PWG1, 32, > PWA1, 8, > P2GP, 8, > HRE2, 8, > HRG2, 32, > HRA2, 8, > PWE2, 8, > PWG2, 32, > PWA2, 8, > DLPW, 16, > DLHR, 16, > EECP, 8, > XBAS, 32, > GBAS, 16, > NVGA, 32, > NVHA, 32, > AMDA, 32, > LTRX, 8, > OBFX, 8, > LTRY, 8, > OBFY, 8, > LTRZ, 8, > OBFZ, 8, > LTRW, 8, > OBFA, 8, > SMSL, 16, > SNSL, 16, > P0UB, 8, > P1UB, 8, > P2UB, 8, > P3UB, 8, > PCSL, 8, > PBGE, 8, > M64B, 64, > M64L, 64, > CPEX, 32, > EEC1, 8, > EEC2, 8, > SBN0, 8, > SBN1, 8, > SBN2, 8, > M32B, 32, > M32L, 32, > P0WK, 32, > P1WK, 32, > P2WK, 32, > VTDS, 8, > VTB1, 32, > VTB2, 32, > VTB3, 32, > VE1V, 16, > VE2V, 16, > SBN3, 8, > P3GP, 8, > HRE3, 8, > HRG3, 32, > HRA3, 8, > PWE3, 8, > PWG3, 32, > PWA3, 8, > P3WK, 32, > EEC3, 8, > RPIN, 8, > RPBA, 32, > Offset (0x1F4) > } > > If my math is correct, this is offset 1456 bits, ie 0xb6 > bytes, and so 0x4fbf70ce > > XBAS + 0x8000 is 0x4fbff0ce ? > > cat /proc/iomem shows that this is > 4ee5f000-4fca0fff : ACPI Non-volatile Storage > > -- > MST > _______________________________________________ > Nouveau mailing list > Nouveau at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/nouveau
Michael S. Tsirkin
2018-Nov-29 01:29 UTC
[Nouveau] 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
On Thu, Nov 29, 2018 at 12:21:31AM +0100, Karol Herbst wrote:> this was already debugged and there is no point in searching inside > the Firmware. It's not a firmware bug or anything. > > The proper fix is to do something inside Nouveau so that we don't > upset the device and being able to runtime resume it again. > > The initial thing we do inside Nouveau to cause those issues is to run > that so called "DEVINIT" script inside the vbios to initialize the > GPU, problem is, it changes something on the PCIe configuration so > that the GPU isn't able to runtime resume anymore. I am in contact > with Nvidia about that issue and hopefully we get the proper answers. > When I was digging into that myself I was able to make the situation > more stable by setting the PCIE link speed to the boot defaults, but > that was still pretty unstable. > > Anyway, because the binary driver fails here as well (through > bumblebee and so on) there isn't much of reverse engineering we can do > besides guessing and trying it on literally every hardware until it > works. > > We also have an upstream bug for this issue: > https://bugzilla.kernel.org/show_bug.cgi?id=156341If you like I can probably dump the pcie registers on card and/or the pcie port under windows. The card works there :) Let me know. -- MST
Apparently Analagous Threads
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
- 4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups