Ben Skeggs
2023-Jan-30 01:09 UTC
[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected
On Sat, 28 Jan 2023 at 21:29, Chris Clayton <chris2553 at googlemail.com> wrote:> > > > On 28/01/2023 05:42, Linux kernel regression tracking (Thorsten Leemhuis) wrote: > > On 27.01.23 20:46, Chris Clayton wrote: > >> [Resend because the mail client on my phone decided to turn HTML on behind my back, so my reply got bounced.] > >> > >> Thanks Thorsten. > >> > >> I did try to revert but it didnt revert cleanly and I don't have the knowledge to fix it up. > >> > >> The patch was part of a merge that included a number of related patches. Tomorrow, I'll try to revert the lot and report > >> back. > > > > You are free to do so, but there is no need for that from my side. I > > only wanted to know if a simple revert would do the trick; if it > > doesn't, it in my experience often is best to leave things to the > > developers of the code in question, > > Sound advice, Thorsten. Way to many conflicts for me to resolve.Hey, This is a complete shot-in-the-dark, as I don't see this behaviour on *any* of my boards. Could you try the attached patch please? Thanks, Ben.> > as they know it best and thus have a > > better idea which hidden side effect a more complex revert might have. > > > > Ciao, Thorsten > > > >> On 27/01/2023 11:20, Linux kernel regression tracking (Thorsten Leemhuis) wrote: > >>> Hi, this is your Linux kernel regression tracker. Top-posting for once, > >>> to make this easily accessible to everyone. > >>> > >>> @nouveau-maintainers, did anyone take a look at this? The report is > >>> already 8 days old and I don't see a single reply. Sure, we'll likely > >>> get a -rc8, but still it would be good to not fix this on the finish line. > >>> > >>> Chris, btw, did you try if you can revert the commit on top of latest > >>> mainline? And if so, does it fix the problem? > >>> > >>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > >>> -- > >>> Everything you wanna know about Linux kernel regression tracking: > >>> https://linux-regtracking.leemhuis.info/about/#tldr > >>> If I did something stupid, please tell me, as explained on that page. > >>> > >>> #regzbot poke > >>> > >>> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis) > >>> wrote: > >>>> [adding various lists and the two other nouveau maintainers to the list > >>>> of recipients] > >>> > >>>> On 18.01.23 21:59, Chris Clayton wrote: > >>>>> Hi. > >>>>> > >>>>> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or > >>>>> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is: > >>>>> > >>>>> sd 4:0:0:0: [sda] Synchronising SCSI cache > >>>>> > >>>>> when closing down I see one additional line: > >>>>> > >>>>> sd 4:0:0:0 [sda]Stopping disk > >>>>> > >>>>> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off. > >>>>> > >>>>> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on: > >>>>> > >>>>> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs > >>>>> (VPR scrubber) > >>>>> > >>>>> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out > >>>>> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so > >>>>> I'm confident the bisect outcome is OK. > >>>>> > >>>>> Kernels 6.1.6 and 5.15.88 are also OK. > >>>>> > >>>>> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is: > >>>>> > >>>>> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller]) > >>>>> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics] > >>>>> > >>>>> Flags: bus master, fast devsel, latency 0, IRQ 142 > >>>>> > >>>>> Memory at c2000000 (64-bit, non-prefetchable) [size=16M] > >>>>> > >>>>> Memory at a0000000 (64-bit, prefetchable) [size=256M] > >>>>> > >>>>> I/O ports at 5000 [size=64] > >>>>> > >>>>> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] > >>>>> > >>>>> Capabilities: [40] Vendor Specific Information: Len=0c <?> > >>>>> > >>>>> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 > >>>>> > >>>>> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit- > >>>>> > >>>>> Capabilities: [d0] Power Management version 2 > >>>>> > >>>>> Kernel driver in use: i915 > >>>>> > >>>>> Kernel modules: i915 > >>>>> > >>>>> > >>>>> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA > >>>>> controller]) > >>>>> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile] > >>>>> Flags: bus master, fast devsel, latency 0, IRQ 141 > >>>>> Memory at c4000000 (32-bit, non-prefetchable) [size=16M] > >>>>> Memory at b0000000 (64-bit, prefetchable) [size=256M] > >>>>> Memory at c0000000 (64-bit, prefetchable) [size=32M] > >>>>> I/O ports at 4000 [size=128] > >>>>> Expansion ROM at c3000000 [disabled] [size=512K] > >>>>> Capabilities: [60] Power Management version 3 > >>>>> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ > >>>>> Capabilities: [78] Express Legacy Endpoint, MSI 00 > >>>>> Kernel driver in use: nouveau > >>>>> Kernel modules: nouveau > >>>>> > >>>>> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit). > >>>>> > >>>>> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not > >>>>> subscribed. > >>>> > >>>> Thanks for the report. To be sure the issue doesn't fall through the > >>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression > >>>> tracking bot: > >>>> > >>>> #regzbot ^introduced e44c2170876197 > >>>> #regzbot title drm: nouveau: hangs on poweroff/reboot > >>>> #regzbot ignore-activity > >>>> > >>>> This isn't a regression? This issue or a fix for it are already > >>>> discussed somewhere else? It was fixed already? You want to clarify when > >>>> the regression started to happen? Or point out I got the title or > >>>> something else totally wrong? Then just reply and tell me -- ideally > >>>> while also telling regzbot about it, as explained by the page listed in > >>>> the footer of this mail. > >>>> > >>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing > >>>> to the report (the parent of this mail). See page linked in footer for > >>>> details. > >>>> > >>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > >>>> -- > >>>> Everything you wanna know about Linux kernel regression tracking: > >>>> https://linux-regtracking.leemhuis.info/about/#tldr > >>>> That page also explains what to do if mails like this annoy you. > >> > >>-------------- next part -------------- A non-text attachment was scrubbed... Name: nvdec0-reset.diff Type: text/x-patch Size: 849 bytes Desc: not available URL: <https://lists.freedesktop.org/archives/nouveau/attachments/20230130/a52ad70d/attachment.bin>
Chris Clayton
2023-Jan-30 20:19 UTC
[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected
Thanks, Ben. On 30/01/2023 01:09, Ben Skeggs wrote:> On Sat, 28 Jan 2023 at 21:29, Chris Clayton <chris2553 at googlemail.com> wrote: >> >> >> >> On 28/01/2023 05:42, Linux kernel regression tracking (Thorsten Leemhuis) wrote: >>> On 27.01.23 20:46, Chris Clayton wrote: >>>> [Resend because the mail client on my phone decided to turn HTML on behind my back, so my reply got bounced.] >>>> >>>> Thanks Thorsten. >>>> >>>> I did try to revert but it didnt revert cleanly and I don't have the knowledge to fix it up. >>>> >>>> The patch was part of a merge that included a number of related patches. Tomorrow, I'll try to revert the lot and report >>>> back. >>> >>> You are free to do so, but there is no need for that from my side. I >>> only wanted to know if a simple revert would do the trick; if it >>> doesn't, it in my experience often is best to leave things to the >>> developers of the code in question, >> >> Sound advice, Thorsten. Way to many conflicts for me to resolve. > Hey, > > This is a complete shot-in-the-dark, as I don't see this behaviour on > *any* of my boards. Could you try the attached patch please?Unfortunately, the patch made no difference. I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my problem? I'll have a fiddle an see what I can work out. Chris> > Thanks, > Ben. > >> >> as they know it best and thus have a >>> better idea which hidden side effect a more complex revert might have. >>> >>> Ciao, Thorsten >>> >>>> On 27/01/2023 11:20, Linux kernel regression tracking (Thorsten Leemhuis) wrote: >>>>> Hi, this is your Linux kernel regression tracker. Top-posting for once, >>>>> to make this easily accessible to everyone. >>>>> >>>>> @nouveau-maintainers, did anyone take a look at this? The report is >>>>> already 8 days old and I don't see a single reply. Sure, we'll likely >>>>> get a -rc8, but still it would be good to not fix this on the finish line. >>>>> >>>>> Chris, btw, did you try if you can revert the commit on top of latest >>>>> mainline? And if so, does it fix the problem? >>>>> >>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >>>>> -- >>>>> Everything you wanna know about Linux kernel regression tracking: >>>>> https://linux-regtracking.leemhuis.info/about/#tldr >>>>> If I did something stupid, please tell me, as explained on that page. >>>>> >>>>> #regzbot poke >>>>> >>>>> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis) >>>>> wrote: >>>>>> [adding various lists and the two other nouveau maintainers to the list >>>>>> of recipients] >>>>> >>>>>> On 18.01.23 21:59, Chris Clayton wrote: >>>>>>> Hi. >>>>>>> >>>>>>> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or >>>>>>> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is: >>>>>>> >>>>>>> sd 4:0:0:0: [sda] Synchronising SCSI cache >>>>>>> >>>>>>> when closing down I see one additional line: >>>>>>> >>>>>>> sd 4:0:0:0 [sda]Stopping disk >>>>>>> >>>>>>> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off. >>>>>>> >>>>>>> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on: >>>>>>> >>>>>>> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs >>>>>>> (VPR scrubber) >>>>>>> >>>>>>> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out >>>>>>> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so >>>>>>> I'm confident the bisect outcome is OK. >>>>>>> >>>>>>> Kernels 6.1.6 and 5.15.88 are also OK. >>>>>>> >>>>>>> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is: >>>>>>> >>>>>>> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller]) >>>>>>> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics] >>>>>>> >>>>>>> Flags: bus master, fast devsel, latency 0, IRQ 142 >>>>>>> >>>>>>> Memory at c2000000 (64-bit, non-prefetchable) [size=16M] >>>>>>> >>>>>>> Memory at a0000000 (64-bit, prefetchable) [size=256M] >>>>>>> >>>>>>> I/O ports at 5000 [size=64] >>>>>>> >>>>>>> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] >>>>>>> >>>>>>> Capabilities: [40] Vendor Specific Information: Len=0c <?> >>>>>>> >>>>>>> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 >>>>>>> >>>>>>> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit- >>>>>>> >>>>>>> Capabilities: [d0] Power Management version 2 >>>>>>> >>>>>>> Kernel driver in use: i915 >>>>>>> >>>>>>> Kernel modules: i915 >>>>>>> >>>>>>> >>>>>>> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA >>>>>>> controller]) >>>>>>> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile] >>>>>>> Flags: bus master, fast devsel, latency 0, IRQ 141 >>>>>>> Memory at c4000000 (32-bit, non-prefetchable) [size=16M] >>>>>>> Memory at b0000000 (64-bit, prefetchable) [size=256M] >>>>>>> Memory at c0000000 (64-bit, prefetchable) [size=32M] >>>>>>> I/O ports at 4000 [size=128] >>>>>>> Expansion ROM at c3000000 [disabled] [size=512K] >>>>>>> Capabilities: [60] Power Management version 3 >>>>>>> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ >>>>>>> Capabilities: [78] Express Legacy Endpoint, MSI 00 >>>>>>> Kernel driver in use: nouveau >>>>>>> Kernel modules: nouveau >>>>>>> >>>>>>> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit). >>>>>>> >>>>>>> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not >>>>>>> subscribed. >>>>>> >>>>>> Thanks for the report. To be sure the issue doesn't fall through the >>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression >>>>>> tracking bot: >>>>>> >>>>>> #regzbot ^introduced e44c2170876197 >>>>>> #regzbot title drm: nouveau: hangs on poweroff/reboot >>>>>> #regzbot ignore-activity >>>>>> >>>>>> This isn't a regression? This issue or a fix for it are already >>>>>> discussed somewhere else? It was fixed already? You want to clarify when >>>>>> the regression started to happen? Or point out I got the title or >>>>>> something else totally wrong? Then just reply and tell me -- ideally >>>>>> while also telling regzbot about it, as explained by the page listed in >>>>>> the footer of this mail. >>>>>> >>>>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing >>>>>> to the report (the parent of this mail). See page linked in footer for >>>>>> details. >>>>>> >>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >>>>>> -- >>>>>> Everything you wanna know about Linux kernel regression tracking: >>>>>> https://linux-regtracking.leemhuis.info/about/#tldr >>>>>> That page also explains what to do if mails like this annoy you. >>>> >>>>
Chris Clayton
2023-Jan-30 23:09 UTC
[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected
Hi again. On 30/01/2023 20:19, Chris Clayton wrote:> Thanks, Ben.<snip>>> Hey, >> >> This is a complete shot-in-the-dark, as I don't see this behaviour on >> *any* of my boards. Could you try the attached patch please? > > Unfortunately, the patch made no difference. > > I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might > be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber > firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed > what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to > ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being > loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my > problem? > > I'll have a fiddle an see what I can work out. > > Chris > >> >> Thanks, >> Ben. >> >>>Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117 firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too. Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR locked, but no scrubber binary!), but, hey, we can't have everything. If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels. Thanks, Chris <snip>