Linux regression tracking (Thorsten Leemhuis)
2023-Feb-10 19:33 UTC
[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected
On 10.02.23 20:01, Karol Herbst wrote:> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten > Leemhuis) <regressions at leemhuis.info> wrote: >> >> On 08.02.23 09:48, Chris Clayton wrote: >>> >>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released. >> >> Yeah, looks like it. That's unfortunate, but happens. But there is still >> time to fix it and there is one thing I wonder: >> >> Did any of the nouveau developers look at the netconsole captures Chris >> posted more than a week ago to check if they somehow help to track down >> the root of this problem? > > I did now and I can't spot anything. I think at this point it would > make sense to dump the active tasks/threads via sqsrq keys to see if > any is in a weird state preventing the machine from shutting down.Many thx for looking into it! Ciao, Thorsten>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >> -- >> Everything you wanna know about Linux kernel regression tracking: >> https://linux-regtracking.leemhuis.info/about/#tldr >> If I did something stupid, please tell me, as explained on that page. >> >>> Consequently, I've >>> implemented a (very simple) workaround. All that happens is that in the (sysv) init script that starts and stops SDDM, >>> the nouveau module is removed once SDDM is stopped. With that in place, my system no longer freezes on reboot or poweroff. >>> >>> Let me know if I can provide any additional diagnostics although, with the problem seemingly occurring so late in the >>> shutdown process, I may need help on how to go about capturing. >>> >>> Chris >>> >>> On 02/02/2023 20:45, Chris Clayton wrote: >>>> >>>> >>>> On 01/02/2023 13:51, Chris Clayton wrote: >>>>> >>>>> >>>>> On 30/01/2023 23:27, Ben Skeggs wrote: >>>>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553 at googlemail.com> wrote: >>>>>>> >>>>>>> Hi again. >>>>>>> >>>>>>> On 30/01/2023 20:19, Chris Clayton wrote: >>>>>>>> Thanks, Ben. >>>>>>> >>>>>>> <snip> >>>>>>> >>>>>>>>> Hey, >>>>>>>>> >>>>>>>>> This is a complete shot-in-the-dark, as I don't see this behaviour on >>>>>>>>> *any* of my boards. Could you try the attached patch please? >>>>>>>> >>>>>>>> Unfortunately, the patch made no difference. >>>>>>>> >>>>>>>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might >>>>>>>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber >>>>>>>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed >>>>>>>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to >>>>>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being >>>>>>>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my >>>>>>>> problem? >>>>>>>> >>>>>>>> I'll have a fiddle an see what I can work out. >>>>>>>> >>>>>>>> Chris >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ben. >>>>>>>>> >>>>>>>>>> >>>>>>> >>>>>>> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink >>>>>>> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117 >>>>>>> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too. >>>>>>> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links >>>>>>> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit >>>>>>> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said >>>>>>> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR >>>>>>> locked, but no scrubber binary!), but, hey, we can't have everything. >>>>>>> >>>>>>> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect >>>>>>> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in >>>>>>> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels. >>>>>> The symlinks are correct - whole groups of GPUs share the same FW, and >>>>>> we use symlinks in linux-firmware to represent this. >>>>>> >>>>>> I don't really have any ideas how/why this patch causes issues with >>>>>> shutdown - it's a path that only gets executed during initialisation. >>>>>> Can you try and capture the kernel log during shutdown ("dmesg -w" >>>>>> over ssh? netconsole?), and see if there's any relevant messages >>>>>> providing a hint at what's going on? Alternatively, you could try >>>>>> unloading the module (you will have to stop X/wayland/gdm/etc/etc >>>>>> first) and seeing if that hangs too. >>>>>> >>>>>> Ben. >>>>> >>>>> Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole >>>>> produced a log with nothing unusual in it. >>>>> >>>>> Simply stopping Xorg and removing the nouveau module succeeds. >>>>> >>>>> So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a >>>>> console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily >>>>> restart. The last few lines on the console might be helpful: >>>>> >>>>> ... >>>>> nouveau 0000:01:00:0 fifo: preinit running... >>>>> nouveau 0000:01:00:0 fifo: preinit completed in 4us >>>>> nouveau 0000:01:00:0 gr: preinit running... >>>>> nouveau 0000:01:00:0 gr: preinit completed in 0us >>>>> nouveau 0000:01:00:0 nvdec0: preinit running... >>>>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us >>>>> nouveau 0000:01:00:0 nvdec0: preinit running... >>>>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us >>>>> nouveau 0000:01:00:0 sec2: preinit running... >>>>> nouveau 0000:01:00:0 sec2: preinit completed in 0us >>>>> nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary >>>>> >>>>> These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email. >>>>> >>>>> After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I >>>>> get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that >>>>> all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a >>>>> runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same >>>>> messages from nouveau followed by the lockup. >>>>> >>>>> Let me know if you need any additional diagnostics. >>>>> >>>>> Chris >>>>> >>>> >>>> I've done some more investigation and found that I hadn't done sufficient amemdment the scripts run at shutdown to >>>> prevent the network being shutdown. I've now got netconsole captures for 6.2.0-rc6+ >>>> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two logs are attached. >>>> >>>> Chris >>>> >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Chris >>>>>>> >>>>>>> <snip> >>> >>> >> > > >
Chris Clayton
2023-Feb-11 13:38 UTC
[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected
On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:> On 10.02.23 20:01, Karol Herbst wrote: >> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten >> Leemhuis) <regressions at leemhuis.info> wrote: >>> >>> On 08.02.23 09:48, Chris Clayton wrote: >>>> >>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released. >>> >>> Yeah, looks like it. That's unfortunate, but happens. But there is still >>> time to fix it and there is one thing I wonder: >>> >>> Did any of the nouveau developers look at the netconsole captures Chris >>> posted more than a week ago to check if they somehow help to track down >>> the root of this problem? >> >> I did now and I can't spot anything. I think at this point it would >> make sense to dump the active tasks/threads via sqsrq keys to see if >> any is in a weird state preventing the machine from shutting down. > > Many thx for looking into it!Yes, thanks Karol. Attached is the output from dmesg when this block of code: /bin/mount /dev/sda7 /mnt/sda7 /bin/mountpoint /proc || /bin/mount /proc /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log & /bin/echo t > /proc/sysrq-trigger /bin/sleep 1 /bin/sync /bin/sleep 1 kill $(pidof dmesg) /bin/umount /mnt/sda7 is executed immediately before /sbin/reboot is called as the final step of rebooting my system. I hope this is what you were looking for, but if not, please let me know what you need Chris> > Ciao, Thorsten > >>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >>> -- >>> Everything you wanna know about Linux kernel regression tracking: >>> https://linux-regtracking.leemhuis.info/about/#tldr >>> If I did something stupid, please tell me, as explained on that page. >>> >>>> Consequently, I've >>>> implemented a (very simple) workaround. All that happens is that in the (sysv) init script that starts and stops SDDM, >>>> the nouveau module is removed once SDDM is stopped. With that in place, my system no longer freezes on reboot or poweroff. >>>> >>>> Let me know if I can provide any additional diagnostics although, with the problem seemingly occurring so late in the >>>> shutdown process, I may need help on how to go about capturing. >>>> >>>> Chris >>>> >>>> On 02/02/2023 20:45, Chris Clayton wrote: >>>>> >>>>> >>>>> On 01/02/2023 13:51, Chris Clayton wrote: >>>>>> >>>>>> >>>>>> On 30/01/2023 23:27, Ben Skeggs wrote: >>>>>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553 at googlemail.com> wrote: >>>>>>>> >>>>>>>> Hi again. >>>>>>>> >>>>>>>> On 30/01/2023 20:19, Chris Clayton wrote: >>>>>>>>> Thanks, Ben. >>>>>>>> >>>>>>>> <snip> >>>>>>>> >>>>>>>>>> Hey, >>>>>>>>>> >>>>>>>>>> This is a complete shot-in-the-dark, as I don't see this behaviour on >>>>>>>>>> *any* of my boards. Could you try the attached patch please? >>>>>>>>> >>>>>>>>> Unfortunately, the patch made no difference. >>>>>>>>> >>>>>>>>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might >>>>>>>>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber >>>>>>>>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed >>>>>>>>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to >>>>>>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being >>>>>>>>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my >>>>>>>>> problem? >>>>>>>>> >>>>>>>>> I'll have a fiddle an see what I can work out. >>>>>>>>> >>>>>>>>> Chris >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ben. >>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>>>> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink >>>>>>>> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117 >>>>>>>> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too. >>>>>>>> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links >>>>>>>> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit >>>>>>>> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said >>>>>>>> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR >>>>>>>> locked, but no scrubber binary!), but, hey, we can't have everything. >>>>>>>> >>>>>>>> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect >>>>>>>> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in >>>>>>>> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels. >>>>>>> The symlinks are correct - whole groups of GPUs share the same FW, and >>>>>>> we use symlinks in linux-firmware to represent this. >>>>>>> >>>>>>> I don't really have any ideas how/why this patch causes issues with >>>>>>> shutdown - it's a path that only gets executed during initialisation. >>>>>>> Can you try and capture the kernel log during shutdown ("dmesg -w" >>>>>>> over ssh? netconsole?), and see if there's any relevant messages >>>>>>> providing a hint at what's going on? Alternatively, you could try >>>>>>> unloading the module (you will have to stop X/wayland/gdm/etc/etc >>>>>>> first) and seeing if that hangs too. >>>>>>> >>>>>>> Ben. >>>>>> >>>>>> Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole >>>>>> produced a log with nothing unusual in it. >>>>>> >>>>>> Simply stopping Xorg and removing the nouveau module succeeds. >>>>>> >>>>>> So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a >>>>>> console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily >>>>>> restart. The last few lines on the console might be helpful: >>>>>> >>>>>> ... >>>>>> nouveau 0000:01:00:0 fifo: preinit running... >>>>>> nouveau 0000:01:00:0 fifo: preinit completed in 4us >>>>>> nouveau 0000:01:00:0 gr: preinit running... >>>>>> nouveau 0000:01:00:0 gr: preinit completed in 0us >>>>>> nouveau 0000:01:00:0 nvdec0: preinit running... >>>>>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us >>>>>> nouveau 0000:01:00:0 nvdec0: preinit running... >>>>>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us >>>>>> nouveau 0000:01:00:0 sec2: preinit running... >>>>>> nouveau 0000:01:00:0 sec2: preinit completed in 0us >>>>>> nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary >>>>>> >>>>>> These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email. >>>>>> >>>>>> After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I >>>>>> get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that >>>>>> all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a >>>>>> runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same >>>>>> messages from nouveau followed by the lockup. >>>>>> >>>>>> Let me know if you need any additional diagnostics. >>>>>> >>>>>> Chris >>>>>> >>>>> >>>>> I've done some more investigation and found that I hadn't done sufficient amemdment the scripts run at shutdown to >>>>> prevent the network being shutdown. I've now got netconsole captures for 6.2.0-rc6+ >>>>> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two logs are attached. >>>>> >>>>> Chris >>>>> >>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Chris >>>>>>>> >>>>>>>> <snip> >>>> >>>> >>> >> >> >>-------------- next part -------------- A non-text attachment was scrubbed... Name: sysrq-t.dmesg.log Type: text/x-log Size: 219569 bytes Desc: not available URL: <https://lists.freedesktop.org/archives/nouveau/attachments/20230211/3ac5c620/attachment-0001.bin>
Dave Airlie
2023-Feb-13 02:57 UTC
[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected
On Sun, 12 Feb 2023 at 00:43, Chris Clayton <chris2553 at googlemail.com> wrote:> > > > On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote: > > On 10.02.23 20:01, Karol Herbst wrote: > >> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten > >> Leemhuis) <regressions at leemhuis.info> wrote: > >>> > >>> On 08.02.23 09:48, Chris Clayton wrote: > >>>> > >>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released. > >>> > >>> Yeah, looks like it. That's unfortunate, but happens. But there is still > >>> time to fix it and there is one thing I wonder: > >>> > >>> Did any of the nouveau developers look at the netconsole captures Chris > >>> posted more than a week ago to check if they somehow help to track down > >>> the root of this problem? > >> > >> I did now and I can't spot anything. I think at this point it would > >> make sense to dump the active tasks/threads via sqsrq keys to see if > >> any is in a weird state preventing the machine from shutting down. > > > > Many thx for looking into it! > > Yes, thanks Karol. > > Attached is the output from dmesg when this block of code: > > /bin/mount /dev/sda7 /mnt/sda7 > /bin/mountpoint /proc || /bin/mount /proc > /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log & > /bin/echo t > /proc/sysrq-trigger > /bin/sleep 1 > /bin/sync > /bin/sleep 1 > kill $(pidof dmesg) > /bin/umount /mnt/sda7 > > is executed immediately before /sbin/reboot is called as the final step of rebooting my system. > > I hope this is what you were looking for, but if not, please let me know what you needAnother shot in the dark, but does nouveau.runpm=0 help at all? Dave.