Chris Clayton
2023-Feb-01 13:51 UTC
[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected
On 30/01/2023 23:27, Ben Skeggs wrote:> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553 at googlemail.com> wrote: >> >> Hi again. >> >> On 30/01/2023 20:19, Chris Clayton wrote: >>> Thanks, Ben. >> >> <snip> >> >>>> Hey, >>>> >>>> This is a complete shot-in-the-dark, as I don't see this behaviour on >>>> *any* of my boards. Could you try the attached patch please? >>> >>> Unfortunately, the patch made no difference. >>> >>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might >>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber >>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed >>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to >>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being >>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my >>> problem? >>> >>> I'll have a fiddle an see what I can work out. >>> >>> Chris >>> >>>> >>>> Thanks, >>>> Ben. >>>> >>>>> >> >> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink >> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117 >> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too. >> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links >> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit >> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said >> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR >> locked, but no scrubber binary!), but, hey, we can't have everything. >> >> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect >> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in >> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels. > The symlinks are correct - whole groups of GPUs share the same FW, and > we use symlinks in linux-firmware to represent this. > > I don't really have any ideas how/why this patch causes issues with > shutdown - it's a path that only gets executed during initialisation. > Can you try and capture the kernel log during shutdown ("dmesg -w" > over ssh? netconsole?), and see if there's any relevant messages > providing a hint at what's going on? Alternatively, you could try > unloading the module (you will have to stop X/wayland/gdm/etc/etc > first) and seeing if that hangs too. > > Ben.Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole produced a log with nothing unusual in it. Simply stopping Xorg and removing the nouveau module succeeds. So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily restart. The last few lines on the console might be helpful: ... nouveau 0000:01:00:0 fifo: preinit running... nouveau 0000:01:00:0 fifo: preinit completed in 4us nouveau 0000:01:00:0 gr: preinit running... nouveau 0000:01:00:0 gr: preinit completed in 0us nouveau 0000:01:00:0 nvdec0: preinit running... nouveau 0000:01:00:0 nvdec0: preinit completed in 0us nouveau 0000:01:00:0 nvdec0: preinit running... nouveau 0000:01:00:0 nvdec0: preinit completed in 0us nouveau 0000:01:00:0 sec2: preinit running... nouveau 0000:01:00:0 sec2: preinit completed in 0us nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email. After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same messages from nouveau followed by the lockup. Let me know if you need any additional diagnostics. Chris> >> >> Thanks, >> >> Chris >> >> <snip>
Chris Clayton
2023-Feb-02 20:45 UTC
[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected
On 01/02/2023 13:51, Chris Clayton wrote:> > > On 30/01/2023 23:27, Ben Skeggs wrote: >> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553 at googlemail.com> wrote: >>> >>> Hi again. >>> >>> On 30/01/2023 20:19, Chris Clayton wrote: >>>> Thanks, Ben. >>> >>> <snip> >>> >>>>> Hey, >>>>> >>>>> This is a complete shot-in-the-dark, as I don't see this behaviour on >>>>> *any* of my boards. Could you try the attached patch please? >>>> >>>> Unfortunately, the patch made no difference. >>>> >>>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might >>>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber >>>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed >>>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to >>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being >>>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my >>>> problem? >>>> >>>> I'll have a fiddle an see what I can work out. >>>> >>>> Chris >>>> >>>>> >>>>> Thanks, >>>>> Ben. >>>>> >>>>>> >>> >>> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink >>> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117 >>> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too. >>> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links >>> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit >>> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said >>> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR >>> locked, but no scrubber binary!), but, hey, we can't have everything. >>> >>> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect >>> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in >>> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels. >> The symlinks are correct - whole groups of GPUs share the same FW, and >> we use symlinks in linux-firmware to represent this. >> >> I don't really have any ideas how/why this patch causes issues with >> shutdown - it's a path that only gets executed during initialisation. >> Can you try and capture the kernel log during shutdown ("dmesg -w" >> over ssh? netconsole?), and see if there's any relevant messages >> providing a hint at what's going on? Alternatively, you could try >> unloading the module (you will have to stop X/wayland/gdm/etc/etc >> first) and seeing if that hangs too. >> >> Ben. > > Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole > produced a log with nothing unusual in it. > > Simply stopping Xorg and removing the nouveau module succeeds. > > So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a > console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily > restart. The last few lines on the console might be helpful: > > ... > nouveau 0000:01:00:0 fifo: preinit running... > nouveau 0000:01:00:0 fifo: preinit completed in 4us > nouveau 0000:01:00:0 gr: preinit running... > nouveau 0000:01:00:0 gr: preinit completed in 0us > nouveau 0000:01:00:0 nvdec0: preinit running... > nouveau 0000:01:00:0 nvdec0: preinit completed in 0us > nouveau 0000:01:00:0 nvdec0: preinit running... > nouveau 0000:01:00:0 nvdec0: preinit completed in 0us > nouveau 0000:01:00:0 sec2: preinit running... > nouveau 0000:01:00:0 sec2: preinit completed in 0us > nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary > > These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email. > > After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I > get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that > all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a > runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same > messages from nouveau followed by the lockup. > > Let me know if you need any additional diagnostics. > > Chris >I've done some more investigation and found that I hadn't done sufficient amemdment the scripts run at shutdown to prevent the network being shutdown. I've now got netconsole captures for 6.2.0-rc6+ (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two logs are attached. Chris>> >>> >>> Thanks, >>> >>> Chris >>> >>> <snip>-------------- next part -------------- A non-text attachment was scrubbed... Name: netconsole-6.1.9.log Type: text/x-log Size: 442957 bytes Desc: not available URL: <https://lists.freedesktop.org/archives/nouveau/attachments/20230202/3c823f17/attachment-0002.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: netconsole-6.2.0-rc6+.log Type: text/x-log Size: 619445 bytes Desc: not available URL: <https://lists.freedesktop.org/archives/nouveau/attachments/20230202/3c823f17/attachment-0003.bin>