thr3ads.net - Nouveau - [Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected [Feb 2023]

If this information is useful, please help other people find it:
Share via:

Linux regression tracking (Thorsten Leemhuis)

2023-Feb-10 19:33 UTC

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 10.02.23 20:01, Karol Herbst wrote:> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
> Leemhuis) <regressions at leemhuis.info> wrote:
>>
>> On 08.02.23 09:48, Chris Clayton wrote:
>>>
>>> I'm assuming  that we are not going to see a fix for this
regression before 6.2 is released.
>>
>> Yeah, looks like it. That's unfortunate, but happens. But there is
still
>> time to fix it and there is one thing I wonder:
>>
>> Did any of the nouveau developers look at the netconsole captures Chris
>> posted more than a week ago to check if they somehow help to track down
>> the root of this problem?
> 
> I did now and I can't spot anything. I think at this point it would
> make sense to dump the active tasks/threads via sqsrq keys to see if
> any is in a weird state preventing the machine from shutting down.
Many thx for looking into it!

Ciao, Thorsten
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression
tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>>> Consequently, I've
>>> implemented a (very simple) workaround. All that happens is that in
the (sysv) init script that starts and stops SDDM,
>>> the nouveau module is removed once SDDM is stopped. With that in
place, my system no longer freezes on reboot or poweroff.
>>>
>>> Let me know if I can provide any additional diagnostics although,
with the problem seemingly occurring so late in the
>>> shutdown process, I may need help on how to go about capturing.
>>>
>>> Chris
>>>
>>> On 02/02/2023 20:45, Chris Clayton wrote:
>>>>
>>>>
>>>> On 01/02/2023 13:51, Chris Clayton wrote:
>>>>>
>>>>>
>>>>> On 30/01/2023 23:27, Ben Skeggs wrote:
>>>>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton
<chris2553 at googlemail.com> wrote:
>>>>>>>
>>>>>>> Hi again.
>>>>>>>
>>>>>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>>>>>> Thanks, Ben.
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>>> Hey,
>>>>>>>>>
>>>>>>>>> This is a complete shot-in-the-dark, as I
don't see this behaviour on
>>>>>>>>> *any* of my boards.  Could you try the
attached patch please?
>>>>>>>>
>>>>>>>> Unfortunately, the patch made no difference.
>>>>>>>>
>>>>>>>> I've been looking at how the graphics on my
laptop is set up, and have a bit of a worry about whether the firmware might
>>>>>>>> be playing a part in this problem. In order to
offload video decoding to the NVidia TU117 GPU, it seems the scrubber
>>>>>>>> firmware must be available, but as far as I
know,that has not been released by NVidia. To get it to work, I followed
>>>>>>>> what ubuntu have done and the scrubber in
/lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>>>>>> ../../tu116/nvdev/scrubber.bin. That, of
course, means that some of the firmware loaded is for a different card is being
>>>>>>>> loaded. I note that processing related to
firmware is being changed in the patch. Might my set up be at the root of my
>>>>>>>> problem?
>>>>>>>>
>>>>>>>> I'll have a fiddle an see what I can work
out.
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ben.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>> Well, my fiddling has got my system rebooting and
shutting down successfully again. I found that if I delete the symlink
>>>>>>> to the scrubber firmware, reboot and shutdown work
again. There are however, a number of other files in the tu117
>>>>>>> firmware directory tree that that are symlinks to
actual files in its tu116 counterpart. So I deleted all of those too.
>>>>>>> Unfortunately, the absence of one or more of those
symlinks causes Xorg to fail to start. I've reinstated all the links
>>>>>>> except scrubber and I now have a system that works
as it did until I tried to run a kernel that includes the bad commit
>>>>>>> I identified in my bisection. That includes
offloading video decoding to the NVidia card, so what ever I read that said
>>>>>>> the scrubber firmware was needed seems to have been
wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
>>>>>>> locked, but no scrubber binary!), but, hey, we
can't have everything.
>>>>>>>
>>>>>>> If you still want to get to the bottom of this, let
me know what you need me to provide and I'll do my best. I suspect
>>>>>>> you might want to because there will a n awful lot
of Ubuntu-based systems out there with that scrubber.bin symlink in
>>>>>>> place. On the other hand,m it could but quite a
while before ubuntu are deploying 6.2 or later kernels.
>>>>>> The symlinks are correct - whole groups of GPUs share
the same FW, and
>>>>>> we use symlinks in linux-firmware to represent this.
>>>>>>
>>>>>> I don't really have any ideas how/why this patch
causes issues with
>>>>>> shutdown - it's a path that only gets executed
during initialisation.
>>>>>> Can you try and capture the kernel log during shutdown
("dmesg -w"
>>>>>> over ssh? netconsole?), and see if there's any
relevant messages
>>>>>> providing a hint at what's going on? 
Alternatively, you could try
>>>>>> unloading the module (you will have to stop
X/wayland/gdm/etc/etc
>>>>>> first) and seeing if that hangs too.
>>>>>>
>>>>>> Ben.
>>>>>
>>>>> Sorry for the delay - I've been learning about
netconsole and netcat. However, I had no success with ssh and netconsole
>>>>> produced a log with nothing unusual in it.
>>>>>
>>>>> Simply stopping Xorg and removing the nouveau module
succeeds.
>>>>>
>>>>> So, I rebuilt rc6+ after a pull from linus' tree this
morning and set the nouveau debug level to 7. I then booted to a
>>>>> console before doing a reboot (with Ctl+Alt+Del). As
expected the machine locked up just before it would ordinarily
>>>>> restart. The last few lines on the console might be
helpful:
>>>>>
>>>>> ...
>>>>> nouveau 0000:01:00:0  fifo: preinit running...
>>>>> nouveau 0000:01:00:0  fifo: preinit completed in 4us
>>>>> nouveau 0000:01:00:0  gr: preinit running...
>>>>> nouveau 0000:01:00:0  gr: preinit completed in 0us
>>>>> nouveau 0000:01:00:0  nvdec0: preinit running...
>>>>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
>>>>> nouveau 0000:01:00:0  nvdec0: preinit running...
>>>>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
>>>>> nouveau 0000:01:00:0  sec2: preinit running...
>>>>> nouveau 0000:01:00:0  sec2: preinit completed in 0us
>>>>> nouveau 0000:01:00:0  fb:.VPR locked, running scrubber
binary
>>>>>
>>>>> These messages appear after the "sd 4:0:0:0 [sda]
Stopping disk" I reported in my initial email.
>>>>>
>>>>> After the "running scrubber" line appears the
machine is locked and I have to hold down the power button to recover. I
>>>>> get the same outcome from running "halt -dip",
"poweroff -di" and "shutdown -h -P now". I guess it's no
surprise that
>>>>> all three result in the same outcome because invocations
halt, poweroff and reboot (without the -f argument)from a
>>>>> runlevel other than 0 resukt in shutdown being run.
switching to runlevel 0 with "telenit 0" results in the same
>>>>> messages from nouveau followed by the lockup.
>>>>>
>>>>> Let me know if you need any additional diagnostics.
>>>>>
>>>>> Chris
>>>>>
>>>>
>>>> I've done some more investigation and found that I
hadn't done sufficient amemdment the scripts run at shutdown to
>>>> prevent the network being shutdown. I've now got netconsole
captures for 6.2.0-rc6+
>>>> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison,
6.1.9. These two logs are attached.
>>>>
>>>> Chris
>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>> <snip>
>>>
>>>
>>
> 
> 
>

Chris Clayton

2023-Feb-11 13:38 UTC

head link

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis)
wrote:> On 10.02.23 20:01, Karol Herbst wrote:
>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>> Leemhuis) <regressions at leemhuis.info> wrote:
>>>
>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>
>>>> I'm assuming  that we are not going to see a fix for this
regression before 6.2 is released.
>>>
>>> Yeah, looks like it. That's unfortunate, but happens. But there
is still
>>> time to fix it and there is one thing I wonder:
>>>
>>> Did any of the nouveau developers look at the netconsole captures
Chris
>>> posted more than a week ago to check if they somehow help to track
down
>>> the root of this problem?
>>
>> I did now and I can't spot anything. I think at this point it would
>> make sense to dump the active tasks/threads via sqsrq keys to see if
>> any is in a weird state preventing the machine from shutting down.
> 
> Many thx for looking into it!
Yes, thanks Karol.

Attached is the output from dmesg when this block of code:

        /bin/mount /dev/sda7 /mnt/sda7
        /bin/mountpoint /proc || /bin/mount /proc
        /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
        /bin/echo t > /proc/sysrq-trigger
        /bin/sleep 1
        /bin/sync
        /bin/sleep 1
        kill $(pidof dmesg)
        /bin/umount /mnt/sda7

is executed immediately before /sbin/reboot is called as the final step of
rebooting my system.

I hope this is what you were looking for, but if not, please let me know what
you need

Chris
> 
> Ciao, Thorsten
> 
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression
tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that
page.
>>>
>>>> Consequently, I've
>>>> implemented a (very simple) workaround. All that happens is
that in the (sysv) init script that starts and stops SDDM,
>>>> the nouveau module is removed once SDDM is stopped. With that
in place, my system no longer freezes on reboot or poweroff.
>>>>
>>>> Let me know if I can provide any additional diagnostics
although, with the problem seemingly occurring so late in the
>>>> shutdown process, I may need help on how to go about capturing.
>>>>
>>>> Chris
>>>>
>>>> On 02/02/2023 20:45, Chris Clayton wrote:
>>>>>
>>>>>
>>>>> On 01/02/2023 13:51, Chris Clayton wrote:
>>>>>>
>>>>>>
>>>>>> On 30/01/2023 23:27, Ben Skeggs wrote:
>>>>>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton
<chris2553 at googlemail.com> wrote:
>>>>>>>>
>>>>>>>> Hi again.
>>>>>>>>
>>>>>>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>>>>>>> Thanks, Ben.
>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>>>> Hey,
>>>>>>>>>>
>>>>>>>>>> This is a complete shot-in-the-dark, as
I don't see this behaviour on
>>>>>>>>>> *any* of my boards.  Could you try the
attached patch please?
>>>>>>>>>
>>>>>>>>> Unfortunately, the patch made no
difference.
>>>>>>>>>
>>>>>>>>> I've been looking at how the graphics
on my laptop is set up, and have a bit of a worry about whether the firmware
might
>>>>>>>>> be playing a part in this problem. In order
to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
>>>>>>>>> firmware must be available, but as far as I
know,that has not been released by NVidia. To get it to work, I followed
>>>>>>>>> what ubuntu have done and the scrubber in
/lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>>>>>>> ../../tu116/nvdev/scrubber.bin. That, of
course, means that some of the firmware loaded is for a different card is being
>>>>>>>>> loaded. I note that processing related to
firmware is being changed in the patch. Might my set up be at the root of my
>>>>>>>>> problem?
>>>>>>>>>
>>>>>>>>> I'll have a fiddle an see what I can
work out.
>>>>>>>>>
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>> Well, my fiddling has got my system rebooting
and shutting down successfully again. I found that if I delete the symlink
>>>>>>>> to the scrubber firmware, reboot and shutdown
work again. There are however, a number of other files in the tu117
>>>>>>>> firmware directory tree that that are symlinks
to actual files in its tu116 counterpart. So I deleted all of those too.
>>>>>>>> Unfortunately, the absence of one or more of
those symlinks causes Xorg to fail to start. I've reinstated all the links
>>>>>>>> except scrubber and I now have a system that
works as it did until I tried to run a kernel that includes the bad commit
>>>>>>>> I identified in my bisection. That includes
offloading video decoding to the NVidia card, so what ever I read that said
>>>>>>>> the scrubber firmware was needed seems to have
been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
>>>>>>>> locked, but no scrubber binary!), but, hey, we
can't have everything.
>>>>>>>>
>>>>>>>> If you still want to get to the bottom of this,
let me know what you need me to provide and I'll do my best. I suspect
>>>>>>>> you might want to because there will a n awful
lot of Ubuntu-based systems out there with that scrubber.bin symlink in
>>>>>>>> place. On the other hand,m it could but quite a
while before ubuntu are deploying 6.2 or later kernels.
>>>>>>> The symlinks are correct - whole groups of GPUs
share the same FW, and
>>>>>>> we use symlinks in linux-firmware to represent
this.
>>>>>>>
>>>>>>> I don't really have any ideas how/why this
patch causes issues with
>>>>>>> shutdown - it's a path that only gets executed
during initialisation.
>>>>>>> Can you try and capture the kernel log during
shutdown ("dmesg -w"
>>>>>>> over ssh? netconsole?), and see if there's any
relevant messages
>>>>>>> providing a hint at what's going on? 
Alternatively, you could try
>>>>>>> unloading the module (you will have to stop
X/wayland/gdm/etc/etc
>>>>>>> first) and seeing if that hangs too.
>>>>>>>
>>>>>>> Ben.
>>>>>>
>>>>>> Sorry for the delay - I've been learning about
netconsole and netcat. However, I had no success with ssh and netconsole
>>>>>> produced a log with nothing unusual in it.
>>>>>>
>>>>>> Simply stopping Xorg and removing the nouveau module
succeeds.
>>>>>>
>>>>>> So, I rebuilt rc6+ after a pull from linus' tree
this morning and set the nouveau debug level to 7. I then booted to a
>>>>>> console before doing a reboot (with Ctl+Alt+Del). As
expected the machine locked up just before it would ordinarily
>>>>>> restart. The last few lines on the console might be
helpful:
>>>>>>
>>>>>> ...
>>>>>> nouveau 0000:01:00:0  fifo: preinit running...
>>>>>> nouveau 0000:01:00:0  fifo: preinit completed in 4us
>>>>>> nouveau 0000:01:00:0  gr: preinit running...
>>>>>> nouveau 0000:01:00:0  gr: preinit completed in 0us
>>>>>> nouveau 0000:01:00:0  nvdec0: preinit running...
>>>>>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
>>>>>> nouveau 0000:01:00:0  nvdec0: preinit running...
>>>>>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
>>>>>> nouveau 0000:01:00:0  sec2: preinit running...
>>>>>> nouveau 0000:01:00:0  sec2: preinit completed in 0us
>>>>>> nouveau 0000:01:00:0  fb:.VPR locked, running scrubber
binary
>>>>>>
>>>>>> These messages appear after the "sd 4:0:0:0 [sda]
Stopping disk" I reported in my initial email.
>>>>>>
>>>>>> After the "running scrubber" line appears the
machine is locked and I have to hold down the power button to recover. I
>>>>>> get the same outcome from running "halt
-dip", "poweroff -di" and "shutdown -h -P now". I guess
it's no surprise that
>>>>>> all three result in the same outcome because
invocations halt, poweroff and reboot (without the -f argument)from a
>>>>>> runlevel other than 0 resukt in shutdown being run.
switching to runlevel 0 with "telenit 0" results in the same
>>>>>> messages from nouveau followed by the lockup.
>>>>>>
>>>>>> Let me know if you need any additional diagnostics.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>
>>>>> I've done some more investigation and found that I
hadn't done sufficient amemdment the scripts run at shutdown to
>>>>> prevent the network being shutdown. I've now got
netconsole captures for 6.2.0-rc6+
>>>>> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for
comparison, 6.1.9. These two logs are attached.
>>>>>
>>>>> Chris
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>> <snip>
>>>>
>>>>
>>>
>>
>>
>>-------------- next part --------------
A non-text attachment was scrubbed...
Name: sysrq-t.dmesg.log
Type: text/x-log
Size: 219569 bytes
Desc: not available
URL:
<https://lists.freedesktop.org/archives/nouveau/attachments/20230211/3ac5c620/attachment-0001.bin>

Dave Airlie

2023-Feb-13 02:57 UTC

head link

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Sun, 12 Feb 2023 at 00:43, Chris Clayton <chris2553 at googlemail.com>
wrote:>
>
>
> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
> > On 10.02.23 20:01, Karol Herbst wrote:
> >> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking
(Thorsten
> >> Leemhuis) <regressions at leemhuis.info> wrote:
> >>>
> >>> On 08.02.23 09:48, Chris Clayton wrote:
> >>>>
> >>>> I'm assuming  that we are not going to see a fix for
this regression before 6.2 is released.
> >>>
> >>> Yeah, looks like it. That's unfortunate, but happens. But
there is still
> >>> time to fix it and there is one thing I wonder:
> >>>
> >>> Did any of the nouveau developers look at the netconsole
captures Chris
> >>> posted more than a week ago to check if they somehow help to
track down
> >>> the root of this problem?
> >>
> >> I did now and I can't spot anything. I think at this point it
would
> >> make sense to dump the active tasks/threads via sqsrq keys to see
if
> >> any is in a weird state preventing the machine from shutting down.
> >
> > Many thx for looking into it!
>
> Yes, thanks Karol.
>
> Attached is the output from dmesg when this block of code:
>
>         /bin/mount /dev/sda7 /mnt/sda7
>         /bin/mountpoint /proc || /bin/mount /proc
>         /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>         /bin/echo t > /proc/sysrq-trigger
>         /bin/sleep 1
>         /bin/sync
>         /bin/sleep 1
>         kill $(pidof dmesg)
>         /bin/umount /mnt/sda7
>
> is executed immediately before /sbin/reboot is called as the final step of
rebooting my system.
>
> I hope this is what you were looking for, but if not, please let me know
what you need
Another shot in the dark, but does nouveau.runpm=0 help at all?

Dave.

Seemingly Similar Threads

Search for more reasonably related threads

Nouveau - Feb 2023 - linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

Seemingly Similar Threads