thr3ads.net - Nouveau - [Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected [Feb 2023]

If this information is useful, please help other people find it:
Share via:

Chris Clayton

2023-Feb-01 13:51 UTC

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 30/01/2023 23:27, Ben Skeggs wrote:> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553 at
googlemail.com> wrote:
>>
>> Hi again.
>>
>> On 30/01/2023 20:19, Chris Clayton wrote:
>>> Thanks, Ben.
>>
>> <snip>
>>
>>>> Hey,
>>>>
>>>> This is a complete shot-in-the-dark, as I don't see this
behaviour on
>>>> *any* of my boards.  Could you try the attached patch please?
>>>
>>> Unfortunately, the patch made no difference.
>>>
>>> I've been looking at how the graphics on my laptop is set up,
and have a bit of a worry about whether the firmware might
>>> be playing a part in this problem. In order to offload video
decoding to the NVidia TU117 GPU, it seems the scrubber
>>> firmware must be available, but as far as I know,that has not been
released by NVidia. To get it to work, I followed
>>> what ubuntu have done and the scrubber in
/lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of
the firmware loaded is for a different card is being
>>> loaded. I note that processing related to firmware is being changed
in the patch. Might my set up be at the root of my
>>> problem?
>>>
>>> I'll have a fiddle an see what I can work out.
>>>
>>> Chris
>>>
>>>>
>>>> Thanks,
>>>> Ben.
>>>>
>>>>>
>>
>> Well, my fiddling has got my system rebooting and shutting down
successfully again. I found that if I delete the symlink
>> to the scrubber firmware, reboot and shutdown work again. There are
however, a number of other files in the tu117
>> firmware directory tree that that are symlinks to actual files in its
tu116 counterpart. So I deleted all of those too.
>> Unfortunately, the absence of one or more of those symlinks causes Xorg
to fail to start. I've reinstated all the links
>> except scrubber and I now have a system that works as it did until I
tried to run a kernel that includes the bad commit
>> I identified in my bisection. That includes offloading video decoding
to the NVidia card, so what ever I read that said
>> the scrubber firmware was needed seems to have been wrong. I get a new
message that (nouveau 0000:01:00.0: fb: VPR
>> locked, but no scrubber binary!), but, hey, we can't have
everything.
>>
>> If you still want to get to the bottom of this, let me know what you
need me to provide and I'll do my best. I suspect
>> you might want to because there will a n awful lot of Ubuntu-based
systems out there with that scrubber.bin symlink in
>> place. On the other hand,m it could but quite a while before ubuntu are
deploying 6.2 or later kernels.
> The symlinks are correct - whole groups of GPUs share the same FW, and
> we use symlinks in linux-firmware to represent this.
> 
> I don't really have any ideas how/why this patch causes issues with
> shutdown - it's a path that only gets executed during initialisation.
> Can you try and capture the kernel log during shutdown ("dmesg
-w"
> over ssh? netconsole?), and see if there's any relevant messages
> providing a hint at what's going on?  Alternatively, you could try
> unloading the module (you will have to stop X/wayland/gdm/etc/etc
> first) and seeing if that hangs too.
> 
> Ben.
Sorry for the delay - I've been learning about netconsole and netcat.
However, I had no success with ssh and netconsole
produced a log with nothing unusual in it.

Simply stopping Xorg and removing the nouveau module succeeds.

So, I rebuilt rc6+ after a pull from linus' tree this morning and set the
nouveau debug level to 7. I then booted to a
console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked
up just before it would ordinarily
restart. The last few lines on the console might be helpful:

...
nouveau 0000:01:00:0  fifo: preinit running...
nouveau 0000:01:00:0  fifo: preinit completed in 4us
nouveau 0000:01:00:0  gr: preinit running...
nouveau 0000:01:00:0  gr: preinit completed in 0us
nouveau 0000:01:00:0  nvdec0: preinit running...
nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
nouveau 0000:01:00:0  nvdec0: preinit running...
nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
nouveau 0000:01:00:0  sec2: preinit running...
nouveau 0000:01:00:0  sec2: preinit completed in 0us
nouveau 0000:01:00:0  fb:.VPR locked, running scrubber binary

These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I
reported in my initial email.

After the "running scrubber" line appears the machine is locked and I
have to hold down the power button to recover. I
get the same outcome from running "halt -dip", "poweroff
-di" and "shutdown -h -P now". I guess it's no surprise that
all three result in the same outcome because invocations halt, poweroff and
reboot (without the -f argument)from a
runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with
"telenit 0" results in the same
messages from nouveau followed by the lockup.

Let me know if you need any additional diagnostics.

Chris
> 
>>
>> Thanks,
>>
>> Chris
>>
>> <snip>

Chris Clayton

2023-Feb-02 20:45 UTC

head link

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 01/02/2023 13:51, Chris Clayton wrote:> 
> 
> On 30/01/2023 23:27, Ben Skeggs wrote:
>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553 at
googlemail.com> wrote:
>>>
>>> Hi again.
>>>
>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>> Thanks, Ben.
>>>
>>> <snip>
>>>
>>>>> Hey,
>>>>>
>>>>> This is a complete shot-in-the-dark, as I don't see
this behaviour on
>>>>> *any* of my boards.  Could you try the attached patch
please?
>>>>
>>>> Unfortunately, the patch made no difference.
>>>>
>>>> I've been looking at how the graphics on my laptop is set
up, and have a bit of a worry about whether the firmware might
>>>> be playing a part in this problem. In order to offload video
decoding to the NVidia TU117 GPU, it seems the scrubber
>>>> firmware must be available, but as far as I know,that has not
been released by NVidia. To get it to work, I followed
>>>> what ubuntu have done and the scrubber in
/lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that
some of the firmware loaded is for a different card is being
>>>> loaded. I note that processing related to firmware is being
changed in the patch. Might my set up be at the root of my
>>>> problem?
>>>>
>>>> I'll have a fiddle an see what I can work out.
>>>>
>>>> Chris
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ben.
>>>>>
>>>>>>
>>>
>>> Well, my fiddling has got my system rebooting and shutting down
successfully again. I found that if I delete the symlink
>>> to the scrubber firmware, reboot and shutdown work again. There are
however, a number of other files in the tu117
>>> firmware directory tree that that are symlinks to actual files in
its tu116 counterpart. So I deleted all of those too.
>>> Unfortunately, the absence of one or more of those symlinks causes
Xorg to fail to start. I've reinstated all the links
>>> except scrubber and I now have a system that works as it did until
I tried to run a kernel that includes the bad commit
>>> I identified in my bisection. That includes offloading video
decoding to the NVidia card, so what ever I read that said
>>> the scrubber firmware was needed seems to have been wrong. I get a
new message that (nouveau 0000:01:00.0: fb: VPR
>>> locked, but no scrubber binary!), but, hey, we can't have
everything.
>>>
>>> If you still want to get to the bottom of this, let me know what
you need me to provide and I'll do my best. I suspect
>>> you might want to because there will a n awful lot of Ubuntu-based
systems out there with that scrubber.bin symlink in
>>> place. On the other hand,m it could but quite a while before ubuntu
are deploying 6.2 or later kernels.
>> The symlinks are correct - whole groups of GPUs share the same FW, and
>> we use symlinks in linux-firmware to represent this.
>>
>> I don't really have any ideas how/why this patch causes issues with
>> shutdown - it's a path that only gets executed during
initialisation.
>> Can you try and capture the kernel log during shutdown ("dmesg
-w"
>> over ssh? netconsole?), and see if there's any relevant messages
>> providing a hint at what's going on?  Alternatively, you could try
>> unloading the module (you will have to stop X/wayland/gdm/etc/etc
>> first) and seeing if that hangs too.
>>
>> Ben.
> 
> Sorry for the delay - I've been learning about netconsole and netcat.
However, I had no success with ssh and netconsole
> produced a log with nothing unusual in it.
> 
> Simply stopping Xorg and removing the nouveau module succeeds.
> 
> So, I rebuilt rc6+ after a pull from linus' tree this morning and set
the nouveau debug level to 7. I then booted to a
> console before doing a reboot (with Ctl+Alt+Del). As expected the machine
locked up just before it would ordinarily
> restart. The last few lines on the console might be helpful:
> 
> ...
> nouveau 0000:01:00:0  fifo: preinit running...
> nouveau 0000:01:00:0  fifo: preinit completed in 4us
> nouveau 0000:01:00:0  gr: preinit running...
> nouveau 0000:01:00:0  gr: preinit completed in 0us
> nouveau 0000:01:00:0  nvdec0: preinit running...
> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
> nouveau 0000:01:00:0  nvdec0: preinit running...
> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
> nouveau 0000:01:00:0  sec2: preinit running...
> nouveau 0000:01:00:0  sec2: preinit completed in 0us
> nouveau 0000:01:00:0  fb:.VPR locked, running scrubber binary
> 
> These messages appear after the "sd 4:0:0:0 [sda] Stopping disk"
I reported in my initial email.
> 
> After the "running scrubber" line appears the machine is locked
and I have to hold down the power button to recover. I
> get the same outcome from running "halt -dip", "poweroff
-di" and "shutdown -h -P now". I guess it's no surprise that
> all three result in the same outcome because invocations halt, poweroff and
reboot (without the -f argument)from a
> runlevel other than 0 resukt in shutdown being run. switching to runlevel 0
with "telenit 0" results in the same
> messages from nouveau followed by the lockup.
> 
> Let me know if you need any additional diagnostics.
> 
> Chris
> 
I've done some more investigation and found that I hadn't done
sufficient amemdment the scripts run at shutdown to
prevent the network being shutdown. I've now got netconsole captures for
6.2.0-rc6+
(9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two
logs are attached.

Chris
>>
>>>
>>> Thanks,
>>>
>>> Chris
>>>
>>> <snip>-------------- next part --------------
A non-text attachment was scrubbed...
Name: netconsole-6.1.9.log
Type: text/x-log
Size: 442957 bytes
Desc: not available
URL:
<https://lists.freedesktop.org/archives/nouveau/attachments/20230202/3c823f17/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: netconsole-6.2.0-rc6+.log
Type: text/x-log
Size: 619445 bytes
Desc: not available
URL:
<https://lists.freedesktop.org/archives/nouveau/attachments/20230202/3c823f17/attachment-0003.bin>

Nouveau - Feb 2023 - linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected