thr3ads.net - Nouveau - [Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected [Feb 2023]

If this information is useful, please help other people find it:
Share via:

Chris Clayton

2023-Feb-02 20:45 UTC

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 01/02/2023 13:51, Chris Clayton wrote:> 
> 
> On 30/01/2023 23:27, Ben Skeggs wrote:
>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553 at
googlemail.com> wrote:
>>>
>>> Hi again.
>>>
>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>> Thanks, Ben.
>>>
>>> <snip>
>>>
>>>>> Hey,
>>>>>
>>>>> This is a complete shot-in-the-dark, as I don't see
this behaviour on
>>>>> *any* of my boards.  Could you try the attached patch
please?
>>>>
>>>> Unfortunately, the patch made no difference.
>>>>
>>>> I've been looking at how the graphics on my laptop is set
up, and have a bit of a worry about whether the firmware might
>>>> be playing a part in this problem. In order to offload video
decoding to the NVidia TU117 GPU, it seems the scrubber
>>>> firmware must be available, but as far as I know,that has not
been released by NVidia. To get it to work, I followed
>>>> what ubuntu have done and the scrubber in
/lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that
some of the firmware loaded is for a different card is being
>>>> loaded. I note that processing related to firmware is being
changed in the patch. Might my set up be at the root of my
>>>> problem?
>>>>
>>>> I'll have a fiddle an see what I can work out.
>>>>
>>>> Chris
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ben.
>>>>>
>>>>>>
>>>
>>> Well, my fiddling has got my system rebooting and shutting down
successfully again. I found that if I delete the symlink
>>> to the scrubber firmware, reboot and shutdown work again. There are
however, a number of other files in the tu117
>>> firmware directory tree that that are symlinks to actual files in
its tu116 counterpart. So I deleted all of those too.
>>> Unfortunately, the absence of one or more of those symlinks causes
Xorg to fail to start. I've reinstated all the links
>>> except scrubber and I now have a system that works as it did until
I tried to run a kernel that includes the bad commit
>>> I identified in my bisection. That includes offloading video
decoding to the NVidia card, so what ever I read that said
>>> the scrubber firmware was needed seems to have been wrong. I get a
new message that (nouveau 0000:01:00.0: fb: VPR
>>> locked, but no scrubber binary!), but, hey, we can't have
everything.
>>>
>>> If you still want to get to the bottom of this, let me know what
you need me to provide and I'll do my best. I suspect
>>> you might want to because there will a n awful lot of Ubuntu-based
systems out there with that scrubber.bin symlink in
>>> place. On the other hand,m it could but quite a while before ubuntu
are deploying 6.2 or later kernels.
>> The symlinks are correct - whole groups of GPUs share the same FW, and
>> we use symlinks in linux-firmware to represent this.
>>
>> I don't really have any ideas how/why this patch causes issues with
>> shutdown - it's a path that only gets executed during
initialisation.
>> Can you try and capture the kernel log during shutdown ("dmesg
-w"
>> over ssh? netconsole?), and see if there's any relevant messages
>> providing a hint at what's going on?  Alternatively, you could try
>> unloading the module (you will have to stop X/wayland/gdm/etc/etc
>> first) and seeing if that hangs too.
>>
>> Ben.
> 
> Sorry for the delay - I've been learning about netconsole and netcat.
However, I had no success with ssh and netconsole
> produced a log with nothing unusual in it.
> 
> Simply stopping Xorg and removing the nouveau module succeeds.
> 
> So, I rebuilt rc6+ after a pull from linus' tree this morning and set
the nouveau debug level to 7. I then booted to a
> console before doing a reboot (with Ctl+Alt+Del). As expected the machine
locked up just before it would ordinarily
> restart. The last few lines on the console might be helpful:
> 
> ...
> nouveau 0000:01:00:0  fifo: preinit running...
> nouveau 0000:01:00:0  fifo: preinit completed in 4us
> nouveau 0000:01:00:0  gr: preinit running...
> nouveau 0000:01:00:0  gr: preinit completed in 0us
> nouveau 0000:01:00:0  nvdec0: preinit running...
> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
> nouveau 0000:01:00:0  nvdec0: preinit running...
> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
> nouveau 0000:01:00:0  sec2: preinit running...
> nouveau 0000:01:00:0  sec2: preinit completed in 0us
> nouveau 0000:01:00:0  fb:.VPR locked, running scrubber binary
> 
> These messages appear after the "sd 4:0:0:0 [sda] Stopping disk"
I reported in my initial email.
> 
> After the "running scrubber" line appears the machine is locked
and I have to hold down the power button to recover. I
> get the same outcome from running "halt -dip", "poweroff
-di" and "shutdown -h -P now". I guess it's no surprise that
> all three result in the same outcome because invocations halt, poweroff and
reboot (without the -f argument)from a
> runlevel other than 0 resukt in shutdown being run. switching to runlevel 0
with "telenit 0" results in the same
> messages from nouveau followed by the lockup.
> 
> Let me know if you need any additional diagnostics.
> 
> Chris
> 
I've done some more investigation and found that I hadn't done
sufficient amemdment the scripts run at shutdown to
prevent the network being shutdown. I've now got netconsole captures for
6.2.0-rc6+
(9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two
logs are attached.

Chris
>>
>>>
>>> Thanks,
>>>
>>> Chris
>>>
>>> <snip>-------------- next part --------------
A non-text attachment was scrubbed...
Name: netconsole-6.1.9.log
Type: text/x-log
Size: 442957 bytes
Desc: not available
URL:
<https://lists.freedesktop.org/archives/nouveau/attachments/20230202/3c823f17/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: netconsole-6.2.0-rc6+.log
Type: text/x-log
Size: 619445 bytes
Desc: not available
URL:
<https://lists.freedesktop.org/archives/nouveau/attachments/20230202/3c823f17/attachment-0003.bin>

Chris Clayton

2023-Feb-08 08:48 UTC

head link

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

Hi.

I'm assuming  that we are not going to see a fix for this regression before
6.2 is released. Consequently, I've
implemented a (very simple) workaround. All that happens is that in the (sysv)
init script that starts and stops SDDM,
the nouveau module is removed once SDDM is stopped. With that in place, my
system no longer freezes on reboot or poweroff.

Let me know if I can provide any additional diagnostics although, with the
problem seemingly occurring so late in the
shutdown process, I may need help on how to go about capturing.

Chris

On 02/02/2023 20:45, Chris Clayton wrote:> 
> 
> On 01/02/2023 13:51, Chris Clayton wrote:
>>
>>
>> On 30/01/2023 23:27, Ben Skeggs wrote:
>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553 at
googlemail.com> wrote:
>>>>
>>>> Hi again.
>>>>
>>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>>> Thanks, Ben.
>>>>
>>>> <snip>
>>>>
>>>>>> Hey,
>>>>>>
>>>>>> This is a complete shot-in-the-dark, as I don't see
this behaviour on
>>>>>> *any* of my boards.  Could you try the attached patch
please?
>>>>>
>>>>> Unfortunately, the patch made no difference.
>>>>>
>>>>> I've been looking at how the graphics on my laptop is
set up, and have a bit of a worry about whether the firmware might
>>>>> be playing a part in this problem. In order to offload
video decoding to the NVidia TU117 GPU, it seems the scrubber
>>>>> firmware must be available, but as far as I know,that has
not been released by NVidia. To get it to work, I followed
>>>>> what ubuntu have done and the scrubber in
/lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that
some of the firmware loaded is for a different card is being
>>>>> loaded. I note that processing related to firmware is being
changed in the patch. Might my set up be at the root of my
>>>>> problem?
>>>>>
>>>>> I'll have a fiddle an see what I can work out.
>>>>>
>>>>> Chris
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ben.
>>>>>>
>>>>>>>
>>>>
>>>> Well, my fiddling has got my system rebooting and shutting down
successfully again. I found that if I delete the symlink
>>>> to the scrubber firmware, reboot and shutdown work again. There
are however, a number of other files in the tu117
>>>> firmware directory tree that that are symlinks to actual files
in its tu116 counterpart. So I deleted all of those too.
>>>> Unfortunately, the absence of one or more of those symlinks
causes Xorg to fail to start. I've reinstated all the links
>>>> except scrubber and I now have a system that works as it did
until I tried to run a kernel that includes the bad commit
>>>> I identified in my bisection. That includes offloading video
decoding to the NVidia card, so what ever I read that said
>>>> the scrubber firmware was needed seems to have been wrong. I
get a new message that (nouveau 0000:01:00.0: fb: VPR
>>>> locked, but no scrubber binary!), but, hey, we can't have
everything.
>>>>
>>>> If you still want to get to the bottom of this, let me know
what you need me to provide and I'll do my best. I suspect
>>>> you might want to because there will a n awful lot of
Ubuntu-based systems out there with that scrubber.bin symlink in
>>>> place. On the other hand,m it could but quite a while before
ubuntu are deploying 6.2 or later kernels.
>>> The symlinks are correct - whole groups of GPUs share the same FW,
and
>>> we use symlinks in linux-firmware to represent this.
>>>
>>> I don't really have any ideas how/why this patch causes issues
with
>>> shutdown - it's a path that only gets executed during
initialisation.
>>> Can you try and capture the kernel log during shutdown ("dmesg
-w"
>>> over ssh? netconsole?), and see if there's any relevant
messages
>>> providing a hint at what's going on?  Alternatively, you could
try
>>> unloading the module (you will have to stop X/wayland/gdm/etc/etc
>>> first) and seeing if that hangs too.
>>>
>>> Ben.
>>
>> Sorry for the delay - I've been learning about netconsole and
netcat. However, I had no success with ssh and netconsole
>> produced a log with nothing unusual in it.
>>
>> Simply stopping Xorg and removing the nouveau module succeeds.
>>
>> So, I rebuilt rc6+ after a pull from linus' tree this morning and
set the nouveau debug level to 7. I then booted to a
>> console before doing a reboot (with Ctl+Alt+Del). As expected the
machine locked up just before it would ordinarily
>> restart. The last few lines on the console might be helpful:
>>
>> ...
>> nouveau 0000:01:00:0  fifo: preinit running...
>> nouveau 0000:01:00:0  fifo: preinit completed in 4us
>> nouveau 0000:01:00:0  gr: preinit running...
>> nouveau 0000:01:00:0  gr: preinit completed in 0us
>> nouveau 0000:01:00:0  nvdec0: preinit running...
>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
>> nouveau 0000:01:00:0  nvdec0: preinit running...
>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
>> nouveau 0000:01:00:0  sec2: preinit running...
>> nouveau 0000:01:00:0  sec2: preinit completed in 0us
>> nouveau 0000:01:00:0  fb:.VPR locked, running scrubber binary
>>
>> These messages appear after the "sd 4:0:0:0 [sda] Stopping
disk" I reported in my initial email.
>>
>> After the "running scrubber" line appears the machine is
locked and I have to hold down the power button to recover. I
>> get the same outcome from running "halt -dip", "poweroff
-di" and "shutdown -h -P now". I guess it's no surprise that
>> all three result in the same outcome because invocations halt, poweroff
and reboot (without the -f argument)from a
>> runlevel other than 0 resukt in shutdown being run. switching to
runlevel 0 with "telenit 0" results in the same
>> messages from nouveau followed by the lockup.
>>
>> Let me know if you need any additional diagnostics.
>>
>> Chris
>>
> 
> I've done some more investigation and found that I hadn't done
sufficient amemdment the scripts run at shutdown to
> prevent the network being shutdown. I've now got netconsole captures
for 6.2.0-rc6+
> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9.
These two logs are attached.
> 
> Chris
> 
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Chris
>>>>
>>>> <snip>

Linux regression tracking (Thorsten Leemhuis)

2023-Feb-10 18:35 UTC

head link

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 08.02.23 09:48, Chris Clayton wrote:> 
> I'm assuming  that we are not going to see a fix for this regression
before 6.2 is released.
Yeah, looks like it. That's unfortunate, but happens. But there is still
time to fix it and there is one thing I wonder:

Did any of the nouveau developers look at the netconsole captures Chris
posted more than a week ago to check if they somehow help to track down
the root of this problem?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker'
hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
> Consequently, I've
> implemented a (very simple) workaround. All that happens is that in the
(sysv) init script that starts and stops SDDM,
> the nouveau module is removed once SDDM is stopped. With that in place, my
system no longer freezes on reboot or poweroff.
> 
> Let me know if I can provide any additional diagnostics although, with the
problem seemingly occurring so late in the
> shutdown process, I may need help on how to go about capturing.
> 
> Chris
> 
> On 02/02/2023 20:45, Chris Clayton wrote:
>>
>>
>> On 01/02/2023 13:51, Chris Clayton wrote:
>>>
>>>
>>> On 30/01/2023 23:27, Ben Skeggs wrote:
>>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553 at
googlemail.com> wrote:
>>>>>
>>>>> Hi again.
>>>>>
>>>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>>>> Thanks, Ben.
>>>>>
>>>>> <snip>
>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>> This is a complete shot-in-the-dark, as I don't
see this behaviour on
>>>>>>> *any* of my boards.  Could you try the attached
patch please?
>>>>>>
>>>>>> Unfortunately, the patch made no difference.
>>>>>>
>>>>>> I've been looking at how the graphics on my laptop
is set up, and have a bit of a worry about whether the firmware might
>>>>>> be playing a part in this problem. In order to offload
video decoding to the NVidia TU117 GPU, it seems the scrubber
>>>>>> firmware must be available, but as far as I know,that
has not been released by NVidia. To get it to work, I followed
>>>>>> what ubuntu have done and the scrubber in
/lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means
that some of the firmware loaded is for a different card is being
>>>>>> loaded. I note that processing related to firmware is
being changed in the patch. Might my set up be at the root of my
>>>>>> problem?
>>>>>>
>>>>>> I'll have a fiddle an see what I can work out.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ben.
>>>>>>>
>>>>>>>>
>>>>>
>>>>> Well, my fiddling has got my system rebooting and shutting
down successfully again. I found that if I delete the symlink
>>>>> to the scrubber firmware, reboot and shutdown work again.
There are however, a number of other files in the tu117
>>>>> firmware directory tree that that are symlinks to actual
files in its tu116 counterpart. So I deleted all of those too.
>>>>> Unfortunately, the absence of one or more of those symlinks
causes Xorg to fail to start. I've reinstated all the links
>>>>> except scrubber and I now have a system that works as it
did until I tried to run a kernel that includes the bad commit
>>>>> I identified in my bisection. That includes offloading
video decoding to the NVidia card, so what ever I read that said
>>>>> the scrubber firmware was needed seems to have been wrong.
I get a new message that (nouveau 0000:01:00.0: fb: VPR
>>>>> locked, but no scrubber binary!), but, hey, we can't
have everything.
>>>>>
>>>>> If you still want to get to the bottom of this, let me know
what you need me to provide and I'll do my best. I suspect
>>>>> you might want to because there will a n awful lot of
Ubuntu-based systems out there with that scrubber.bin symlink in
>>>>> place. On the other hand,m it could but quite a while
before ubuntu are deploying 6.2 or later kernels.
>>>> The symlinks are correct - whole groups of GPUs share the same
FW, and
>>>> we use symlinks in linux-firmware to represent this.
>>>>
>>>> I don't really have any ideas how/why this patch causes
issues with
>>>> shutdown - it's a path that only gets executed during
initialisation.
>>>> Can you try and capture the kernel log during shutdown
("dmesg -w"
>>>> over ssh? netconsole?), and see if there's any relevant
messages
>>>> providing a hint at what's going on?  Alternatively, you
could try
>>>> unloading the module (you will have to stop
X/wayland/gdm/etc/etc
>>>> first) and seeing if that hangs too.
>>>>
>>>> Ben.
>>>
>>> Sorry for the delay - I've been learning about netconsole and
netcat. However, I had no success with ssh and netconsole
>>> produced a log with nothing unusual in it.
>>>
>>> Simply stopping Xorg and removing the nouveau module succeeds.
>>>
>>> So, I rebuilt rc6+ after a pull from linus' tree this morning
and set the nouveau debug level to 7. I then booted to a
>>> console before doing a reboot (with Ctl+Alt+Del). As expected the
machine locked up just before it would ordinarily
>>> restart. The last few lines on the console might be helpful:
>>>
>>> ...
>>> nouveau 0000:01:00:0  fifo: preinit running...
>>> nouveau 0000:01:00:0  fifo: preinit completed in 4us
>>> nouveau 0000:01:00:0  gr: preinit running...
>>> nouveau 0000:01:00:0  gr: preinit completed in 0us
>>> nouveau 0000:01:00:0  nvdec0: preinit running...
>>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
>>> nouveau 0000:01:00:0  nvdec0: preinit running...
>>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
>>> nouveau 0000:01:00:0  sec2: preinit running...
>>> nouveau 0000:01:00:0  sec2: preinit completed in 0us
>>> nouveau 0000:01:00:0  fb:.VPR locked, running scrubber binary
>>>
>>> These messages appear after the "sd 4:0:0:0 [sda] Stopping
disk" I reported in my initial email.
>>>
>>> After the "running scrubber" line appears the machine is
locked and I have to hold down the power button to recover. I
>>> get the same outcome from running "halt -dip",
"poweroff -di" and "shutdown -h -P now". I guess it's no
surprise that
>>> all three result in the same outcome because invocations halt,
poweroff and reboot (without the -f argument)from a
>>> runlevel other than 0 resukt in shutdown being run. switching to
runlevel 0 with "telenit 0" results in the same
>>> messages from nouveau followed by the lockup.
>>>
>>> Let me know if you need any additional diagnostics.
>>>
>>> Chris
>>>
>>
>> I've done some more investigation and found that I hadn't done
sufficient amemdment the scripts run at shutdown to
>> prevent the network being shutdown. I've now got netconsole
captures for 6.2.0-rc6+
>> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9.
These two logs are attached.
>>
>> Chris
>>
>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Chris
>>>>>
>>>>> <snip>
> 
>

Nouveau - Feb 2023 - linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected