thr3ads.net - Nouveau - [Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected [Feb 2023]

If this information is useful, please help other people find it:
Share via:

Linux regression tracking (Thorsten Leemhuis)

2023-Feb-10 18:35 UTC

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 08.02.23 09:48, Chris Clayton wrote:> 
> I'm assuming  that we are not going to see a fix for this regression
before 6.2 is released.
Yeah, looks like it. That's unfortunate, but happens. But there is still
time to fix it and there is one thing I wonder:

Did any of the nouveau developers look at the netconsole captures Chris
posted more than a week ago to check if they somehow help to track down
the root of this problem?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker'
hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
> Consequently, I've
> implemented a (very simple) workaround. All that happens is that in the
(sysv) init script that starts and stops SDDM,
> the nouveau module is removed once SDDM is stopped. With that in place, my
system no longer freezes on reboot or poweroff.
> 
> Let me know if I can provide any additional diagnostics although, with the
problem seemingly occurring so late in the
> shutdown process, I may need help on how to go about capturing.
> 
> Chris
> 
> On 02/02/2023 20:45, Chris Clayton wrote:
>>
>>
>> On 01/02/2023 13:51, Chris Clayton wrote:
>>>
>>>
>>> On 30/01/2023 23:27, Ben Skeggs wrote:
>>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553 at
googlemail.com> wrote:
>>>>>
>>>>> Hi again.
>>>>>
>>>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>>>> Thanks, Ben.
>>>>>
>>>>> <snip>
>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>> This is a complete shot-in-the-dark, as I don't
see this behaviour on
>>>>>>> *any* of my boards.  Could you try the attached
patch please?
>>>>>>
>>>>>> Unfortunately, the patch made no difference.
>>>>>>
>>>>>> I've been looking at how the graphics on my laptop
is set up, and have a bit of a worry about whether the firmware might
>>>>>> be playing a part in this problem. In order to offload
video decoding to the NVidia TU117 GPU, it seems the scrubber
>>>>>> firmware must be available, but as far as I know,that
has not been released by NVidia. To get it to work, I followed
>>>>>> what ubuntu have done and the scrubber in
/lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means
that some of the firmware loaded is for a different card is being
>>>>>> loaded. I note that processing related to firmware is
being changed in the patch. Might my set up be at the root of my
>>>>>> problem?
>>>>>>
>>>>>> I'll have a fiddle an see what I can work out.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ben.
>>>>>>>
>>>>>>>>
>>>>>
>>>>> Well, my fiddling has got my system rebooting and shutting
down successfully again. I found that if I delete the symlink
>>>>> to the scrubber firmware, reboot and shutdown work again.
There are however, a number of other files in the tu117
>>>>> firmware directory tree that that are symlinks to actual
files in its tu116 counterpart. So I deleted all of those too.
>>>>> Unfortunately, the absence of one or more of those symlinks
causes Xorg to fail to start. I've reinstated all the links
>>>>> except scrubber and I now have a system that works as it
did until I tried to run a kernel that includes the bad commit
>>>>> I identified in my bisection. That includes offloading
video decoding to the NVidia card, so what ever I read that said
>>>>> the scrubber firmware was needed seems to have been wrong.
I get a new message that (nouveau 0000:01:00.0: fb: VPR
>>>>> locked, but no scrubber binary!), but, hey, we can't
have everything.
>>>>>
>>>>> If you still want to get to the bottom of this, let me know
what you need me to provide and I'll do my best. I suspect
>>>>> you might want to because there will a n awful lot of
Ubuntu-based systems out there with that scrubber.bin symlink in
>>>>> place. On the other hand,m it could but quite a while
before ubuntu are deploying 6.2 or later kernels.
>>>> The symlinks are correct - whole groups of GPUs share the same
FW, and
>>>> we use symlinks in linux-firmware to represent this.
>>>>
>>>> I don't really have any ideas how/why this patch causes
issues with
>>>> shutdown - it's a path that only gets executed during
initialisation.
>>>> Can you try and capture the kernel log during shutdown
("dmesg -w"
>>>> over ssh? netconsole?), and see if there's any relevant
messages
>>>> providing a hint at what's going on?  Alternatively, you
could try
>>>> unloading the module (you will have to stop
X/wayland/gdm/etc/etc
>>>> first) and seeing if that hangs too.
>>>>
>>>> Ben.
>>>
>>> Sorry for the delay - I've been learning about netconsole and
netcat. However, I had no success with ssh and netconsole
>>> produced a log with nothing unusual in it.
>>>
>>> Simply stopping Xorg and removing the nouveau module succeeds.
>>>
>>> So, I rebuilt rc6+ after a pull from linus' tree this morning
and set the nouveau debug level to 7. I then booted to a
>>> console before doing a reboot (with Ctl+Alt+Del). As expected the
machine locked up just before it would ordinarily
>>> restart. The last few lines on the console might be helpful:
>>>
>>> ...
>>> nouveau 0000:01:00:0  fifo: preinit running...
>>> nouveau 0000:01:00:0  fifo: preinit completed in 4us
>>> nouveau 0000:01:00:0  gr: preinit running...
>>> nouveau 0000:01:00:0  gr: preinit completed in 0us
>>> nouveau 0000:01:00:0  nvdec0: preinit running...
>>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
>>> nouveau 0000:01:00:0  nvdec0: preinit running...
>>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
>>> nouveau 0000:01:00:0  sec2: preinit running...
>>> nouveau 0000:01:00:0  sec2: preinit completed in 0us
>>> nouveau 0000:01:00:0  fb:.VPR locked, running scrubber binary
>>>
>>> These messages appear after the "sd 4:0:0:0 [sda] Stopping
disk" I reported in my initial email.
>>>
>>> After the "running scrubber" line appears the machine is
locked and I have to hold down the power button to recover. I
>>> get the same outcome from running "halt -dip",
"poweroff -di" and "shutdown -h -P now". I guess it's no
surprise that
>>> all three result in the same outcome because invocations halt,
poweroff and reboot (without the -f argument)from a
>>> runlevel other than 0 resukt in shutdown being run. switching to
runlevel 0 with "telenit 0" results in the same
>>> messages from nouveau followed by the lockup.
>>>
>>> Let me know if you need any additional diagnostics.
>>>
>>> Chris
>>>
>>
>> I've done some more investigation and found that I hadn't done
sufficient amemdment the scripts run at shutdown to
>> prevent the network being shutdown. I've now got netconsole
captures for 6.2.0-rc6+
>> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9.
These two logs are attached.
>>
>> Chris
>>
>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Chris
>>>>>
>>>>> <snip>
> 
>

Karol Herbst

2023-Feb-10 19:01 UTC

head link

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
Leemhuis) <regressions at leemhuis.info> wrote:>
> On 08.02.23 09:48, Chris Clayton wrote:
> >
> > I'm assuming  that we are not going to see a fix for this
regression before 6.2 is released.
>
> Yeah, looks like it. That's unfortunate, but happens. But there is
still
> time to fix it and there is one thing I wonder:
>
> Did any of the nouveau developers look at the netconsole captures Chris
> posted more than a week ago to check if they somehow help to track down
> the root of this problem?
>
I did now and I can't spot anything. I think at this point it would
make sense to dump the active tasks/threads via sqsrq keys to see if
any is in a weird state preventing the machine from shutting down.
> Ciao, Thorsten (wearing his 'the Linux kernel's regression
tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> > Consequently, I've
> > implemented a (very simple) workaround. All that happens is that in
the (sysv) init script that starts and stops SDDM,
> > the nouveau module is removed once SDDM is stopped. With that in
place, my system no longer freezes on reboot or poweroff.
> >
> > Let me know if I can provide any additional diagnostics although, with
the problem seemingly occurring so late in the
> > shutdown process, I may need help on how to go about capturing.
> >
> > Chris
> >
> > On 02/02/2023 20:45, Chris Clayton wrote:
> >>
> >>
> >> On 01/02/2023 13:51, Chris Clayton wrote:
> >>>
> >>>
> >>> On 30/01/2023 23:27, Ben Skeggs wrote:
> >>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <chris2553
at googlemail.com> wrote:
> >>>>>
> >>>>> Hi again.
> >>>>>
> >>>>> On 30/01/2023 20:19, Chris Clayton wrote:
> >>>>>> Thanks, Ben.
> >>>>>
> >>>>> <snip>
> >>>>>
> >>>>>>> Hey,
> >>>>>>>
> >>>>>>> This is a complete shot-in-the-dark, as I
don't see this behaviour on
> >>>>>>> *any* of my boards.  Could you try the
attached patch please?
> >>>>>>
> >>>>>> Unfortunately, the patch made no difference.
> >>>>>>
> >>>>>> I've been looking at how the graphics on my
laptop is set up, and have a bit of a worry about whether the firmware might
> >>>>>> be playing a part in this problem. In order to
offload video decoding to the NVidia TU117 GPU, it seems the scrubber
> >>>>>> firmware must be available, but as far as I
know,that has not been released by NVidia. To get it to work, I followed
> >>>>>> what ubuntu have done and the scrubber in
/lib/firmware/nvidia/tu117/nvdec/ is a symlink to
> >>>>>> ../../tu116/nvdev/scrubber.bin. That, of course,
means that some of the firmware loaded is for a different card is being
> >>>>>> loaded. I note that processing related to firmware
is being changed in the patch. Might my set up be at the root of my
> >>>>>> problem?
> >>>>>>
> >>>>>> I'll have a fiddle an see what I can work out.
> >>>>>>
> >>>>>> Chris
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Ben.
> >>>>>>>
> >>>>>>>>
> >>>>>
> >>>>> Well, my fiddling has got my system rebooting and
shutting down successfully again. I found that if I delete the symlink
> >>>>> to the scrubber firmware, reboot and shutdown work
again. There are however, a number of other files in the tu117
> >>>>> firmware directory tree that that are symlinks to
actual files in its tu116 counterpart. So I deleted all of those too.
> >>>>> Unfortunately, the absence of one or more of those
symlinks causes Xorg to fail to start. I've reinstated all the links
> >>>>> except scrubber and I now have a system that works as
it did until I tried to run a kernel that includes the bad commit
> >>>>> I identified in my bisection. That includes offloading
video decoding to the NVidia card, so what ever I read that said
> >>>>> the scrubber firmware was needed seems to have been
wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
> >>>>> locked, but no scrubber binary!), but, hey, we
can't have everything.
> >>>>>
> >>>>> If you still want to get to the bottom of this, let me
know what you need me to provide and I'll do my best. I suspect
> >>>>> you might want to because there will a n awful lot of
Ubuntu-based systems out there with that scrubber.bin symlink in
> >>>>> place. On the other hand,m it could but quite a while
before ubuntu are deploying 6.2 or later kernels.
> >>>> The symlinks are correct - whole groups of GPUs share the
same FW, and
> >>>> we use symlinks in linux-firmware to represent this.
> >>>>
> >>>> I don't really have any ideas how/why this patch
causes issues with
> >>>> shutdown - it's a path that only gets executed during
initialisation.
> >>>> Can you try and capture the kernel log during shutdown
("dmesg -w"
> >>>> over ssh? netconsole?), and see if there's any
relevant messages
> >>>> providing a hint at what's going on?  Alternatively,
you could try
> >>>> unloading the module (you will have to stop
X/wayland/gdm/etc/etc
> >>>> first) and seeing if that hangs too.
> >>>>
> >>>> Ben.
> >>>
> >>> Sorry for the delay - I've been learning about netconsole
and netcat. However, I had no success with ssh and netconsole
> >>> produced a log with nothing unusual in it.
> >>>
> >>> Simply stopping Xorg and removing the nouveau module succeeds.
> >>>
> >>> So, I rebuilt rc6+ after a pull from linus' tree this
morning and set the nouveau debug level to 7. I then booted to a
> >>> console before doing a reboot (with Ctl+Alt+Del). As expected
the machine locked up just before it would ordinarily
> >>> restart. The last few lines on the console might be helpful:
> >>>
> >>> ...
> >>> nouveau 0000:01:00:0  fifo: preinit running...
> >>> nouveau 0000:01:00:0  fifo: preinit completed in 4us
> >>> nouveau 0000:01:00:0  gr: preinit running...
> >>> nouveau 0000:01:00:0  gr: preinit completed in 0us
> >>> nouveau 0000:01:00:0  nvdec0: preinit running...
> >>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
> >>> nouveau 0000:01:00:0  nvdec0: preinit running...
> >>> nouveau 0000:01:00:0  nvdec0: preinit completed in 0us
> >>> nouveau 0000:01:00:0  sec2: preinit running...
> >>> nouveau 0000:01:00:0  sec2: preinit completed in 0us
> >>> nouveau 0000:01:00:0  fb:.VPR locked, running scrubber binary
> >>>
> >>> These messages appear after the "sd 4:0:0:0 [sda]
Stopping disk" I reported in my initial email.
> >>>
> >>> After the "running scrubber" line appears the
machine is locked and I have to hold down the power button to recover. I
> >>> get the same outcome from running "halt -dip",
"poweroff -di" and "shutdown -h -P now". I guess it's no
surprise that
> >>> all three result in the same outcome because invocations halt,
poweroff and reboot (without the -f argument)from a
> >>> runlevel other than 0 resukt in shutdown being run. switching
to runlevel 0 with "telenit 0" results in the same
> >>> messages from nouveau followed by the lockup.
> >>>
> >>> Let me know if you need any additional diagnostics.
> >>>
> >>> Chris
> >>>
> >>
> >> I've done some more investigation and found that I hadn't
done sufficient amemdment the scripts run at shutdown to
> >> prevent the network being shutdown. I've now got netconsole
captures for 6.2.0-rc6+
> >> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison,
6.1.9. These two logs are attached.
> >>
> >> Chris
> >>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Chris
> >>>>>
> >>>>> <snip>
> >
> >
>

Nouveau - Feb 2023 - linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Nouveau] linux-6.2-rc4+ hangs on poweroff/reboot: Bisected