I begin to find a way that help me investigate fifo: SCHED_ERROR 0a
[CTXSW_TIMEOUT] errors.
See https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/339
I believe this affects mostly Fermi, Kepler and Maxwell1 graphic cards.
I'd like first to describe a bit how I proceed, then talk about separating
this issue in many.
I am working on Gnome Debian Testing.
This environment react well (no freeze) when programs or XWayland are killed
from the error.
I am using drm-misc, from: https://cgit.freedesktop.org/drm-misc/tree/
that I got with git clone git://cgit.freedesktop.org/drm-misc/tree/
and have compiled.
my kernel command line in /etc/grub/default have:
GRUB_CMDLINE_LINUX="pcie_aspm=off nouveau.debug=info nouveau.noaccel=0
drm.debug=0 log_buf_len=8M"
Not sure if only me need pcie_aspm=off to remove some AER errors on PCIe bus.
The first thing I do is:
su -
the - allows to have access to programs in /usr/sbin
dmesg --console-off
because I will generate a lot of messages, and I want them only in log files,
not on console screen.
I launch Firefox, most of the bugs I get by browsing the web.
When ready to debug I do:
echo 255 > /sys/module/drm/parameters/debug
[At first was using 2, then 1 as suggested by /usr/sbin/modinfo drm, but then
concluded 255 for all is the
best to have all the cases that could cause the timeout]
I browse the web.
When Firefox stop, or everything goes away and return to the gdm (login screen),
first thing I do is:
echo 0 > /sys/modules/drm/parameters/debug
to stop logging so much messages.
Then I do:
journalctl -b -g SCHED
to find at which second, the CTXSW_TIMEOUT message is.
Suppose it is at 08:21:14.
journalctl -b --since 08:21:13 --until 08:21:14
until the CTXSW_TIMEOUT is not the first line, I do it again with minus 1 sec on
--since
Let's say I get up to: 08:21:09:
journalctl -b --since 08:21:09 --until 08:21:14 -o short-monotone > err.txt
cp err.txt /home/paul
mv /home/paul/err.txt /home/paul/journalctl_no1.txt
chown paul:paul /home/paul/journalctl_no1.txt
And then, as normal user paul:
gnome-text-editor journalctl_no1.txt &
and I search for: SCHED again...
and I looks the lines before to try to figure out the cause of the timeout.
If you take a look at: See
https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues/339
you can see that what is before vary quite a bit each time.
I suspect there is many causes that can result in a MMU error on the GPU and so
cause a timeout.
There is the possibility of a non-related memory corruption... I suppose.
But if not, it would make some sense to open a different issue for each
different things happening before the timeout message.
Not sure, if is is really the good thing to do. So in part why I am writing this
message to ask opinion(s).