Innocenti Maresin
2015-Oct-05 08:30 UTC
[Nouveau] Struggle with GPU lockups and console deadlock using kernel-space modifications
Hello. I have a poorly functioning GeForce 8600 GTS (rev a1) video card, that causes many problems for the box where it’s installed, primarily GPU lockups (sometimes unprovoked), several instances in a day. Without intervention, a GPU lockup is a condition where the system console is no longer usable (even the keyboard, because switching from Xorg to TUI becomes obstructed; see below). Upgrading Linux from 3.16 to 4.3 and improved cooling reduced some minor problems (such as snow), but didn’t prevent lockups. Therefore I present some experiences and reflections on abating GPU lockups. First, I hoped to use the bus hardware reset control, found at /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/reset in my system. The #nouveau @ freenode channel suggested some insights afterwards. But maneuvers with unloading/loading the nouveau module after a lockup were not succeed for several reasons. I envisaged the following sequence that could recover a computer from a GPU lockup, with the system console usable again and applications not disturbed too much. 1. Suspend all applications using the video card. 2. If necessary, perform hardware reset on the bus. 3. If necessary, run initialization functions for the video card. 4. Restore a usable video mode (can be achieved by switching virtual consoles Xorg ⇔ TUI, for example). 5. Resume the work. Implementation became a challenge. First, loading nouveau with config=NvForcePost=1 doesn’t result in a usable console, either during Linux startup or else. It doesn’t produce a signal at all with my hardware. I tested it with at least three different versions of nouveau at both Linux 3.16 and Linux 4.3. There are major problems with making the step 4 from the kernel mode. Linux kernel has the «set_console(nr)» function (from vt.c), but doesn’t export it. Moreover, in modern kernels (apparently since Linux 3) even this internal kernel function performs checks for «vt_dont_switch», hence a deadlock can ensue. Neither are exported other functions, even such high-level ones as «suspend_console()» and «resume_console()». I even was unable to reconnaissance their true addresses in the memory using /proc/kallsyms — there are only zeros (that wasn’t the case for Linux 2.6). Some partial experiences, made using remote shell access, are described below. With any module and no lockup: stop Xorg; echo 0 >/sys/class/vtconsole/vtcon1/bind; rmmod nouveau; modprobe nouveau; — success. With standard nouveau module on Linux 4.3: • stop Xorg; reset device; modprobe nouveau; — the module won’t initialize. • stop Xorg; reset device; reload module (as above) — won’t work, symptoms differ from case to case. With modified nouveau modules, after a lockup: • «nvkm_device_init(⧦);» (at ⧦->devinit->post = false) — no effect. • reset device while Xorg runs — system crash or deadlock, nothing in logs. • «⧦->devinit->post = true; nvkm_device_init(⧦);» without reset — to be tested. («⧦» points to the card’s «struct nvkm_device» object.) My modified nouveau module was derived from git://people.freedesktop.org/~darktama/nouveau Proposals and suggestions? Please, think generally and not focus too much on my particular case. My video card (and, possibly, some related stuff on the motherboard) almost certainly functions improperly. Regards, Incnis Mrsi