Marcin Zajączkowski
2019-Dec-16 17:42 UTC
[Nouveau] Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
On 2019-12-16 18:08, Ilia Mirkin wrote:> Hi Marcin, > > You should do a git bisect rather than guessing about commits. I > suspect that searching for "kernel git bisect fedora" should prove > instructive if you're not sure how to do this.Thanks for your suggestion. I realize that I can do it at the Git level and it is the ultimate way to go. However, building the kernel version from sources takes some time (in addition to a regular time needed to install/restart/verify which I already experienced narrowing down to a "just" ~250 commits). Therefore, I would be really thankful for a suggestion which commits could be good to check first - having 2, 4 is better than 8-10 (assuming someone is right :) ). Marcin> On Mon, Dec 16, 2019 at 11:42 AM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: >> >> Hi, >> >> I've encountered a severe regression in TU116 (probably also TU117) >> introduced in 5.3-rc4 (valid also for recent 5.4.2) [1]. The system >> usually hangs on the subsequent graphic mode related operation (calling >> xrandr after login is enough) with the following error: >> >>> kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 08 [] >> ... >>> kernel: nouveau 0000:01:00.0: DRM: failed to idle channel 0 [DRM] >>> kernel: nouveau 0000:01:00.0: i2c: aux 0007: begin idle timeout ffffffff >>> kernel: nouveau 0000:01:00.0: tmr: stalled at ffffffffffffffff >>> kernel: ------------[ cut here ]------------ >>> kernel: nouveau 0000:01:00.0: timeout >>> kernel: WARNING: CPU: 10 PID: 384 at drivers/gpu/drm/nouveau/nvkm/subdev/bar/g84.c:35 g84_bar_flush+0xcf/> 0xe0 [nouveau] >> >> (detailed log in a corresponding issue - [1]) >> >> With earlier kernels there was no hardware acceleration for NVidia GTX >> 1660 Ti, but at least I could use nouveau to disable it (to save >> battery, trees and lower temperature) or even have an external output >> (with Wayland). Now, the system is unusable with nouveau :(. >> >> I spent some time trying to narrow the scope using on the existing >> kernel builds for Fedora. I was able to determine that the problem was >> introduced between 5.3.0-0.rc3.git1.1 (commit 33920f1ec5bf - works fine) >> and 5.3.0-0.rc4.git0.1 (tag v5.3-rc4 - fails with errors). >> >> It's just a few days (7-11 Aug) and "only" around 250 commits. I went >> through them, but (based on the commits name) I haven't seen any nouveau >> related changes and in general no very suspected drm related changes. >> >>> git log 33920f1ec5bf..v5.3-rc4 --stat >> >> >> Maybe some of more nouveau/drm-experienced developers could take a look >> at that to determine which commit could break it (to make it easier to >> find out what should be fixed to prevent that regression)? >> >> >> [1] - >> https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/516 >> >> Thanks in advance >> Marcin >> _______________________________________________ >> Nouveau mailing list >> Nouveau at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/nouveau
Ilia Mirkin
2019-Dec-16 18:45 UTC
[Nouveau] Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
The obvious candidate based on a quick scan is 0acf5676dc0ffe0683543a20d5ecbd112af5b8ee -- it merges a fix that messes with PCI stuff, and there lie dragons. You could try building that commit, and if things still work, then I have no idea (and you've narrowed the range). Also I'd recommend ensuring that the good kernel is really good and the bad kernel is really bad -- boot them a few times. Cheers, -ilia On Mon, Dec 16, 2019 at 12:42 PM Marcin Zaj?czkowski <mszpak at wp.pl> wrote:> > On 2019-12-16 18:08, Ilia Mirkin wrote: > > Hi Marcin, > > > > You should do a git bisect rather than guessing about commits. I > > suspect that searching for "kernel git bisect fedora" should prove > > instructive if you're not sure how to do this. > > Thanks for your suggestion. I realize that I can do it at the Git level > and it is the ultimate way to go. However, building the kernel version > from sources takes some time (in addition to a regular time needed to > install/restart/verify which I already experienced narrowing down to a > "just" ~250 commits). > > Therefore, I would be really thankful for a suggestion which commits > could be good to check first - having 2, 4 is better than 8-10 (assuming > someone is right :) ). > > Marcin > > > > > On Mon, Dec 16, 2019 at 11:42 AM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: > >> > >> Hi, > >> > >> I've encountered a severe regression in TU116 (probably also TU117) > >> introduced in 5.3-rc4 (valid also for recent 5.4.2) [1]. The system > >> usually hangs on the subsequent graphic mode related operation (calling > >> xrandr after login is enough) with the following error: > >> > >>> kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 08 [] > >> ... > >>> kernel: nouveau 0000:01:00.0: DRM: failed to idle channel 0 [DRM] > >>> kernel: nouveau 0000:01:00.0: i2c: aux 0007: begin idle timeout ffffffff > >>> kernel: nouveau 0000:01:00.0: tmr: stalled at ffffffffffffffff > >>> kernel: ------------[ cut here ]------------ > >>> kernel: nouveau 0000:01:00.0: timeout > >>> kernel: WARNING: CPU: 10 PID: 384 at drivers/gpu/drm/nouveau/nvkm/subdev/bar/g84.c:35 g84_bar_flush+0xcf/> 0xe0 [nouveau] > >> > >> (detailed log in a corresponding issue - [1]) > >> > >> With earlier kernels there was no hardware acceleration for NVidia GTX > >> 1660 Ti, but at least I could use nouveau to disable it (to save > >> battery, trees and lower temperature) or even have an external output > >> (with Wayland). Now, the system is unusable with nouveau :(. > >> > >> I spent some time trying to narrow the scope using on the existing > >> kernel builds for Fedora. I was able to determine that the problem was > >> introduced between 5.3.0-0.rc3.git1.1 (commit 33920f1ec5bf - works fine) > >> and 5.3.0-0.rc4.git0.1 (tag v5.3-rc4 - fails with errors). > >> > >> It's just a few days (7-11 Aug) and "only" around 250 commits. I went > >> through them, but (based on the commits name) I haven't seen any nouveau > >> related changes and in general no very suspected drm related changes. > >> > >>> git log 33920f1ec5bf..v5.3-rc4 --stat > >> > >> > >> Maybe some of more nouveau/drm-experienced developers could take a look > >> at that to determine which commit could break it (to make it easier to > >> find out what should be fixed to prevent that regression)? > >> > >> > >> [1] - > >> https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/516 > >> > >> Thanks in advance > >> Marcin > >> _______________________________________________ > >> Nouveau mailing list > >> Nouveau at lists.freedesktop.org > >> https://lists.freedesktop.org/mailman/listinfo/nouveau
Marcin Zajączkowski
2019-Dec-19 20:27 UTC
[Nouveau] Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
On 2019-12-16 19:45, Ilia Mirkin wrote:> The obvious candidate based on a quick scan is > 0acf5676dc0ffe0683543a20d5ecbd112af5b8ee -- it merges a fix that > messes with PCI stuff, and there lie dragons. You could try building > that commit, and if things still work, then I have no idea (and you'veNice shot Ilia! I managed to build kernel from suspected bd112af5b8ee and it fails miserably (as previously described). The build from the previous commit 86a04561920b works fine.> narrowed the range). Also I'd recommend ensuring that the good kernel > is really good and the bad kernel is really bad -- boot them a few > times.Well, this problem is reproducible in 100% in newer kernels. I see the errors on boot logs and after login to Gnome Shell the first execution of xrandr (or opening a lid) hangs the system (the graphic card). On the other side I haven't seen that problem in any earlier kernel. Therefore, the situation is rather clear in my case. Nevertheless, I will stay with that self-build good kernel (5.3.0-0.rc3 + git) to check it further. How would you see it, Ilia? Is there anything in nouveau that needs to be adjusted to that changes or rather those changes break something in nouveau that would be best to fix/revert them (and it would be good to let the committer know about the problem)? Marcin> On Mon, Dec 16, 2019 at 12:42 PM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: >> >> On 2019-12-16 18:08, Ilia Mirkin wrote: >>> Hi Marcin, >>> >>> You should do a git bisect rather than guessing about commits. I >>> suspect that searching for "kernel git bisect fedora" should prove >>> instructive if you're not sure how to do this. >> >> Thanks for your suggestion. I realize that I can do it at the Git level >> and it is the ultimate way to go. However, building the kernel version >> from sources takes some time (in addition to a regular time needed to >> install/restart/verify which I already experienced narrowing down to a >> "just" ~250 commits). >> >> Therefore, I would be really thankful for a suggestion which commits >> could be good to check first - having 2, 4 is better than 8-10 (assuming >> someone is right :) ). >> >> Marcin >> >> >> >>> On Mon, Dec 16, 2019 at 11:42 AM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: >>>> >>>> Hi, >>>> >>>> I've encountered a severe regression in TU116 (probably also TU117) >>>> introduced in 5.3-rc4 (valid also for recent 5.4.2) [1]. The system >>>> usually hangs on the subsequent graphic mode related operation (calling >>>> xrandr after login is enough) with the following error: >>>> >>>>> kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 08 [] >>>> ... >>>>> kernel: nouveau 0000:01:00.0: DRM: failed to idle channel 0 [DRM] >>>>> kernel: nouveau 0000:01:00.0: i2c: aux 0007: begin idle timeout ffffffff >>>>> kernel: nouveau 0000:01:00.0: tmr: stalled at ffffffffffffffff >>>>> kernel: ------------[ cut here ]------------ >>>>> kernel: nouveau 0000:01:00.0: timeout >>>>> kernel: WARNING: CPU: 10 PID: 384 at drivers/gpu/drm/nouveau/nvkm/subdev/bar/g84.c:35 g84_bar_flush+0xcf/> 0xe0 [nouveau] >>>> >>>> (detailed log in a corresponding issue - [1]) >>>> >>>> With earlier kernels there was no hardware acceleration for NVidia GTX >>>> 1660 Ti, but at least I could use nouveau to disable it (to save >>>> battery, trees and lower temperature) or even have an external output >>>> (with Wayland). Now, the system is unusable with nouveau :(. >>>> >>>> I spent some time trying to narrow the scope using on the existing >>>> kernel builds for Fedora. I was able to determine that the problem was >>>> introduced between 5.3.0-0.rc3.git1.1 (commit 33920f1ec5bf - works fine) >>>> and 5.3.0-0.rc4.git0.1 (tag v5.3-rc4 - fails with errors). >>>> >>>> It's just a few days (7-11 Aug) and "only" around 250 commits. I went >>>> through them, but (based on the commits name) I haven't seen any nouveau >>>> related changes and in general no very suspected drm related changes. >>>> >>>>> git log 33920f1ec5bf..v5.3-rc4 --stat >>>> >>>> >>>> Maybe some of more nouveau/drm-experienced developers could take a look >>>> at that to determine which commit could break it (to make it easier to >>>> find out what should be fixed to prevent that regression)? >>>> >>>> >>>> [1] - >>>> https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/516 >>>> >>>> Thanks in advance >>>> Marcin
Possibly Parallel Threads
- Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
- Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
- Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
- Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
- Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed