Marcin Zajączkowski
2019-Dec-19 20:27 UTC
[Nouveau] Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
On 2019-12-16 19:45, Ilia Mirkin wrote:> The obvious candidate based on a quick scan is > 0acf5676dc0ffe0683543a20d5ecbd112af5b8ee -- it merges a fix that > messes with PCI stuff, and there lie dragons. You could try building > that commit, and if things still work, then I have no idea (and you'veNice shot Ilia! I managed to build kernel from suspected bd112af5b8ee and it fails miserably (as previously described). The build from the previous commit 86a04561920b works fine.> narrowed the range). Also I'd recommend ensuring that the good kernel > is really good and the bad kernel is really bad -- boot them a few > times.Well, this problem is reproducible in 100% in newer kernels. I see the errors on boot logs and after login to Gnome Shell the first execution of xrandr (or opening a lid) hangs the system (the graphic card). On the other side I haven't seen that problem in any earlier kernel. Therefore, the situation is rather clear in my case. Nevertheless, I will stay with that self-build good kernel (5.3.0-0.rc3 + git) to check it further. How would you see it, Ilia? Is there anything in nouveau that needs to be adjusted to that changes or rather those changes break something in nouveau that would be best to fix/revert them (and it would be good to let the committer know about the problem)? Marcin> On Mon, Dec 16, 2019 at 12:42 PM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: >> >> On 2019-12-16 18:08, Ilia Mirkin wrote: >>> Hi Marcin, >>> >>> You should do a git bisect rather than guessing about commits. I >>> suspect that searching for "kernel git bisect fedora" should prove >>> instructive if you're not sure how to do this. >> >> Thanks for your suggestion. I realize that I can do it at the Git level >> and it is the ultimate way to go. However, building the kernel version >> from sources takes some time (in addition to a regular time needed to >> install/restart/verify which I already experienced narrowing down to a >> "just" ~250 commits). >> >> Therefore, I would be really thankful for a suggestion which commits >> could be good to check first - having 2, 4 is better than 8-10 (assuming >> someone is right :) ). >> >> Marcin >> >> >> >>> On Mon, Dec 16, 2019 at 11:42 AM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: >>>> >>>> Hi, >>>> >>>> I've encountered a severe regression in TU116 (probably also TU117) >>>> introduced in 5.3-rc4 (valid also for recent 5.4.2) [1]. The system >>>> usually hangs on the subsequent graphic mode related operation (calling >>>> xrandr after login is enough) with the following error: >>>> >>>>> kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 08 [] >>>> ... >>>>> kernel: nouveau 0000:01:00.0: DRM: failed to idle channel 0 [DRM] >>>>> kernel: nouveau 0000:01:00.0: i2c: aux 0007: begin idle timeout ffffffff >>>>> kernel: nouveau 0000:01:00.0: tmr: stalled at ffffffffffffffff >>>>> kernel: ------------[ cut here ]------------ >>>>> kernel: nouveau 0000:01:00.0: timeout >>>>> kernel: WARNING: CPU: 10 PID: 384 at drivers/gpu/drm/nouveau/nvkm/subdev/bar/g84.c:35 g84_bar_flush+0xcf/> 0xe0 [nouveau] >>>> >>>> (detailed log in a corresponding issue - [1]) >>>> >>>> With earlier kernels there was no hardware acceleration for NVidia GTX >>>> 1660 Ti, but at least I could use nouveau to disable it (to save >>>> battery, trees and lower temperature) or even have an external output >>>> (with Wayland). Now, the system is unusable with nouveau :(. >>>> >>>> I spent some time trying to narrow the scope using on the existing >>>> kernel builds for Fedora. I was able to determine that the problem was >>>> introduced between 5.3.0-0.rc3.git1.1 (commit 33920f1ec5bf - works fine) >>>> and 5.3.0-0.rc4.git0.1 (tag v5.3-rc4 - fails with errors). >>>> >>>> It's just a few days (7-11 Aug) and "only" around 250 commits. I went >>>> through them, but (based on the commits name) I haven't seen any nouveau >>>> related changes and in general no very suspected drm related changes. >>>> >>>>> git log 33920f1ec5bf..v5.3-rc4 --stat >>>> >>>> >>>> Maybe some of more nouveau/drm-experienced developers could take a look >>>> at that to determine which commit could break it (to make it easier to >>>> find out what should be fixed to prevent that regression)? >>>> >>>> >>>> [1] - >>>> https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/516 >>>> >>>> Thanks in advance >>>> Marcin
Ilia Mirkin
2019-Dec-19 20:38 UTC
[Nouveau] Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
Let's add Mika and Rafael, as they were responsible for that commit. Mika/Rafael - any ideas? The commit in question is 0617bdede5114a0002298b12cd0ca2b0cfd0395d Marcin -- would be nice if you could confirm that taking a recent kernel + "git revert 0617bdede5114a0002298b12cd0ca2b0cfd0395d" works well for you. On Thu, Dec 19, 2019 at 3:27 PM Marcin Zaj?czkowski <mszpak at wp.pl> wrote:> > On 2019-12-16 19:45, Ilia Mirkin wrote: > > The obvious candidate based on a quick scan is > > 0acf5676dc0ffe0683543a20d5ecbd112af5b8ee -- it merges a fix that > > messes with PCI stuff, and there lie dragons. You could try building > > that commit, and if things still work, then I have no idea (and you've > > Nice shot Ilia! > > I managed to build kernel from suspected bd112af5b8ee and it failsTook me a while, but this is the end of the hash. Normally you list the start of the hash (and that's what all the git tools accept). In this case this is commit 0acf5676dc0ffe0683543a20d5ecbd112af5b8ee> miserably (as previously described). The build from the previous commit > 86a04561920b works fine.e577dc152e232c78e5774e4c9b5486a04561920b> > > narrowed the range). Also I'd recommend ensuring that the good kernel > > is really good and the bad kernel is really bad -- boot them a few > > times. > > Well, this problem is reproducible in 100% in newer kernels. I see the > errors on boot logs and after login to Gnome Shell the first execution > of xrandr (or opening a lid) hangs the system (the graphic card). On the > other side I haven't seen that problem in any earlier kernel. Therefore, > the situation is rather clear in my case. Nevertheless, I will stay with > that self-build good kernel (5.3.0-0.rc3 + git) to check it further. > > > How would you see it, Ilia? Is there anything in nouveau that needs to > be adjusted to that changes or rather those changes break something in > nouveau that would be best to fix/revert them (and it would be good to > let the committer know about the problem)? > > Marcin > > > > > On Mon, Dec 16, 2019 at 12:42 PM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: > >> > >> On 2019-12-16 18:08, Ilia Mirkin wrote: > >>> Hi Marcin, > >>> > >>> You should do a git bisect rather than guessing about commits. I > >>> suspect that searching for "kernel git bisect fedora" should prove > >>> instructive if you're not sure how to do this. > >> > >> Thanks for your suggestion. I realize that I can do it at the Git level > >> and it is the ultimate way to go. However, building the kernel version > >> from sources takes some time (in addition to a regular time needed to > >> install/restart/verify which I already experienced narrowing down to a > >> "just" ~250 commits). > >> > >> Therefore, I would be really thankful for a suggestion which commits > >> could be good to check first - having 2, 4 is better than 8-10 (assuming > >> someone is right :) ). > >> > >> Marcin > >> > >> > >> > >>> On Mon, Dec 16, 2019 at 11:42 AM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I've encountered a severe regression in TU116 (probably also TU117) > >>>> introduced in 5.3-rc4 (valid also for recent 5.4.2) [1]. The system > >>>> usually hangs on the subsequent graphic mode related operation (calling > >>>> xrandr after login is enough) with the following error: > >>>> > >>>>> kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 08 [] > >>>> ... > >>>>> kernel: nouveau 0000:01:00.0: DRM: failed to idle channel 0 [DRM] > >>>>> kernel: nouveau 0000:01:00.0: i2c: aux 0007: begin idle timeout ffffffff > >>>>> kernel: nouveau 0000:01:00.0: tmr: stalled at ffffffffffffffff > >>>>> kernel: ------------[ cut here ]------------ > >>>>> kernel: nouveau 0000:01:00.0: timeout > >>>>> kernel: WARNING: CPU: 10 PID: 384 at drivers/gpu/drm/nouveau/nvkm/subdev/bar/g84.c:35 g84_bar_flush+0xcf/> 0xe0 [nouveau] > >>>> > >>>> (detailed log in a corresponding issue - [1]) > >>>> > >>>> With earlier kernels there was no hardware acceleration for NVidia GTX > >>>> 1660 Ti, but at least I could use nouveau to disable it (to save > >>>> battery, trees and lower temperature) or even have an external output > >>>> (with Wayland). Now, the system is unusable with nouveau :(. > >>>> > >>>> I spent some time trying to narrow the scope using on the existing > >>>> kernel builds for Fedora. I was able to determine that the problem was > >>>> introduced between 5.3.0-0.rc3.git1.1 (commit 33920f1ec5bf - works fine) > >>>> and 5.3.0-0.rc4.git0.1 (tag v5.3-rc4 - fails with errors). > >>>> > >>>> It's just a few days (7-11 Aug) and "only" around 250 commits. I went > >>>> through them, but (based on the commits name) I haven't seen any nouveau > >>>> related changes and in general no very suspected drm related changes. > >>>> > >>>>> git log 33920f1ec5bf..v5.3-rc4 --stat > >>>> > >>>> > >>>> Maybe some of more nouveau/drm-experienced developers could take a look > >>>> at that to determine which commit could break it (to make it easier to > >>>> find out what should be fixed to prevent that regression)? > >>>> > >>>> > >>>> [1] - > >>>> https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/516 > >>>> > >>>> Thanks in advance > >>>> Marcin
Marcin Zajączkowski
2019-Dec-19 21:58 UTC
[Nouveau] Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
On 2019-12-19 21:38, Ilia Mirkin wrote:> Let's add Mika and Rafael, as they were responsible for that commit. > Mika/Rafael - any ideas? The commit in question is > > 0617bdede5114a0002298b12cd0ca2b0cfd0395d > > Marcin -- would be nice if you could confirm that taking a recent > kernel + "git revert 0617bdede5114a0002298b12cd0ca2b0cfd0395d" works > well for you.I gave it a try, however, there were subsequent changes in the neighborhood and I'm not sure how to solve the conflicts (as of master today). Nevertheless, I should be able to test a provided patch to verify that some assumptions are right. Marcin> > On Thu, Dec 19, 2019 at 3:27 PM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: >> >> On 2019-12-16 19:45, Ilia Mirkin wrote: >>> The obvious candidate based on a quick scan is >>> 0acf5676dc0ffe0683543a20d5ecbd112af5b8ee -- it merges a fix that >>> messes with PCI stuff, and there lie dragons. You could try building >>> that commit, and if things still work, then I have no idea (and you've >> >> Nice shot Ilia! >> >> I managed to build kernel from suspected bd112af5b8ee and it fails > > Took me a while, but this is the end of the hash. Normally you list > the start of the hash (and that's what all the git tools accept). In > this case this is commitWhat a bummer, I knew that...> > 0acf5676dc0ffe0683543a20d5ecbd112af5b8ee > >> miserably (as previously described). The build from the previous commit >> 86a04561920b works fine. > > e577dc152e232c78e5774e4c9b5486a04561920b > >> >>> narrowed the range). Also I'd recommend ensuring that the good kernel >>> is really good and the bad kernel is really bad -- boot them a few >>> times. >> >> Well, this problem is reproducible in 100% in newer kernels. I see the >> errors on boot logs and after login to Gnome Shell the first execution >> of xrandr (or opening a lid) hangs the system (the graphic card). On the >> other side I haven't seen that problem in any earlier kernel. Therefore, >> the situation is rather clear in my case. Nevertheless, I will stay with >> that self-build good kernel (5.3.0-0.rc3 + git) to check it further. >> >> >> How would you see it, Ilia? Is there anything in nouveau that needs to >> be adjusted to that changes or rather those changes break something in >> nouveau that would be best to fix/revert them (and it would be good to >> let the committer know about the problem)? >> >> Marcin >> >> >> >>> On Mon, Dec 16, 2019 at 12:42 PM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: >>>> >>>> On 2019-12-16 18:08, Ilia Mirkin wrote: >>>>> Hi Marcin, >>>>> >>>>> You should do a git bisect rather than guessing about commits. I >>>>> suspect that searching for "kernel git bisect fedora" should prove >>>>> instructive if you're not sure how to do this. >>>> >>>> Thanks for your suggestion. I realize that I can do it at the Git level >>>> and it is the ultimate way to go. However, building the kernel version >>>> from sources takes some time (in addition to a regular time needed to >>>> install/restart/verify which I already experienced narrowing down to a >>>> "just" ~250 commits). >>>> >>>> Therefore, I would be really thankful for a suggestion which commits >>>> could be good to check first - having 2, 4 is better than 8-10 (assuming >>>> someone is right :) ). >>>> >>>> Marcin >>>> >>>> >>>> >>>>> On Mon, Dec 16, 2019 at 11:42 AM Marcin Zaj?czkowski <mszpak at wp.pl> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I've encountered a severe regression in TU116 (probably also TU117) >>>>>> introduced in 5.3-rc4 (valid also for recent 5.4.2) [1]. The system >>>>>> usually hangs on the subsequent graphic mode related operation (calling >>>>>> xrandr after login is enough) with the following error: >>>>>> >>>>>>> kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 08 [] >>>>>> ... >>>>>>> kernel: nouveau 0000:01:00.0: DRM: failed to idle channel 0 [DRM] >>>>>>> kernel: nouveau 0000:01:00.0: i2c: aux 0007: begin idle timeout ffffffff >>>>>>> kernel: nouveau 0000:01:00.0: tmr: stalled at ffffffffffffffff >>>>>>> kernel: ------------[ cut here ]------------ >>>>>>> kernel: nouveau 0000:01:00.0: timeout >>>>>>> kernel: WARNING: CPU: 10 PID: 384 at drivers/gpu/drm/nouveau/nvkm/subdev/bar/g84.c:35 g84_bar_flush+0xcf/> 0xe0 [nouveau] >>>>>> >>>>>> (detailed log in a corresponding issue - [1]) >>>>>> >>>>>> With earlier kernels there was no hardware acceleration for NVidia GTX >>>>>> 1660 Ti, but at least I could use nouveau to disable it (to save >>>>>> battery, trees and lower temperature) or even have an external output >>>>>> (with Wayland). Now, the system is unusable with nouveau :(. >>>>>> >>>>>> I spent some time trying to narrow the scope using on the existing >>>>>> kernel builds for Fedora. I was able to determine that the problem was >>>>>> introduced between 5.3.0-0.rc3.git1.1 (commit 33920f1ec5bf - works fine) >>>>>> and 5.3.0-0.rc4.git0.1 (tag v5.3-rc4 - fails with errors). >>>>>> >>>>>> It's just a few days (7-11 Aug) and "only" around 250 commits. I went >>>>>> through them, but (based on the commits name) I haven't seen any nouveau >>>>>> related changes and in general no very suspected drm related changes. >>>>>> >>>>>>> git log 33920f1ec5bf..v5.3-rc4 --stat >>>>>> >>>>>> >>>>>> Maybe some of more nouveau/drm-experienced developers could take a look >>>>>> at that to determine which commit could break it (to make it easier to >>>>>> find out what should be fixed to prevent that regression)? >>>>>> >>>>>> >>>>>> [1] - >>>>>> https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/516 >>>>>> >>>>>> Thanks in advance >>>>>> Marcin
Mika Westerberg
2019-Dec-20 06:05 UTC
[Nouveau] Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
On Thu, Dec 19, 2019 at 03:38:10PM -0500, Ilia Mirkin wrote:> Let's add Mika and Rafael, as they were responsible for that commit. > Mika/Rafael - any ideas? The commit in question is > > 0617bdede5114a0002298b12cd0ca2b0cfd0395dThis seems to be Revert "PCI: Add missing link delays required by the PCIe spec" Can you try v5.5-rcX without any additional changes? It should include the same fix done bit differently (trying to avoid breaking systems which caused us to revert the previous one): 4827d63891b6 PCI/PM: Add pcie_wait_for_link_delay() ad9001f2f411 PCI/PM: Add missing link delays required by the PCIe spec
Seemingly Similar Threads
- Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
- Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
- Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
- Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed
- Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed