Karol Herbst
2023-Jan-14 03:27 UTC
[Nouveau] [REGRESSION] GM20B probe fails after commit 2541626cfb79
On Fri, Jan 13, 2023 at 2:19 PM Linux kernel regression tracking (Thorsten Leemhuis) <regressions at leemhuis.info> wrote:> > [CCing Daniel] > > On 05.01.23 13:28, Thorsten Leemhuis wrote: > > [adding Karol and Lyude to the list of recipients] > > > > On 28.12.22 15:49, Diogo Ivo wrote: > >> Hello, > >> > >> Commit 2541626cfb79 breaks GM20B probe with > >> the following kernel log: > > Just wondering: is anyone looking on this? The report was posted more > > than a week ago and didn't even get a single reply yet afaics. This of > > course can happen at this time of the year, but I nevertheless thought a > > quick status inquiry might be a good idea at this point. > > Hmmm, the report is now more that two weeks old and didn't get a single > reply. My prodding about a week ago also didn't help. Then I guess I > have to bring this to Linus attention, unless something happens in the > next 2 days. >I tried to look into it, but my jetson nano, just constantly behaves in very strange ways. I tried to compile and install a 6.1 kernel onto it, but any kernel just refuses to boot and I have no idea what's up with that device. The kernel starts to boot and it just stops in the middle. From what I can tell is that most of the tegra devices never worked reliably in the first place and there are a couple of random and strange bugs around. I've attached my dmesg, so if anybody has any clues why the kernel just stops doing anything, it would really help me. But maybe it would be for the best to just pull tegra support out of nouveau, because in the current situation we really can't spare much time dealing with them and we are already busy enough just dealing with the desktop GPUs. And the firmware we got from Nvidia is so ancient and different from the desktop GPU ones, that without actually having all those boards available and properly tested, we can't be sure to not break them. And afaik there are almost no _actual_ users, just distribution folks wanting to claim "support" for those devices, but then ending up using Nvidia's out of tree Tegra driver in deployments anyway. If there are actual users using them for their daily life, I'd like to know, because I'm aware of none. If there are companies/entities actually caring about those devices running _nouveau_, I'd be happy to keep supporting them, but then only with proper kernel CI, because the current situation is just not sustainable. Ben, Lyude, Dave, Daniel, any thoughts on that?> Diogo, for that it would be really helpful to known: is the issue still > happening with latest mainline? Is it possible to revert 2541626cfb79 > easily? And if so: do things work afterwards again? > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > -- > Everything you wanna know about Linux kernel regression tracking: > https://linux-regtracking.leemhuis.info/about/#tldr > If I did something stupid, please tell me, as explained on that page. > > #regzbot poke > > >> [ 2.153892] ------------[ cut here ]------------ > >> [ 2.153897] WARNING: CPU: 1 PID: 36 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgf100.c:273 gf100_vmm_valid+0x2c4/0x390 > >> [ 2.153916] Modules linked in: > >> [ 2.153922] CPU: 1 PID: 36 Comm: kworker/u8:1 Not tainted 6.1.0+ #1 > >> [ 2.153929] Hardware name: Google Pixel C (DT) > >> [ 2.153933] Workqueue: events_unbound deferred_probe_work_func > >> [ 2.153943] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > >> [ 2.153950] pc : gf100_vmm_valid+0x2c4/0x390 > >> [ 2.153959] lr : gf100_vmm_valid+0xb4/0x390 > >> [ 2.153966] sp : ffffffc009e134b0 > >> [ 2.153969] x29: ffffffc009e134b0 x28: 0000000000000000 x27: ffffffc008fd44c8 > >> [ 2.153979] x26: 00000000ffffffea x25: ffffffc0087b98d0 x24: ffffff8080f89038 > >> [ 2.153987] x23: ffffff8081fadc08 x22: 0000000000000000 x21: 0000000000000000 > >> [ 2.153995] x20: ffffff8080f8a000 x19: ffffffc009e13678 x18: 0000000000000000 > >> [ 2.154003] x17: f37a8b93418958e6 x16: ffffffc009f0d000 x15: 0000000000000000 > >> [ 2.154011] x14: 0000000000000002 x13: 000000000003a020 x12: ffffffc008000000 > >> [ 2.154019] x11: 0000000102913000 x10: 0000000000000000 x9 : 0000000000000000 > >> [ 2.154026] x8 : ffffffc009e136d8 x7 : ffffffc008fd44c8 x6 : ffffff80803d0f00 > >> [ 2.154034] x5 : 0000000000000000 x4 : ffffff8080f88c00 x3 : 0000000000000010 > >> [ 2.154041] x2 : 000000000000000c x1 : 00000000ffffffea x0 : 00000000ffffffea > >> [ 2.154050] Call trace: > >> [ 2.154053] gf100_vmm_valid+0x2c4/0x390 > >> [ 2.154061] nvkm_vmm_map_valid+0xd4/0x204 > >> [ 2.154069] nvkm_vmm_map_locked+0xa4/0x344 > >> [ 2.154076] nvkm_vmm_map+0x50/0x84 > >> [ 2.154083] nvkm_firmware_mem_map+0x84/0xc4 > >> [ 2.154094] nvkm_falcon_fw_oneinit+0xc8/0x320 > >> [ 2.154101] nvkm_acr_oneinit+0x428/0x5b0 > >> [ 2.154109] nvkm_subdev_oneinit_+0x50/0x104 > >> [ 2.154114] nvkm_subdev_init_+0x3c/0x12c > >> [ 2.154119] nvkm_subdev_init+0x60/0xa0 > >> [ 2.154125] nvkm_device_init+0x14c/0x2a0 > >> [ 2.154133] nvkm_udevice_init+0x60/0x9c > >> [ 2.154140] nvkm_object_init+0x48/0x1b0 > >> [ 2.154144] nvkm_ioctl_new+0x168/0x254 > >> [ 2.154149] nvkm_ioctl+0xd0/0x220 > >> [ 2.154153] nvkm_client_ioctl+0x10/0x1c > >> [ 2.154162] nvif_object_ctor+0xf4/0x22c > >> [ 2.154168] nvif_device_ctor+0x28/0x70 > >> [ 2.154174] nouveau_cli_init+0x150/0x590 > >> [ 2.154180] nouveau_drm_device_init+0x60/0x2a0 > >> [ 2.154187] nouveau_platform_device_create+0x90/0xd0 > >> [ 2.154193] nouveau_platform_probe+0x3c/0x9c > >> [ 2.154200] platform_probe+0x68/0xc0 > >> [ 2.154207] really_probe+0xbc/0x2dc > >> [ 2.154211] __driver_probe_device+0x78/0xe0 > >> [ 2.154216] driver_probe_device+0xd8/0x160 > >> [ 2.154221] __device_attach_driver+0xb8/0x134 > >> [ 2.154226] bus_for_each_drv+0x78/0xd0 > >> [ 2.154230] __device_attach+0x9c/0x1a0 > >> [ 2.154234] device_initial_probe+0x14/0x20 > >> [ 2.154239] bus_probe_device+0x98/0xa0 > >> [ 2.154243] deferred_probe_work_func+0x88/0xc0 > >> [ 2.154247] process_one_work+0x204/0x40c > >> [ 2.154256] worker_thread+0x230/0x450 > >> [ 2.154261] kthread+0xc8/0xcc > >> [ 2.154266] ret_from_fork+0x10/0x20 > >> [ 2.154273] ---[ end trace 0000000000000000 ]--- > >> [ 2.154278] nouveau 57000000.gpu: pmu: map -22 > >> [ 2.154285] nouveau 57000000.gpu: acr: one-time init failed, -22 > >> [ 2.154559] nouveau 57000000.gpu: init failed with -22 > >> [ 2.154564] nouveau: DRM-master:00000000:00000080: init failed with -22 > >> [ 2.154574] nouveau 57000000.gpu: DRM-master: Device allocation failed: -22 > >> [ 2.162905] nouveau: probe of 57000000.gpu failed with error -22 > >> > >> #regzbot introduced: 2541626cfb79 > >> > >> Thanks, > >> > >> Diogo Ivo > >> > >> > > > > #regzbot poke >-------------- next part -------------- A non-text attachment was scrubbed... Name: dmesg Type: application/octet-stream Size: 18397 bytes Desc: not available URL: <https://lists.freedesktop.org/archives/nouveau/attachments/20230114/eb4f9645/attachment-0001.obj>
Diogo Ivo
2023-Jan-14 16:05 UTC
[Nouveau] [REGRESSION] GM20B probe fails after commit 2541626cfb79
On Sat, Jan 14, 2023 at 04:27:38AM +0100, Karol Herbst wrote:> I tried to look into it, but my jetson nano, just constantly behaves > in very strange ways. I tried to compile and install a 6.1 kernel onto > it, but any kernel just refuses to boot and I have no idea what's up > with that device. The kernel starts to boot and it just stops in the > middle. From what I can tell is that most of the tegra devices never > worked reliably in the first place and there are a couple of random > and strange bugs around. I've attached my dmesg, so if anybody has any > clues why the kernel just stops doing anything, it would really help > me.Hello, Thank you for looking into this! I have seen this type of hang in mainline on this SoC, and it was due to a reset not being deasserted. Would you mind getting a log with initcall_debug enabled to pinpoint where the hang occurs? I would be happy to help if I can.> But maybe it would be for the best to just pull tegra support out of > nouveau, because in the current situation we really can't spare much > time dealing with them and we are already busy enough just dealing > with the desktop GPUs. And the firmware we got from Nvidia is so > ancient and different from the desktop GPU ones, that without actually > having all those boards available and properly tested, we can't be > sure to not break them. > > And afaik there are almost no _actual_ users, just distribution folks > wanting to claim "support" for those devices, but then ending up using > Nvidia's out of tree Tegra driver in deployments anyway.> If there are actual users using them for their daily life, I'd like to > know, because I'm aware of none.For what it's worth, I consider myself a user of nouveau. Granted, I'm using it as a hobby project, but in its current state it is not far from a usable desktop experience on the Pixel C. Diogo
Karol Herbst
2023-Jan-14 18:56 UTC
[Nouveau] [REGRESSION] GM20B probe fails after commit 2541626cfb79
On Sat, Jan 14, 2023 at 5:07 PM Diogo Ivo <diogo.ivo at tecnico.ulisboa.pt> wrote:> > On Sat, Jan 14, 2023 at 04:27:38AM +0100, Karol Herbst wrote: > > I tried to look into it, but my jetson nano, just constantly behaves > > in very strange ways. I tried to compile and install a 6.1 kernel onto > > it, but any kernel just refuses to boot and I have no idea what's up > > with that device. The kernel starts to boot and it just stops in the > > middle. From what I can tell is that most of the tegra devices never > > worked reliably in the first place and there are a couple of random > > and strange bugs around. I've attached my dmesg, so if anybody has any > > clues why the kernel just stops doing anything, it would really help > > me. > > Hello, > > Thank you for looking into this! I have seen this type of hang in > mainline on this SoC, and it was due to a reset not being deasserted. > Would you mind getting a log with initcall_debug enabled to pinpoint > where the hang occurs? I would be happy to help if I can. >the last thing printed is: [ 20.517642] calling clk_disable_unused+0x0/0xe0 @ 1> > But maybe it would be for the best to just pull tegra support out of > > nouveau, because in the current situation we really can't spare much > > time dealing with them and we are already busy enough just dealing > > with the desktop GPUs. And the firmware we got from Nvidia is so > > ancient and different from the desktop GPU ones, that without actually > > having all those boards available and properly tested, we can't be > > sure to not break them. > > > > And afaik there are almost no _actual_ users, just distribution folks > > wanting to claim "support" for those devices, but then ending up using > > Nvidia's out of tree Tegra driver in deployments anyway. > > > If there are actual users using them for their daily life, I'd like to > > know, because I'm aware of none. > > For what it's worth, I consider myself a user of nouveau. Granted, I'm > using it as a hobby project, but in its current state it is not far from > a usable desktop experience on the Pixel C. >okay. I mean, I'm happy to keep fixing regressions and figuring out what's wrong with booting the devices and such if regular users come around and file bugs. And until today I wasn't really aware of anybody :) It's just not worth my time, if there are no users using them at all. Or rather.. if there would only be commercial users (like.. companies deploying those for money), then they could get involved and help us out, because I wouldn't be willing to spend my time on this, if that would be the case.> Diogo >