Monty Montgomery
2023-Apr-21 17:36 UTC
[Nouveau] Panic report and patch against master (Quadro FX)
Hiya folks, The new nouveau event/notify code in 6.2.x panics early in device setup here on my Quadro FX hardware, resulting in device setup failure and a black screen. The direct cause appears to be i2c aux IDs no longer being truncated to 8 bits before being used as an index to event->ref[] by nvkm_event_put, nvkm_event_get, and nvkm_event_send as of 773eb04d14a11552b2c3953097ed09cde2ab4831. The now-eliminated nvkm_i2c_intr_ctor built an nvkm_notify struct that stored 'index' as u8, truncating the integer ID to a usable index. The new code no longer constructs a notification ahead of time, simply passing the integer id of the i2c bus/aux/pad/etc directly to nvkm_event_send as int. The ID value routinely stores offests/flags/markers(?) in the upper bytes, and when used directly as an array index, it goes way off the end of event->ref[] and triggers a panic. There's no code to check/guard out-of-range access in ref lookup, so I also added guard/trace code to 6.1 to verify this is a new bug, and did not record any event->ref[] index bounds errors in 6.1.15. I don't understand the actual device setup procedure for an nVidia card, so this surface analysis may be woefully incomplete :-) At the point of panic, we see the i2c code calling event code with an index of 269 (0x010d). Apr 17 01:07:53 boatanchor kernel: nouveau: DRM:00000006:8000000e: ioctl: mthd size 8 Apr 17 01:07:53 boatanchor kernel: nouveau: DRM:00000006:8000000e: ioctl: mthd vers 0 mthd 00 Apr 17 01:07:53 boatanchor kernel: nouveau 0000:01:00.0: i2c: event: ntfy allow 00000003 on 269 Apr 17 01:07:53 boatanchor kernel: nouveau 0000:01:00.0: i2c: event: ntfy state changed Apr 17 01:07:53 boatanchor kernel: nouveau 0000:01:00.0: i2c: event: incr 00000003 on 269 Apr 17 01:07:53 boatanchor kernel: nouveau 0000:01:00.0: i2c: event: allowing 0 on 269 Apr 17 01:07:53 boatanchor kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 Apr 17 01:07:53 boatanchor kernel: #PF: supervisor instruction fetch in kernel mode Apr 17 01:07:53 boatanchor kernel: #PF: error_code(0x0010) - not-present page Apr 17 01:07:53 boatanchor kernel: PGD 0 P4D 0 Apr 17 01:07:53 boatanchor kernel: Oops: 0010 [#1] PREEMPT SMP PTI Apr 17 01:07:53 boatanchor kernel: CPU: 1 PID: 427 Comm: (udev-worker) Not tainted 6.2.10-200.fc37.x86_64+debug #1 Apr 17 01:07:53 boatanchor kernel: Hardware name: LENOVO 25003BU/25003BU, BIOS 6KET60WW (1.30 ) 10/24/2012 Apr 17 01:07:53 boatanchor kernel: RIP: 0010:0x0 Apr 17 01:07:53 boatanchor kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6. Apr 17 01:07:53 boatanchor kernel: RSP: 0018:ffff967ec1b9f8c0 EFLAGS: 00010086 Apr 17 01:07:53 boatanchor kernel: RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000 Apr 17 01:07:53 boatanchor kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8b69120fc400 Apr 17 01:07:53 boatanchor kernel: RBP: ffff8b69120fc568 R08: ffff8b69120fc400 R09: ffff967ec1b9f690 Apr 17 01:07:53 boatanchor kernel: R10: 0000000000000003 R11: ffffffffb1564968 R12: ffff8b631362ca88 Apr 17 01:07:53 boatanchor kernel: R13: 000000000000010d R14: 0000000000000001 R15: 0000000000000001 Apr 17 01:07:53 boatanchor kernel: FS: 00007fbc0af0cb40(0000) GS:ffff8b6a1e200000(0000) knlGS:0000000000000000 Apr 17 01:07:53 boatanchor kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 17 01:07:53 boatanchor kernel: CR2: ffffffffffffffd6 CR3: 00000001133ea000 CR4: 00000000000006e0 Apr 17 01:07:53 boatanchor kernel: Call Trace: Apr 17 01:07:53 boatanchor kernel: <TASK> Apr 17 01:07:53 boatanchor kernel: nvkm_event_ntfy_state+0x184/0x250 [nouveau] Apr 17 01:07:53 boatanchor kernel: nvkm_event_ntfy_allow+0x5f/0xc0 [nouveau] Apr 17 01:07:53 boatanchor kernel: nvkm_uevent_mthd+0x49/0x70 [nouveau] Apr 17 01:07:53 boatanchor kernel: nvkm_ioctl+0x10a/0x240 [nouveau] Apr 17 01:07:53 boatanchor kernel: nvif_object_mthd+0xcb/0x200 [nouveau] Apr 17 01:07:53 boatanchor kernel: nvif_event_allow+0x26/0xa0 [nouveau] Apr 17 01:07:53 boatanchor kernel: nouveau_display_init+0x71/0x110 [nouveau] Apr 17 01:07:53 boatanchor kernel: nouveau_drm_device_init+0x1d8/0x9b0 [nouveau] Apr 17 01:07:53 boatanchor kernel: ? pci_bus_read_config_word+0x49/0x80 Apr 17 01:07:53 boatanchor kernel: nouveau_drm_probe+0x128/0x280 [nouveau] Apr 17 01:07:53 boatanchor kernel: local_pci_probe+0x41/0x90 Apr 17 01:07:53 boatanchor kernel: pci_device_probe+0xc3/0x230 Apr 17 01:07:53 boatanchor kernel: really_probe+0x1b9/0x410 Apr 17 01:07:53 boatanchor kernel: __driver_probe_device+0x78/0x170 Apr 17 01:07:53 boatanchor kernel: driver_probe_device+0x1f/0x90 Apr 17 01:07:53 boatanchor kernel: __driver_attach+0xd2/0x1c0 Apr 17 01:07:53 boatanchor kernel: ? __pfx___driver_attach+0x10/0x10 Apr 17 01:07:53 boatanchor kernel: bus_for_each_dev+0x8a/0xd0 Apr 17 01:07:53 boatanchor kernel: bus_add_driver+0x141/0x230 Apr 17 01:07:53 boatanchor kernel: driver_register+0x77/0x120 Apr 17 01:07:53 boatanchor kernel: ? __pfx_init_module+0x10/0x10 [nouveau] Apr 17 01:07:53 boatanchor kernel: do_one_initcall+0x70/0x290 Apr 17 01:07:53 boatanchor kernel: do_init_module+0x4a/0x220 Apr 17 01:07:53 boatanchor kernel: __do_sys_init_module+0x192/0x1c0 Apr 17 01:07:53 boatanchor kernel: do_syscall_64+0x5b/0x80 Apr 17 01:07:53 boatanchor kernel: ? lock_release+0x15d/0x400 Apr 17 01:07:53 boatanchor kernel: ? preempt_count_add+0x47/0xa0 Apr 17 01:07:53 boatanchor kernel: ? __up_read+0x98/0x220 Apr 17 01:07:53 boatanchor kernel: ? do_user_addr_fault+0x202/0x730 Apr 17 01:07:53 boatanchor kernel: ? exc_page_fault+0xfc/0x200 Apr 17 01:07:53 boatanchor kernel: ? lockdep_hardirqs_off+0x9c/0xe0 Apr 17 01:07:53 boatanchor kernel: ? asm_exc_page_fault+0x22/0x30 Apr 17 01:07:53 boatanchor kernel: ? lockdep_hardirqs_on+0x7d/0x100 Apr 17 01:07:53 boatanchor kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc Apr 17 01:07:53 boatanchor kernel: RIP: 0033:0x7fbc0b97800e Apr 17 01:07:53 boatanchor kernel: Code: 48 8b 0d 25 5e 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f2 5d 0c 00 f7 d8 64 89 01 48 Apr 17 01:07:53 boatanchor kernel: RSP: 002b:00007fff5dc02b58 EFLAGS: 00000246 ORIG_RAX: 00000000000000af Apr 17 01:07:53 boatanchor kernel: RAX: ffffffffffffffda RBX: 00005613c37bee10 RCX: 00007fbc0b97800e Apr 17 01:07:53 boatanchor kernel: RDX: 00005613c36b7bb0 RSI: 00000000007258f6 RDI: 00005613c3fc1b70 Apr 17 01:07:53 boatanchor kernel: RBP: 00005613c36b7bb0 R08: 00005613c36af860 R09: 00007fff5dc002f6 Apr 17 01:07:53 boatanchor kernel: R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000020000 Apr 17 01:07:53 boatanchor kernel: R13: 00005613c36ab4e0 R14: 0000000000000000 R15: 00005613c3689610 Apr 17 01:07:53 boatanchor kernel: </TASK> Apr 17 01:07:53 boatanchor kernel: Modules linked in: nouveau(+) raid0 wacom uas usb_storage crc32c_intel serio_raw e1000e sha5 12_ssse3 sdhci_pci cqhci sdhci mmc_core drm_ttm_helper firewire_ohci ttm mxm_wmi firewire_core drm_display_helper crc_itu_t cec video wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse Apr 17 01:07:53 boatanchor kernel: CR2: 0000000000000000 nvkm/core/event.c:nvkm_event_get() is inlined, that's where the panic is happening. Noting that 269 == 0x10d, which matches an i2c pad ID allocated earlier during CCB enumeration, and that earlier we logged only 16 CCBs: Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 00: type 05 drive 00 sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0100: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: bus 0000: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 01: type 05 drive 01 sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0101: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: bus 0001: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 02: type 05 drive 03 sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0102: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: bus 0002: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 03: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0103: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 04: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0104: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 05: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0105: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 06: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0106: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 07: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0107: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 08: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0108: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 09: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 0109: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0a: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010a: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0b: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010b: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0c: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010c: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0d: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010d: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0e: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010e: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: ccb 0f: type ff drive ff sense ff share ff auxch ff Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 010f: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: dcb 08 drv 00 unknown Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: pad 020d: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: aux 010d: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: bus 010d: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: aux 010d: ctor Apr 17 01:07:51 boatanchor kernel: nouveau 0000:01:00.0: i2c: bus 010d: ctor I added additional debugging logs to determine that the index_nr for the relevant event is only 16, so an index of 269 to event->ref[] is obviously wrong. It appears that the simplest maybe-correct fix is to reinstate the u8 truncation in this argument path, eg: diff -urp a/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c b/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c --- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c 2023-04-06 06:12:48.000000000 -0400 +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c 2023-04-21 09:07:59.864884896 -0400 @@ -86,7 +86,7 @@ nvkm_uconn_uevent(struct nvkm_object *ob if (args->v0.types & NVIF_CONN_EVENT_V0_UNPLUG) bits |= NVKM_I2C_UNPLUG; if (args->v0.types & NVIF_CONN_EVENT_V0_IRQ ) bits |= NVKM_I2C_IRQ; - return nvkm_uevent_add(uevent, &device->i2c->event, outp->dp.aux->id, bits, + return nvkm_uevent_add(uevent, &device->i2c->event, (outp->dp.aux->id)&0xff, bits, nvkm_uconn_uevent_aux); } } This assumes the truncation was itself correct/intentional, and not due to some other mistake which should also be corrected. However, a slightly larger patch I've attached (which contains the above fix + a little extra guard/logging) has me back up and running under 6.2.10. Cheers, Monty -------------- next part -------------- A non-text attachment was scrubbed... Name: linux-6.2.10-nouveau_event.patch Type: text/x-patch Size: 4149 bytes Desc: not available URL: <https://lists.freedesktop.org/archives/nouveau/attachments/20230421/191ce0c1/attachment.bin>