Eugene Grosbein
2017-Jul-23 13:02 UTC
stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock
Hi! Long story short: stable/11 r321371 started to panic at the moment of smartd invocation after my SSD died. I have Intel motherboard with graid-supported pseudo-raid. I use it in RAID1 mode with one HDD and one SSD. Yesterday the SSD has died: it is not detected by BIOS nor FreeBSD kernel (timeouts). This went unnoticed by me as graid just disconnected it on-the-fly: kernel: ahcich5: Timeout on slot 24 port 0 kernel: ahcich5: is 00000000 cs 00000000 ss 01000000 rs 01000000 tfd 40 serr 00000000 cmd 0000d817 kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 00 00 00 00 00 kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout kernel: (ada1:ahcich5:0:0:0): Retrying command kernel: ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) [skip] kernel: ada1 at ahcich5 bus 0 scbus2 target 0 lun 0 kernel: ada1: <LITEON CV1-MB512-11 2803> s/n JYKJ550855860139 detached [skip] kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 00 00 00 00 00 kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout kernel: (ada1:ahcich5:0:0:0): Error 5, Periph was invalidated kernel: GEOM_RAID: Write failed: failing subdisk. ada1[WRITE(offset=269389066240, length=32768)] kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from ACTIVE to FAILED. kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-ada1 state changed from ACTIVE to FAILED. kernel: GEOM_RAID: Intel-c291fe96: Volume r0 state changed from OPTIMAL to DEGRADED. kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from FAILED to OFFLINE. kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-[unknown] state changed from FAILED to NONE. kernel: GEOM_RAID: Write failed: failing subdisk. ada1[WRITE(offset=270699851776, length=32768)] kernel: GEOM_RAID: Intel-c291fe96: Warning! Fail request to a disk in a wrong state (OFFLINE)! Unaware of that, I've performed standard source upgrade from 11.1-PRERELEASE r318692 to stable/11 r321371 that went smooth. After reboot, BIOS was unable to detect SSD, reported degraded state of the mirror and booted FreeBSD using second mirror component (HDD). After long timeout, the kernel could not detect dead SSD too and continued to run with degraded mirror just fine: the system went multiuser mode and almost finished loading when rcNG started smartd. The kernel panices that moment. This is repeatable: I can cold-boot to single user mode, start smartd and get same panic. This is debugging kernel and I managed to obtain crashdump. kgdb session follows: <118>Starting smartd. Kernel page fault with the following non-sleepable locks held: exclusive sleep mutex CAM device lock (CAM device lock) r = 0 (0xfffff8000cf71c60) locked @ /home/src/sys/cam/scsi/scsi_pass.c:1766 stack backtrace: #0 0xffffffff80a12620 at witness_debugger+0x70 #1 0xffffffff80a13a4e at witness_warn+0x45e #2 0xffffffff80e4b693 at trap_pfault+0x53 #3 0xffffffff80e4ae3e at trap+0x29e #4 0xffffffff80e2ed91 at calltrap+0x8 #5 0xffffffff8033873a at passsendccb+0x6a #6 0xffffffff80337836 at passdoioctl+0x3c6 #7 0xffffffff80337052 at passioctl+0x22 #8 0xffffffff80878c78 at devfs_ioctl_f+0x138 #9 0xffffffff80a18184 at kern_ioctl+0x2c4 #10 0xffffffff80a17e4f at sys_ioctl+0x16f #11 0xffffffff80e4c05a at amd64_syscall+0x53a #12 0xffffffff80e2f07b at Xfast_syscall+0xfb Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0xa fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80e494e1 stack pointer = 0x28:0xfffffe04675ff670 frame pointer = 0x28:0xfffffe04675ff670 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1387 (smartd) trap number = 12 panic: page fault cpuid = 0 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe04675ff250 vpanic() at vpanic+0x186/frame 0xfffffe04675ff2d0 panic() at panic+0x43/frame 0xfffffe04675ff330 trap_fatal() at trap_fatal+0x322/frame 0xfffffe04675ff380 trap_pfault() at trap_pfault+0x62/frame 0xfffffe04675ff3e0 trap() at trap+0x29e/frame 0xfffffe04675ff5a0 calltrap() at calltrap+0x8/frame 0xfffffe04675ff5a0 --- trap 0xc, rip = 0xffffffff80e494e1, rsp = 0xfffffe04675ff670, rbp = 0xfffffe04675ff670 --- copyin() at copyin+0x41/frame 0xfffffe04675ff670 passsendccb() at passsendccb+0x6a/frame 0xfffffe04675ff6f0 passdoioctl() at passdoioctl+0x3c6/frame 0xfffffe04675ff7a0 passioctl() at passioctl+0x22/frame 0xfffffe04675ff7e0 devfs_ioctl_f() at devfs_ioctl_f+0x138/frame 0xfffffe04675ff840 kern_ioctl() at kern_ioctl+0x2c4/frame 0xfffffe04675ff8a0 sys_ioctl() at sys_ioctl+0x16f/frame 0xfffffe04675ff980 amd64_syscall() at amd64_syscall+0x53a/frame 0xfffffe04675ffab0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe04675ffab0 --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80174c18a, rsp = 0x7fffffff0308, rbp = 0x7fffffff0900 --- Uptime: 2m46s Dumping 723 out of 16285 MB:..3%..12%..23%..31%..43%..51%..62%..71%..82%..91% (kgdb) bt #0 doadump (textdump=1) at /home/src/sys/kern/kern_shutdown.c:299 #1 0xffffffff809b1ab1 in kern_reboot (howto=260) at /home/src/sys/kern/kern_shutdown.c:376 #2 0xffffffff809b2100 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /home/src/sys/kern/kern_shutdown.c:770 #3 0xffffffff809b2143 in panic (fmt=<value optimized out>) at /home/src/sys/kern/kern_shutdown.c:701 #4 0xffffffff80e4b632 in trap_fatal (frame=0xfffffe04675ff5b0, eva=10) at /home/src/sys/amd64/amd64/trap.c:801 #5 0xffffffff80e4b6a2 in trap_pfault (frame=0xfffffe04675ff5b0, usermode=0) at pcpu.h:222 #6 0xffffffff80e4ae3e in trap (frame=0xfffffe04675ff5b0) at /home/src/sys/amd64/amd64/trap.c:421 #7 0xffffffff80e2ed91 in calltrap () at /home/src/sys/amd64/amd64/exception.S:236 #8 0xffffffff80e494e1 in copyin () at /home/src/sys/amd64/amd64/support.S:304 #9 0xffffffff8033873a in passsendccb (periph=0xfffff8000cfe5e00, ccb=0xfffff800662f9000, inccb=0xfffff80066d91000) at /home/src/sys/cam/scsi/scsi_pass.c:2172 #10 0xffffffff80337836 in passdoioctl (dev=<value optimized out>, cmd=<value optimized out>, addr=0xfffff80066d91000 "", flag=<value optimized out>, td=<value optimized out>) at /home/src/sys/cam/scsi/scsi_pass.c:1823 #11 0xffffffff80337052 in passioctl (dev=0xfffff8000cfc4c00, cmd=3303020802, addr=0xfffff80066d91000 "", flag=3, td=0xfffff8000e1a3000) at /home/src/sys/cam/scsi/scsi_pass.c:1751 #12 0xffffffff80878c78 in devfs_ioctl_f (fp=0xfffff8000e50d3c0, com=3303020802, data=0xfffff80066d91000, cred=0xfffff800066c7200, td=0xfffff8000e1a3000) at /home/src/sys/fs/devfs/devfs_vnops.c:791 #13 0xffffffff80a18184 in kern_ioctl (td=<value optimized out>, fd=<value optimized out>, com=<value optimized out>, data=<value optimized out>) at file.h:323 #14 0xffffffff80a17e4f in sys_ioctl (td=<value optimized out>, uap=0xfffff8000e1a3538) at /home/src/sys/kern/sys_generic.c:745 #15 0xffffffff80e4c05a in amd64_syscall (td=0xfffff8000e1a3000, traced=0) at subr_syscall.c:131 #16 0xffffffff80e2f07b in Xfast_syscall () at /home/src/sys/amd64/amd64/exception.S:396 #17 0x000000080174c18a in ?? () Previous frame inner to this frame (corrupt stack?) I can keep dead SSD in place to debug this. Looking for help.
Eugene Grosbein
2017-Jul-23 13:20 UTC
stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock
On 23.07.2017 20:02, Eugene Grosbein wrote:> Fatal trap 12: page fault while in kernel mode > cpuid = 0; apic id = 00 > fault virtual address = 0xa > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80e494e1 > stack pointer = 0x28:0xfffffe04675ff670 > frame pointer = 0x28:0xfffffe04675ff670 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 1387 (smartd) > trap number = 12 > panic: page fault > cpuid = 0I also have a screenshot of another case of same panic that notes lock order reversal: (Giant after non-sleepable): http://www.grosbein.net/freebsd/crash.png
Eugene Grosbein
2017-Jul-24 17:39 UTC
stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock
On 23.07.2017 20:02, Eugene Grosbein wrote:> Hi! > > Long story short: stable/11 r321371 started to panic at the moment of smartd invocation > after my SSD died. > > I have Intel motherboard with graid-supported pseudo-raid. > I use it in RAID1 mode with one HDD and one SSD. > > Yesterday the SSD has died: it is not detected by BIOS nor FreeBSD kernel (timeouts). > This went unnoticed by me as graid just disconnected it on-the-fly: > > kernel: ahcich5: Timeout on slot 24 port 0 > kernel: ahcich5: is 00000000 cs 00000000 ss 01000000 rs 01000000 tfd 40 serr 00000000 cmd 0000d817 > kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 00 00 00 00 00 > kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout > kernel: (ada1:ahcich5:0:0:0): Retrying command > kernel: ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) > [skip] > kernel: ada1 at ahcich5 bus 0 scbus2 target 0 lun 0 > kernel: ada1: <LITEON CV1-MB512-11 2803> s/n JYKJ550855860139 detached > [skip] > kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 00 00 00 00 00 > kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout > kernel: (ada1:ahcich5:0:0:0): Error 5, Periph was invalidated > kernel: GEOM_RAID: Write failed: failing subdisk. ada1[WRITE(offset=269389066240, length=32768)] > kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from ACTIVE to FAILED. > kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-ada1 state changed from ACTIVE to FAILED. > kernel: GEOM_RAID: Intel-c291fe96: Volume r0 state changed from OPTIMAL to DEGRADED. > kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from FAILED to OFFLINE. > kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-[unknown] state changed from FAILED to NONE. > kernel: GEOM_RAID: Write failed: failing subdisk. ada1[WRITE(offset=270699851776, length=32768)] > kernel: GEOM_RAID: Intel-c291fe96: Warning! Fail request to a disk in a wrong state (OFFLINE)! > > Unaware of that, I've performed standard source upgrade from 11.1-PRERELEASE r318692 > to stable/11 r321371 that went smooth. After reboot, BIOS was unable to detect SSD, > reported degraded state of the mirror and booted FreeBSD using second mirror component (HDD). > > After long timeout, the kernel could not detect dead SSD too and continued to run with degraded mirror > just fine: the system went multiuser mode and almost finished loading when rcNG started smartd. > The kernel panices that moment. This is repeatable: I can cold-boot to single user mode, start smartd > and get same panic. This is debugging kernel and I managed to obtain crashdump. > > kgdb session follows: > > <118>Starting smartd. > Kernel page fault with the following non-sleepable locks held: > exclusive sleep mutex CAM device lock (CAM device lock) r = 0 (0xfffff8000cf71c60) locked @ /home/src/sys/cam/scsi/scsi_pass.c:1766 > stack backtrace: > #0 0xffffffff80a12620 at witness_debugger+0x70 > #1 0xffffffff80a13a4e at witness_warn+0x45e > #2 0xffffffff80e4b693 at trap_pfault+0x53 > #3 0xffffffff80e4ae3e at trap+0x29e > #4 0xffffffff80e2ed91 at calltrap+0x8 > #5 0xffffffff8033873a at passsendccb+0x6a > #6 0xffffffff80337836 at passdoioctl+0x3c6 > #7 0xffffffff80337052 at passioctl+0x22 > #8 0xffffffff80878c78 at devfs_ioctl_f+0x138 > #9 0xffffffff80a18184 at kern_ioctl+0x2c4 > #10 0xffffffff80a17e4f at sys_ioctl+0x16f > #11 0xffffffff80e4c05a at amd64_syscall+0x53a > #12 0xffffffff80e2f07b at Xfast_syscall+0xfb > > Fatal trap 12: page fault while in kernel mode > cpuid = 0; apic id = 00 > fault virtual address = 0xa > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80e494e1 > stack pointer = 0x28:0xfffffe04675ff670 > frame pointer = 0x28:0xfffffe04675ff670 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 1387 (smartd) > trap number = 12 > panic: page fault > cpuid = 0 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe04675ff250 > vpanic() at vpanic+0x186/frame 0xfffffe04675ff2d0 > panic() at panic+0x43/frame 0xfffffe04675ff330 > trap_fatal() at trap_fatal+0x322/frame 0xfffffe04675ff380 > trap_pfault() at trap_pfault+0x62/frame 0xfffffe04675ff3e0 > trap() at trap+0x29e/frame 0xfffffe04675ff5a0 > calltrap() at calltrap+0x8/frame 0xfffffe04675ff5a0 > --- trap 0xc, rip = 0xffffffff80e494e1, rsp = 0xfffffe04675ff670, rbp = 0xfffffe04675ff670 --- > copyin() at copyin+0x41/frame 0xfffffe04675ff670 > passsendccb() at passsendccb+0x6a/frame 0xfffffe04675ff6f0 > passdoioctl() at passdoioctl+0x3c6/frame 0xfffffe04675ff7a0 > passioctl() at passioctl+0x22/frame 0xfffffe04675ff7e0 > devfs_ioctl_f() at devfs_ioctl_f+0x138/frame 0xfffffe04675ff840 > kern_ioctl() at kern_ioctl+0x2c4/frame 0xfffffe04675ff8a0 > sys_ioctl() at sys_ioctl+0x16f/frame 0xfffffe04675ff980 > amd64_syscall() at amd64_syscall+0x53a/frame 0xfffffe04675ffab0 > Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe04675ffab0 > --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80174c18a, rsp = 0x7fffffff0308, rbp = 0x7fffffff0900 --- > Uptime: 2m46s > Dumping 723 out of 16285 MB:..3%..12%..23%..31%..43%..51%..62%..71%..82%..91% > > (kgdb) bt > #0 doadump (textdump=1) at /home/src/sys/kern/kern_shutdown.c:299 > #1 0xffffffff809b1ab1 in kern_reboot (howto=260) at /home/src/sys/kern/kern_shutdown.c:376 > #2 0xffffffff809b2100 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /home/src/sys/kern/kern_shutdown.c:770 > #3 0xffffffff809b2143 in panic (fmt=<value optimized out>) at /home/src/sys/kern/kern_shutdown.c:701 > #4 0xffffffff80e4b632 in trap_fatal (frame=0xfffffe04675ff5b0, eva=10) at /home/src/sys/amd64/amd64/trap.c:801 > #5 0xffffffff80e4b6a2 in trap_pfault (frame=0xfffffe04675ff5b0, usermode=0) at pcpu.h:222 > #6 0xffffffff80e4ae3e in trap (frame=0xfffffe04675ff5b0) at /home/src/sys/amd64/amd64/trap.c:421 > #7 0xffffffff80e2ed91 in calltrap () at /home/src/sys/amd64/amd64/exception.S:236 > #8 0xffffffff80e494e1 in copyin () at /home/src/sys/amd64/amd64/support.S:304 > #9 0xffffffff8033873a in passsendccb (periph=0xfffff8000cfe5e00, ccb=0xfffff800662f9000, inccb=0xfffff80066d91000) at /home/src/sys/cam/scsi/scsi_pass.c:2172 > #10 0xffffffff80337836 in passdoioctl (dev=<value optimized out>, cmd=<value optimized out>, addr=0xfffff80066d91000 "", flag=<value optimized out>, > td=<value optimized out>) at /home/src/sys/cam/scsi/scsi_pass.c:1823 > #11 0xffffffff80337052 in passioctl (dev=0xfffff8000cfc4c00, cmd=3303020802, addr=0xfffff80066d91000 "", flag=3, td=0xfffff8000e1a3000) > at /home/src/sys/cam/scsi/scsi_pass.c:1751 > #12 0xffffffff80878c78 in devfs_ioctl_f (fp=0xfffff8000e50d3c0, com=3303020802, data=0xfffff80066d91000, cred=0xfffff800066c7200, td=0xfffff8000e1a3000) > at /home/src/sys/fs/devfs/devfs_vnops.c:791 > #13 0xffffffff80a18184 in kern_ioctl (td=<value optimized out>, fd=<value optimized out>, com=<value optimized out>, data=<value optimized out>) at file.h:323 > #14 0xffffffff80a17e4f in sys_ioctl (td=<value optimized out>, uap=0xfffff8000e1a3538) at /home/src/sys/kern/sys_generic.c:745 > #15 0xffffffff80e4c05a in amd64_syscall (td=0xfffff8000e1a3000, traced=0) at subr_syscall.c:131 > #16 0xffffffff80e2f07b in Xfast_syscall () at /home/src/sys/amd64/amd64/exception.S:396 > #17 0x000000080174c18a in ?? () > Previous frame inner to this frame (corrupt stack?) > > I can keep dead SSD in place to debug this. Looking for help.I managed to obtain crashdump for the case and filled PR with details: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220972