thr3ads.net - freebsd stable - stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Eugene Grosbein

2017-Jul-23 13:02 UTC

stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock

Hi!

Long story short: stable/11 r321371 started to panic at the moment of smartd
invocation
after my SSD died.

I have Intel motherboard with graid-supported pseudo-raid.
I use it in RAID1 mode with one HDD and one SSD.

Yesterday the SSD has died: it is not detected by BIOS nor FreeBSD kernel
(timeouts).
This went unnoticed by me as graid just disconnected it on-the-fly:

kernel: ahcich5: Timeout on slot 24 port 0
kernel: ahcich5: is 00000000 cs 00000000 ss 01000000 rs 01000000 tfd 40 serr
00000000 cmd 0000d817
kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 00
00 00 00 00
kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout
kernel: (ada1:ahcich5:0:0:0): Retrying command
kernel: ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
[skip]
kernel: ada1 at ahcich5 bus 0 scbus2 target 0 lun 0
kernel: ada1: <LITEON CV1-MB512-11 2803> s/n JYKJ550855860139 detached
[skip]
kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 00
00 00 00 00
kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout
kernel: (ada1:ahcich5:0:0:0): Error 5, Periph was invalidated
kernel: GEOM_RAID: Write failed: failing subdisk.
ada1[WRITE(offset=269389066240, length=32768)]
kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from ACTIVE to
FAILED.
kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-ada1 state changed from ACTIVE
to FAILED.
kernel: GEOM_RAID: Intel-c291fe96: Volume r0 state changed from OPTIMAL to
DEGRADED.
kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from FAILED to
OFFLINE.
kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-[unknown] state changed from
FAILED to NONE.
kernel: GEOM_RAID: Write failed: failing subdisk.
ada1[WRITE(offset=270699851776, length=32768)]
kernel: GEOM_RAID: Intel-c291fe96: Warning! Fail request to a disk in a wrong
state (OFFLINE)!

Unaware of that, I've performed standard source upgrade from 11.1-PRERELEASE
r318692
to stable/11 r321371 that went smooth. After reboot, BIOS was unable to detect
SSD,
reported degraded state of the mirror and booted FreeBSD using second mirror
component (HDD).

After long timeout, the kernel could not detect dead SSD too and continued to
run with degraded mirror
just fine: the system went multiuser mode and almost finished loading when rcNG
started smartd.
The kernel panices that moment. This is repeatable: I can cold-boot to single
user mode, start smartd
and get same panic. This is debugging kernel and I managed to obtain crashdump.

kgdb session follows:

<118>Starting smartd.
Kernel page fault with the following non-sleepable locks held:
exclusive sleep mutex CAM device lock (CAM device lock) r = 0
(0xfffff8000cf71c60) locked @ /home/src/sys/cam/scsi/scsi_pass.c:1766
stack backtrace:
#0 0xffffffff80a12620 at witness_debugger+0x70
#1 0xffffffff80a13a4e at witness_warn+0x45e
#2 0xffffffff80e4b693 at trap_pfault+0x53
#3 0xffffffff80e4ae3e at trap+0x29e
#4 0xffffffff80e2ed91 at calltrap+0x8
#5 0xffffffff8033873a at passsendccb+0x6a
#6 0xffffffff80337836 at passdoioctl+0x3c6
#7 0xffffffff80337052 at passioctl+0x22
#8 0xffffffff80878c78 at devfs_ioctl_f+0x138
#9 0xffffffff80a18184 at kern_ioctl+0x2c4
#10 0xffffffff80a17e4f at sys_ioctl+0x16f
#11 0xffffffff80e4c05a at amd64_syscall+0x53a
#12 0xffffffff80e2f07b at Xfast_syscall+0xfb

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0xa
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80e494e1
stack pointer           = 0x28:0xfffffe04675ff670
frame pointer           = 0x28:0xfffffe04675ff670
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1387 (smartd)
trap number             = 12
panic: page fault
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe04675ff250
vpanic() at vpanic+0x186/frame 0xfffffe04675ff2d0
panic() at panic+0x43/frame 0xfffffe04675ff330
trap_fatal() at trap_fatal+0x322/frame 0xfffffe04675ff380
trap_pfault() at trap_pfault+0x62/frame 0xfffffe04675ff3e0
trap() at trap+0x29e/frame 0xfffffe04675ff5a0
calltrap() at calltrap+0x8/frame 0xfffffe04675ff5a0
--- trap 0xc, rip = 0xffffffff80e494e1, rsp = 0xfffffe04675ff670, rbp =
0xfffffe04675ff670 ---
copyin() at copyin+0x41/frame 0xfffffe04675ff670
passsendccb() at passsendccb+0x6a/frame 0xfffffe04675ff6f0
passdoioctl() at passdoioctl+0x3c6/frame 0xfffffe04675ff7a0
passioctl() at passioctl+0x22/frame 0xfffffe04675ff7e0
devfs_ioctl_f() at devfs_ioctl_f+0x138/frame 0xfffffe04675ff840
kern_ioctl() at kern_ioctl+0x2c4/frame 0xfffffe04675ff8a0
sys_ioctl() at sys_ioctl+0x16f/frame 0xfffffe04675ff980
amd64_syscall() at amd64_syscall+0x53a/frame 0xfffffe04675ffab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe04675ffab0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80174c18a, rsp =
0x7fffffff0308, rbp = 0x7fffffff0900 ---
Uptime: 2m46s
Dumping 723 out of 16285 MB:..3%..12%..23%..31%..43%..51%..62%..71%..82%..91%

(kgdb) bt
#0  doadump (textdump=1) at /home/src/sys/kern/kern_shutdown.c:299
#1  0xffffffff809b1ab1 in kern_reboot (howto=260) at
/home/src/sys/kern/kern_shutdown.c:376
#2  0xffffffff809b2100 in vpanic (fmt=<value optimized out>, ap=<value
optimized out>) at /home/src/sys/kern/kern_shutdown.c:770
#3  0xffffffff809b2143 in panic (fmt=<value optimized out>) at
/home/src/sys/kern/kern_shutdown.c:701
#4  0xffffffff80e4b632 in trap_fatal (frame=0xfffffe04675ff5b0, eva=10) at
/home/src/sys/amd64/amd64/trap.c:801
#5  0xffffffff80e4b6a2 in trap_pfault (frame=0xfffffe04675ff5b0, usermode=0) at
pcpu.h:222
#6  0xffffffff80e4ae3e in trap (frame=0xfffffe04675ff5b0) at
/home/src/sys/amd64/amd64/trap.c:421
#7  0xffffffff80e2ed91 in calltrap () at
/home/src/sys/amd64/amd64/exception.S:236
#8  0xffffffff80e494e1 in copyin () at /home/src/sys/amd64/amd64/support.S:304
#9  0xffffffff8033873a in passsendccb (periph=0xfffff8000cfe5e00,
ccb=0xfffff800662f9000, inccb=0xfffff80066d91000) at
/home/src/sys/cam/scsi/scsi_pass.c:2172
#10 0xffffffff80337836 in passdoioctl (dev=<value optimized out>,
cmd=<value optimized out>, addr=0xfffff80066d91000 "",
flag=<value optimized out>,
    td=<value optimized out>) at /home/src/sys/cam/scsi/scsi_pass.c:1823
#11 0xffffffff80337052 in passioctl (dev=0xfffff8000cfc4c00, cmd=3303020802,
addr=0xfffff80066d91000 "", flag=3, td=0xfffff8000e1a3000)
    at /home/src/sys/cam/scsi/scsi_pass.c:1751
#12 0xffffffff80878c78 in devfs_ioctl_f (fp=0xfffff8000e50d3c0, com=3303020802,
data=0xfffff80066d91000, cred=0xfffff800066c7200, td=0xfffff8000e1a3000)
    at /home/src/sys/fs/devfs/devfs_vnops.c:791
#13 0xffffffff80a18184 in kern_ioctl (td=<value optimized out>,
fd=<value optimized out>, com=<value optimized out>, data=<value
optimized out>) at file.h:323
#14 0xffffffff80a17e4f in sys_ioctl (td=<value optimized out>,
uap=0xfffff8000e1a3538) at /home/src/sys/kern/sys_generic.c:745
#15 0xffffffff80e4c05a in amd64_syscall (td=0xfffff8000e1a3000, traced=0) at
subr_syscall.c:131
#16 0xffffffff80e2f07b in Xfast_syscall () at
/home/src/sys/amd64/amd64/exception.S:396
#17 0x000000080174c18a in ?? ()
Previous frame inner to this frame (corrupt stack?)

I can keep dead SSD in place to debug this. Looking for help.

Eugene Grosbein

2017-Jul-23 13:20 UTC

head link

stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock

On 23.07.2017 20:02, Eugene Grosbein wrote:
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0xa
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff80e494e1
> stack pointer           = 0x28:0xfffffe04675ff670
> frame pointer           = 0x28:0xfffffe04675ff670
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 1387 (smartd)
> trap number             = 12
> panic: page fault
> cpuid = 0
I also have a screenshot of another case of same panic that notes
lock order reversal: (Giant after non-sleepable):

http://www.grosbein.net/freebsd/crash.png

Eugene Grosbein

2017-Jul-24 17:39 UTC

head link

stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock

On 23.07.2017 20:02, Eugene Grosbein wrote:> Hi!
> 
> Long story short: stable/11 r321371 started to panic at the moment of
smartd invocation
> after my SSD died.
> 
> I have Intel motherboard with graid-supported pseudo-raid.
> I use it in RAID1 mode with one HDD and one SSD.
> 
> Yesterday the SSD has died: it is not detected by BIOS nor FreeBSD kernel
(timeouts).
> This went unnoticed by me as graid just disconnected it on-the-fly:
> 
> kernel: ahcich5: Timeout on slot 24 port 0
> kernel: ahcich5: is 00000000 cs 00000000 ss 01000000 rs 01000000 tfd 40
serr 00000000 cmd 0000d817
> kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b
00 00 00 00 00
> kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout
> kernel: (ada1:ahcich5:0:0:0): Retrying command
> kernel: ahcich5: AHCI reset: device not ready after 31000ms (tfd =
00000080)
> [skip]
> kernel: ada1 at ahcich5 bus 0 scbus2 target 0 lun 0
> kernel: ada1: <LITEON CV1-MB512-11 2803> s/n JYKJ550855860139
detached
> [skip]
> kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b
00 00 00 00 00
> kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout
> kernel: (ada1:ahcich5:0:0:0): Error 5, Periph was invalidated
> kernel: GEOM_RAID: Write failed: failing subdisk.
ada1[WRITE(offset=269389066240, length=32768)]
> kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from ACTIVE to
FAILED.
> kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-ada1 state changed from
ACTIVE to FAILED.
> kernel: GEOM_RAID: Intel-c291fe96: Volume r0 state changed from OPTIMAL to
DEGRADED.
> kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from FAILED to
OFFLINE.
> kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-[unknown] state changed
from FAILED to NONE.
> kernel: GEOM_RAID: Write failed: failing subdisk.
ada1[WRITE(offset=270699851776, length=32768)]
> kernel: GEOM_RAID: Intel-c291fe96: Warning! Fail request to a disk in a
wrong state (OFFLINE)!
> 
> Unaware of that, I've performed standard source upgrade from
11.1-PRERELEASE r318692
> to stable/11 r321371 that went smooth. After reboot, BIOS was unable to
detect SSD,
> reported degraded state of the mirror and booted FreeBSD using second
mirror component (HDD).
> 
> After long timeout, the kernel could not detect dead SSD too and continued
to run with degraded mirror
> just fine: the system went multiuser mode and almost finished loading when
rcNG started smartd.
> The kernel panices that moment. This is repeatable: I can cold-boot to
single user mode, start smartd
> and get same panic. This is debugging kernel and I managed to obtain
crashdump.
> 
> kgdb session follows:
> 
> <118>Starting smartd.
> Kernel page fault with the following non-sleepable locks held:
> exclusive sleep mutex CAM device lock (CAM device lock) r = 0
(0xfffff8000cf71c60) locked @ /home/src/sys/cam/scsi/scsi_pass.c:1766
> stack backtrace:
> #0 0xffffffff80a12620 at witness_debugger+0x70
> #1 0xffffffff80a13a4e at witness_warn+0x45e
> #2 0xffffffff80e4b693 at trap_pfault+0x53
> #3 0xffffffff80e4ae3e at trap+0x29e
> #4 0xffffffff80e2ed91 at calltrap+0x8
> #5 0xffffffff8033873a at passsendccb+0x6a
> #6 0xffffffff80337836 at passdoioctl+0x3c6
> #7 0xffffffff80337052 at passioctl+0x22
> #8 0xffffffff80878c78 at devfs_ioctl_f+0x138
> #9 0xffffffff80a18184 at kern_ioctl+0x2c4
> #10 0xffffffff80a17e4f at sys_ioctl+0x16f
> #11 0xffffffff80e4c05a at amd64_syscall+0x53a
> #12 0xffffffff80e2f07b at Xfast_syscall+0xfb
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0xa
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff80e494e1
> stack pointer           = 0x28:0xfffffe04675ff670
> frame pointer           = 0x28:0xfffffe04675ff670
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 1387 (smartd)
> trap number             = 12
> panic: page fault
> cpuid = 0
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe04675ff250
> vpanic() at vpanic+0x186/frame 0xfffffe04675ff2d0
> panic() at panic+0x43/frame 0xfffffe04675ff330
> trap_fatal() at trap_fatal+0x322/frame 0xfffffe04675ff380
> trap_pfault() at trap_pfault+0x62/frame 0xfffffe04675ff3e0
> trap() at trap+0x29e/frame 0xfffffe04675ff5a0
> calltrap() at calltrap+0x8/frame 0xfffffe04675ff5a0
> --- trap 0xc, rip = 0xffffffff80e494e1, rsp = 0xfffffe04675ff670, rbp =
0xfffffe04675ff670 ---
> copyin() at copyin+0x41/frame 0xfffffe04675ff670
> passsendccb() at passsendccb+0x6a/frame 0xfffffe04675ff6f0
> passdoioctl() at passdoioctl+0x3c6/frame 0xfffffe04675ff7a0
> passioctl() at passioctl+0x22/frame 0xfffffe04675ff7e0
> devfs_ioctl_f() at devfs_ioctl_f+0x138/frame 0xfffffe04675ff840
> kern_ioctl() at kern_ioctl+0x2c4/frame 0xfffffe04675ff8a0
> sys_ioctl() at sys_ioctl+0x16f/frame 0xfffffe04675ff980
> amd64_syscall() at amd64_syscall+0x53a/frame 0xfffffe04675ffab0
> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe04675ffab0
> --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80174c18a, rsp =
0x7fffffff0308, rbp = 0x7fffffff0900 ---
> Uptime: 2m46s
> Dumping 723 out of 16285
MB:..3%..12%..23%..31%..43%..51%..62%..71%..82%..91%
> 
> (kgdb) bt
> #0  doadump (textdump=1) at /home/src/sys/kern/kern_shutdown.c:299
> #1  0xffffffff809b1ab1 in kern_reboot (howto=260) at
/home/src/sys/kern/kern_shutdown.c:376
> #2  0xffffffff809b2100 in vpanic (fmt=<value optimized out>,
ap=<value optimized out>) at /home/src/sys/kern/kern_shutdown.c:770
> #3  0xffffffff809b2143 in panic (fmt=<value optimized out>) at
/home/src/sys/kern/kern_shutdown.c:701
> #4  0xffffffff80e4b632 in trap_fatal (frame=0xfffffe04675ff5b0, eva=10) at
/home/src/sys/amd64/amd64/trap.c:801
> #5  0xffffffff80e4b6a2 in trap_pfault (frame=0xfffffe04675ff5b0,
usermode=0) at pcpu.h:222
> #6  0xffffffff80e4ae3e in trap (frame=0xfffffe04675ff5b0) at
/home/src/sys/amd64/amd64/trap.c:421
> #7  0xffffffff80e2ed91 in calltrap () at
/home/src/sys/amd64/amd64/exception.S:236
> #8  0xffffffff80e494e1 in copyin () at
/home/src/sys/amd64/amd64/support.S:304
> #9  0xffffffff8033873a in passsendccb (periph=0xfffff8000cfe5e00,
ccb=0xfffff800662f9000, inccb=0xfffff80066d91000) at
/home/src/sys/cam/scsi/scsi_pass.c:2172
> #10 0xffffffff80337836 in passdoioctl (dev=<value optimized out>,
cmd=<value optimized out>, addr=0xfffff80066d91000 "",
flag=<value optimized out>,
>     td=<value optimized out>) at
/home/src/sys/cam/scsi/scsi_pass.c:1823
> #11 0xffffffff80337052 in passioctl (dev=0xfffff8000cfc4c00,
cmd=3303020802, addr=0xfffff80066d91000 "", flag=3,
td=0xfffff8000e1a3000)
>     at /home/src/sys/cam/scsi/scsi_pass.c:1751
> #12 0xffffffff80878c78 in devfs_ioctl_f (fp=0xfffff8000e50d3c0,
com=3303020802, data=0xfffff80066d91000, cred=0xfffff800066c7200,
td=0xfffff8000e1a3000)
>     at /home/src/sys/fs/devfs/devfs_vnops.c:791
> #13 0xffffffff80a18184 in kern_ioctl (td=<value optimized out>,
fd=<value optimized out>, com=<value optimized out>, data=<value
optimized out>) at file.h:323
> #14 0xffffffff80a17e4f in sys_ioctl (td=<value optimized out>,
uap=0xfffff8000e1a3538) at /home/src/sys/kern/sys_generic.c:745
> #15 0xffffffff80e4c05a in amd64_syscall (td=0xfffff8000e1a3000, traced=0)
at subr_syscall.c:131
> #16 0xffffffff80e2f07b in Xfast_syscall () at
/home/src/sys/amd64/amd64/exception.S:396
> #17 0x000000080174c18a in ?? ()
> Previous frame inner to this frame (corrupt stack?)
> 
> I can keep dead SSD in place to debug this. Looking for help.
I managed to obtain crashdump for the case and filled PR with details:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220972

freebsd stable - Jul 2017 - stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock

stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock

stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock

stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock