Steven Hartland
2012-Oct-27 13:00 UTC
mfi panic on recused on non-recusive mutex MFI I/O lock
Testing a new machine which is based on 8.3-RELEASE with the mfi driver from 8-STABLE and just got a panic. The below is translation of the hand copied from console:- mfi0: sense error 0, sense_key 0, asc 0, ascq 0 mfisyspd5: hard error cmd=write 90827650-90827905 mfi0: I/O error, status= 46 scsi_status= 240 mfi0: sense error 0, sense_key 0, asc 0, ascq 0 mfisyspd5: hard error cmd=write 90827394-90827649 mfi0: I/O error, status= 46 scsi_status= 240 mfi0: sense error 0, sense_key 0, asc 0, ascq 0 mfisyspd5: hard error cmd=write 90827138-90827393 mfi0: I/O error, status= 46 scsi_status= 240 mfi0: sense error 0, sense_key 0, asc 0, ascq 0 mfisyspd5: hard error cmd=write 90826882-90827137 mfi0: I/O error, status= 2 scsi_status= 2 mfi0: sense error 112, sense_key 6, asc 41, ascq 0 mfisyspd4: hard error cmd=write 90830466-90830721 mfi0: I/O error, status= 2 scsi_status= 2 mfi0: sense error 112, sense_key 6, asc 41, ascq 0 mfisyspd5: hard error cmd=write 90830722-90830977 mfi0: Adapter RESET condition detected mfi0: First state FW reset initiated... mfi0: ADP_RESET_TBOLT: HostDiag=a0 mfi0: first state of reset complete, second state initiated... mfi0: Second state FW reset initiated... panic: _mtx_lock_sleep: recursed on non-recusive mutex MFI I/O lock @ /usr/src/sys/dev/mfi/mfi_tbolt:346 cpuid = 6 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a kdb_backtrace() at kdb_backtrace+0x37 panic() at panic+0x178 _mtx_lock_sleep() at _mtx_lock_sleep+0x152 _mtx_lock_flags() at _mtx_lock_flags+0x80 mfi_tbolt_init_MFI_queue() at mfi_tbolt_init_MFI_queue+0x72 mfi_timeout() at mfi_timeout+0x27 softclock() at softclock+0x2aa intr_event_execute_handlers() at intr_event_execute_handlers+0x66 ithread_loop() at ithread_loop+0xb2 fork_exit() at fork_exit+0x135 fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xffffff80005ccd00, rbp = 0 --- KDB: enter panic [thread pid 12 tid 100020 ] Stopperd at kdb_enter+0x3b: movq $0,0x51cb32(%rip) db> So questions:- 1. What are the "hard error" errors? The machine was testing IO with dd but due to the panic I cant tell if that was the cause. 2. Looking at the code this seems like the reset was tripped by firmware bug, is that the case? 3. Is the fix the panic a simple one we cat test? Regards Steve ===============================================This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster at multiplay.co.uk.
Steven Hartland
2012-Nov-05 16:55 UTC
mfi panic on recused on non-recusive mutex MFI I/O lock
I've managed to get the machine to reproduce this fairly regularly now. Without a debug kernel it still results in a panic, just at a later stage or so I believe, the none debug panic messages is "command not in queue". In each none debug panic I've seen the cm_flags indicates the command being dequeued is on the busy queue and not on the expected free or ready queue which is being processed at the time. The triggering issue seems to be the adapter reset code run from mfi_timeout. I've had a good look but can't see how a cm could be in a queue yet have its cm_flags set to that of a different queue as all manipulation seems to be being done via the "mfi_<method> ## name" macros which all correctly maintain the queue / cm_flags relationship. At this point I believe it could be a thread being interrupted by a timeout part way the processing of a queue request hence queue and cm_flags being out of sync. Any pointers on how to debug this issue further / fix it would be most appreciated. Regards Steve ----- Original Message ----- From: "Steven Hartland"> Testing a new machine which is based on 8.3-RELEASE with the mfi > driver from 8-STABLE and just got a panic. > > > The below is translation of the hand copied from console:- > mfi0: sense error 0, sense_key 0, asc 0, ascq 0 > mfisyspd5: hard error cmd=write 90827650-90827905 > mfi0: I/O error, status= 46 scsi_status= 240 > mfi0: sense error 0, sense_key 0, asc 0, ascq 0 > mfisyspd5: hard error cmd=write 90827394-90827649 > mfi0: I/O error, status= 46 scsi_status= 240 > mfi0: sense error 0, sense_key 0, asc 0, ascq 0 > mfisyspd5: hard error cmd=write 90827138-90827393 > mfi0: I/O error, status= 46 scsi_status= 240 > mfi0: sense error 0, sense_key 0, asc 0, ascq 0 > mfisyspd5: hard error cmd=write 90826882-90827137 > mfi0: I/O error, status= 2 scsi_status= 2 > mfi0: sense error 112, sense_key 6, asc 41, ascq 0 > mfisyspd4: hard error cmd=write 90830466-90830721 > mfi0: I/O error, status= 2 scsi_status= 2 > mfi0: sense error 112, sense_key 6, asc 41, ascq 0 > mfisyspd5: hard error cmd=write 90830722-90830977 > mfi0: Adapter RESET condition detected > mfi0: First state FW reset initiated... > mfi0: ADP_RESET_TBOLT: HostDiag=a0 > mfi0: first state of reset complete, second state initiated... > mfi0: Second state FW reset initiated... > panic: _mtx_lock_sleep: recursed on non-recusive mutex MFI I/O lock @ /usr/src/sys/dev/mfi/mfi_tbolt:346 > > cpuid = 6 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2a > kdb_backtrace() at kdb_backtrace+0x37 > panic() at panic+0x178 > _mtx_lock_sleep() at _mtx_lock_sleep+0x152 > _mtx_lock_flags() at _mtx_lock_flags+0x80 > mfi_tbolt_init_MFI_queue() at mfi_tbolt_init_MFI_queue+0x72 > mfi_timeout() at mfi_timeout+0x27 > softclock() at softclock+0x2aa > intr_event_execute_handlers() at intr_event_execute_handlers+0x66 > ithread_loop() at ithread_loop+0xb2 > fork_exit() at fork_exit+0x135 > fork_trampoline() at fork_trampoline+0xe > --- trap 0, rip = 0, rsp = 0xffffff80005ccd00, rbp = 0 --- > KDB: enter panic > [thread pid 12 tid 100020 ] > Stopperd at kdb_enter+0x3b: movq $0,0x51cb32(%rip) > db> > > So questions:- > 1. What are the "hard error" errors? The machine was testing IO > with dd but due to the panic I cant tell if that was the cause. > 2. Looking at the code this seems like the reset was tripped by > firmware bug, is that the case? > 3. Is the fix the panic a simple one we cat test?===============================================This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster at multiplay.co.uk.