Attila Nagy
2012-Oct-20 21:39 UTC
mpt doesn't propagate read errors and dies on a single sector?
Hi, I have a Sun X4540 with LSI C1068E based SAS controllers (FW version: 1.27.02.00-IT). My problem is if one drive starts to fail with read errors, the machine becomes completely unusable (running stable/9 with ZFS), because -it seems- ZFS can't see that there are read errors on a device, the mpt driver (controller, kernel?) wants to re-issue the operation endlessly. Here is a verbose (dev.mpt.0.debug=7 level) dump: mpt0: Address Reply: SCSI IO Request Reply @ 0xffffff87ffcfdc00 IOC Status Success IOCLogInfo 0x00000000 MsgLength 0x09 MsgFlags 0x00 MsgContext 0x000200eb Bus: 0 TargetID 3 CDBLength 10 SCSI Status: Check Condition SCSI State: (0x00000001)AutoSense_Valid TransferCnt 0x20000 SenseCnt 0x0012 ResponseInfo 0x00000000 (da3:mpt0:0:3:0): READ(10). CDB: 28 0 3a 38 5d e 0 1 0 0 (da3:mpt0:0:3:0): CAM status: SCSI Status Error (da3:mpt0:0:3:0): SCSI status: Check Condition (da3:mpt0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) (da3:mpt0:0:3:0): Info: 0x3a385d1a (da3:mpt0:0:3:0): Error 5, Unretryable error SCSI IO Request @ 0xffffff80003046f0 Chain Offset 0x00 MsgFlags 0x00 MsgContext 0x000200ea Bus: 0 TargetID 3 SenseBufferLength 32 LUN: 0x0 Control 0x02000000 READ SIMPLEQ DataLength 0x00020000 SenseBufAddr 0x0c65d5e0 CDB[0:10] 28 00 3a 38 5e 0e 00 01 00 00 SE64 0xffffff87ffd1c430: Addr=0x000000010e858000 FlagsLength=0xd3020000 64_BIT_ADDRESSING LAST_ELEMENT END_OF_BUFFER END_OF_LIST mpt0: Address Reply: SCSI IO Request Reply @ 0xffffff87ffcfdd00 IOC Status Success IOCLogInfo 0x00000000 MsgLength 0x09 MsgFlags 0x00 MsgContext 0x000200ea Bus: 0 TargetID 3 CDBLength 10 SCSI Status: Check Condition SCSI State: (0x00000001)AutoSense_Valid TransferCnt 0x20000 SenseCnt 0x0012 ResponseInfo 0x00000000 And I get these check condition SCSI errors endlessly. If ZFS is enabled at boot, the machine can't even start because of this (zpool import never finishes), if I boot without ZFS, and try to import, the zpool command stucks in the vdev_g state: 1163 root 1 20 0 35440K 5200K vdev_g 6 0:01 0.10% zpool procstat -k 1163 PID TID COMM TDNAME KSTACK 1163 100116 zpool - mi_switch sleepq_timedwait _sleep biowait vdev_geom_read_guid vdev_geom_open vdev_open vdev_open_children vdev_raidz_open vdev_open vdev_open_children vdev_root_open vdev_open spa_load spa_tryimport zfs_ioc_pool_tryimport zfsdev_ioctl devfs_ioctl_f Could it be that GEOM/ZFS doesn't receive this read error and waits indefinitely for the command to complete?