David W. Smith
2007-Mar-14 01:19 UTC
[zfs-discuss] Question: Zpool replace on a disk which is getting errors
I have a large pool and I starting getting the following errors on one of the LUNS: Mar 13 17:52:36 gdo-node-2 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk at g600a0b8000115ea20000fedd45e81306 (sd337): Mar 13 17:52:36 gdo-node-2 Error for Command: write(10) Error Level: Retryable Mar 13 17:52:36 gdo-node-2 scsi: [ID 107833 kern.notice] Requested Block: 15782 Error Block: 15782 Mar 13 17:52:36 gdo-node-2 scsi: [ID 107833 kern.notice] Vendor: STK Serial Number: Mar 13 17:52:36 gdo-node-2 scsi: [ID 107833 kern.notice] Sense Key: Hardware Error Mar 13 17:52:36 gdo-node-2 scsi: [ID 107833 kern.notice] ASC: 0x84 (<vendor unique code 0x84>), ASCQ: 0x0, FRU: 0x0 Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk at g600a0b8000115ea20000fedd45e81306 (sd337): Mar 13 17:52:37 gdo-node-2 Error for Command: write(10) Error Level: Retryable Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice] Requested Block: 885894 Error Block: 885894 Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice] Vendor: STK Serial Number: Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice] Sense Key: Hardware Error Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice] ASC: 0x84 (<vendor unique code 0x84>), ASCQ: 0x0, FRU: 0x0 Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk at g600a0b8000115ea20000fedd45e81306 (sd337): Mar 13 17:52:37 gdo-node-2 Error for Command: write(10) Error Level: Retryable Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice] Requested Block: 15779 Error Block: 15779 Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice] Vendor: STK Serial Number: Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice] Sense Key: Hardware Error There were others which were at a "Fatal" error level.>From the hardware side of things this lun has failed as well. The lunis actually only composed of a single disk of which the entire disk has been made into the lun. 1 lun / volume / disk. I''m testing various configurations from the hardware, from R5 volumes, to these single disk volumes. Back to the issue... So I was hoping that the hotspare would kick in, but since that didn''t seem to be the case I thought I would try and replace the disk manually. I did the following on this disk, but the errors just keep coming. zpool replace -f gdo-pool-01 c8t600A0B8000115EA20000FEDD45E81306d0 \ c8t600A0B800011399600007D6F45E8149Bd0 Originally the replacement disk was part of the spares, for this pool, hence I think I had to use the -f. I had removed the disk from the spares just prior to the above zpool replace. zpool remove gdo-pool-01 c8t600A0B800011399600007D6F45E8149Bd0 After the replacement the raidz2 group looked like: bash-3.00# zpool status gdo-pool-01 pool: gdo-pool-01 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed with 0 errors on Tue Mar 13 17:38:21 2007 config: ... <several raidz2 group listings deleted to make this email shorter > ... raidz2 ONLINE 0 0 0 c8t600A0B800011399600007CDD45E80D31d0 ONLINE 0 0 0 spare ONLINE 0 0 0 c8t600A0B8000115EA20000FEDD45E81306d0 ONLINE 14 121.9 0 c8t600A0B800011399600007D6F45E8149Bd0 ONLINE 0 0 0 c8t600A0B800011399600007D0745E80F03d0 ONLINE 0 0 0 c8t600A0B8000115EA20000FEF945E814DEd0 ONLINE 0 0 0 c8t600A0B800011399600007D3145E810E9d0 ONLINE 0 0 0 c8t600A0B800011399600007D4F45E81263d0 ONLINE 0 0 0 c8t600A0B8000115EA20000FF1F45E8183Ed0 ONLINE 0 0 0 c8t600A0B800011399600007D6B45E81471d0 ONLINE 0 0 0 c8t600A0B8000115EA20000FE8B45E80D46d0 ONLINE 0 0 0 c8t600A0B800011399600007C6F45E80927d0 ONLINE 0 0 0 c8t600A0B8000115EA20000FEA745E80ED4d0 ONLINE 0 0 0 c8t600A0B800011399600007C9945E80ABDd0 ONLINE 0 0 0 c8t600A0B800011399600007CB545E80B81d0 ONLINE 0 0 0 c8t600A0B8000115EA20000FEC345E8114Ed0 ONLINE 0 0 0 c8t600A0B800011399600007CDF45E80D3Fd0 ONLINE 0 0 0 c8t600A0B8000115EA20000FEDF45E81316d0 ONLINE 0 0 0 So even after the replace the read and write errors continue to accumulate in the zpool status output and I continue to see errors in /var/adm/messages. This system is an x4600 running Solaris 10 update 3, with fairly recent patches applied. Any advise on what I should have done, or what I can do to make the system stop using the bad lun would appreciated. Thank you, David