Justin Daniel Meyer
2010-Jun-21 16:21 UTC
[zfs-discuss] Many checksum errors during resilver.
I''ve decided to upgrade my home server capacity by replacing the disks
in one of my mirror vdevs. The procedure appeared to work out, but during
resilver, a couple million checksum errors were logged on the new device.
I''ve read through quite a bit of the archive and searched around a bit,
but can not find anything definitive to ease my mind on whether to proceed.
SunOS deepthought 5.10 Generic_142901-13 i86pc i386 i86pc
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.00% done, 691h28m to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror DEGRADED 0 0 0
replacing DEGRADED 215 0 0
c1t6d0s0/o FAULTED 0 0 0 corrupted data
c1t6d0 ONLINE 0 0 215 3.73M resilvered
c1t2d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
logs
c8t1d0p1 ONLINE 0 0 0
cache
c2t1d0p2 ONLINE 0 0 0
During the resilver, the cache device and the zil were both removed for errors
(1-2k each). (Despite the c2/c8 discrepancy, they are partitions on the same
OCZvertexII device.)
# zpool status -xv tank
pool: tank
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using ''zpool clear'' or replace the device with
''zpool replace''.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver completed after 9h20m with 0 errors on Sat Jun 19 22:07:27 2010
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror ONLINE 0 0 0
c1t6d0 ONLINE 0 0 2.69M 539G resilvered
c1t2d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
logs
c8t1d0p1 REMOVED 0 0 0
cache
c2t1d0p2 REMOVED 0 0 0
I cleared the errors (about 5000/GB resilvered!), removed the cache device, and
replaced the zil partition with the whole device. After 3 pool scrubs with no
errors, I want to check with someone else that it appears okay to replace the
second drive in this mirror vdev. The one thing I have not tried is a large
file transfer to the server, as I am also dealing with an NFS mount problem
which popped up suspiciously close to my most recent patch update.
# zpool status -v tank
pool: tank
state: ONLINE
scrub: scrub completed after 3h26m with 0 errors on Mon Jun 21 01:45:00 2010
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
logs
c0t0d0 ONLINE 0 0 0
errors: No known data errors
/var/adm/messages is positively over-run with these triplets/quadruplets, not
all of which end which end up as "fatal" type.
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1043,815a at 7/disk at 1,0 (sd14):
Jun 19 21:43:19 deepthought Error for Command: write(10) Error
Level: Retryable
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block:
26721062 Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA
Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key:
Aborted Command
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] ASC: 0x8 (LUN
communication failure), ASCQ: 0x0, FRU: 0x0
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1043,815a at 7/disk at 1,0 (sd14):
Jun 19 21:43:19 deepthought Error for Command: write(10) Error
Level: Retryable
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block:
26721062 Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA
Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key:
Aborted Command
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1043,815a at 7/disk at 1,0 (sd14):
Jun 19 21:43:19 deepthought Error for Command: write(10) Error
Level: Fatal
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block:
26721062 Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA
Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key:
Aborted Command
In the past this kernel.notice ID has come up as "informational" for
others, and for my case it _only_ occurred during the initial resilver. One
last point of interest is the new drive is the WD Green WD10EARS, and the old
are WD Green WD6400AACS (all of which I have tested on another system with the
WD read-test utility). I know these drives get their share of ridicule (and
occasional praise/satisfaction), but I''d appreciate any thoughts on
proceeding with the mirror upgrade. [Backups are a check.]
Justin
Cindy Swearingen
2010-Jun-21 16:47 UTC
[zfs-discuss] Many checksum errors during resilver.
Hi Justin, This looks like an older Solaris 10 release. If so, this looks like a zpool status display bug, where it looks like the checksum errors are occurring on the replacement device, but they are not. I would review the steps described in the hardware section of the ZFS troubleshooting wiki to confirm that the new disk is working as expected: http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide Then, follow steps in the Notify FMA That Device Replacement is Complete section to reset FMA. Then, start monitoring the replacement device with fmdump to see if any new activity occurs on this device. Thanks, Cindy On 06/21/10 10:21, Justin Daniel Meyer wrote:> I''ve decided to upgrade my home server capacity by replacing the disks in one of my mirror vdevs. The procedure appeared to work out, but during resilver, a couple million checksum errors were logged on the new device. I''ve read through quite a bit of the archive and searched around a bit, but can not find anything definitive to ease my mind on whether to proceed. > > > SunOS deepthought 5.10 Generic_142901-13 i86pc i386 i86pc > > pool: tank > state: DEGRADED > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 0h0m, 0.00% done, 691h28m to go > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > mirror DEGRADED 0 0 0 > replacing DEGRADED 215 0 0 > c1t6d0s0/o FAULTED 0 0 0 corrupted data > c1t6d0 ONLINE 0 0 215 3.73M resilvered > c1t2d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > logs > c8t1d0p1 ONLINE 0 0 0 > cache > c2t1d0p2 ONLINE 0 0 0 > > > During the resilver, the cache device and the zil were both removed for errors (1-2k each). (Despite the c2/c8 discrepancy, they are partitions on the same OCZvertexII device.) > > > # zpool status -xv tank > pool: tank > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver completed after 9h20m with 0 errors on Sat Jun 19 22:07:27 2010 > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > mirror ONLINE 0 0 0 > c1t6d0 ONLINE 0 0 2.69M 539G resilvered > c1t2d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > logs > c8t1d0p1 REMOVED 0 0 0 > cache > c2t1d0p2 REMOVED 0 0 0 > > I cleared the errors (about 5000/GB resilvered!), removed the cache device, and replaced the zil partition with the whole device. After 3 pool scrubs with no errors, I want to check with someone else that it appears okay to replace the second drive in this mirror vdev. The one thing I have not tried is a large file transfer to the server, as I am also dealing with an NFS mount problem which popped up suspiciously close to my most recent patch update. > > > # zpool status -v tank > pool: tank > state: ONLINE > scrub: scrub completed after 3h26m with 0 errors on Mon Jun 21 01:45:00 2010 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t6d0 ONLINE 0 0 0 > c1t2d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > logs > c0t0d0 ONLINE 0 0 0 > > errors: No known data errors > > > /var/adm/messages is positively over-run with these triplets/quadruplets, not all of which end which end up as "fatal" type. > > > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1043,815a at 7/disk at 1,0 (sd14): > Jun 19 21:43:19 deepthought Error for Command: write(10) Error Level: Retryable > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block: 26721062 Error Block: 26721062 > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key: Aborted Command > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] ASC: 0x8 (LUN communication failure), ASCQ: 0x0, FRU: 0x0 > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1043,815a at 7/disk at 1,0 (sd14): > Jun 19 21:43:19 deepthought Error for Command: write(10) Error Level: Retryable > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block: 26721062 Error Block: 26721062 > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key: Aborted Command > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1043,815a at 7/disk at 1,0 (sd14): > Jun 19 21:43:19 deepthought Error for Command: write(10) Error Level: Fatal > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block: 26721062 Error Block: 26721062 > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key: Aborted Command > > > In the past this kernel.notice ID has come up as "informational" for others, and for my case it _only_ occurred during the initial resilver. One last point of interest is the new drive is the WD Green WD10EARS, and the old are WD Green WD6400AACS (all of which I have tested on another system with the WD read-test utility). I know these drives get their share of ridicule (and occasional praise/satisfaction), but I''d appreciate any thoughts on proceeding with the mirror upgrade. [Backups are a check.] > > Justin > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss