Justin Daniel Meyer
2010-Jun-21 16:21 UTC
[zfs-discuss] Many checksum errors during resilver.
I''ve decided to upgrade my home server capacity by replacing the disks in one of my mirror vdevs. The procedure appeared to work out, but during resilver, a couple million checksum errors were logged on the new device. I''ve read through quite a bit of the archive and searched around a bit, but can not find anything definitive to ease my mind on whether to proceed. SunOS deepthought 5.10 Generic_142901-13 i86pc i386 i86pc pool: tank state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h0m, 0.00% done, 691h28m to go config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 mirror DEGRADED 0 0 0 replacing DEGRADED 215 0 0 c1t6d0s0/o FAULTED 0 0 0 corrupted data c1t6d0 ONLINE 0 0 215 3.73M resilvered c1t2d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 logs c8t1d0p1 ONLINE 0 0 0 cache c2t1d0p2 ONLINE 0 0 0 During the resilver, the cache device and the zil were both removed for errors (1-2k each). (Despite the c2/c8 discrepancy, they are partitions on the same OCZvertexII device.) # zpool status -xv tank pool: tank state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 9h20m with 0 errors on Sat Jun 19 22:07:27 2010 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 mirror ONLINE 0 0 0 c1t6d0 ONLINE 0 0 2.69M 539G resilvered c1t2d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 logs c8t1d0p1 REMOVED 0 0 0 cache c2t1d0p2 REMOVED 0 0 0 I cleared the errors (about 5000/GB resilvered!), removed the cache device, and replaced the zil partition with the whole device. After 3 pool scrubs with no errors, I want to check with someone else that it appears okay to replace the second drive in this mirror vdev. The one thing I have not tried is a large file transfer to the server, as I am also dealing with an NFS mount problem which popped up suspiciously close to my most recent patch update. # zpool status -v tank pool: tank state: ONLINE scrub: scrub completed after 3h26m with 0 errors on Mon Jun 21 01:45:00 2010 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 logs c0t0d0 ONLINE 0 0 0 errors: No known data errors /var/adm/messages is positively over-run with these triplets/quadruplets, not all of which end which end up as "fatal" type. Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1043,815a at 7/disk at 1,0 (sd14): Jun 19 21:43:19 deepthought Error for Command: write(10) Error Level: Retryable Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block: 26721062 Error Block: 26721062 Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key: Aborted Command Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] ASC: 0x8 (LUN communication failure), ASCQ: 0x0, FRU: 0x0 Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1043,815a at 7/disk at 1,0 (sd14): Jun 19 21:43:19 deepthought Error for Command: write(10) Error Level: Retryable Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block: 26721062 Error Block: 26721062 Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key: Aborted Command Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1043,815a at 7/disk at 1,0 (sd14): Jun 19 21:43:19 deepthought Error for Command: write(10) Error Level: Fatal Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block: 26721062 Error Block: 26721062 Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key: Aborted Command In the past this kernel.notice ID has come up as "informational" for others, and for my case it _only_ occurred during the initial resilver. One last point of interest is the new drive is the WD Green WD10EARS, and the old are WD Green WD6400AACS (all of which I have tested on another system with the WD read-test utility). I know these drives get their share of ridicule (and occasional praise/satisfaction), but I''d appreciate any thoughts on proceeding with the mirror upgrade. [Backups are a check.] Justin
Cindy Swearingen
2010-Jun-21 16:47 UTC
[zfs-discuss] Many checksum errors during resilver.
Hi Justin, This looks like an older Solaris 10 release. If so, this looks like a zpool status display bug, where it looks like the checksum errors are occurring on the replacement device, but they are not. I would review the steps described in the hardware section of the ZFS troubleshooting wiki to confirm that the new disk is working as expected: http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide Then, follow steps in the Notify FMA That Device Replacement is Complete section to reset FMA. Then, start monitoring the replacement device with fmdump to see if any new activity occurs on this device. Thanks, Cindy On 06/21/10 10:21, Justin Daniel Meyer wrote:> I''ve decided to upgrade my home server capacity by replacing the disks in one of my mirror vdevs. The procedure appeared to work out, but during resilver, a couple million checksum errors were logged on the new device. I''ve read through quite a bit of the archive and searched around a bit, but can not find anything definitive to ease my mind on whether to proceed. > > > SunOS deepthought 5.10 Generic_142901-13 i86pc i386 i86pc > > pool: tank > state: DEGRADED > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 0h0m, 0.00% done, 691h28m to go > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > mirror DEGRADED 0 0 0 > replacing DEGRADED 215 0 0 > c1t6d0s0/o FAULTED 0 0 0 corrupted data > c1t6d0 ONLINE 0 0 215 3.73M resilvered > c1t2d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > logs > c8t1d0p1 ONLINE 0 0 0 > cache > c2t1d0p2 ONLINE 0 0 0 > > > During the resilver, the cache device and the zil were both removed for errors (1-2k each). (Despite the c2/c8 discrepancy, they are partitions on the same OCZvertexII device.) > > > # zpool status -xv tank > pool: tank > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver completed after 9h20m with 0 errors on Sat Jun 19 22:07:27 2010 > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > mirror ONLINE 0 0 0 > c1t6d0 ONLINE 0 0 2.69M 539G resilvered > c1t2d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > logs > c8t1d0p1 REMOVED 0 0 0 > cache > c2t1d0p2 REMOVED 0 0 0 > > I cleared the errors (about 5000/GB resilvered!), removed the cache device, and replaced the zil partition with the whole device. After 3 pool scrubs with no errors, I want to check with someone else that it appears okay to replace the second drive in this mirror vdev. The one thing I have not tried is a large file transfer to the server, as I am also dealing with an NFS mount problem which popped up suspiciously close to my most recent patch update. > > > # zpool status -v tank > pool: tank > state: ONLINE > scrub: scrub completed after 3h26m with 0 errors on Mon Jun 21 01:45:00 2010 > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t6d0 ONLINE 0 0 0 > c1t2d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t0d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > logs > c0t0d0 ONLINE 0 0 0 > > errors: No known data errors > > > /var/adm/messages is positively over-run with these triplets/quadruplets, not all of which end which end up as "fatal" type. > > > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1043,815a at 7/disk at 1,0 (sd14): > Jun 19 21:43:19 deepthought Error for Command: write(10) Error Level: Retryable > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block: 26721062 Error Block: 26721062 > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key: Aborted Command > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] ASC: 0x8 (LUN communication failure), ASCQ: 0x0, FRU: 0x0 > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1043,815a at 7/disk at 1,0 (sd14): > Jun 19 21:43:19 deepthought Error for Command: write(10) Error Level: Retryable > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block: 26721062 Error Block: 26721062 > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key: Aborted Command > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at 0,0/pci1043,815a at 7/disk at 1,0 (sd14): > Jun 19 21:43:19 deepthought Error for Command: write(10) Error Level: Fatal > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Requested Block: 26721062 Error Block: 26721062 > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: > Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice] Sense Key: Aborted Command > > > In the past this kernel.notice ID has come up as "informational" for others, and for my case it _only_ occurred during the initial resilver. One last point of interest is the new drive is the WD Green WD10EARS, and the old are WD Green WD6400AACS (all of which I have tested on another system with the WD read-test utility). I know these drives get their share of ridicule (and occasional praise/satisfaction), but I''d appreciate any thoughts on proceeding with the mirror upgrade. [Backups are a check.] > > Justin > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss