thr3ads.net - zfs discuss - [zfs-discuss] Many checksum errors during resilver. [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Justin Daniel Meyer

2010-Jun-21 16:21 UTC

[zfs-discuss] Many checksum errors during resilver.

I''ve decided to upgrade my home server capacity by replacing the disks
in one of my mirror vdevs.  The procedure appeared to work out, but during
resilver, a couple million checksum errors were logged on the new device.
I''ve read through quite a bit of the archive and searched around a bit,
but can not find anything definitive to ease my mind on whether to proceed.


SunOS deepthought 5.10 Generic_142901-13 i86pc i386 i86pc

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.00% done, 691h28m to go
config:

        NAME              STATE     READ WRITE CKSUM
        tank              DEGRADED     0     0     0
          mirror          DEGRADED     0     0     0
            replacing     DEGRADED   215     0     0
              c1t6d0s0/o  FAULTED      0     0     0  corrupted data
              c1t6d0      ONLINE       0     0   215  3.73M resilvered
            c1t2d0        ONLINE       0     0     0
          mirror          ONLINE       0     0     0
            c1t1d0        ONLINE       0     0     0
            c1t5d0        ONLINE       0     0     0
          mirror          ONLINE       0     0     0
            c1t0d0        ONLINE       0     0     0
            c1t4d0        ONLINE       0     0     0
        logs
          c8t1d0p1        ONLINE       0     0     0
        cache
          c2t1d0p2        ONLINE       0     0     0


During the resilver, the cache device and the zil were both removed for errors
(1-2k each).  (Despite the c2/c8 discrepancy, they are partitions on the same
OCZvertexII device.)


# zpool status -xv tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 9h20m with 0 errors on Sat Jun 19 22:07:27 2010
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          mirror    ONLINE       0     0     0
            c1t6d0  ONLINE       0     0 2.69M  539G resilvered
            c1t2d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
        logs
          c8t1d0p1  REMOVED      0     0     0
        cache
          c2t1d0p2  REMOVED      0     0     0

I cleared the errors (about 5000/GB resilvered!), removed the cache device, and
replaced the zil partition with the whole device.  After 3 pool scrubs with no
errors, I want to check with someone else that it appears okay to replace the
second drive in this mirror vdev.  The one thing I have not tried is a large
file transfer to the server, as I am also dealing with an NFS mount problem
which popped up suspiciously close to my most recent patch update.


# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: scrub completed after 3h26m with 0 errors on Mon Jun 21 01:45:00 2010
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t6d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
        logs
          c0t0d0    ONLINE       0     0     0

errors: No known data errors


/var/adm/messages is positively over-run with these triplets/quadruplets, not
all of which end which end up as "fatal" type.


Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1043,815a at 7/disk at 1,0 (sd14):
Jun 19 21:43:19 deepthought     Error for Command: write(10)               Error
Level: Retryable
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested Block:
26721062                  Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA     
Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key:
Aborted Command
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       ASC: 0x8 (LUN
communication failure), ASCQ: 0x0, FRU: 0x0
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1043,815a at 7/disk at 1,0 (sd14):
Jun 19 21:43:19 deepthought     Error for Command: write(10)               Error
Level: Retryable
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested Block:
26721062                  Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA     
Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key:
Aborted Command
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1043,815a at 7/disk at 1,0 (sd14):
Jun 19 21:43:19 deepthought     Error for Command: write(10)               Error
Level: Fatal
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested Block:
26721062                  Error Block: 26721062
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA     
Serial Number:
Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key:
Aborted Command


In the past this kernel.notice ID has come up as "informational" for
others, and for my case it _only_ occurred during the initial resilver.  One
last point of interest is the new drive is the WD Green WD10EARS, and the old
are WD Green WD6400AACS  (all of which I have tested on another system with the
WD read-test utility).  I know these drives get their share of ridicule (and
occasional praise/satisfaction), but I''d appreciate any thoughts on
proceeding with the mirror upgrade. [Backups are a check.]

Justin

Cindy Swearingen

2010-Jun-21 16:47 UTC

head link

[zfs-discuss] Many checksum errors during resilver.

Hi Justin,

This looks like an older Solaris 10 release. If so, this looks like
a zpool status display bug, where it looks like the checksum errors
are occurring on the replacement device, but they are not.

I would review the steps described in the hardware section of the ZFS
troubleshooting wiki to confirm that the new disk is working as
expected:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide

Then, follow steps in the Notify FMA That Device Replacement is Complete
section to reset FMA. Then, start monitoring the replacement device
with fmdump to see if any new activity occurs on this device.

Thanks,

Cindy


On 06/21/10 10:21, Justin Daniel Meyer wrote:> I''ve decided to upgrade my home server capacity by replacing the
disks in one of my mirror vdevs.  The procedure appeared to work out, but during
resilver, a couple million checksum errors were logged on the new device.
I''ve read through quite a bit of the archive and searched around a bit,
but can not find anything definitive to ease my mind on whether to proceed.
> 
> 
> SunOS deepthought 5.10 Generic_142901-13 i86pc i386 i86pc
> 
>   pool: tank
>  state: DEGRADED
> status: One or more devices is currently being resilvered.  The pool will
>         continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>  scrub: resilver in progress for 0h0m, 0.00% done, 691h28m to go
> config:
> 
>         NAME              STATE     READ WRITE CKSUM
>         tank              DEGRADED     0     0     0
>           mirror          DEGRADED     0     0     0
>             replacing     DEGRADED   215     0     0
>               c1t6d0s0/o  FAULTED      0     0     0  corrupted data
>               c1t6d0      ONLINE       0     0   215  3.73M resilvered
>             c1t2d0        ONLINE       0     0     0
>           mirror          ONLINE       0     0     0
>             c1t1d0        ONLINE       0     0     0
>             c1t5d0        ONLINE       0     0     0
>           mirror          ONLINE       0     0     0
>             c1t0d0        ONLINE       0     0     0
>             c1t4d0        ONLINE       0     0     0
>         logs
>           c8t1d0p1        ONLINE       0     0     0
>         cache
>           c2t1d0p2        ONLINE       0     0     0
> 
> 
> During the resilver, the cache device and the zil were both removed for
errors (1-2k each).  (Despite the c2/c8 discrepancy, they are partitions on the
same OCZvertexII device.)
> 
> 
> # zpool status -xv tank
>   pool: tank
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: resilver completed after 9h20m with 0 errors on Sat Jun 19 22:07:27
2010
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         tank        DEGRADED     0     0     0
>           mirror    ONLINE       0     0     0
>             c1t6d0  ONLINE       0     0 2.69M  539G resilvered
>             c1t2d0  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c1t1d0  ONLINE       0     0     0
>             c1t5d0  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c1t0d0  ONLINE       0     0     0
>             c1t4d0  ONLINE       0     0     0
>         logs
>           c8t1d0p1  REMOVED      0     0     0
>         cache
>           c2t1d0p2  REMOVED      0     0     0
> 
> I cleared the errors (about 5000/GB resilvered!), removed the cache device,
and replaced the zil partition with the whole device.  After 3 pool scrubs with
no errors, I want to check with someone else that it appears okay to replace the
second drive in this mirror vdev.  The one thing I have not tried is a large
file transfer to the server, as I am also dealing with an NFS mount problem
which popped up suspiciously close to my most recent patch update.
> 
> 
> # zpool status -v tank
>   pool: tank
>  state: ONLINE
>  scrub: scrub completed after 3h26m with 0 errors on Mon Jun 21 01:45:00
2010
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         tank        ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c1t6d0  ONLINE       0     0     0
>             c1t2d0  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c1t1d0  ONLINE       0     0     0
>             c1t5d0  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c1t0d0  ONLINE       0     0     0
>             c1t4d0  ONLINE       0     0     0
>         logs
>           c0t0d0    ONLINE       0     0     0
> 
> errors: No known data errors
> 
> 
> /var/adm/messages is positively over-run with these triplets/quadruplets,
not all of which end which end up as "fatal" type.
> 
> 
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1043,815a at 7/disk at 1,0 (sd14):
> Jun 19 21:43:19 deepthought     Error for Command: write(10)              
Error Level: Retryable
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested
Block: 26721062                  Error Block: 26721062
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA
Serial Number:
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key:
Aborted Command
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       ASC: 0x8
(LUN communication failure), ASCQ: 0x0, FRU: 0x0
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1043,815a at 7/disk at 1,0 (sd14):
> Jun 19 21:43:19 deepthought     Error for Command: write(10)              
Error Level: Retryable
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested
Block: 26721062                  Error Block: 26721062
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA
Serial Number:
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key:
Aborted Command
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.warning] WARNING: /pci at
0,0/pci1043,815a at 7/disk at 1,0 (sd14):
> Jun 19 21:43:19 deepthought     Error for Command: write(10)              
Error Level: Fatal
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Requested
Block: 26721062                  Error Block: 26721062
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Vendor: ATA
Serial Number:
> Jun 19 21:43:19 deepthought scsi: [ID 107833 kern.notice]       Sense Key:
Aborted Command
> 
> 
> In the past this kernel.notice ID has come up as "informational"
for others, and for my case it _only_ occurred during the initial resilver.  One
last point of interest is the new drive is the WD Green WD10EARS, and the old
are WD Green WD6400AACS  (all of which I have tested on another system with the
WD read-test utility).  I know these drives get their share of ridicule (and
occasional praise/satisfaction), but I''d appreciate any thoughts on
proceeding with the mirror upgrade. [Backups are a check.]
> 
> Justin
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Jun 2010 - Many checksum errors during resilver.

[zfs-discuss] Many checksum errors during resilver.

[zfs-discuss] Many checksum errors during resilver.