thr3ads.net - zfs discuss - [zfs-discuss] Spare drive inherited cksum errors? [May 2012]

If this information is useful, please help other people find it:
Share via:

Stephan Budach

2012-May-27 19:52 UTC

[zfs-discuss] Spare drive inherited cksum errors?

Hi,

today I issued a scrub on one of my zpools and after some time I noticed 
that one of the vdevs became degraded due to some drive having cksum 
errors. The spare kicked in and the drive got resilvered, but why does 
the spare drive now also show almost the same number of cksum errors, as 
the degraded drive?

root at solaris11c:~# zpool status obelixData
   pool: obelixData
  state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
         attempt was made to correct the error.  Applications are 
unaffected.
action: Determine if the device needs to be replaced, and clear the errors
         using ''zpool clear'' or replace the device with
''zpool replace''.
    see: http://www.sun.com/msg/ZFS-8000-9P
   scan: resilvered 1,12T in 10h50m with 0 errors on Sun May 27 21:15:32 
2012
config:

         NAME                         STATE     READ WRITE CKSUM
         obelixData                   DEGRADED     0     0     0
           mirror-0                   ONLINE       0     0     0
             c9t2100001378AC02DDd1    ONLINE       0     0     0
             c9t2100001378AC02F4d1    ONLINE       0     0     0
           mirror-1                   ONLINE       0     0     0
             c9t2100001378AC02F4d0    ONLINE       0     0     0
             c9t2100001378AC02DDd0    ONLINE       0     0     0
           mirror-2                   ONLINE       0     0     0
             c9t2100001378AC02DDd2    ONLINE       0     0     0
             c9t2100001378AC02F4d2    ONLINE       0     0     0
           mirror-3                   ONLINE       0     0     0
             c9t2100001378AC02DDd3    ONLINE       0     0     0
             c9t2100001378AC02F4d3    ONLINE       0     0     0
           mirror-4                   ONLINE       0     0     0
             c9t2100001378AC02DDd5    ONLINE       0     0     0
             c9t2100001378AC02F4d5    ONLINE       0     0     0
           mirror-5                   ONLINE       0     0     0
             c9t2100001378AC02DDd4    ONLINE       0     0     0
             c9t2100001378AC02F4d4    ONLINE       0     0     0
           mirror-6                   ONLINE       0     0     0
             c9t2100001378AC02DDd6    ONLINE       0     0     0
             c9t2100001378AC02F4d6    ONLINE       0     0     0
           mirror-7                   ONLINE       0     0     0
             c9t2100001378AC02DDd7    ONLINE       0     0     0
             c9t2100001378AC02F4d7    ONLINE       0     0     0
           mirror-8                   ONLINE       0     0     0
             c9t2100001378AC02DDd8    ONLINE       0     0     0
             c9t2100001378AC02F4d8    ONLINE       0     0     0
           mirror-9                   DEGRADED     0     0     0
             c9t2100001378AC02DDd9    ONLINE       0     0     0
             spare-1                  DEGRADED     0     0    10
c9t2100001378AC02F4d9  DEGRADED     0     0    22  too many errors
               c9t2100001378AC02BFd1  ONLINE       0     0    23
           mirror-10                  ONLINE       0     0     0
             c9t2100001378AC02DDd10   ONLINE       0     0     0
             c9t2100001378AC02F4d10   ONLINE       0     0     0
           mirror-11                  ONLINE       0     0     0
             c9t2100001378AC02DDd11   ONLINE       0     0     0
             c9t2100001378AC02F4d11   ONLINE       0     0     0
           mirror-12                  ONLINE       0     0     0
             c9t2100001378AC02DDd12   ONLINE       0     0     0
             c9t2100001378AC02F4d12   ONLINE       0     0     0
           mirror-13                  ONLINE       0     0     0
             c9t2100001378AC02DDd13   ONLINE       0     0     0
             c9t2100001378AC02F4d13   ONLINE       0     0     0
           mirror-14                  ONLINE       0     0     0
             c9t2100001378AC02DDd14   ONLINE       0     0     0
             c9t2100001378AC02F4d14   ONLINE       0     0     0
         logs
           mirror-15                  ONLINE       0     0     0
             c9t2100001378AC02D9d0    ONLINE       0     0     0
             c9t2100001378AC02BFd0    ONLINE       0     0     0
         spares
           c9t2100001378AC02BFd1      INUSE     currently in use


What would be the best way to proceed? The drive c9t2100001378AC02BFd1 
is the spare drive, that is tagged as ONLINE, but it shows 23 cksum 
errors, while the drive that became degraded only shows 22 cksum errors.

What would be the best procedure to continue? Would one now first run 
another scrub and detach the degraded drive afterwards, or detach the 
degrades drive immediately and run a scrub afterwards?

Thanks,
budy



-- 
Stephan Budach
Jung von Matt/it-services GmbH
Glash?ttenstra?e 79
20357 Hamburg


Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.budach at jvm.de
Internet: http://www.jvm.com

Gesch?ftsf?hrer: Frank Wilhelm, Stephan Budach (stellv.)
AG HH HRB 98380

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120527/838a93f2/attachment.html>

Edward Ned Harvey

2012-May-27 22:14 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Stephan Budach
> 
> today I issued a scrub on one of my zpools and after some time I noticed
that
> one of the vdevs became degraded due to some drive having cksum errors.
> The spare kicked in and the drive got resilvered, but why does the spare
> drive now also show almost the same number of cksum errors, as the
> degraded drive?
> 
> What would be the best way to proceed? The drive c9t2100001378AC02BFd1
> is the spare drive, that is tagged as ONLINE, but it shows 23 cksum errors,
> while the drive that became degraded only shows 22 cksum errors.
> 
> What would be the best procedure to continue? Would one now first run
> another scrub and detach the degraded drive afterwards, or detach the
> degrades drive immediately and run a scrub afterwards?
Either you have two bad disks, or you have a problem (or had a problem)
somewhere that can span disks (such as bus, or host bus adapter, or ram.) 
Remember, you don''t necessarily need to have the problem now - If there
was a problem in the past and corrupted data got written to disk, then later
(meaning now) you would run your scrub and get cksum errors, because of having
previously written corrupted data.

Your first step should be to look for obvious hardware conditions, like overheat
or huge piles of dust.  Assuming you find none, get a new disk, and zpool
replace it.  If the problem persists, you have to assume you have (or had) a
problem with something that spans multiple disks.

Richard Elling

2012-May-27 22:35 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

On May 27, 2012, at 12:52 PM, Stephan Budach wrote:
> Hi, 
> 
> today I issued a scrub on one of my zpools and after some time I noticed
that one of the vdevs became degraded due to some drive having cksum errors. The
spare kicked in and the drive got resilvered, but why does the spare drive now
also show almost the same number of cksum errors, as the degraded drive?
The answer is not available via zpool status. You will need to look at the FMA
diagnosis:
	fmadm faulty

more clues can be found in the FMA error reports:
	fmdump -eV

 -- richard

> 
> root at solaris11c:~# zpool status obelixData
>   pool: obelixData
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>   scan: resilvered 1,12T in 10h50m with 0 errors on Sun May 27 21:15:32
2012
> config:
> 
>         NAME                         STATE     READ WRITE CKSUM
>         obelixData                   DEGRADED     0     0     0
>           mirror-0                   ONLINE       0     0     0
>             c9t2100001378AC02DDd1    ONLINE       0     0     0
>             c9t2100001378AC02F4d1    ONLINE       0     0     0
>           mirror-1                   ONLINE       0     0     0
>             c9t2100001378AC02F4d0    ONLINE       0     0     0
>             c9t2100001378AC02DDd0    ONLINE       0     0     0
>           mirror-2                   ONLINE       0     0     0
>             c9t2100001378AC02DDd2    ONLINE       0     0     0
>             c9t2100001378AC02F4d2    ONLINE       0     0     0
>           mirror-3                   ONLINE       0     0     0
>             c9t2100001378AC02DDd3    ONLINE       0     0     0
>             c9t2100001378AC02F4d3    ONLINE       0     0     0
>           mirror-4                   ONLINE       0     0     0
>             c9t2100001378AC02DDd5    ONLINE       0     0     0
>             c9t2100001378AC02F4d5    ONLINE       0     0     0
>           mirror-5                   ONLINE       0     0     0
>             c9t2100001378AC02DDd4    ONLINE       0     0     0
>             c9t2100001378AC02F4d4    ONLINE       0     0     0
>           mirror-6                   ONLINE       0     0     0
>             c9t2100001378AC02DDd6    ONLINE       0     0     0
>             c9t2100001378AC02F4d6    ONLINE       0     0     0
>           mirror-7                   ONLINE       0     0     0
>             c9t2100001378AC02DDd7    ONLINE       0     0     0
>             c9t2100001378AC02F4d7    ONLINE       0     0     0
>           mirror-8                   ONLINE       0     0     0
>             c9t2100001378AC02DDd8    ONLINE       0     0     0
>             c9t2100001378AC02F4d8    ONLINE       0     0     0
>           mirror-9                   DEGRADED     0     0     0
>             c9t2100001378AC02DDd9    ONLINE       0     0     0
>             spare-1                  DEGRADED     0     0    10
>               c9t2100001378AC02F4d9  DEGRADED     0     0    22  too many
errors
>               c9t2100001378AC02BFd1  ONLINE       0     0    23
>           mirror-10                  ONLINE       0     0     0
>             c9t2100001378AC02DDd10   ONLINE       0     0     0
>             c9t2100001378AC02F4d10   ONLINE       0     0     0
>           mirror-11                  ONLINE       0     0     0
>             c9t2100001378AC02DDd11   ONLINE       0     0     0
>             c9t2100001378AC02F4d11   ONLINE       0     0     0
>           mirror-12                  ONLINE       0     0     0
>             c9t2100001378AC02DDd12   ONLINE       0     0     0
>             c9t2100001378AC02F4d12   ONLINE       0     0     0
>           mirror-13                  ONLINE       0     0     0
>             c9t2100001378AC02DDd13   ONLINE       0     0     0
>             c9t2100001378AC02F4d13   ONLINE       0     0     0
>           mirror-14                  ONLINE       0     0     0
>             c9t2100001378AC02DDd14   ONLINE       0     0     0
>             c9t2100001378AC02F4d14   ONLINE       0     0     0
>         logs
>           mirror-15                  ONLINE       0     0     0
>             c9t2100001378AC02D9d0    ONLINE       0     0     0
>             c9t2100001378AC02BFd0    ONLINE       0     0     0
>         spares
>           c9t2100001378AC02BFd1      INUSE     currently in use
> 
> 
> What would be the best way to proceed? The drive c9t2100001378AC02BFd1 is
the spare drive, that is tagged as ONLINE, but it shows 23 cksum errors, while
the drive that became degraded only shows 22 cksum errors.
> 
> What would be the best procedure to continue? Would one now first run
another scrub and detach the degraded drive afterwards, or detach the degrades
drive immediately and run a scrub afterwards?
> 
> Thanks,
> budy
> 
> 
> 
>  -- 
> Stephan Budach
> Jung von Matt/it-services GmbH
> Glash?ttenstra?e 79
> 20357 Hamburg
> 
> 
> Tel: +49 40-4321-1353
> Fax: +49 40-4321-1114
> E-Mail: stephan.budach at jvm.de
> Internet: http://www.jvm.com
> 
> Gesch?ftsf?hrer: Frank Wilhelm, Stephan Budach (stellv.)
> AG HH HRB 98380
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120527/8170a82a/attachment-0001.html>

Stephan Budach

2012-May-28 06:02 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

Am 28.05.12 00:35, schrieb Richard Elling:>
> On May 27, 2012, at 12:52 PM, Stephan Budach wrote:
>
>> Hi,
>>
>> today I issued a scrub on one of my zpools and after some time I 
>> noticed that one of the vdevs became degraded due to some drive 
>> having cksum errors. The spare kicked in and the drive got 
>> resilvered, but why does the spare drive now also show almost the 
>> same number of cksum errors, as the degraded drive?
>
> The answer is not available via zpool status. You will need to look at 
> the FMA diagnosis:
> fmadm faulty
>
> more clues can be found in the FMA error reports:
> fmdump -eV
>Thanks - I had taken a look at the FMA diagnosis, but hadn''t shared it 
in my first post. FMA  only shows  one instance as of yesterday:

root at solaris11c:~# fmadm faulty |less
--------------- ------------------------------------  -------------- 
---------
TIME            EVENT-ID                              MSG-ID         
SEVERITY
--------------- ------------------------------------  -------------- 
---------
Mai 27 10:24:24 f0601f5f-cb8b-67bc-bd63-e71948ea8428  ZFS-8000-GH    Major

Host        : solaris11c
Platform    : SUN-FIRE-X4170-M2-SERVER  Chassis_id  : 1046FMM0NH
Product_sn  : 1046FMM0NH

Fault class : fault.fs.zfs.vdev.checksum
Affects     : zfs://pool=obelixData/vdev=52e3ca377dbdbec9
                   faulted but still providing degraded service
Problem in  : zfs://pool=obelixData/vdev=52e3ca377dbdbec9
                   faulted but still providing degraded service

Description : The number of checksum errors associated with a ZFS device
               exceeded acceptable levels.  Refer to
               http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
               will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run ''zpool status -x'' and replace the bad
device.

--------------- ------------------------------------  -------------- 
---------
TIME            EVENT-ID                              MSG-ID         
SEVERITY
--------------- ------------------------------------  -------------- 
---------
M?r 15 16:34:52 5ad04cb0-af03-e84b-cd8a-a07aff7aec2c  PCIEX-8000-J5  Major

I thought this to be the instance when the vdev initially got degraded 
and there have been no more errors afterwards, while the resilver took 
place, so I tend to think that the spare drive is indeed okay.

Thanks,
budy


>  -- richard
>
>
>>
>> root at solaris11c:~# zpool status obelixData
>>   pool: obelixData
>>  state: DEGRADED
>> status: One or more devices has experienced an unrecoverable error.  An
>>         attempt was made to correct the error.  Applications are 
>> unaffected.
>> action: Determine if the device needs to be replaced, and clear the 
>> errors
>>         using ''zpool clear'' or replace the device
with ''zpool replace''.
>>    see: http://www.sun.com/msg/ZFS-8000-9P
>>   scan: resilvered 1,12T in 10h50m with 0 errors on Sun May 27 
>> 21:15:32 2012
>> config:
>>
>>         NAME                         STATE     READ WRITE CKSUM
>>         obelixData                   DEGRADED     0     0     0
>>           mirror-0                   ONLINE       0     0     0
>>             c9t2100001378AC02DDd1    ONLINE       0     0     0
>>             c9t2100001378AC02F4d1    ONLINE       0     0     0
>>           mirror-1                   ONLINE       0     0     0
>>             c9t2100001378AC02F4d0    ONLINE       0     0     0
>>             c9t2100001378AC02DDd0    ONLINE       0     0     0
>>           mirror-2                   ONLINE       0     0     0
>>             c9t2100001378AC02DDd2    ONLINE       0     0     0
>>             c9t2100001378AC02F4d2    ONLINE       0     0     0
>>           mirror-3                   ONLINE       0     0     0
>>             c9t2100001378AC02DDd3    ONLINE       0     0     0
>>             c9t2100001378AC02F4d3    ONLINE       0     0     0
>>           mirror-4                   ONLINE       0     0     0
>>             c9t2100001378AC02DDd5    ONLINE       0     0     0
>>             c9t2100001378AC02F4d5    ONLINE       0     0     0
>>           mirror-5                   ONLINE       0     0     0
>>             c9t2100001378AC02DDd4    ONLINE       0     0     0
>>             c9t2100001378AC02F4d4    ONLINE       0     0     0
>>           mirror-6                   ONLINE       0     0     0
>>             c9t2100001378AC02DDd6    ONLINE       0     0     0
>>             c9t2100001378AC02F4d6    ONLINE       0     0     0
>>           mirror-7                   ONLINE       0     0     0
>>             c9t2100001378AC02DDd7    ONLINE       0     0     0
>>             c9t2100001378AC02F4d7    ONLINE       0     0     0
>>           mirror-8                   ONLINE       0     0     0
>>             c9t2100001378AC02DDd8    ONLINE       0     0     0
>>             c9t2100001378AC02F4d8    ONLINE       0     0     0
>>           mirror-9                   DEGRADED     0     0     0
>>             c9t2100001378AC02DDd9    ONLINE       0     0     0
>>             spare-1                  DEGRADED     0     0    10
>> c9t2100001378AC02F4d9  DEGRADED     0     0    22  too many errors
>>               c9t2100001378AC02BFd1  ONLINE       0     0    23
>>           mirror-10                  ONLINE       0     0     0
>>             c9t2100001378AC02DDd10   ONLINE       0     0     0
>>             c9t2100001378AC02F4d10   ONLINE       0     0     0
>>           mirror-11                  ONLINE       0     0     0
>>             c9t2100001378AC02DDd11   ONLINE       0     0     0
>>             c9t2100001378AC02F4d11   ONLINE       0     0     0
>>           mirror-12                  ONLINE       0     0     0
>>             c9t2100001378AC02DDd12   ONLINE       0     0     0
>>             c9t2100001378AC02F4d12   ONLINE       0     0     0
>>           mirror-13                  ONLINE       0     0     0
>>             c9t2100001378AC02DDd13   ONLINE       0     0     0
>>             c9t2100001378AC02F4d13   ONLINE       0     0     0
>>           mirror-14                  ONLINE       0     0     0
>>             c9t2100001378AC02DDd14   ONLINE       0     0     0
>>             c9t2100001378AC02F4d14   ONLINE       0     0     0
>>         logs
>>           mirror-15                  ONLINE       0     0     0
>>             c9t2100001378AC02D9d0    ONLINE       0     0     0
>>             c9t2100001378AC02BFd0    ONLINE       0     0     0
>>         spares
>>           c9t2100001378AC02BFd1      INUSE     currently in use
>>
>>
>> What would be the best way to proceed? The drive 
>> c9t2100001378AC02BFd1 is the spare drive, that is tagged as ONLINE, 
>> but it shows 23 cksum errors, while the drive that became degraded 
>> only shows 22 cksum errors.
>>
>> What would be the best procedure to continue? Would one now first run 
>> another scrub and detach the degraded drive afterwards, or detach the 
>> degrades drive immediately and run a scrub afterwards?
>>
>> Thanks,
>> budy
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120528/f48113ba/attachment.html>

Stephan Budach

2012-May-29 04:21 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

Hi all,

just to wrap this issue up: as FMA didn''t report any other error than 
the one which led to the degradation of the one mirror, I detached the 
original drive from the zpool which flagged the mirror vdev as ONLINE 
(although there was still a cksum error count of 23 on the spare drive).

Afterwards I attached the formerly degraded drive again to the good 
drive in that mirror and let the resilver finish, which didn''t show any
errors at all. Finally I detached the former spare drive and re-added it 
as a spare drive again.

Now, I will run a scrub once more to veryfy the zpool.

Cheers,
budy

Richard Elling

2012-May-29 04:54 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

On May 28, 2012, at 9:21 PM, Stephan Budach wrote:
> Hi all,
> 
> just to wrap this issue up: as FMA didn''t report any other error
than the one which led to the degradation of the one mirror, I detached the
original drive from the zpool which flagged the mirror vdev as ONLINE (although
there was still a cksum error count of 23 on the spare drive).
You showed the result of the FMA diagnosis, but not the error reports.
One feature of the error reports on modern Solaris is that the expected and
reported
bit images are described, showing the nature and extent of the corruption.
> 
> Afterwards I attached the formerly degraded drive again to the good drive
in that mirror and let the resilver finish, which didn''t show any
errors at all. Finally I detached the former spare drive and re-added it as a
spare drive again.
Good. Perhaps they were transient.
> 
> Now, I will run a scrub once more to veryfy the zpool.
Good idea.
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120528/4b39d784/attachment.html>

Stephan Budach

2012-May-29 07:02 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

Hi Richard,

Am 29.05.12 06:54, schrieb Richard Elling:>
> On May 28, 2012, at 9:21 PM, Stephan Budach wrote:
>
>> Hi all,
>>
>> just to wrap this issue up: as FMA didn''t report any other
error than
>> the one which led to the degradation of the one mirror, I detached 
>> the original drive from the zpool which flagged the mirror vdev as 
>> ONLINE (although there was still a cksum error count of 23 on the 
>> spare drive).
>
> You showed the result of the FMA diagnosis, but not the error reports.
> One feature of the error reports on modern Solaris is that the 
> expected and reported
> bit images are described, showing the nature and extent of the corruption.Are you referring to these errors:

root at solaris11c:~# fmdump -e -u f0601f5f-cb8b-67bc-bd63-e71948ea8428
TIME                 CLASS
Mai 27 10:24:23.3654 ereport.fs.zfs.checksum
Mai 27 10:24:23.3652 ereport.fs.zfs.checksum
Mai 27 10:24:23.3650 ereport.fs.zfs.checksum
Mai 27 10:24:23.3648 ereport.fs.zfs.checksum
Mai 27 10:24:23.3646 ereport.fs.zfs.checksum
Mai 27 10:24:23.2696 ereport.fs.zfs.checksum
Mai 27 10:24:23.2694 ereport.fs.zfs.checksum
Mai 27 10:24:23.2692 ereport.fs.zfs.checksum
Mai 27 10:24:23.2690 ereport.fs.zfs.checksum
Mai 27 10:24:23.2688 ereport.fs.zfs.checksum
Mai 27 10:24:23.2686 ereport.fs.zfs.checksum

And to pick one in detail:

root at solaris11c:~# fmdump -eV -u f0601f5f-cb8b-67bc-bd63-e71948ea8428
TIME                           CLASS
Mai 27 2012 10:24:23.365451280 ereport.fs.zfs.checksum
nvlist version: 0
         class = ereport.fs.zfs.checksum
         ena = 0xdfb23b0bc9700001
         detector = (embedded nvlist)
         nvlist version: 0
                 version = 0x0
                 scheme = zfs
                 pool = 0x855ebc6738ef6dd6
                 vdev = 0x52e3ca377dbdbec9
         (end detector)

         pool = obelixData
         pool_guid = 0x855ebc6738ef6dd6
         pool_context = 0
         pool_failmode = wait
         vdev_guid = 0x52e3ca377dbdbec9
         vdev_type = disk
         vdev_path = /dev/dsk/c9t2100001378AC02F4d9s0
         vdev_devid = id1,sd at n2047001378ac02f4/a
         parent_guid = 0x695bf14bdabd6714
         parent_type = mirror
         zio_err = 50
         zio_offset = 0x2d8b974600
         zio_size = 0x20000
         zio_objset = 0x81ea9
         zio_object = 0x5594
         zio_level = 0
         zio_blkid = 0x3c
         cksum_expected = 0x12869460bd5d 0x49e4661395e6973 
0xc974c2622ce7a035 0x81fe9ef14082a245
         cksum_actual = 0x1bba2b185478 0x707883eac587dd3 
0x54de998365cc6a8d 0x6822e5f4add45237
         cksum_algorithm = fletcher4
         bad_ranges = 0x0 0x20000
         bad_ranges_min_gap = 0x8
         bad_range_sets = 0x357a5
         bad_range_clears = 0x3935b
         bad_set_histogram = 0x8f3 0xdd4 0x52c 0x13d0 0xd76 0xea0 0xec1 
0x100f 0x8f0 0xdc7 0x51e 0x13e7 0xd6b 0xe87 0xf30 0xf9c 0x8cd 0xddc 
0x51a 0x1458 0xd93 0xf0a 0xf04 0x102d 0x8b4 0xdea 0x51a 0x141d 0xdd3 
0xefc 0xf18 0x1003 0x8bc 0xde9 0x52f 0x13a4 0xdd9 0xf07 0xea2 0x100d 
0x8c1 0xdf4 0x4e6 0x1368 0xdce 0xed9 0xf27 0x1002 0x8bf 0xdf4 0x4fe 
0x1396 0xd7d 0xee0 0xf2b 0xfcc 0x8d8 0xdd7 0x4fc 0x13b8 0xd8e 0xe8b 
0xedb 0x100e
         bad_cleared_histogram = 0x0 0x46 0x211a 0xc77 0x124f 0x1146 
0x113b 0x1020 0x0 0x35 0x20df 0xc9f 0x12dc 0x110c 0x10fc 0x1018 0x0 0x37 
0x2103 0xcbb 0x12a9 0x113d 0x1100 0xf8d 0x0 0x35 0x210d 0xc6e 0x121a 
0x1171 0x108f 0x1020 0x0 0x46 0x20ec 0xc3f 0x12ba 0x10ce 0x1172 0x1009 
0x0 0x47 0x20a4 0xc5e 0x129f 0x1102 0x112e 0x1031 0x0 0x4a 0x20d1 0xc64 
0x126b 0x1159 0x111c 0x1074 0x0 0x3a 0x20ed 0xc5b 0x1245 0x1160 0x111c 0xfc0
         __ttl = 0x1
         __tod = 0x4fc1e4b7 0x15c85810

They were all from the same vdev_path and ranged through these block IDs:

         zio_blkid = 0x3c
         zio_blkid = 0x3e
         zio_blkid = 0x40
         zio_blkid = 0x3a
         zio_blkid = 0x3d
         zio_blkid = 0xf
         zio_blkid = 0xc
         zio_blkid = 0x10
         zio_blkid = 0x12
         zio_blkid = 0x14
         zio_blkid = 0x11


I really was a bit surprised by the cksum errors on the spare drive, 
especially when no errors had been logged for the spare drive while it 
was resilvering.

We''ll see what the scrub will tell us.

Thanks,
budy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120529/f3d18c75/attachment.html>

Edward Ned Harvey

2012-May-29 12:04 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Stephan Budach
> 
> Now, I will run a scrub once more to veryfy the zpool.
If you have a drive (or two drives) with bad sectors, they will only be
detected as long as the bad sectors get used.  Given that your pool is less
than 100% full, it means you might still have bad hardware going undetected,
if you pass your scrub.

You might consider creating a big file (dd if=/dev/zero of=bigfile.junk
bs=1024k) and then when you''re out of disk space, scrub again. 
(Obviously,
you would be unable to make new writes to pool as long as it''s
filled...)

And since certain types of checksum errors will only occur when you *change*
the bits on disk, repeat the same test.  rm bigfile.junk ; dd
if=/dev/urandom of=bigfile.junk bs=1024k   and then scrub again.

Cindy Swearingen

2012-May-29 15:12 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

Hi--

You don''t see what release this is but I think that seeing the checkum
error accumulation on the spare was a zpool status formatting bug that
I have seen myself. This is fixed in a later Solaris release.

Thanks,

Cindy

On 05/28/12 22:21, Stephan Budach wrote:> Hi all,
>
> just to wrap this issue up: as FMA didn''t report any other error
than
> the one which led to the degradation of the one mirror, I detached the
> original drive from the zpool which flagged the mirror vdev as ONLINE
> (although there was still a cksum error count of 23 on the spare drive).
>
> Afterwards I attached the formerly degraded drive again to the good
> drive in that mirror and let the resilver finish, which didn''t
show any
> errors at all. Finally I detached the former spare drive and re-added it
> as a spare drive again.
>
> Now, I will run a scrub once more to veryfy the zpool.
>
> Cheers,
> budy
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2012-May-29 16:59 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

On May 29, 2012, at 8:12 AM, Cindy Swearingen wrote:
> Hi--
> 
> You don''t see what release this is but I think that seeing the
checkum
> error accumulation on the spare was a zpool status formatting bug that
> I have seen myself. This is fixed in a later Solaris release.
> 
Once again, Cindy beats me to it :-)

Verify that the ereports are logged against the original device and not the
spare. If there are no ereports for the spare, then Cindy gets the prize :-)
 -- richard
> Thanks,
> 
> Cindy
> 
> On 05/28/12 22:21, Stephan Budach wrote:
>> Hi all,
>> 
>> just to wrap this issue up: as FMA didn''t report any other
error than
>> the one which led to the degradation of the one mirror, I detached the
>> original drive from the zpool which flagged the mirror vdev as ONLINE
>> (although there was still a cksum error count of 23 on the spare
drive).
>> 
>> Afterwards I attached the formerly degraded drive again to the good
>> drive in that mirror and let the resilver finish, which didn''t
show any
>> errors at all. Finally I detached the former spare drive and re-added
it
>> as a spare drive again.
>> 
>> Now, I will run a scrub once more to veryfy the zpool.
>> 
>> Cheers,
>> budy
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120529/c6773bf2/attachment-0001.html>

Stephan Budach

2012-May-29 17:39 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

Am 29.05.12 18:59, schrieb Richard Elling:> On May 29, 2012, at 8:12 AM, Cindy Swearingen wrote:
>
>> Hi--
>>
>> You don''t see what release this is but I think that seeing the
checkum
>> error accumulation on the spare was a zpool status formatting bug that
>> I have seen myself. This is fixed in a later Solaris release.
>>
>
> Once again, Cindy beats me to it :-)
>
> Verify that the ereports are logged against the original device and 
> not the
> spare. If there are no ereports for the spare, then Cindy gets the 
> prize :-)
>  -- richard
Yeah, I verified that the error reports were only logged against the 
original device, as I stated earlier, so Cindy wins! :)
If now I''d only knew how to get the actual S11 release level of my box.
Neither uname -a nor cat /etc/release does give me a clue, since they 
display all the same data when run on different hosts that are on 
different updates.

Thanks,
budy
>> Thanks,
>>
>> Cindy
>>
>> On 05/28/12 22:21, Stephan Budach wrote:
>>> Hi all,
>>>
>>> just to wrap this issue up: as FMA didn''t report any other
error than
>>> the one which led to the degradation of the one mirror, I detached
the
>>> original drive from the zpool which flagged the mirror vdev as
ONLINE
>>> (although there was still a cksum error count of 23 on the spare
drive).
>>>
>>> Afterwards I attached the formerly degraded drive again to the good
>>> drive in that mirror and let the resilver finish, which
didn''t show any
>>> errors at all. Finally I detached the former spare drive and
re-added it
>>> as a spare drive again.
>>>
>>> Now, I will run a scrub once more to veryfy the zpool.
>>>
>>> Cheers,
>>> budy
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org <mailto:zfs-discuss at
opensolaris.org>
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org <mailto:zfs-discuss at
opensolaris.org>
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> --
> ZFS Performance and Training
> Richard.Elling at RichardElling.com <mailto:Richard.Elling at
RichardElling.com>
> +1-760-896-4422
>
>
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120529/0bed3745/attachment.html>

Peter Jeremy

2012-May-29 22:39 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

On 2012-May-29 22:04:39 +1000, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at nedharvey.com>
wrote:>If you have a drive (or two drives) with bad sectors, they will only be
>detected as long as the bad sectors get used.  Given that your pool is less
>than 100% full, it means you might still have bad hardware going undetected,
>if you pass your scrub.
One way around this is to ''dd'' each drive to /dev/null (or do
a "long"
test using smartmontools).  This ensures that the drive thinks all
sectors are readable.
>You might consider creating a big file (dd if=/dev/zero of=bigfile.junk
>bs=1024k) and then when you''re out of disk space, scrub again. 
(Obviously,
>you would be unable to make new writes to pool as long as it''s
filled...)
I''m not sure how ZFS handles "no large free blocks", so you
might need
to repeat this more than once to fill the disk.

This could leave your drive seriously fragmented.  If you do try this,
I''d recommend creating a snapshot first and then rolling back to it,
rather than just deleting the junk file.  Also, this (obviously) won''t
work at all on a filesystem with compression enabled.

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120530/3c114d43/attachment.bin>

John D Groenveld

2012-May-30 01:18 UTC

head link

[zfs-discuss] Spare drive inherited cksum errors?

In message <4FC509E8.8080600 at jvm.de>, Stephan Budach
writes:>If now I''d only knew how to get the actual S11 release level of my
box.
>Neither uname -a nor cat /etc/release does give me a clue, since they 
>display all the same data when run on different hosts that are on 
>different updates.
$ pkg info entire

John
groenveld at acm.org

zfs discuss - May 2012 - Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?

[zfs-discuss] Spare drive inherited cksum errors?