thr3ads.net - zfs discuss - [zfs-discuss] What about this status report [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Harry Putnam

2010-Mar-27 16:18 UTC

[zfs-discuss] What about this status report

What to do with a status report like the one included below?

What does it mean to have an unrecoverable error but no data errors? 

Is it just a matter of `clearing'' this device?  But what would have
prompted such a report then?

Also note the numeral 7 in the CKSUM column for device c3d1s0.  What
does it mean.

-------        ---------       ---=---       ---------      -------- 

 zpool status -vx rpool
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 4h44m with 0 errors on Sat Mar 27 07:48:20 2010
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c3d0s0  ONLINE       0     0     0
            c3d1s0  ONLINE       0     0     7

errors: No known data errors

Bob Friesenhahn

2010-Mar-27 17:00 UTC

head link

[zfs-discuss] What about this status report

On Sat, 27 Mar 2010, Harry Putnam wrote:
> What to do with a status report like the one included below?
>
> What does it mean to have an unrecoverable error but no data errors?
I think that this summary means that the zfs scrub did not encounter 
any reported read/write errors from the disks, but on one of the 
disks, 7 of the returned blocks had a computed checksum error.  This 
could be a problem with the data that the disk previously wrote. 
Perhaps there was an undetected data transfer error, the drive 
firmware glitched, the drive experienced a cache memory glitch, or the 
drive wrote/read data from the wrong track.

If you clear the error information, make sure you keep a record of it 
in case it happens again.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Harry Putnam

2010-Mar-27 21:02 UTC

head link

[zfs-discuss] What about this status report

Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:
> On Sat, 27 Mar 2010, Harry Putnam wrote:
>
>> What to do with a status report like the one included below?
>>
>> What does it mean to have an unrecoverable error but no data errors?
>
> I think that this summary means that the zfs scrub did not encounter
> any reported read/write errors from the disks, but on one of the
> disks, 7 of the returned blocks had a computed checksum error.  This
> could be a problem with the data that the disk previously
> wrote. Perhaps there was an undetected data transfer error, the drive
> firmware glitched, the drive experienced a cache memory glitch, or the
> drive wrote/read data from the wrong track.
>
> If you clear the error information, make sure you keep a record of it
> in case it happens again.
Thanks.

So its not a serious matter?  Or maybe more of a potentially serious
matter?

Is there specific documentation somewhere that tells how to read these
status reports?

Giovanni Tirloni

2010-Mar-27 21:13 UTC

head link

[zfs-discuss] What about this status report

On Sat, Mar 27, 2010 at 6:02 PM, Harry Putnam <reader at newsguy.com>
wrote:
> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:
>
> > On Sat, 27 Mar 2010, Harry Putnam wrote:
> >
> >> What to do with a status report like the one included below?
> >>
> >> What does it mean to have an unrecoverable error but no data
errors?
> >
> > I think that this summary means that the zfs scrub did not encounter
> > any reported read/write errors from the disks, but on one of the
> > disks, 7 of the returned blocks had a computed checksum error.  This
> > could be a problem with the data that the disk previously
> > wrote. Perhaps there was an undetected data transfer error, the drive
> > firmware glitched, the drive experienced a cache memory glitch, or the
> > drive wrote/read data from the wrong track.
> >
> > If you clear the error information, make sure you keep a record of it
> > in case it happens again.
>
> Thanks.
>
> So its not a serious matter?  Or maybe more of a potentially serious
> matter?
>
Not really. That exactly the kind of problem ZFS is designed to catch.

>
> Is there specific documentation somewhere that tells how to read these
> status reports?
>
Your pool is not degraded so I don''t think anything will show up in
fmdump.

But check ''fmdump -eV'' and see the actual errors that got
created. You could
find something there.

-- 
Giovanni
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100327/a2987b08/attachment.html>

Ian Collins

2010-Mar-27 21:26 UTC

head link

[zfs-discuss] What about this status report

On 03/28/10 10:02 AM, Harry Putnam wrote:> Bob Friesenhahn<bfriesen at simple.dallas.tx.us>  writes:
>
>    
>> On Sat, 27 Mar 2010, Harry Putnam wrote:
>>
>>      
>>> What to do with a status report like the one included below?
>>>
>>> What does it mean to have an unrecoverable error but no data
errors?
>>>        
>> I think that this summary means that the zfs scrub did not encounter
>> any reported read/write errors from the disks, but on one of the
>> disks, 7 of the returned blocks had a computed checksum error.  This
>> could be a problem with the data that the disk previously
>> wrote. Perhaps there was an undetected data transfer error, the drive
>> firmware glitched, the drive experienced a cache memory glitch, or the
>> drive wrote/read data from the wrong track.
>>
>> If you clear the error information, make sure you keep a record of it
>> in case it happens again.
>>      
> Thanks.
>
> So its not a serious matter?  Or maybe more of a potentially serious
> matter?
>    
Not really.  The error has been corrected.> Is there specific documentation somewhere that tells how to read these
> status reports?
>
>    If you run a scrub on a pool and an error condition is fixed, the report 
wil give you a URL to check.

-- 
Ian.

Bob Friesenhahn

2010-Mar-27 22:50 UTC

head link

[zfs-discuss] What about this status report

On Sat, 27 Mar 2010, Harry Putnam wrote:>
> So its not a serious matter?  Or maybe more of a potentially serious
> matter?
It is difficult to say if this is a serious matter or not.  It should 
not have happened.  The severity depends on the cause of the problem 
(which may be difficult to figure out).   Perhaps you will find out 
what the problem is some day.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ethan

2010-Mar-28 00:17 UTC

head link

[zfs-discuss] What about this status report

On Sat, Mar 27, 2010 at 18:50, Bob Friesenhahn <bfriesen at
simple.dallas.tx.us> wrote:
> On Sat, 27 Mar 2010, Harry Putnam wrote:
>
>>
>> So its not a serious matter?  Or maybe more of a potentially serious
>> matter?
>>
>
> It is difficult to say if this is a serious matter or not.  It should not
> have happened.  The severity depends on the cause of the problem (which may
> be difficult to figure out).   Perhaps you will find out what the problem
is
> some day.
>
> Bob
> --
> Bob Friesenhahn
>
> Assuming your drives support SMART, I''d install smartmontools and
see ifthere are any SMART errors on the drive. While the absence of SMART errors
doesn''t mean the drive isn''t about to fail, the presence of
them can be a
good indicator that the drive is failing.
So, if there are significant SMART errors, replace the drive. If there
aren''t any, then I''d keep going and see if you get more
checksum errors. If
you do, replace the drive. If you don''t, chalk it up to freak random
bit-flipping and forget about it.
I''ve had trouble getting smartmontools to work with some of my
controllers/drives in opensolaris, and have had better luck just booting
into a linux live cd, sometimes, so that may be something to keep in mind.

-Ethan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100327/6221fff2/attachment.html>

Harry Putnam

2010-Mar-28 14:58 UTC

head link

[zfs-discuss] What about this status report

Ethan <notethan at gmail.com> writes:
>> Assuming your drives support SMART, I''d install smartmontools
and see if
> there are any SMART errors on the drive. While the absence of SMART errors
[...]
> I''ve had trouble getting smartmontools to work with some of my
> controllers/drives in opensolaris, and have had better luck just booting
> into a linux live cd, sometimes, so that may be something to keep in mind.
Did you ever get it working on opensolaris?

Tonmaus

2010-Mar-28 20:08 UTC

head link

[zfs-discuss] What about this status report

Yes. Basically working here. All fine under ahci, some problems under mpt
(smartctl says that WD1002fbys wouldn''t allow to store smart events,
which I think is probably nonsense.)

Regards,

Tonmaus
-- 
This message posted from opensolaris.org

Harry Putnam

2010-Mar-29 12:24 UTC

head link

[zfs-discuss] What about this status report

Harry Putnam <reader at newsguy.com> writes:
> Ethan <notethan at gmail.com> writes:
>
>>> Assuming your drives support SMART, I''d install
smartmontools and see if
>> there are any SMART errors on the drive. While the absence of SMART
errors
>
> [...]
>
>> I''ve had trouble getting smartmontools to work with some of my
>> controllers/drives in opensolaris, and have had better luck just
booting
>> into a linux live cd, sometimes, so that may be something to keep in
mind.
>
> Did you ever get it working on opensolaris?
Tonmaus <sequoiamobil at gmx.net> writes:
> Yes. Basically working here. All fine under ahci, some problems
> under mpt (smartctl says that WD1002fbys wouldn''t allow to store
> smart events, which I think is probably nonsense.)
Thanks...   what is ahci and mpt?

Tonmaus

2010-Mar-29 14:58 UTC

head link

[zfs-discuss] What about this status report

Both are driver modules for storage adapters
Properties can be reviewed in the documentation:
ahci: http://docs.sun.com/app/docs/doc/816-5177/ahci-7d?a=view
mpt: http://docs.sun.com/app/docs/doc/816-5177/mpt-7d?a=view
ahci has a man entry on b133, as well.

cheers,

Tonmaus
-- 
This message posted from opensolaris.org

Harry Putnam

2010-Mar-29 21:39 UTC

head link

[zfs-discuss] What about this status report

Just to apologize

This not only sounds lame but IS pretty lame.

Somehow in reading the output of `zpool status POOL'', I just blew right
by the URL included there:
  http://www.sun.com/msg/ZFS-8000-9P

Which has quite a decent discussion of what it means.

zfs discuss - Mar 2010 - What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report

[zfs-discuss] What about this status report