thr3ads.net - zfs discuss - [zfs-discuss] Drive Checksum error [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Glaser, David

2008-Dec-16 18:05 UTC

[zfs-discuss] Drive Checksum error

Hi all,

A few weeks ago I was inquiring of the group on how often to do zfs scrubs of
pools on our x4500''s. Figures that the first time I try to do a monthly
scrub of our pools, we get one of the three machines to throw an error. On one
of the machines, there''s one disk that has registered one Checksum
error. Sun lists it as an ''unrecoverable I/O error''. Is it
really an unrecoverable error? Is the drive really bad (i.e. warrant a call to
SUN for an RMA of the drive?)  Researching the error message says that you can
set the plateau of checksum errors before it throws an error, but I''d
figure that one is too many.

So, is there a way to see if it is a bad disk, or just zfs being a pain? Should
I reset the checksum error counter and re-run the scrub?

Thanks
Dave

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081216/beba673e/attachment.html>

Tim

2008-Dec-16 18:26 UTC

head link

[zfs-discuss] Drive Checksum error

On Tue, Dec 16, 2008 at 12:05 PM, Glaser, David <dsglaser at umich.edu>
wrote:
>  Hi all,
>
>
>
> A few weeks ago I was inquiring of the group on how often to do zfs scrubs
> of pools on our x4500''s. Figures that the first time I try to do a
monthly
> scrub of our pools, we get one of the three machines to throw an error. On
> one of the machines, there''s one disk that has registered one
Checksum
> error. Sun lists it as an ''unrecoverable I/O error''. Is
it really an
> unrecoverable error? Is the drive really bad (i.e. warrant a call to SUN
for
> an RMA of the drive?)  Researching the error message says that you can set
> the plateau of checksum errors before it throws an error, but I''d
figure
> that one is too many.
>
>
>
> So, is there a way to see if it is a bad disk, or just zfs being a pain?
> Should I reset the checksum error counter and re-run the scrub?
>

Well, I believe something as simple as a bad block can cause a checksum
error (someone feel free to correct me if I''m wrong).  So while one
isn''t
necessarily going to kill you, if you see it repeatedly on the same drive,
the drive is likely going to let go.

There shouldn''t ever be an instance where zfs would report a checksum
error
when the drive really didn''t return one.  If there were, I''d
consider that a
serious flaw.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081216/c5c08443/attachment.html>

Jonathan

2008-Dec-16 19:27 UTC

head link

[zfs-discuss] Drive Checksum error

Glaser, David wrote:> Hi all,
[snipped]
> So, is there a way to see if it is a bad disk, or just zfs being a pain?
> Should I reset the checksum error counter and re-run the scrub?
You could try using smartctl to query the disk directly, although I
don''t recall if it works on the x4500.  Normally 1 error is not a big
deal.  Clearing the errors and re-running the scrub would not hurt
anything and if you get errors again then it may be worth checking the
disk further.  Perhaps swapping it with a known good drive to make sure
the disk is the problem and not the cable.

If you start seeing hundreds of errors be sure to check things like the
cable.  I had a SATA cable come loose on a home ZFS fileserver and scrub
was throwing 100''s of errors even though the drive itself was fine, I
don''t want to think about what could have happened with UFS...

Hope that helps,
Jonathan

Richard Elling

2008-Dec-17 01:04 UTC

head link

[zfs-discuss] Drive Checksum error

Glaser, David wrote:> Hi all,
> 
> A few weeks ago I was inquiring of the group on how often to do zfs 
> scrubs of pools on our x4500''s. Figures that the first time I try 
> to do a monthly scrub of our pools, we get one of the three machines
> to throw an error. On one of the machines, there''s one disk that
has
> registered one Checksum error. Sun lists it as an ''unrecoverable
I/O
> error''. Is it really an unrecoverable error? Is the drive really
bad
> (i.e. warrant a call to SUN for an RMA of the drive?)  Researching 
> the error message says that you can set the plateau of checksum 
> errors before it throws an error, but I''d figure that one is too
many.
I presume you mean that a "zpool status" shows a data error?
If so, try "zpool status -xv" to see which file(s) are affected.
If ZFS is managing the redundancy, it should be able to recover
the data.

Depending on the drive, disk drive vendors spec 1 UER for every 1e15
bits read. So it is not really all that unlikely to see them on a
system the size of an X4500 which can hold ~3.8e14 bits.
> So, is there a way to see if it is a bad disk, or just zfs being a 
> pain? Should I reset the checksum error counter and re-run the scrub?
Don''t kill the canary!  Check the error logs for more details, also
make sure you are up-to-date on Marvell SATA controller patches.

Jonathan wrote:> If you start seeing hundreds of errors be sure to check things like the
> cable.  I had a SATA cable come loose on a home ZFS fileserver and scrub
> was throwing 100''s of errors even though the drive itself was
fine, I
> don''t want to think about what could have happened with UFS...
X4500s don''t have any SATA cables :-)
  -- richard

Glaser, David

2008-Dec-17 01:34 UTC

head link

[zfs-discuss] Drive Checksum error

Thanks for the responses.

Richard,

Yes, zpool status returns an error:

# zpool status -xv
  pool: zpool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed with 0 errors on Tue Dec  2 10:50:47 2008
config:
        NAME        STATE     READ WRITE CKSUM
        zpool1      ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
	<snip>
          raidz1    ONLINE       0     0     0
            c0t6d0  ONLINE       0     0     0
            c1t6d0  ONLINE       0     0     1
            c5t6d0  ONLINE       0     0     0
            c6t6d0  ONLINE       0     0     0
            c7t6d0  ONLINE       0     0     0
            c8t6d0  ONLINE       0     0     0
errors: No known data errors

So, it doesn''t appear to be any data errors, probably because the
raiding has saved the data (and there was much rejoicing).

I wasn''t really attempting to kill the canary, just making sure it
didn''t just fall asleep. By clearing the error and re-running the scrub
I was hoping to see if the error wasn''t just a transient error and a
real hardware I/O issue.

I looked through the logs, but Solaris logs are worse than Linux logs at trying
to figure out hardware errors, heh. Nothing appears to be issues with drives
(aside from a couple entries from pulling a USB cdrom from the machine a couple
weeks ago).

The machine was updated (Solaris 10 U5) as of Nov 22nd, which was our last
scheduled maintenance day. Our next is January 24th. Hopefully then we will be
going to U6.

I guess I''m more wondering how best to determine if it''s a
hardware problem on the disk and needs to be replaced.

And I noticed the SATA cable comment, but I wasn''t going to point it
out. :)

Dave

-----Original Message-----
From: Richard.Elling at Sun.COM [mailto:Richard.Elling at Sun.COM] 
Sent: Tuesday, December 16, 2008 8:04 PM
To: Jonathan
Cc: Glaser, David; zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] Drive Checksum error

Glaser, David wrote:> Hi all,
> 
> A few weeks ago I was inquiring of the group on how often to do zfs 
> scrubs of pools on our x4500''s. Figures that the first time I try 
> to do a monthly scrub of our pools, we get one of the three machines
> to throw an error. On one of the machines, there''s one disk that
has
> registered one Checksum error. Sun lists it as an ''unrecoverable
I/O
> error''. Is it really an unrecoverable error? Is the drive really
bad
> (i.e. warrant a call to SUN for an RMA of the drive?)  Researching 
> the error message says that you can set the plateau of checksum 
> errors before it throws an error, but I''d figure that one is too
many.
I presume you mean that a "zpool status" shows a data error?
If so, try "zpool status -xv" to see which file(s) are affected.
If ZFS is managing the redundancy, it should be able to recover
the data.

Depending on the drive, disk drive vendors spec 1 UER for every 1e15
bits read. So it is not really all that unlikely to see them on a
system the size of an X4500 which can hold ~3.8e14 bits.
> So, is there a way to see if it is a bad disk, or just zfs being a 
> pain? Should I reset the checksum error counter and re-run the scrub?
Don''t kill the canary!  Check the error logs for more details, also
make sure you are up-to-date on Marvell SATA controller patches.

Jonathan wrote:> If you start seeing hundreds of errors be sure to check things like the
> cable.  I had a SATA cable come loose on a home ZFS fileserver and scrub
> was throwing 100''s of errors even though the drive itself was
fine, I
> don''t want to think about what could have happened with UFS...
X4500s don''t have any SATA cables :-)
  -- richard

zfs discuss - Dec 2008 - Drive Checksum error

[zfs-discuss] Drive Checksum error

[zfs-discuss] Drive Checksum error

[zfs-discuss] Drive Checksum error

[zfs-discuss] Drive Checksum error

[zfs-discuss] Drive Checksum error