Hi all, A few weeks ago I was inquiring of the group on how often to do zfs scrubs of pools on our x4500''s. Figures that the first time I try to do a monthly scrub of our pools, we get one of the three machines to throw an error. On one of the machines, there''s one disk that has registered one Checksum error. Sun lists it as an ''unrecoverable I/O error''. Is it really an unrecoverable error? Is the drive really bad (i.e. warrant a call to SUN for an RMA of the drive?) Researching the error message says that you can set the plateau of checksum errors before it throws an error, but I''d figure that one is too many. So, is there a way to see if it is a bad disk, or just zfs being a pain? Should I reset the checksum error counter and re-run the scrub? Thanks Dave -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081216/beba673e/attachment.html>
On Tue, Dec 16, 2008 at 12:05 PM, Glaser, David <dsglaser at umich.edu> wrote:> Hi all, > > > > A few weeks ago I was inquiring of the group on how often to do zfs scrubs > of pools on our x4500''s. Figures that the first time I try to do a monthly > scrub of our pools, we get one of the three machines to throw an error. On > one of the machines, there''s one disk that has registered one Checksum > error. Sun lists it as an ''unrecoverable I/O error''. Is it really an > unrecoverable error? Is the drive really bad (i.e. warrant a call to SUN for > an RMA of the drive?) Researching the error message says that you can set > the plateau of checksum errors before it throws an error, but I''d figure > that one is too many. > > > > So, is there a way to see if it is a bad disk, or just zfs being a pain? > Should I reset the checksum error counter and re-run the scrub? >Well, I believe something as simple as a bad block can cause a checksum error (someone feel free to correct me if I''m wrong). So while one isn''t necessarily going to kill you, if you see it repeatedly on the same drive, the drive is likely going to let go. There shouldn''t ever be an instance where zfs would report a checksum error when the drive really didn''t return one. If there were, I''d consider that a serious flaw. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081216/c5c08443/attachment.html>
Glaser, David wrote:> Hi all,[snipped]> So, is there a way to see if it is a bad disk, or just zfs being a pain? > Should I reset the checksum error counter and re-run the scrub?You could try using smartctl to query the disk directly, although I don''t recall if it works on the x4500. Normally 1 error is not a big deal. Clearing the errors and re-running the scrub would not hurt anything and if you get errors again then it may be worth checking the disk further. Perhaps swapping it with a known good drive to make sure the disk is the problem and not the cable. If you start seeing hundreds of errors be sure to check things like the cable. I had a SATA cable come loose on a home ZFS fileserver and scrub was throwing 100''s of errors even though the drive itself was fine, I don''t want to think about what could have happened with UFS... Hope that helps, Jonathan
Glaser, David wrote:> Hi all, > > A few weeks ago I was inquiring of the group on how often to do zfs > scrubs of pools on our x4500''s. Figures that the first time I try > to do a monthly scrub of our pools, we get one of the three machines > to throw an error. On one of the machines, there''s one disk that has > registered one Checksum error. Sun lists it as an ''unrecoverable I/O > error''. Is it really an unrecoverable error? Is the drive really bad > (i.e. warrant a call to SUN for an RMA of the drive?) Researching > the error message says that you can set the plateau of checksum > errors before it throws an error, but I''d figure that one is too many.I presume you mean that a "zpool status" shows a data error? If so, try "zpool status -xv" to see which file(s) are affected. If ZFS is managing the redundancy, it should be able to recover the data. Depending on the drive, disk drive vendors spec 1 UER for every 1e15 bits read. So it is not really all that unlikely to see them on a system the size of an X4500 which can hold ~3.8e14 bits.> So, is there a way to see if it is a bad disk, or just zfs being a > pain? Should I reset the checksum error counter and re-run the scrub?Don''t kill the canary! Check the error logs for more details, also make sure you are up-to-date on Marvell SATA controller patches. Jonathan wrote:> If you start seeing hundreds of errors be sure to check things like the > cable. I had a SATA cable come loose on a home ZFS fileserver and scrub > was throwing 100''s of errors even though the drive itself was fine, I > don''t want to think about what could have happened with UFS...X4500s don''t have any SATA cables :-) -- richard
Thanks for the responses. Richard, Yes, zpool status returns an error: # zpool status -xv pool: zpool1 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed with 0 errors on Tue Dec 2 10:50:47 2008 config: NAME STATE READ WRITE CKSUM zpool1 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 <snip> raidz1 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 1 c5t6d0 ONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 c8t6d0 ONLINE 0 0 0 errors: No known data errors So, it doesn''t appear to be any data errors, probably because the raiding has saved the data (and there was much rejoicing). I wasn''t really attempting to kill the canary, just making sure it didn''t just fall asleep. By clearing the error and re-running the scrub I was hoping to see if the error wasn''t just a transient error and a real hardware I/O issue. I looked through the logs, but Solaris logs are worse than Linux logs at trying to figure out hardware errors, heh. Nothing appears to be issues with drives (aside from a couple entries from pulling a USB cdrom from the machine a couple weeks ago). The machine was updated (Solaris 10 U5) as of Nov 22nd, which was our last scheduled maintenance day. Our next is January 24th. Hopefully then we will be going to U6. I guess I''m more wondering how best to determine if it''s a hardware problem on the disk and needs to be replaced. And I noticed the SATA cable comment, but I wasn''t going to point it out. :) Dave -----Original Message----- From: Richard.Elling at Sun.COM [mailto:Richard.Elling at Sun.COM] Sent: Tuesday, December 16, 2008 8:04 PM To: Jonathan Cc: Glaser, David; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Drive Checksum error Glaser, David wrote:> Hi all, > > A few weeks ago I was inquiring of the group on how often to do zfs > scrubs of pools on our x4500''s. Figures that the first time I try > to do a monthly scrub of our pools, we get one of the three machines > to throw an error. On one of the machines, there''s one disk that has > registered one Checksum error. Sun lists it as an ''unrecoverable I/O > error''. Is it really an unrecoverable error? Is the drive really bad > (i.e. warrant a call to SUN for an RMA of the drive?) Researching > the error message says that you can set the plateau of checksum > errors before it throws an error, but I''d figure that one is too many.I presume you mean that a "zpool status" shows a data error? If so, try "zpool status -xv" to see which file(s) are affected. If ZFS is managing the redundancy, it should be able to recover the data. Depending on the drive, disk drive vendors spec 1 UER for every 1e15 bits read. So it is not really all that unlikely to see them on a system the size of an X4500 which can hold ~3.8e14 bits.> So, is there a way to see if it is a bad disk, or just zfs being a > pain? Should I reset the checksum error counter and re-run the scrub?Don''t kill the canary! Check the error logs for more details, also make sure you are up-to-date on Marvell SATA controller patches. Jonathan wrote:> If you start seeing hundreds of errors be sure to check things like the > cable. I had a SATA cable come loose on a home ZFS fileserver and scrub > was throwing 100''s of errors even though the drive itself was fine, I > don''t want to think about what could have happened with UFS...X4500s don''t have any SATA cables :-) -- richard