What to do with a status report like the one included below? What does it mean to have an unrecoverable error but no data errors? Is it just a matter of `clearing'' this device? But what would have prompted such a report then? Also note the numeral 7 in the CKSUM column for device c3d1s0. What does it mean. ------- --------- ---=--- --------- -------- zpool status -vx rpool pool: rpool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 4h44m with 0 errors on Sat Mar 27 07:48:20 2010 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3d0s0 ONLINE 0 0 0 c3d1s0 ONLINE 0 0 7 errors: No known data errors
On Sat, 27 Mar 2010, Harry Putnam wrote:> What to do with a status report like the one included below? > > What does it mean to have an unrecoverable error but no data errors?I think that this summary means that the zfs scrub did not encounter any reported read/write errors from the disks, but on one of the disks, 7 of the returned blocks had a computed checksum error. This could be a problem with the data that the disk previously wrote. Perhaps there was an undetected data transfer error, the drive firmware glitched, the drive experienced a cache memory glitch, or the drive wrote/read data from the wrong track. If you clear the error information, make sure you keep a record of it in case it happens again. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:> On Sat, 27 Mar 2010, Harry Putnam wrote: > >> What to do with a status report like the one included below? >> >> What does it mean to have an unrecoverable error but no data errors? > > I think that this summary means that the zfs scrub did not encounter > any reported read/write errors from the disks, but on one of the > disks, 7 of the returned blocks had a computed checksum error. This > could be a problem with the data that the disk previously > wrote. Perhaps there was an undetected data transfer error, the drive > firmware glitched, the drive experienced a cache memory glitch, or the > drive wrote/read data from the wrong track. > > If you clear the error information, make sure you keep a record of it > in case it happens again.Thanks. So its not a serious matter? Or maybe more of a potentially serious matter? Is there specific documentation somewhere that tells how to read these status reports?
On Sat, Mar 27, 2010 at 6:02 PM, Harry Putnam <reader at newsguy.com> wrote:> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes: > > > On Sat, 27 Mar 2010, Harry Putnam wrote: > > > >> What to do with a status report like the one included below? > >> > >> What does it mean to have an unrecoverable error but no data errors? > > > > I think that this summary means that the zfs scrub did not encounter > > any reported read/write errors from the disks, but on one of the > > disks, 7 of the returned blocks had a computed checksum error. This > > could be a problem with the data that the disk previously > > wrote. Perhaps there was an undetected data transfer error, the drive > > firmware glitched, the drive experienced a cache memory glitch, or the > > drive wrote/read data from the wrong track. > > > > If you clear the error information, make sure you keep a record of it > > in case it happens again. > > Thanks. > > So its not a serious matter? Or maybe more of a potentially serious > matter? >Not really. That exactly the kind of problem ZFS is designed to catch.> > Is there specific documentation somewhere that tells how to read these > status reports? >Your pool is not degraded so I don''t think anything will show up in fmdump. But check ''fmdump -eV'' and see the actual errors that got created. You could find something there. -- Giovanni -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100327/a2987b08/attachment.html>
On 03/28/10 10:02 AM, Harry Putnam wrote:> Bob Friesenhahn<bfriesen at simple.dallas.tx.us> writes: > > >> On Sat, 27 Mar 2010, Harry Putnam wrote: >> >> >>> What to do with a status report like the one included below? >>> >>> What does it mean to have an unrecoverable error but no data errors? >>> >> I think that this summary means that the zfs scrub did not encounter >> any reported read/write errors from the disks, but on one of the >> disks, 7 of the returned blocks had a computed checksum error. This >> could be a problem with the data that the disk previously >> wrote. Perhaps there was an undetected data transfer error, the drive >> firmware glitched, the drive experienced a cache memory glitch, or the >> drive wrote/read data from the wrong track. >> >> If you clear the error information, make sure you keep a record of it >> in case it happens again. >> > Thanks. > > So its not a serious matter? Or maybe more of a potentially serious > matter? >Not really. The error has been corrected.> Is there specific documentation somewhere that tells how to read these > status reports? > >If you run a scrub on a pool and an error condition is fixed, the report wil give you a URL to check. -- Ian.
On Sat, 27 Mar 2010, Harry Putnam wrote:> > So its not a serious matter? Or maybe more of a potentially serious > matter?It is difficult to say if this is a serious matter or not. It should not have happened. The severity depends on the cause of the problem (which may be difficult to figure out). Perhaps you will find out what the problem is some day. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, Mar 27, 2010 at 18:50, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Sat, 27 Mar 2010, Harry Putnam wrote: > >> >> So its not a serious matter? Or maybe more of a potentially serious >> matter? >> > > It is difficult to say if this is a serious matter or not. It should not > have happened. The severity depends on the cause of the problem (which may > be difficult to figure out). Perhaps you will find out what the problem is > some day. > > Bob > -- > Bob Friesenhahn > > Assuming your drives support SMART, I''d install smartmontools and see ifthere are any SMART errors on the drive. While the absence of SMART errors doesn''t mean the drive isn''t about to fail, the presence of them can be a good indicator that the drive is failing. So, if there are significant SMART errors, replace the drive. If there aren''t any, then I''d keep going and see if you get more checksum errors. If you do, replace the drive. If you don''t, chalk it up to freak random bit-flipping and forget about it. I''ve had trouble getting smartmontools to work with some of my controllers/drives in opensolaris, and have had better luck just booting into a linux live cd, sometimes, so that may be something to keep in mind. -Ethan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100327/6221fff2/attachment.html>
Ethan <notethan at gmail.com> writes:>> Assuming your drives support SMART, I''d install smartmontools and see if > there are any SMART errors on the drive. While the absence of SMART errors[...]> I''ve had trouble getting smartmontools to work with some of my > controllers/drives in opensolaris, and have had better luck just booting > into a linux live cd, sometimes, so that may be something to keep in mind.Did you ever get it working on opensolaris?
Yes. Basically working here. All fine under ahci, some problems under mpt (smartctl says that WD1002fbys wouldn''t allow to store smart events, which I think is probably nonsense.) Regards, Tonmaus -- This message posted from opensolaris.org
Harry Putnam <reader at newsguy.com> writes:> Ethan <notethan at gmail.com> writes: > >>> Assuming your drives support SMART, I''d install smartmontools and see if >> there are any SMART errors on the drive. While the absence of SMART errors > > [...] > >> I''ve had trouble getting smartmontools to work with some of my >> controllers/drives in opensolaris, and have had better luck just booting >> into a linux live cd, sometimes, so that may be something to keep in mind. > > Did you ever get it working on opensolaris?Tonmaus <sequoiamobil at gmx.net> writes:> Yes. Basically working here. All fine under ahci, some problems > under mpt (smartctl says that WD1002fbys wouldn''t allow to store > smart events, which I think is probably nonsense.)Thanks... what is ahci and mpt?
Both are driver modules for storage adapters Properties can be reviewed in the documentation: ahci: http://docs.sun.com/app/docs/doc/816-5177/ahci-7d?a=view mpt: http://docs.sun.com/app/docs/doc/816-5177/mpt-7d?a=view ahci has a man entry on b133, as well. cheers, Tonmaus -- This message posted from opensolaris.org
Just to apologize This not only sounds lame but IS pretty lame. Somehow in reading the output of `zpool status POOL'', I just blew right by the URL included there: http://www.sun.com/msg/ZFS-8000-9P Which has quite a decent discussion of what it means.