Hey folks, Well, I''ve had a disk fail in my home server, so I''ve had my first experience of hunting down the faulty drive and replacing it (damn site easier on Sun kit than on a home built box I can tell you!). All seemed well, I replaced the faulty drive, imported the pool again, and kicked off the repair with: # zpool replace zfspool c1t1d0 But then a few minutes later I noticed this: # zpool status pool: zfspool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver in progress for 0h5m, 1.35% done, 6h46m to go config: NAME STATE READ WRITE CKSUM zfspool DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 replacing DEGRADED 0 0 68.0K 15299378891435382892 FAULTED 0 212K 0 was /dev/dsk/c1t1d0s0/old c1t1d0 ONLINE 0 0 0 2.89G resilvered c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 1 43K resilvered errors: No known data errors A checksum error on one of the other disks! Thank god I went with raid-z2. Ross -- This message posted from opensolaris.org
Lucky one there Ross! Makes me glad I also upgraded to RAID-Z2 ;-) Simon -- This message posted from opensolaris.org
I''m curious, how often do you scrub the pool? On Mon, 2009-06-22 at 15:33, Ross wrote:> Hey folks, > > Well, I''ve had a disk fail in my home server, so I''ve had my first experience of hunting down the faulty drive and replacing it (damn site easier on Sun kit than on a home built box I can tell you!). > > All seemed well, I replaced the faulty drive, imported the pool again, and kicked off the repair with: > # zpool replace zfspool c1t1d0 > > But then a few minutes later I noticed this: > > # zpool status > pool: zfspool > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver in progress for 0h5m, 1.35% done, 6h46m to go > config: > > NAME STATE READ WRITE CKSUM > zfspool DEGRADED 0 0 0 > raidz2 DEGRADED 0 0 0 > replacing DEGRADED 0 0 68.0K > 15299378891435382892 FAULTED 0 212K 0 was /dev/dsk/c1t1d0s0/old > c1t1d0 ONLINE 0 0 0 2.89G resilvered > c1t2d0 ONLINE 0 0 0 > c1t3d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 1 43K resilvered > > errors: No known data errors > > > A checksum error on one of the other disks! Thank god I went with raid-z2. > > Ross > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Ed Spencer Information Services and Technology UNIX System Administrator Infrastructure Group EMail: Ed_Spencer at UManitoba.CA The University of Manitoba telephone: (204) 474-8311 Winnipeg, Manitoba, Canada R3T 2N2
On Mon, 22 Jun 2009, Ed Spencer wrote:> I''m curious, how often do you scrub the pool?Once a week for me. Early every Monday morning so that if something goes wrong, it is at the start of the week. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
To be honest, never. It''s a cheap server sat at home, and I never got around to writing a script to scrub it and report errors. I''m going to write one now though! Look at how the resilver finished: # zpool status pool: zfspool state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 5h50m with 0 errors on Tue Jun 23 05:04:18 2009 config: NAME STATE READ WRITE CKSUM zfspool ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 188G resilvered c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 3 0 0 128K resilvered c1t4d0 ONLINE 0 0 11 473K resilvered c1t5d0 ONLINE 0 0 23 986K resilvered errors: No known data errors -- This message posted from opensolaris.org
On Tue, Jun 23, 2009 at 1:13 PM, Ross<no-reply at opensolaris.org> wrote:> Look at how the resilver finished: > > ? ? ? ? ? ?c1t3d0 ?ONLINE ? ? ? 3 ? ? 0 ? ? 0 ?128K resilvered > ? ? ? ? ? ?c1t4d0 ?ONLINE ? ? ? 0 ? ? 0 ? ?11 ?473K resilvered > ? ? ? ? ? ?c1t5d0 ?ONLINE ? ? ? 0 ? ? 0 ? ?23 ?986K resilveredComparing from your initial post, both c1t4d0 and c1t5d0 experince MORE checksum errors, while c1t3d0 experience read errors (?). You might want to look at that. Possibly the cause is something other than disk (disk controller, memory, power supply, etc.) -- Fajar
On Mon, 22 Jun 2009, Ross wrote:> All seemed well, I replaced the faulty drive, imported the pool again, and kicked off the repair with: > # zpool replace zfspool c1t1d0What build are you running? Between builds 105 and 113 inclusive there''s a bug in the resilver code which causes it to miss the parity data. If you''re running one of those builds, do a scrub after the resilver completes in order to be safe. Regards, markm
"scrub: resilver completed after 5h50m with 0 errors on Tue Jun 23 05:04:18 2009" Zero errors even though other parts of the message definitely show errors? This is described here: http://docs.sun.com/app/docs/doc/819-5461/gbcve?a=view Device errors do not guarantee pool errors when redundancy is present. Change suggestion to the ZFS programmers: insert the phrases ''unrecoverable pool data'' or ''pool data'' or word ''data'' into the error message like this: "scrub: resilver completed after 5h50m with 0 unrecoverable pool data errors on Tue Jun 23 05:04:18 2009" "scrub: resilver completed after 5h50m with 0 pool data errors on Tue Jun 23 05:04:18 2009" "scrub: resilver completed after 5h50m with 0 data errors on Tue Jun 23 05:04:18 2009" That would clarify why the status message above it doesn''t agree (shows device errors) nor do the numbers in the config table below it (shows detailed device errors)(3+11+23), and it would match with the last line that says "No known data errors". Ross wrote:> To be honest, never. It''s a cheap server sat at home, and I never got around to writing a script to scrub it and report errors. > > I''m going to write one now though! Look at how the resilver finished: > > # zpool status > pool: zfspool > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: resilver completed after 5h50m with 0 errors on Tue Jun 23 05:04:18 2009 > config: > > NAME STATE READ WRITE CKSUM > zfspool ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c1t1d0 ONLINE 0 0 0 188G resilvered > c1t2d0 ONLINE 0 0 0 > c1t3d0 ONLINE 3 0 0 128K resilvered > c1t4d0 ONLINE 0 0 11 473K resilvered > c1t5d0 ONLINE 0 0 23 986K resilvered > > errors: No known data errors >
Thanks Mark, it looks like that was good advice. It also appears that as suggested, it''s not the drive that''s faulty... anybody have any thoughts as to how I find what''s actually the problem? # zpool status pool: zfspool state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 6h35m with 0 errors on Wed Jun 24 02:46:58 2009 config: NAME STATE READ WRITE CKSUM zfspool DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c1t1d0 DEGRADED 0 0 2.59M too many errors c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 3 0 16 688K repaired c1t4d0 ONLINE 0 0 29 774K repaired c1t5d0 ONLINE 0 0 23 errors: No known data errors -- This message posted from opensolaris.org
Ok, this is getting weird. I just ran a zpool clear, and now it says: # zpool clear zfspool # zpool status pool: zfspool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using ''zpool upgrade''. Once this is done, the pool will no longer be accessible on older software versions. scrub: scrub completed after 6h35m with 0 errors on Wed Jun 24 02:46:58 2009 config: NAME STATE READ WRITE CKSUM zfspool ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 107G repaired c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 688K repaired c1t4d0 ONLINE 0 0 0 774K repaired c1t5d0 ONLINE 0 0 0 errors: No known data errors -- This message posted from opensolaris.org