thr3ads.net - zfs discuss - [zfs-discuss] Narrow escape! [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Ross

2009-Jun-22 20:33 UTC

[zfs-discuss] Narrow escape!

Hey folks,

Well, I''ve had a disk fail in my home server, so I''ve had my
first experience of hunting down the faulty drive and replacing it (damn site
easier on Sun kit than on a home built box I can tell you!).

All seemed well, I replaced the faulty drive, imported the pool again, and
kicked off the repair with:
# zpool replace zfspool c1t1d0

But then a few minutes later I noticed this:

# zpool status
  pool: zfspool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress for 0h5m, 1.35% done, 6h46m to go
config:

        NAME                        STATE     READ WRITE CKSUM
        zfspool                     DEGRADED     0     0     0
          raidz2                    DEGRADED     0     0     0
            replacing               DEGRADED     0     0 68.0K
              15299378891435382892  FAULTED      0  212K     0  was
/dev/dsk/c1t1d0s0/old
              c1t1d0                ONLINE       0     0     0  2.89G resilvered
            c1t2d0                  ONLINE       0     0     0
            c1t3d0                  ONLINE       0     0     0
            c1t4d0                  ONLINE       0     0     0
            c1t5d0                  ONLINE       0     0     1  43K resilvered

errors: No known data errors


A checksum error on one of the other disks!  Thank god I went with raid-z2.

Ross
-- 
This message posted from opensolaris.org

Simon Breden

2009-Jun-22 21:06 UTC

head link

[zfs-discuss] Narrow escape!

Lucky one there Ross!

Makes me glad I also upgraded to RAID-Z2 ;-)

Simon
-- 
This message posted from opensolaris.org

Ed Spencer

2009-Jun-22 21:34 UTC

head link

[zfs-discuss] Narrow escape!

I''m curious, how often do you scrub the pool?

On Mon, 2009-06-22 at 15:33, Ross wrote:> Hey folks,
> 
> Well, I''ve had a disk fail in my home server, so I''ve had
my first experience of hunting down the faulty drive and replacing it (damn site
easier on Sun kit than on a home built box I can tell you!).
> 
> All seemed well, I replaced the faulty drive, imported the pool again, and
kicked off the repair with:
> # zpool replace zfspool c1t1d0
> 
> But then a few minutes later I noticed this:
> 
> # zpool status
>   pool: zfspool
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: resilver in progress for 0h5m, 1.35% done, 6h46m to go
> config:
> 
>         NAME                        STATE     READ WRITE CKSUM
>         zfspool                     DEGRADED     0     0     0
>           raidz2                    DEGRADED     0     0     0
>             replacing               DEGRADED     0     0 68.0K
>               15299378891435382892  FAULTED      0  212K     0  was
/dev/dsk/c1t1d0s0/old
>               c1t1d0                ONLINE       0     0     0  2.89G
resilvered
>             c1t2d0                  ONLINE       0     0     0
>             c1t3d0                  ONLINE       0     0     0
>             c1t4d0                  ONLINE       0     0     0
>             c1t5d0                  ONLINE       0     0     1  43K
resilvered
> 
> errors: No known data errors
> 
> 
> A checksum error on one of the other disks!  Thank god I went with raid-z2.
> 
> Ross
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- 
Ed Spencer                      Information Services and Technology
UNIX System Administrator       Infrastructure Group
EMail: Ed_Spencer at UManitoba.CA  The University of Manitoba
telephone: (204) 474-8311       Winnipeg, Manitoba, Canada R3T 2N2

Bob Friesenhahn

2009-Jun-22 23:00 UTC

head link

[zfs-discuss] Narrow escape!

On Mon, 22 Jun 2009, Ed Spencer wrote:
> I''m curious, how often do you scrub the pool?
Once a week for me.  Early every Monday morning so that if something 
goes wrong, it is at the start of the week.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross

2009-Jun-23 06:13 UTC

head link

[zfs-discuss] Narrow escape!

To be honest, never.  It''s a cheap server sat at home, and I never got
around to writing a script to scrub it and report errors.

I''m going to write one now though!  Look at how the resilver finished:

# zpool status
  pool: zfspool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 5h50m with 0 errors on Tue Jun 23 05:04:18 2009
config:

        NAME        STATE     READ WRITE CKSUM
        zfspool     ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0  188G resilvered
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       3     0     0  128K resilvered
            c1t4d0  ONLINE       0     0    11  473K resilvered
            c1t5d0  ONLINE       0     0    23  986K resilvered

errors: No known data errors
-- 
This message posted from opensolaris.org

Fajar A. Nugraha

2009-Jun-23 10:08 UTC

head link

[zfs-discuss] Narrow escape!

On Tue, Jun 23, 2009 at 1:13 PM, Ross<no-reply at opensolaris.org>
wrote:> Look at how the resilver finished:
>
> ? ? ? ? ? ?c1t3d0 ?ONLINE ? ? ? 3 ? ? 0 ? ? 0 ?128K resilvered
> ? ? ? ? ? ?c1t4d0 ?ONLINE ? ? ? 0 ? ? 0 ? ?11 ?473K resilvered
> ? ? ? ? ? ?c1t5d0 ?ONLINE ? ? ? 0 ? ? 0 ? ?23 ?986K resilvered
Comparing from your initial post, both c1t4d0 and c1t5d0 experince
MORE checksum errors, while c1t3d0 experience read errors (?). You
might want to look at that. Possibly the cause is something other than
disk (disk controller, memory, power supply, etc.)

-- 
Fajar

Mark J Musante

2009-Jun-23 12:27 UTC

head link

[zfs-discuss] Narrow escape!

On Mon, 22 Jun 2009, Ross wrote:
> All seemed well, I replaced the faulty drive, imported the pool again, and
kicked off the repair with:
> # zpool replace zfspool c1t1d0
What build are you running?  Between builds 105 and 113 inclusive
there''s
a bug in the resilver code which causes it to miss the parity data.  If 
you''re running one of those builds, do a scrub after the resilver 
completes in order to be safe.



Regards,
markm

Haudy Kazemi

2009-Jun-23 22:25 UTC

head link

[zfs-discuss] Narrow escape!

"scrub: resilver completed after 5h50m with 0 errors on Tue Jun 23 05:04:18
2009"

Zero errors even though other parts of the message definitely show errors?  This
is described here: http://docs.sun.com/app/docs/doc/819-5461/gbcve?a=view
Device errors do not guarantee pool errors when redundancy is present.

Change suggestion to the ZFS programmers:
insert the phrases ''unrecoverable pool data'' or ''pool
data'' or word ''data'' into the error message like
this:
"scrub: resilver completed after 5h50m with 0 unrecoverable pool data
errors on Tue Jun 23 05:04:18 2009"
"scrub: resilver completed after 5h50m with 0 pool data errors on Tue Jun
23 05:04:18 2009"
"scrub: resilver completed after 5h50m with 0 data errors on Tue Jun 23
05:04:18 2009"

That would clarify why the status message above it doesn''t agree (shows
device errors) nor do the numbers in the config table below it (shows detailed
device errors)(3+11+23), and it would match with the last line that says
"No known data errors".


Ross wrote:> To be honest, never.  It''s a cheap server sat at home, and I never
got around to writing a script to scrub it and report errors.
>
> I''m going to write one now though!  Look at how the resilver
finished:
>
> # zpool status
>   pool: zfspool
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: resilver completed after 5h50m with 0 errors on Tue Jun 23 05:04:18
2009
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         zfspool     ONLINE       0     0     0
>           raidz2    ONLINE       0     0     0
>             c1t1d0  ONLINE       0     0     0  188G resilvered
>             c1t2d0  ONLINE       0     0     0
>             c1t3d0  ONLINE       3     0     0  128K resilvered
>             c1t4d0  ONLINE       0     0    11  473K resilvered
>             c1t5d0  ONLINE       0     0    23  986K resilvered
>
> errors: No known data errors
>

Ross

2009-Jun-24 17:51 UTC

head link

[zfs-discuss] Narrow escape!

Thanks Mark, it looks like that was good advice.  It also appears that as
suggested, it''s not the drive that''s faulty... anybody have
any thoughts as to how I find what''s actually the problem?

# zpool status
  pool: zfspool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 6h35m with 0 errors on Wed Jun 24 02:46:58 2009
config:

        NAME        STATE     READ WRITE CKSUM
        zfspool     DEGRADED     0     0     0
          raidz2    DEGRADED     0     0     0
            c1t1d0  DEGRADED     0     0 2.59M  too many errors
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       3     0    16  688K repaired
            c1t4d0  ONLINE       0     0    29  774K repaired
            c1t5d0  ONLINE       0     0    23

errors: No known data errors
-- 
This message posted from opensolaris.org

Ross

2009-Jun-24 17:52 UTC

head link

[zfs-discuss] Narrow escape!

Ok, this is getting weird.  I just ran a zpool clear, and now it says:

# zpool clear zfspool
# zpool status
  pool: zfspool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using ''zpool upgrade''.  Once this is
done, the
        pool will no longer be accessible on older software versions.
 scrub: scrub completed after 6h35m with 0 errors on Wed Jun 24 02:46:58 2009
config:

        NAME        STATE     READ WRITE CKSUM
        zfspool     ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0  107G repaired
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0  688K repaired
            c1t4d0  ONLINE       0     0     0  774K repaired
            c1t5d0  ONLINE       0     0     0

errors: No known data errors
-- 
This message posted from opensolaris.org

zfs discuss - Jun 2009 - Narrow escape!

[zfs-discuss] Narrow escape!

[zfs-discuss] Narrow escape!

[zfs-discuss] Narrow escape!

[zfs-discuss] Narrow escape!

[zfs-discuss] Narrow escape!

[zfs-discuss] Narrow escape!

[zfs-discuss] Narrow escape!

[zfs-discuss] Narrow escape!

[zfs-discuss] Narrow escape!

[zfs-discuss] Narrow escape!