thr3ads.net - zfs discuss - [zfs-discuss] Resilvering after drive failure - keeps restarting

If this information is useful, please help other people find it:
Share via:

Michael Stalnaker

2008-Oct-29 19:33 UTC

[zfs-discuss] Resilvering after drive failure - keeps restarting - any guesses why/how to fix?

All;

I have a large zfs tank with four raidz2 groups in it. Each of these groups
is 11 disks, and I have four hot spare disks in the system.  The system is
running Open Solaris build snv_90.   One of these groups has had a disk
failure, which the OS correctly detected, and replaced with one of the hot
spares, and began rebuilding.

Now it gets interesting. The resilver runs for about 1 hour, then stops. If
I put zpool status ?v in a while loop with a 10 minute sleep, I see the
repair proceed, then with no messages of ANY kind, it?ll silently quit and
start over. I?m attaching the output of zpool status ?v from an hour ago and
then from just now below. Has anyone seen this, or have any ideas as to the
cause? Is there a timeout or priorty I need to change in a tuneable or
something?

--Mike



One Hour Ago:

mstalnak at zfs1:~$ zpool status -v
  pool: tank1
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use ''zpool clear'' to
mark the device
        repaired.
 scrub: resilver in progress for 0h46m, 3.96% done, 18h39m to go
config:

        NAME           STATE     READ WRITE CKSUM
        tank1          DEGRADED     0     0     0
          raidz2       DEGRADED     0     0     0
            c1t13d0    ONLINE       0     0     0
            spare      DEGRADED     0     0     0
              c1t14d0  FAULTED      0     0     0  too many errors
              c1t11d0  ONLINE       0     0     0
            c1t15d0    ONLINE       0     0     0
            c1t16d0    ONLINE       0     0     0
            c1t17d0    ONLINE       0     0     0
            c1t18d0    ONLINE       0     0     0
            c1t19d0    ONLINE       0     0     0
            c1t20d0    ONLINE       0     0     0
            c1t21d0    ONLINE       0     0     0
            c1t22d0    ONLINE       0     0     0
            c1t23d0    ONLINE       0     0     0
          raidz2       ONLINE       0     0     0
            c1t0d0     ONLINE       0     0     0
            c1t1d0     ONLINE       0     0     0
            c1t2d0     ONLINE       0     0     0
            c1t3d0     ONLINE       0     0     0
            c1t4d0     ONLINE       0     0     0
            c1t5d0     ONLINE       0     0     0
            c1t6d0     ONLINE       0     0     0
            c1t7d0     ONLINE       0     0     0
            c1t8d0     ONLINE       0     0     0
            c1t9d0     ONLINE       0     0     0
            c1t10d0    ONLINE       0     0     0
          raidz2       ONLINE       0     0     0
            c2t13d0    ONLINE       0     0     0
            c2t14d0    ONLINE       0     0     0
            c2t15d0    ONLINE       0     0     0
            c2t16d0    ONLINE       0     0     0
            c2t17d0    ONLINE       0     0     0
            c2t18d0    ONLINE       0     0     0
            c2t19d0    ONLINE       0     0     0
            c2t20d0    ONLINE       0     0     0
            c2t21d0    ONLINE       0     0     0
            c2t22d0    ONLINE       0     0     0
            c2t23d0    ONLINE       0     0     0
          raidz2       ONLINE       0     0     0
            c2t0d0     ONLINE       0     0     0
            c2t1d0     ONLINE       0     0     0
            c2t2d0     ONLINE       0     0     0
            c2t3d0     ONLINE       0     0     0
            c1t24d0    ONLINE       0     0     0
            c2t5d0     ONLINE       0     0     0
            c2t6d0     ONLINE       0     0     0
            c2t7d0     ONLINE       0     0     0
            c2t8d0     ONLINE       0     0     0
            c2t9d0     ONLINE       0     0     0
            c2t10d0    ONLINE       0     0     0
        spares
          c1t11d0      INUSE     currently in use
          c2t24d0      AVAIL
          c2t11d0      AVAIL
          c2t4d0       AVAIL

errors: No known data errors

Just Now:

mstalnak at zfs1:~$ zpool status -v
  pool: tank1
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use ''zpool clear'' to
mark the device
        repaired.
 scrub: resilver in progress for 0h24m, 2.23% done, 17h51m to go
config:

        NAME           STATE     READ WRITE CKSUM
        tank1          DEGRADED     0     0     0
          raidz2       DEGRADED     0     0     0
            c1t13d0    ONLINE       0     0     0
            spare      DEGRADED     0     0     0
              c1t14d0  FAULTED      0     0     0  too many errors
              c1t11d0  ONLINE       0     0     0
            c1t15d0    ONLINE       0     0     0
            c1t16d0    ONLINE       0     0     0
            c1t17d0    ONLINE       0     0     0
            c1t18d0    ONLINE       0     0     0
            c1t19d0    ONLINE       0     0     0
            c1t20d0    ONLINE       0     0     0
            c1t21d0    ONLINE       0     0     0
            c1t22d0    ONLINE       0     0     0
            c1t23d0    ONLINE       0     0     0
          raidz2       ONLINE       0     0     0
            c1t0d0     ONLINE       0     0     0
            c1t1d0     ONLINE       0     0     0
            c1t2d0     ONLINE       0     0     0
            c1t3d0     ONLINE       0     0     0
            c1t4d0     ONLINE       0     0     0
            c1t5d0     ONLINE       0     0     0
            c1t6d0     ONLINE       0     0     0
            c1t7d0     ONLINE       0     0     0
            c1t8d0     ONLINE       0     0     0
            c1t9d0     ONLINE       0     0     0
            c1t10d0    ONLINE       0     0     0
          raidz2       ONLINE       0     0     0
            c2t13d0    ONLINE       0     0     0
            c2t14d0    ONLINE       0     0     0
            c2t15d0    ONLINE       0     0     0
            c2t16d0    ONLINE       0     0     0
            c2t17d0    ONLINE       0     0     0
            c2t18d0    ONLINE       0     0     0
            c2t19d0    ONLINE       0     0     0
            c2t20d0    ONLINE       0     0     0
            c2t21d0    ONLINE       0     0     0
            c2t22d0    ONLINE       0     0     0
            c2t23d0    ONLINE       0     0     0
          raidz2       ONLINE       0     0     0
            c2t0d0     ONLINE       0     0     0
            c2t1d0     ONLINE       0     0     0
            c2t2d0     ONLINE       0     0     0
            c2t3d0     ONLINE       0     0     0
            c1t24d0    ONLINE       0     0     0
            c2t5d0     ONLINE       0     0     0
            c2t6d0     ONLINE       0     0     0
            c2t7d0     ONLINE       0     0     0
            c2t8d0     ONLINE       0     0     0
            c2t9d0     ONLINE       0     0     0
            c2t10d0    ONLINE       0     0     0
        spares
          c1t11d0      INUSE     currently in use
          c2t24d0      AVAIL
          c2t11d0      AVAIL
          c2t4d0       AVAIL

errors: No known data errors




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081029/093f15f0/attachment.html>

Tomas Ögren

2008-Oct-29 21:02 UTC

head link

[zfs-discuss] Resilvering after drive failure - keeps restarting - any guesses why/how to fix?

On 29 October, 2008 - Michael Stalnaker sent me these 32K bytes:
> All;
> 
> I have a large zfs tank with four raidz2 groups in it. Each of these groups
> is 11 disks, and I have four hot spare disks in the system.  The system is
> running Open Solaris build snv_90.   One of these groups has had a disk
> failure, which the OS correctly detected, and replaced with one of the hot
> spares, and began rebuilding.
> 
> Now it gets interesting. The resilver runs for about 1 hour, then stops. If
> I put zpool status ?v in a while loop with a 10 minute sleep, I see the
> repair proceed, then with no messages of ANY kind, it?ll silently quit and
> start over. I?m attaching the output of zpool status ?v from an hour ago
and
> then from just now below. Has anyone seen this, or have any ideas as to the
> cause? Is there a timeout or priorty I need to change in a tuneable or
> something?
Snapshots every hour? That will currently restart resilvering.. I think
there has been some late fix for that, or if it''s coming soon..

There has been some bug with ''zpool status'' as root restarting
as well..
''zpool status'' as non-root doesn''t..

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

zfs discuss - Oct 2008 - Resilvering after drive failure - keeps restarting - any guesses why/how to fix?

[zfs-discuss] Resilvering after drive failure - keeps restarting - any guesses why/how to fix?

[zfs-discuss] Resilvering after drive failure - keeps restarting - any guesses why/how to fix?