Michael Stalnaker
2008-Oct-29 19:33 UTC
[zfs-discuss] Resilvering after drive failure - keeps restarting - any guesses why/how to fix?
All;
I have a large zfs tank with four raidz2 groups in it. Each of these groups
is 11 disks, and I have four hot spare disks in the system. The system is
running Open Solaris build snv_90. One of these groups has had a disk
failure, which the OS correctly detected, and replaced with one of the hot
spares, and began rebuilding.
Now it gets interesting. The resilver runs for about 1 hour, then stops. If
I put zpool status ?v in a while loop with a 10 minute sleep, I see the
repair proceed, then with no messages of ANY kind, it?ll silently quit and
start over. I?m attaching the output of zpool status ?v from an hour ago and
then from just now below. Has anyone seen this, or have any ideas as to the
cause? Is there a timeout or priorty I need to change in a tuneable or
something?
--Mike
One Hour Ago:
mstalnak at zfs1:~$ zpool status -v
pool: tank1
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use ''zpool clear'' to
mark the device
repaired.
scrub: resilver in progress for 0h46m, 3.96% done, 18h39m to go
config:
NAME STATE READ WRITE CKSUM
tank1 DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c1t13d0 ONLINE 0 0 0
spare DEGRADED 0 0 0
c1t14d0 FAULTED 0 0 0 too many errors
c1t11d0 ONLINE 0 0 0
c1t15d0 ONLINE 0 0 0
c1t16d0 ONLINE 0 0 0
c1t17d0 ONLINE 0 0 0
c1t18d0 ONLINE 0 0 0
c1t19d0 ONLINE 0 0 0
c1t20d0 ONLINE 0 0 0
c1t21d0 ONLINE 0 0 0
c1t22d0 ONLINE 0 0 0
c1t23d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c1t8d0 ONLINE 0 0 0
c1t9d0 ONLINE 0 0 0
c1t10d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c2t13d0 ONLINE 0 0 0
c2t14d0 ONLINE 0 0 0
c2t15d0 ONLINE 0 0 0
c2t16d0 ONLINE 0 0 0
c2t17d0 ONLINE 0 0 0
c2t18d0 ONLINE 0 0 0
c2t19d0 ONLINE 0 0 0
c2t20d0 ONLINE 0 0 0
c2t21d0 ONLINE 0 0 0
c2t22d0 ONLINE 0 0 0
c2t23d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c1t24d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c2t6d0 ONLINE 0 0 0
c2t7d0 ONLINE 0 0 0
c2t8d0 ONLINE 0 0 0
c2t9d0 ONLINE 0 0 0
c2t10d0 ONLINE 0 0 0
spares
c1t11d0 INUSE currently in use
c2t24d0 AVAIL
c2t11d0 AVAIL
c2t4d0 AVAIL
errors: No known data errors
Just Now:
mstalnak at zfs1:~$ zpool status -v
pool: tank1
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use ''zpool clear'' to
mark the device
repaired.
scrub: resilver in progress for 0h24m, 2.23% done, 17h51m to go
config:
NAME STATE READ WRITE CKSUM
tank1 DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c1t13d0 ONLINE 0 0 0
spare DEGRADED 0 0 0
c1t14d0 FAULTED 0 0 0 too many errors
c1t11d0 ONLINE 0 0 0
c1t15d0 ONLINE 0 0 0
c1t16d0 ONLINE 0 0 0
c1t17d0 ONLINE 0 0 0
c1t18d0 ONLINE 0 0 0
c1t19d0 ONLINE 0 0 0
c1t20d0 ONLINE 0 0 0
c1t21d0 ONLINE 0 0 0
c1t22d0 ONLINE 0 0 0
c1t23d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c1t8d0 ONLINE 0 0 0
c1t9d0 ONLINE 0 0 0
c1t10d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c2t13d0 ONLINE 0 0 0
c2t14d0 ONLINE 0 0 0
c2t15d0 ONLINE 0 0 0
c2t16d0 ONLINE 0 0 0
c2t17d0 ONLINE 0 0 0
c2t18d0 ONLINE 0 0 0
c2t19d0 ONLINE 0 0 0
c2t20d0 ONLINE 0 0 0
c2t21d0 ONLINE 0 0 0
c2t22d0 ONLINE 0 0 0
c2t23d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c1t24d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c2t6d0 ONLINE 0 0 0
c2t7d0 ONLINE 0 0 0
c2t8d0 ONLINE 0 0 0
c2t9d0 ONLINE 0 0 0
c2t10d0 ONLINE 0 0 0
spares
c1t11d0 INUSE currently in use
c2t24d0 AVAIL
c2t11d0 AVAIL
c2t4d0 AVAIL
errors: No known data errors
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081029/093f15f0/attachment.html>
Tomas Ă–gren
2008-Oct-29 21:02 UTC
[zfs-discuss] Resilvering after drive failure - keeps restarting - any guesses why/how to fix?
On 29 October, 2008 - Michael Stalnaker sent me these 32K bytes:> All; > > I have a large zfs tank with four raidz2 groups in it. Each of these groups > is 11 disks, and I have four hot spare disks in the system. The system is > running Open Solaris build snv_90. One of these groups has had a disk > failure, which the OS correctly detected, and replaced with one of the hot > spares, and began rebuilding. > > Now it gets interesting. The resilver runs for about 1 hour, then stops. If > I put zpool status ?v in a while loop with a 10 minute sleep, I see the > repair proceed, then with no messages of ANY kind, it?ll silently quit and > start over. I?m attaching the output of zpool status ?v from an hour ago and > then from just now below. Has anyone seen this, or have any ideas as to the > cause? Is there a timeout or priorty I need to change in a tuneable or > something?Snapshots every hour? That will currently restart resilvering.. I think there has been some late fix for that, or if it''s coming soon.. There has been some bug with ''zpool status'' as root restarting as well.. ''zpool status'' as non-root doesn''t.. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se