Michael Stalnaker
2008-Oct-29 19:33 UTC
[zfs-discuss] Resilvering after drive failure - keeps restarting - any guesses why/how to fix?
All; I have a large zfs tank with four raidz2 groups in it. Each of these groups is 11 disks, and I have four hot spare disks in the system. The system is running Open Solaris build snv_90. One of these groups has had a disk failure, which the OS correctly detected, and replaced with one of the hot spares, and began rebuilding. Now it gets interesting. The resilver runs for about 1 hour, then stops. If I put zpool status ?v in a while loop with a 10 minute sleep, I see the repair proceed, then with no messages of ANY kind, it?ll silently quit and start over. I?m attaching the output of zpool status ?v from an hour ago and then from just now below. Has anyone seen this, or have any ideas as to the cause? Is there a timeout or priorty I need to change in a tuneable or something? --Mike One Hour Ago: mstalnak at zfs1:~$ zpool status -v pool: tank1 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use ''zpool clear'' to mark the device repaired. scrub: resilver in progress for 0h46m, 3.96% done, 18h39m to go config: NAME STATE READ WRITE CKSUM tank1 DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c1t13d0 ONLINE 0 0 0 spare DEGRADED 0 0 0 c1t14d0 FAULTED 0 0 0 too many errors c1t11d0 ONLINE 0 0 0 c1t15d0 ONLINE 0 0 0 c1t16d0 ONLINE 0 0 0 c1t17d0 ONLINE 0 0 0 c1t18d0 ONLINE 0 0 0 c1t19d0 ONLINE 0 0 0 c1t20d0 ONLINE 0 0 0 c1t21d0 ONLINE 0 0 0 c1t22d0 ONLINE 0 0 0 c1t23d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c2t13d0 ONLINE 0 0 0 c2t14d0 ONLINE 0 0 0 c2t15d0 ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 c2t19d0 ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c2t22d0 ONLINE 0 0 0 c2t23d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c1t24d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 spares c1t11d0 INUSE currently in use c2t24d0 AVAIL c2t11d0 AVAIL c2t4d0 AVAIL errors: No known data errors Just Now: mstalnak at zfs1:~$ zpool status -v pool: tank1 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use ''zpool clear'' to mark the device repaired. scrub: resilver in progress for 0h24m, 2.23% done, 17h51m to go config: NAME STATE READ WRITE CKSUM tank1 DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c1t13d0 ONLINE 0 0 0 spare DEGRADED 0 0 0 c1t14d0 FAULTED 0 0 0 too many errors c1t11d0 ONLINE 0 0 0 c1t15d0 ONLINE 0 0 0 c1t16d0 ONLINE 0 0 0 c1t17d0 ONLINE 0 0 0 c1t18d0 ONLINE 0 0 0 c1t19d0 ONLINE 0 0 0 c1t20d0 ONLINE 0 0 0 c1t21d0 ONLINE 0 0 0 c1t22d0 ONLINE 0 0 0 c1t23d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c2t13d0 ONLINE 0 0 0 c2t14d0 ONLINE 0 0 0 c2t15d0 ONLINE 0 0 0 c2t16d0 ONLINE 0 0 0 c2t17d0 ONLINE 0 0 0 c2t18d0 ONLINE 0 0 0 c2t19d0 ONLINE 0 0 0 c2t20d0 ONLINE 0 0 0 c2t21d0 ONLINE 0 0 0 c2t22d0 ONLINE 0 0 0 c2t23d0 ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c1t24d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 c2t7d0 ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 spares c1t11d0 INUSE currently in use c2t24d0 AVAIL c2t11d0 AVAIL c2t4d0 AVAIL errors: No known data errors -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081029/093f15f0/attachment.html>
Tomas Ă–gren
2008-Oct-29 21:02 UTC
[zfs-discuss] Resilvering after drive failure - keeps restarting - any guesses why/how to fix?
On 29 October, 2008 - Michael Stalnaker sent me these 32K bytes:> All; > > I have a large zfs tank with four raidz2 groups in it. Each of these groups > is 11 disks, and I have four hot spare disks in the system. The system is > running Open Solaris build snv_90. One of these groups has had a disk > failure, which the OS correctly detected, and replaced with one of the hot > spares, and began rebuilding. > > Now it gets interesting. The resilver runs for about 1 hour, then stops. If > I put zpool status ?v in a while loop with a 10 minute sleep, I see the > repair proceed, then with no messages of ANY kind, it?ll silently quit and > start over. I?m attaching the output of zpool status ?v from an hour ago and > then from just now below. Has anyone seen this, or have any ideas as to the > cause? Is there a timeout or priorty I need to change in a tuneable or > something?Snapshots every hour? That will currently restart resilvering.. I think there has been some late fix for that, or if it''s coming soon.. There has been some bug with ''zpool status'' as root restarting as well.. ''zpool status'' as non-root doesn''t.. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se