I''ve got a Thumper running snv_57 and a large ZFS pool.  I recently
noticed a drive throwing some read errors, so I did the right thing and zfs
replaced it with a spare.
Everything went well, but the resilvering process seems to be taking an
eternity:
# zpool status
  pool: bigpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress, 4.66% done, 12h16m to go
config:
        NAME          STATE     READ WRITE CKSUM
        bigpool       ONLINE       0     0     0
          raidz2      ONLINE       0     0     0
            c4t4d0    ONLINE       0     0     0
            c7t4d0    ONLINE       0     0     0
            c6t4d0    ONLINE       0     0     0
            c1t4d0    ONLINE       0     0     0
            c0t4d0    ONLINE       0     0     0
            c4t0d0    ONLINE       0     0     0
            c7t0d0    ONLINE       0     0     0
            c6t0d0    ONLINE       0     0     0
            c1t0d0    ONLINE       0     0     0
            c0t0d0    ONLINE       0     0     0
          raidz2      ONLINE       0     0     0
            c5t5d0    ONLINE       0     0     0
            c4t5d0    ONLINE       0     0     0
            c7t5d0    ONLINE       0     0     0
            c6t5d0    ONLINE       0     0     0
            c1t5d0    ONLINE       0     0     0
            c0t5d0    ONLINE       0     0     0
** Heres the resilver
           spare     ONLINE       0     0     0
              c4t1d0  ONLINE      18     0     0
              c5t1d0  ONLINE       0     0     0
***            
c7t1d0    ONLINE       0     0     0
            c6t1d0    ONLINE       0     0     0
            c1t1d0    ONLINE       0     0     0
            c0t1d0    ONLINE       0     0     0
          raidz2      ONLINE       0     0     0
            c5t6d0    ONLINE       0     0     0
            c4t6d0    ONLINE       0     0     0
            c7t6d0    ONLINE       0     0     0
            c6t6d0    ONLINE       0     0     0
            c1t6d0    ONLINE       0     0     0
            c0t6d0    ONLINE       0     0     0
            c4t2d0    ONLINE       0     0     0
            c7t2d0    ONLINE       0     0     0
            c6t2d0    ONLINE       0     0     0
            c1t2d0    ONLINE       0     0     0
            c0t2d0    ONLINE       0     0     0
          raidz2      ONLINE       0     0     0
            c5t7d0    ONLINE       0     0     0
            c4t7d0    ONLINE       0     0     0
            c7t7d0    ONLINE       0     0     0
            c6t7d0    ONLINE       0     0     0
            c1t7d0    ONLINE       0     0     0
            c0t7d0    ONLINE       0     0     0
            c5t3d0    ONLINE       0     0     0
            c4t3d0    ONLINE       0     0     0
            c7t3d0    ONLINE       0     0     0
            c6t3d0    ONLINE       0     0     0
            c1t3d0    ONLINE       0     0     0
            c0t3d0    ONLINE       0     0     0
        spares
          c5t1d0      INUSE     currently in use
          c5t2d0      AVAIL   
Looks just fine except its been running for 3 days already!  These are 500gb
drives.
Should I have removed the bad drive and just replaced it vs. trying to swap in a
spare?  Is there some sort of contention issue because the spare and the
original drive are still both up?
Not sure what to think here...
-- 
This message posted from opensolaris.org
On 19 June, 2009 - Joe Kearney sent me these 3,8K bytes:> I''ve got a Thumper running snv_57 and a large ZFS pool. I recently > noticed a drive throwing some read errors, so I did the right thing > and zfs replaced it with a spare.Are you taking snapshots periodically? If so, you''re using a build old enough to restart resilver/scrub whenever a snapshot is taken. There has also been some bug where ''zpool status'' as root restarts resilver/scrub as well. Try as non-root. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Richard Elling
2009-Jun-20  15:42 UTC
[zfs-discuss] x4500 resilvering spare taking forever?
Also, b57 is about 2 years old and misses the improvements in performance, especially in scrub performance. -- richard Tomas ?gren wrote:> On 19 June, 2009 - Joe Kearney sent me these 3,8K bytes: > > >> I''ve got a Thumper running snv_57 and a large ZFS pool. I recently >> noticed a drive throwing some read errors, so I did the right thing >> and zfs replaced it with a spare. >> > > Are you taking snapshots periodically? If so, you''re using a build old > enough to restart resilver/scrub whenever a snapshot is taken. > > There has also been some bug where ''zpool status'' as root restarts > resilver/scrub as well. Try as non-root. > > /Tomas >
UPDATE: It''s now back down to 0.9% complete. Does anyone have a clue as to whats happening here or where I can look for problems? -- This message posted from opensolaris.org
> Are you taking snapshots periodically? If so, you''re > using a build old > enough to restart resilver/scrub whenever a snapshot > is taken.Actually yes, I take snapshots once an hour of various things. I''ll try disabling them for the time being and see how far along it gets. Thanks! -- This message posted from opensolaris.org
> Also, b57 is about 2 years old and misses the > improvements in performance, > especially in scrub performance.Yep, I know. I''ll upgrade them at some point down the road, but they''ve been serving our needs nicely so far. Thanks! -- This message posted from opensolaris.org
Richard Elling
2009-Jun-21  19:10 UTC
[zfs-discuss] x4500 resilvering spare taking forever?
Joe Kearney wrote:>> Also, b57 is about 2 years old and misses the >> improvements in performance, >> especially in scrub performance. >> > > Yep, I know. I''ll upgrade them at some point down the road, but they''ve been serving our needs nicely so far. > >Yep, it also suffers from the bug that restarts resilvers when you take a snapshot. This was fixed in b94. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6343667 -- richard
Andrew Gabriel
2009-Jun-21  22:49 UTC
[zfs-discuss] x4500 resilvering spare taking forever?
Joe Kearney wrote:> UPDATE: It''s now back down to 0.9% complete. Does anyone have a clue as to whats happening here or where I can look for problems?There was also a bug which restarted resilvers each time you issue a zpool status command as a privilaged user. Make sure to check the progress by issuing zpool status commands as a non-privilaged user. -- Andrew
> Yep, it also suffers from the bug that restarts > resilvers when you take a > snapshot. This was fixed in b94. > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bu > g_id=6343667 > -- richardHats off to Richard for saving the day. This was exactly the issue. I shut off my automatic snapshots and 3 days later my resilver is done. Joe -- This message posted from opensolaris.org