Michael Donaghy
2010-May-17 08:26 UTC
[zfs-discuss] zpool replace lockup / replace process now stalled, how to fix?
Hi, I recently moved to a freebsd/zfs system for the sake of data integrity, after losing my data on linux. I''ve now had my first hard disk failure; the bios refused to even boot with the failed drive (ad18) connected, so I removed it. I have another drive, ad16, which had enough space to replace the failed one, so I partitioned it and attempted to use "zpool replace" to replace the failed partitions for new ones, i.e. "zpool replace tank ad18s1d ad16s4d". This seemed to simply hang, with no processor or disk use; any "zpool status" commands also hung. Eventually I attempted to reboot the system, which also eventually hung; after waiting a while, having no other option, rightly or wrongly, I hard-rebooted. Exactly the same behaviour happened with the other zpool replace. Now, my zpool status looks like: arcueid ~ $ zpool status pool: tank state: DEGRADED scrub: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 ad4s1d ONLINE 0 0 0 ad6s1d ONLINE 0 0 0 ad9s1d ONLINE 0 0 0 ad17s1d ONLINE 0 0 0 replacing DEGRADED 0 0 0 ad18s1d UNAVAIL 0 9.62K 0 cannot open ad16s4d ONLINE 0 0 0 ad20s1d ONLINE 0 0 0 raidz2 DEGRADED 0 0 0 ad4s1e ONLINE 0 0 0 ad6s1e ONLINE 0 0 0 ad17s1e ONLINE 0 0 0 replacing DEGRADED 0 0 0 ad18s1e UNAVAIL 0 11.2K 0 cannot open ad16s4e ONLINE 0 0 0 ad20s1e ONLINE 0 0 0 errors: No known data errors It looks like the replace has taken in some sense, but ZFS doesn''t seem to be resilvering as it should. Attempting to zpool offline doesn''t work: arcueid ~ # zpool offline tank ad18s1d cannot offline ad18s1d: no valid replicas Attempting to scrub causes a similar hang to before. Data is still readable (from the zvol which is the only thing actually on this filesystem), although slowly. What should I do to recover this / trigger a proper replace of the failed partitions? Many thanks, Michael
Michael Donaghy
2010-May-21 16:32 UTC
[zfs-discuss] zpool replace lockup / replace process now stalled, how to fix?
For the record, in case anyone else experiences this behaviour: I tried various things which failed, and finally as a last ditch effort, upgraded my freebsd, giving me zpool v14 rather than v13 - and now it''s resilvering as it should. Michael On Monday 17 May 2010 09:26:23 Michael Donaghy wrote:> Hi, > > I recently moved to a freebsd/zfs system for the sake of data integrity, > after losing my data on linux. I''ve now had my first hard disk failure; > the bios refused to even boot with the failed drive (ad18) connected, so I > removed it. I have another drive, ad16, which had enough space to replace > the failed one, so I partitioned it and attempted to use "zpool replace" > to replace the failed partitions for new ones, i.e. "zpool replace tank > ad18s1d ad16s4d". This seemed to simply hang, with no processor or disk > use; any "zpool status" commands also hung. Eventually I attempted to > reboot the system, which also eventually hung; after waiting a while, > having no other option, rightly or wrongly, I hard-rebooted. Exactly the > same behaviour happened with the other zpool replace. > > Now, my zpool status looks like: > arcueid ~ $ zpool status > pool: tank > state: DEGRADED > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > raidz2 DEGRADED 0 0 0 > ad4s1d ONLINE 0 0 0 > ad6s1d ONLINE 0 0 0 > ad9s1d ONLINE 0 0 0 > ad17s1d ONLINE 0 0 0 > replacing DEGRADED 0 0 0 > ad18s1d UNAVAIL 0 9.62K 0 cannot open > ad16s4d ONLINE 0 0 0 > ad20s1d ONLINE 0 0 0 > raidz2 DEGRADED 0 0 0 > ad4s1e ONLINE 0 0 0 > ad6s1e ONLINE 0 0 0 > ad17s1e ONLINE 0 0 0 > replacing DEGRADED 0 0 0 > ad18s1e UNAVAIL 0 11.2K 0 cannot open > ad16s4e ONLINE 0 0 0 > ad20s1e ONLINE 0 0 0 > > errors: No known data errors > > It looks like the replace has taken in some sense, but ZFS doesn''t seem to > be resilvering as it should. Attempting to zpool offline doesn''t work: > arcueid ~ # zpool offline tank ad18s1d > cannot offline ad18s1d: no valid replicas > Attempting to scrub causes a similar hang to before. Data is still readable > (from the zvol which is the only thing actually on this filesystem), > although slowly. > > What should I do to recover this / trigger a proper replace of the failed > partitions? > > Many thanks, > Michael > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >