Dersaidin
2009-Aug-03 10:35 UTC
[zfs-discuss] Pool stuck in FAULTED / DEGRADED after disk replacement
Hello, Recently, one of the disks in a raidz1 on my OpenSolaris (snv_118) file server failed. It continued operating in DEGRADED for a day or so until I noticed. At which point I removed the faulted disk and turned it back on (to confirm I had removed the correct disk). When I replaced the disk, I wasn''t able to bring the pool out of faulted. I know I replaced the correct disk, and I can see that the disk (now replaced) was bad, as it generates hard errors if I reconnect it. (On a side note, those disk hard errors really kill the system performance - if the rpool disk is still fine why does it have such a performance impact?) Hardware: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 4-core 8GB (2G + 2G + 2G + 2G); 8G maximum Gigabyte EP45-DS3 Motherboard Intel Corporation 82801JI (ICH10 Family) 2 port SATA IDE Controller Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller The pool originally had 5 Western Digital 1TB disks in raidz1 (all sata2 disks). I replaced one with an equivalent Seagate model (due to stock at the shop). My rpool is c8d0. The pool: NAME STATE READ WRITE CKSUM data1 FAULTED 0 0 1 bad intent log raidz1 DEGRADED 0 0 6 c8d1 ONLINE 0 0 0 c10d0 UNAVAIL 0 0 0 cannot open c11d0 ONLINE 0 0 0 c9d0 ONLINE 0 0 0 c9d1 ONLINE 0 0 0 As the raidz1 state is DEGRADED and 4/5 of my disks are ONLINE, I''m confident no data has been lost (yet) except those few CKSUM errors. Loosing the 6 files with failed checksum while the raidz was missing parity is acceptable, especially compared to loosing the entire pool. zpool clear is the suggested action in the documentation, as I am not interested in preserving the intent log. When I tried this, the command failed. It also marked all the devices in the pool to FAULTED, until I reboot then they go back to as shown above. The new disk is c0d0 instead of c10d0. The core of the problem is this: `zpool clear data1` - fails due to c10d0 being unavailable, this puts all the devices to FAULTED. `zpool replace data1 c10d0 c0d0` - fails due to the pool being faulted and therefore unaccessable. I do not know if these failures might be caused by those CKSUM errors instead. Perhaps something along the lines of a `zpool clear -f data1` is required? I found a couple of threads on the mailing list where a very similar issue was described: http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/025481.html http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/025574.html Things I have tried: - zpool clear data1 - fails due to c10d0 being unavailable. - zpool online data1 c10d0 - fails due to the pool being faulted and therefore unaccessable. - zpool replace data1 c10d0 c0d0 - fails due to the pool being faulted and therefore unaccessable. - zpool replace -f data1 c10d0 c0d0 - fails as above - restarting, then replacing or clearing - fails as above - ln -s c0d0* c10d0*, then replacing - fails as above - removing /etc/zfs/zpool.cache, restart - fails to import (`zpool import` shows all devices as FAULTED) - restarting with the new hard disk not attached Some output from these attempts: http://dersaidin.ath.cx/other/osol/commands.txt Output from zdb -l for each disk. http://dersaidin.ath.cx/other/osol/zdbout.txt Thanks, Andrew Browne