Dersaidin
2009-Aug-03 10:35 UTC
[zfs-discuss] Pool stuck in FAULTED / DEGRADED after disk replacement
Hello,
Recently, one of the disks in a raidz1 on my OpenSolaris (snv_118)
file server failed.
It continued operating in DEGRADED for a day or so until I noticed.
At which point I removed the faulted disk and turned it back on (to
confirm I had removed the correct disk).
When I replaced the disk, I wasn''t able to bring the pool out of
faulted.
I know I replaced the correct disk, and I can see that the disk (now
replaced) was bad, as it generates hard errors if I reconnect it.
(On a side note, those disk hard errors really kill the system
performance - if the rpool disk is still fine why does it have such a
performance impact?)
Hardware:
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 4-core
8GB (2G + 2G + 2G + 2G); 8G maximum
Gigabyte EP45-DS3 Motherboard
Intel Corporation 82801JI (ICH10 Family) 2 port SATA IDE Controller
Intel Corporation 82801JI (ICH10 Family) 4 port SATA IDE Controller
The pool originally had 5 Western Digital 1TB disks in raidz1 (all sata2 disks).
I replaced one with an equivalent Seagate model (due to stock at the shop).
My rpool is c8d0.
The pool:
NAME STATE READ WRITE CKSUM
data1 FAULTED 0 0 1 bad intent log
raidz1 DEGRADED 0 0 6
c8d1 ONLINE 0 0 0
c10d0 UNAVAIL 0 0 0 cannot open
c11d0 ONLINE 0 0 0
c9d0 ONLINE 0 0 0
c9d1 ONLINE 0 0 0
As the raidz1 state is DEGRADED and 4/5 of my disks are ONLINE, I''m
confident no data has been lost (yet) except those few CKSUM errors.
Loosing the 6 files with failed checksum while the raidz was missing
parity is acceptable, especially compared to loosing the entire pool.
zpool clear is the suggested action in the documentation, as I am not
interested in preserving the intent log.
When I tried this, the command failed. It also marked all the devices
in the pool to FAULTED, until I reboot then they go back to as shown
above.
The new disk is c0d0 instead of c10d0.
The core of the problem is this:
`zpool clear data1` - fails due to c10d0 being unavailable, this puts
all the devices to FAULTED.
`zpool replace data1 c10d0 c0d0` - fails due to the pool being faulted
and therefore unaccessable.
I do not know if these failures might be caused by those CKSUM errors instead.
Perhaps something along the lines of a `zpool clear -f data1` is required?
I found a couple of threads on the mailing list where a very similar
issue was described:
http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/025481.html
http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/025574.html
Things I have tried:
- zpool clear data1 - fails due to c10d0 being unavailable.
- zpool online data1 c10d0 - fails due to the pool being faulted and
therefore unaccessable.
- zpool replace data1 c10d0 c0d0 - fails due to the pool being faulted
and therefore unaccessable.
- zpool replace -f data1 c10d0 c0d0 - fails as above
- restarting, then replacing or clearing - fails as above
- ln -s c0d0* c10d0*, then replacing - fails as above
- removing /etc/zfs/zpool.cache, restart - fails to import (`zpool
import` shows all devices as FAULTED)
- restarting with the new hard disk not attached
Some output from these attempts:
http://dersaidin.ath.cx/other/osol/commands.txt
Output from zdb -l for each disk.
http://dersaidin.ath.cx/other/osol/zdbout.txt
Thanks,
Andrew Browne