I''ve got ZFS running on Solaris s10x_u3wos_10 X86 on a v40z, which has
two PCI SCSI controllers, each connected to it''s own external HP
Diskarray (MSA30) with 7 disks + hot spare.
Both controllers are:
LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI
The disks are a mix of:
COMPAQ-BD3008A4C6-HPB4-279.40GB
COMPAQ-BD30089BBA-HPB1-279.40GB
COMPAQ-BD3008856C-HPB2-279.40GB
For the past few months, we''ve had behavior from the zfs which we
wouldn''t expect. We''ve had previous issues where
we''ve seen a particular
disk''s service time through the roof (while other disks in the same
pool
were idle) and had to reboot due to the pool locking.
The most recent issue happened today, where the zfs pool locked up and
we couldnt do anything about it besides reboot the system. we couldnt
zpool status, we couldnt df -k, all commands related to IO just seemed
to lock. When the system came back up, zfs is showing one of the disks
as UNAVAIL.
# zpool status
pool: dbzpool
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas
exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using ''zpool
online''.
see: http://www.sun.com/msg/ZFS-8000-D3
scrub: resilver completed with 0 errors on Mon Feb 4 17:16:39 2008
config:
NAME STATE READ WRITE CKSUM
dbzpool DEGRADED 0 0 0
mirror ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
mirror DEGRADED 0 0 0
c2t8d0 ONLINE 0 0 0
c3t8d0 UNAVAIL 0 0 0 cannot open
spares
c2t15d0 AVAIL
c3t15d0 AVAIL
errors: No known data errors
I''ve tried:
# zpool offline dbzpool c3t8d0
cannot offline c3t8d0: no valid replicas
# zpool replace dbzpool c3t8d0
cannot replace c3t8d0 with c3t8d0: c3t8d0 is busy
# zpool online dbzpool c3t8d0
Bringing device c3t8d0 online
Note that even through the last command seems fruitful, the disks status
remains UNAVAIL.
I''ve also tried writing to the disk directly - both before and after
the
above zpool commands.
# dd if=/dev/zero of=/dev/rdsk/c3t8d0s0 bs=1024 count=1048576
1048576+0 records in
1048576+0 records out
# smartctl -H /dev/rdsk/c3t8d0s0
smartctl version 5.37 [i386-pc-solaris2.10] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
SMART Health Status: OK
# iostat -nx 5 2 | grep c3t8
0.5 203.7 6.0 1648.2 0.0 0.0 0.0 0.1 0 3 c3t8d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3t8d0
After all this data, my questions are as follows:
1. What do I have to do (short of replacing the seemingly good disk) to
get c3t8d0 back online?
2. Is there an alternative to the seemingly necessary reboot when the
zfs pool locks?
3. Is the pool locking due to a possible problem in u3 that is addressed
in u4 and beyond ?
--
Jeremy Kister
http://jeremy.kister.net./
On 2/4/2008 5:40 PM, Jeremy Kister wrote:> 1. What do I have to do (short of replacing the seemingly good disk) to > get c3t8d0 back online?I did find a related thread: http://mail.opensolaris.org/pipermail/zfs-discuss/2006-June/032179.html but the thread ended without a resolution. also, format seems to think that c3t8d0 doesnt belong to the zpool. # format Searching for disks...done [...] 16. c3t8d0 <COMPAQ-BD3008856C-HPB2-279.40GB> /pci at 1d,0/pci1022,7450 at 2/pci1000,10c0 at 1/sd at 8,0 17. c3t15d0 <COMPAQ-BD3008A4C6-HPB4-279.40GB> /pci at 1d,0/pci1022,7450 at 2/pci1000,10c0 at 1/sd at f,0 Specify disk (enter its number): 17 selecting c3t15d0 [disk formatted] /dev/dsk/c3t15d0s0 is reserved as a hot spare for ZFS pool dbzpool. Please see zpool(1M). [...] Specify disk (enter its number)[17]: 16 selecting c3t8d0 [disk formatted] format> quit Suggestions? -- Jeremy Kister http://jeremy.kister.net./
On 2/5/2008 2:45 PM, Jeremy Kister wrote:>> 1. What do I have to do (short of replacing the seemingly good disk) to >> get c3t8d0 back online?I ended up applying patches 124205-05 and 118855-36. things are much better now, but there are still [at least] two issues remaining. with my zpool in good condition, i yanked out c2t2d0 and c3t3d0. see http://mail.opensolaris.org/pipermail/zfs-discuss/2008-February/045646.html for configuration details. after the two spares came online and the array was resilvered, to simulate io load i did ''cp /dev/zero /dbzpool/bigfile'', i pushed c2t2d0 back in, and in another terminal typed zpool replace dbzpool c2t2d0. this caused the whole system to lock up-- neither of my terminals were accepting commands, i couldnt establish new ssh sessions (the sockets opened, but couldnt get a shell after entering my passwd), and the postgres database that was running on the machine was not answering queries from remote hosts. after 8 hours of this behavior, i power cycled the v40z. when it came back, zpool status shows: # zpool status pool: dbzpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM dbzpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 spare ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c3t15d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 spare ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c2t15d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c3t5d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c3t8d0 ONLINE 0 0 0 spares c2t15d0 INUSE currently in use c3t15d0 INUSE currently in use errors: No known data errors = question 1: how do i get the spares to go back into ''spare'' mode? question 2: why did the system lock up when i issued the zpool replace? -- Jeremy Kister http://jeremy.kister.net./