I''ve got ZFS running on Solaris s10x_u3wos_10 X86 on a v40z, which has two PCI SCSI controllers, each connected to it''s own external HP Diskarray (MSA30) with 7 disks + hot spare. Both controllers are: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI The disks are a mix of: COMPAQ-BD3008A4C6-HPB4-279.40GB COMPAQ-BD30089BBA-HPB1-279.40GB COMPAQ-BD3008856C-HPB2-279.40GB For the past few months, we''ve had behavior from the zfs which we wouldn''t expect. We''ve had previous issues where we''ve seen a particular disk''s service time through the roof (while other disks in the same pool were idle) and had to reboot due to the pool locking. The most recent issue happened today, where the zfs pool locked up and we couldnt do anything about it besides reboot the system. we couldnt zpool status, we couldnt df -k, all commands related to IO just seemed to lock. When the system came back up, zfs is showing one of the disks as UNAVAIL. # zpool status pool: dbzpool state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using ''zpool online''. see: http://www.sun.com/msg/ZFS-8000-D3 scrub: resilver completed with 0 errors on Mon Feb 4 17:16:39 2008 config: NAME STATE READ WRITE CKSUM dbzpool DEGRADED 0 0 0 mirror ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c3t5d0 ONLINE 0 0 0 mirror DEGRADED 0 0 0 c2t8d0 ONLINE 0 0 0 c3t8d0 UNAVAIL 0 0 0 cannot open spares c2t15d0 AVAIL c3t15d0 AVAIL errors: No known data errors I''ve tried: # zpool offline dbzpool c3t8d0 cannot offline c3t8d0: no valid replicas # zpool replace dbzpool c3t8d0 cannot replace c3t8d0 with c3t8d0: c3t8d0 is busy # zpool online dbzpool c3t8d0 Bringing device c3t8d0 online Note that even through the last command seems fruitful, the disks status remains UNAVAIL. I''ve also tried writing to the disk directly - both before and after the above zpool commands. # dd if=/dev/zero of=/dev/rdsk/c3t8d0s0 bs=1024 count=1048576 1048576+0 records in 1048576+0 records out # smartctl -H /dev/rdsk/c3t8d0s0 smartctl version 5.37 [i386-pc-solaris2.10] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ SMART Health Status: OK # iostat -nx 5 2 | grep c3t8 0.5 203.7 6.0 1648.2 0.0 0.0 0.0 0.1 0 3 c3t8d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3t8d0 After all this data, my questions are as follows: 1. What do I have to do (short of replacing the seemingly good disk) to get c3t8d0 back online? 2. Is there an alternative to the seemingly necessary reboot when the zfs pool locks? 3. Is the pool locking due to a possible problem in u3 that is addressed in u4 and beyond ? -- Jeremy Kister http://jeremy.kister.net./
On 2/4/2008 5:40 PM, Jeremy Kister wrote:> 1. What do I have to do (short of replacing the seemingly good disk) to > get c3t8d0 back online?I did find a related thread: http://mail.opensolaris.org/pipermail/zfs-discuss/2006-June/032179.html but the thread ended without a resolution. also, format seems to think that c3t8d0 doesnt belong to the zpool. # format Searching for disks...done [...] 16. c3t8d0 <COMPAQ-BD3008856C-HPB2-279.40GB> /pci at 1d,0/pci1022,7450 at 2/pci1000,10c0 at 1/sd at 8,0 17. c3t15d0 <COMPAQ-BD3008A4C6-HPB4-279.40GB> /pci at 1d,0/pci1022,7450 at 2/pci1000,10c0 at 1/sd at f,0 Specify disk (enter its number): 17 selecting c3t15d0 [disk formatted] /dev/dsk/c3t15d0s0 is reserved as a hot spare for ZFS pool dbzpool. Please see zpool(1M). [...] Specify disk (enter its number)[17]: 16 selecting c3t8d0 [disk formatted] format> quit Suggestions? -- Jeremy Kister http://jeremy.kister.net./
On 2/5/2008 2:45 PM, Jeremy Kister wrote:>> 1. What do I have to do (short of replacing the seemingly good disk) to >> get c3t8d0 back online?I ended up applying patches 124205-05 and 118855-36. things are much better now, but there are still [at least] two issues remaining. with my zpool in good condition, i yanked out c2t2d0 and c3t3d0. see http://mail.opensolaris.org/pipermail/zfs-discuss/2008-February/045646.html for configuration details. after the two spares came online and the array was resilvered, to simulate io load i did ''cp /dev/zero /dbzpool/bigfile'', i pushed c2t2d0 back in, and in another terminal typed zpool replace dbzpool c2t2d0. this caused the whole system to lock up-- neither of my terminals were accepting commands, i couldnt establish new ssh sessions (the sockets opened, but couldnt get a shell after entering my passwd), and the postgres database that was running on the machine was not answering queries from remote hosts. after 8 hours of this behavior, i power cycled the v40z. when it came back, zpool status shows: # zpool status pool: dbzpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM dbzpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c3t0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 spare ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c3t15d0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 spare ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 c2t15d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c3t4d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c3t5d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t8d0 ONLINE 0 0 0 c3t8d0 ONLINE 0 0 0 spares c2t15d0 INUSE currently in use c3t15d0 INUSE currently in use errors: No known data errors = question 1: how do i get the spares to go back into ''spare'' mode? question 2: why did the system lock up when i issued the zpool replace? -- Jeremy Kister http://jeremy.kister.net./