Jorgen Lundman
2009-Aug-06  06:07 UTC
[zfs-discuss] x4540 dead HDD replacement, remains "configured".
x4540 snv_117
We lost a HDD last night, and it seemed to take out most of the bus or 
something and forced us to reboot. (We have yet to experience losing a 
disk that didn''t force a reboot mind you).
So today, I''m looking at replacing the broken HDD, but no amount of
work
makes it "turn on the blue LED". After trying that for an hour, we
just
replaced the HDD anyway. But no amount of work will make it 
use/recognise it. (We tried more than one working spare HDD too).
For example:
# zpool status
           raidz1      DEGRADED     0     0     0
             c5t1d0    ONLINE       0     0     0
             c0t5d0    ONLINE       0     0     0
             spare     DEGRADED     0     0  285K
               c1t5d0  UNAVAIL      0     0     0  cannot open
               c4t7d0  ONLINE       0     0     0  4.13G resilvered
             c2t5d0    ONLINE       0     0     0
             c3t5d0    ONLINE       0     0     0
         spares
           c4t7d0      INUSE     currently in use
# zpool offline zpool1 c1t5d0
           raidz1      DEGRADED     0     0     0
             c5t1d0    ONLINE       0     0     0
             c0t5d0    ONLINE       0     0     0
             spare     DEGRADED     0     0  285K
               c1t5d0  OFFLINE      0     0     0
               c4t7d0  ONLINE       0     0     0  4.13G resilvered
             c2t5d0    ONLINE       0     0     0
             c3t5d0    ONLINE       0     0     0
# cfgadm -al
Ap_Id                          Type         Receptacle   Occupant 
Condition
c1                             scsi-bus     connected    configured 
unknown
c1::dsk/c1t5d0                 disk         connected    configured   failed
# cfgadm -c unconfigure c1::dsk/c1t5d0
# cfgadm -al
c1::dsk/c1t5d0                 disk         connected    configured   failed
# cfgadm -c unconfigure c1::dsk/c1t5d0
# cfgadm -c unconfigure c1::dsk/c1t5d0
# cfgadm -fc unconfigure c1::dsk/c1t5d0
# cfgadm -fc unconfigure c1::dsk/c1t5d0
# cfgadm -al
c1::dsk/c1t5d0                 disk         connected    configured   failed
# hdadm offline slot 13
  1:    5:    9:   13:   17:   21:   25:   29:   33:   37:   41:   45:
c0t1  c0t5  c1t1  c1t5  c2t1  c2t5  c3t1  c3t5  c4t1  c4t5  c5t1  c5t5
^b+   ^++   ^b+   ^--   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
# cfgadm -al
c1::dsk/c1t5d0                 disk         connected    configured   failed
  # fmadm faulty
FRU         : "HD_ID_47" 
(hc://:product-id=Sun-Fire-X4540:chassis-id=0915AMR048:server-id=x4500-10.unix:serial=9QMB024K:part=SEAGATE-ST35002NSSUN500G-09107B024K:revision=SU0D/chassis=0/bay=47/disk=0)
                   faulty
  # fmadm repair HD_ID_47
fmadm: recorded repair to HD_ID_47
  # format | grep c1t5d0
  #
  # hdadm offline slot 13
  1:    5:    9:   13:   17:   21:   25:   29:   33:   37:   41:   45:
c0t1  c0t5  c1t1  c1t5  c2t1  c2t5  c3t1  c3t5  c4t1  c4t5  c5t1  c5t5
^b+   ^++   ^b+   ^--   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
  # cfgadm -al
c1::dsk/c1t5d0                 disk         connected    configured   failed
  # ipmitool sunoem led get|grep 13
  hdd13.fail.led   | ON
  hdd13.ok2rm.led  | OFF
# zpool online zpool1 c1t5d0
warning: device ''c1t5d0'' onlined, but remains in faulted state
use ''zpool replace'' to replace devices that are no longer
present
# cfgadm -c disconnect c1::dsk/c1t5d0
cfgadm: Hardware specific failure: operation not supported for SCSI device
Bah, why were they changed to SCSI? Increasing the size of the hammer...
# cfgadm -x replace_device c1::sd37
Replacing SCSI device: /devices/pci at 0,0/pci10de,375 at b/pci1000,1000 at 0/sd
at 5,0
This operation will suspend activity on SCSI bus: c1
Continue (yes/no)? y
SCSI bus quiesced successfully.
It is now safe to proceed with hotplug operation.
Enter y if operation is complete or n to abort (yes/no)? y
# cfgadm -al
c1::dsk/c1t5d0                 disk         connected    configured   failed
I am fairly certain that if I reboot, it will all come back ok again. 
But I would like to believe that I should be able to replace a disk 
without rebooting on a X4540.
Any other commands I should try?
Lund
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)
Jorgen Lundman
2009-Aug-06  06:48 UTC
[zfs-discuss] x4540 dead HDD replacement, remains "configured".
I suspect this is what it is all about: # devfsadm -v devfsadm[16283]: verbose: no devfs node or mismatched dev_t for /devices/pci at 0,0/pci10de,375 at b/pci1000,1000 at 0/sd at 5,0:a [snip] and indeed: brw-r----- 1 root sys 30, 2311 Aug 6 15:34 sd at 4,0:wd crw-r----- 1 root sys 30, 2311 Aug 6 15:24 sd at 4,0:wd,raw drwxr-xr-x 2 root sys 2 Aug 6 14:31 sd at 5,0 drwxr-xr-x 2 root sys 2 Apr 17 17:52 sd at 6,0 brw-r----- 1 root sys 30, 2432 Jul 6 09:50 sd at 6,0:a crw-r----- 1 root sys 30, 2432 Jul 6 09:48 sd at 6,0:a,raw Perhaps because it was booted with the dead disk in place, it never configured the entire "sd5" mpt driver. Why the other hard-disks work I don''t know. I suspect the only way to fix this, is to reboot again. Lund Jorgen Lundman wrote:> > x4540 snv_117 > > We lost a HDD last night, and it seemed to take out most of the bus or > something and forced us to reboot. (We have yet to experience losing a > disk that didn''t force a reboot mind you). > > So today, I''m looking at replacing the broken HDD, but no amount of work > makes it "turn on the blue LED". After trying that for an hour, we just > replaced the HDD anyway. But no amount of work will make it > use/recognise it. (We tried more than one working spare HDD too). > > For example: > > # zpool status > > raidz1 DEGRADED 0 0 0 > c5t1d0 ONLINE 0 0 0 > c0t5d0 ONLINE 0 0 0 > spare DEGRADED 0 0 285K > c1t5d0 UNAVAIL 0 0 0 cannot open > c4t7d0 ONLINE 0 0 0 4.13G resilvered > c2t5d0 ONLINE 0 0 0 > c3t5d0 ONLINE 0 0 0 > spares > c4t7d0 INUSE currently in use > > > > # zpool offline zpool1 c1t5d0 > > raidz1 DEGRADED 0 0 0 > c5t1d0 ONLINE 0 0 0 > c0t5d0 ONLINE 0 0 0 > spare DEGRADED 0 0 285K > c1t5d0 OFFLINE 0 0 0 > c4t7d0 ONLINE 0 0 0 4.13G resilvered > c2t5d0 ONLINE 0 0 0 > c3t5d0 ONLINE 0 0 0 > > > # cfgadm -al > Ap_Id Type Receptacle Occupant Condition > c1 scsi-bus connected configured unknown > c1::dsk/c1t5d0 disk connected configured > failed > > # cfgadm -c unconfigure c1::dsk/c1t5d0 > # cfgadm -al > c1::dsk/c1t5d0 disk connected configured > failed > # cfgadm -c unconfigure c1::dsk/c1t5d0 > # cfgadm -c unconfigure c1::dsk/c1t5d0 > # cfgadm -fc unconfigure c1::dsk/c1t5d0 > # cfgadm -fc unconfigure c1::dsk/c1t5d0 > # cfgadm -al > c1::dsk/c1t5d0 disk connected configured > failed > > # hdadm offline slot 13 > 1: 5: 9: 13: 17: 21: 25: 29: 33: 37: 41: 45: > c0t1 c0t5 c1t1 c1t5 c2t1 c2t5 c3t1 c3t5 c4t1 c4t5 c5t1 c5t5 > ^b+ ^++ ^b+ ^-- ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ > > # cfgadm -al > c1::dsk/c1t5d0 disk connected configured > failed > > # fmadm faulty > FRU : "HD_ID_47" > (hc://:product-id=Sun-Fire-X4540:chassis-id=0915AMR048:server-id=x4500-10.unix:serial=9QMB024K:part=SEAGATE-ST35002NSSUN500G-09107B024K:revision=SU0D/chassis=0/bay=47/disk=0) > > faulty > > # fmadm repair HD_ID_47 > fmadm: recorded repair to HD_ID_47 > > # format | grep c1t5d0 > # > > # hdadm offline slot 13 > 1: 5: 9: 13: 17: 21: 25: 29: 33: 37: 41: 45: > c0t1 c0t5 c1t1 c1t5 c2t1 c2t5 c3t1 c3t5 c4t1 c4t5 c5t1 c5t5 > ^b+ ^++ ^b+ ^-- ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ > > # cfgadm -al > c1::dsk/c1t5d0 disk connected configured > failed > > # ipmitool sunoem led get|grep 13 > hdd13.fail.led | ON > hdd13.ok2rm.led | OFF > > # zpool online zpool1 c1t5d0 > warning: device ''c1t5d0'' onlined, but remains in faulted state > use ''zpool replace'' to replace devices that are no longer present > > # cfgadm -c disconnect c1::dsk/c1t5d0 > cfgadm: Hardware specific failure: operation not supported for SCSI device > > > Bah, why were they changed to SCSI? Increasing the size of the hammer... > > > # cfgadm -x replace_device c1::sd37 > Replacing SCSI device: /devices/pci at 0,0/pci10de,375 at b/pci1000,1000 at 0/sd at 5,0 > This operation will suspend activity on SCSI bus: c1 > Continue (yes/no)? y > SCSI bus quiesced successfully. > It is now safe to proceed with hotplug operation. > Enter y if operation is complete or n to abort (yes/no)? y > > # cfgadm -al > c1::dsk/c1t5d0 disk connected configured > failed > > > I am fairly certain that if I reboot, it will all come back ok again. > But I would like to believe that I should be able to replace a disk > without rebooting on a X4540. > > Any other commands I should try? > > Lund >-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Brent Jones
2009-Aug-06  07:38 UTC
[zfs-discuss] x4540 dead HDD replacement, remains "configured".
On Wed, Aug 5, 2009 at 11:48 PM, Jorgen Lundman<lundman at gmo.jp> wrote:> > I suspect this is what it is all about: > > ?# devfsadm -v > devfsadm[16283]: verbose: no devfs node or mismatched dev_t for > /devices/pci at 0,0/pci10de,375 at b/pci1000,1000 at 0/sd at 5,0:a > [snip] > > and indeed: > > brw-r----- ? 1 root ? ? sys ? ? ? 30, 2311 Aug ?6 15:34 sd at 4,0:wd > crw-r----- ? 1 root ? ? sys ? ? ? 30, 2311 Aug ?6 15:24 sd at 4,0:wd,raw > drwxr-xr-x ? 2 root ? ? sys ? ? ? ? ? ?2 Aug ?6 14:31 sd at 5,0 > drwxr-xr-x ? 2 root ? ? sys ? ? ? ? ? ?2 Apr 17 17:52 sd at 6,0 > brw-r----- ? 1 root ? ? sys ? ? ? 30, 2432 Jul ?6 09:50 sd at 6,0:a > crw-r----- ? 1 root ? ? sys ? ? ? 30, 2432 Jul ?6 09:48 sd at 6,0:a,raw > > Perhaps because it was booted with the dead disk in place, it never > configured the entire "sd5" mpt driver. Why the other hard-disks work I > don''t know. > > I suspect the only way to fix this, is to reboot again. > > Lund > >I have a pair of X4540''s also, and getting any kind of drive status, or failure alert is a lost cause. I''ve opened several cases with Sun with the following issues: ILOM/BMC can''t see any drives (status, FRU, firmware, etc) FMA cannot see a drive failure (you can pull a drive, and it could be hours before ''zpool status'' will show a failed drive, even during a ''zfs scrub'') Hot swapping drives rarely works, system will not see new drive until a reboot Things I''ve tried that Sun has suggested: New BIOS New controller firmware New ILOM firmware Upgrading to new releases of Osol (currently on 118, no luck) Replacing ILOM card Custom FMA configs Nothing works, and my cases with Sun have been open for about 6 months now, with no resolution in sight. Given that Sun now makes the 7000, I can only assume their support on the more "whitebox" version, AKA X4540, is either near an end, or they don''t intend to support any advanced monitoring whatsoever. Sad, really.. as my $900 Dell and HP servers can send SMS, Jabber messages, SNMP traps, etc, on ANY IPMI event, hardware issue, and what have you without any tinkering or excuses. -- Brent Jones brent at servuhome.net
Ross
2009-Aug-06  08:09 UTC
[zfs-discuss] x4540 dead HDD replacement, remains "configured".
Whoah! "We have yet to experience losing a disk that didn''t force a reboot" Do you have any notes on how many times this has happened Jorgen, or what steps you''ve taken each time? I appreciate you''re probably more concerned with getting an answer to your question, but if ZFS needs a reboot to cope with failures on even an x4540, that''s an absolute deal breaker for everything we want to do with ZFS. Ross -- This message posted from opensolaris.org
Jorgen Lundman
2009-Aug-06  12:47 UTC
[zfs-discuss] x4540 dead HDD replacement, remains "configured".
Well, to be fair, there were some special cases. I know we had 3 separate occasions with broken HDDs, when we were using UFS. 2 of these appeared to hang, and the 3rd only hung once we replaced the disk. This is most likely due to use using UFS in zvol (for quotas). We got an IDR patch, and eventually this was released as "UFS 3-way deadlock writing log with zvol". I forget the number right now, but the patch is out. This is the very first time we have lost a disk in a purely-ZFS system, and I was somewhat hoping that this would be the time everything went smoothly. But it did not. However, I have also experienced (once) a disk dying in such a way that it took out the chain in a netapp, so perhaps the disk died like this here to (it is really dead). But still disappointing. Power cycling the x4540 takes about 7 minutes (service to service), but with Sol svn116(?) and up it can do quiesce-reboots, which take about 57 seconds. In this case, we had to power cycle. Ross wrote:> Whoah! > > "We have yet to experience losing a > disk that didn''t force a reboot" > > Do you have any notes on how many times this has happened Jorgen, or what steps you''ve taken each time? > > I appreciate you''re probably more concerned with getting an answer to your question, but if ZFS needs a reboot to cope with failures on even an x4540, that''s an absolute deal breaker for everything we want to do with ZFS. > > Ross-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Jorgen Lundman
2009-Aug-20  02:57 UTC
[zfs-discuss] x4540 dead HDD replacement, remains "configured".
Finally came to the reboot maintenance to reboot the x4540 to make it 
see the newly replaced HDD.
I tried, reboot, then power-cycle, and reboot -- -r,
but I can not make the x4540 accept any HDD in that bay. I''m starting
to
think that perhaps we did not lose the original HDD, but rather the 
slot, and there is a hardware problem.
This is what I see after a reboot, the disk is c1t5d0, sd37, sd at 5,0 or 
slot 13.
c1::dsk/c1t4d0                 disk         connected    configured 
unknown
c1::dsk/c1t5d0                 disk         connected    configured 
unknown
c1::dsk/c1t6d0                 disk         connected    configured 
unknown
# devfsadm -v
devfsadm[893]: verbose: no devfs node or mismatched dev_t for 
/devices/pci at 0,0/pci10de,375 at b/pci1000,1000 at 0/sd at 5,0:a
devfsadm[893]: verbose: symlink /dev/dsk/c1t5d0s0 -> 
../../devices/pci at 0,0/pci10de,375 at b/pci1000,1000 at 0/sd at 5,0:a
devfsadm[893]: verbose: no devfs node or mismatched dev_t for 
/devices/pci at 0,0/pci10de,375 at b/pci1000,1000 at 0/sd at 5,0:b
devfsadm[893]: verbose: symlink /dev/dsk/c1t5d0s1 -> 
../../devices/pci at 0,0/pci10de,375 at b/pci1000,1000 at 0/sd at 5,0:b
[snip]
Only messages in dmesg are:
Aug 20 02:23:05 x4500-10.unix rootnex: [ID 349649 kern.info] xsvc1 at 
root: space 0 offset 0
Aug 20 02:23:05 x4500-10.unix genunix: [ID 936769 kern.info] xsvc1 is 
/xsvc at 0,0
Aug 20 02:23:09 x4500-10.unix scsi: [ID 583861 kern.info] sd37 at mpt1: 
target 5 lun 0
Aug 20 02:23:09 x4500-10.unix genunix: [ID 936769 kern.info] sd37 is 
/pci at 0,0/pci10de,375 at b/pci1000,1000 at 0/sd at 5,0
Aug 20 02:23:09 x4500-10.unix pseudo: [ID 129642 kern.info] 
pseudo-device: devinfo0
Aug 20 02:23:09 x4500-10.unix genunix: [ID 936769 kern.info] devinfo0 is 
/pseudo/devinfo at 0
root at x4500-10.unix # Aug 20 02:23:12 x400-10.unix genunix: WARNING: 
constraints forbid retire: /pci at 3c,0/pci10de,376 at f/pci1000,1000 at 0/sd at
7,0
# cd ../../devices/pci at 0,0/pci10de,375 at b/pci1000,1000 at 0/
root at x4500-10.unix # ls -l
./sd at 5,0:a: No such device or address
./sd at 5,0:a,raw: No such device or address
./sd at 5,0:b: No such device or address
./sd at 5,0:b,raw: No such device or address
./sd at 5,0:c: No such device or address
[snip lots, these errors only show up the first time you ls]
total 24
drwxr-xr-x   2 root     sys            2 Apr 17 17:52 sd at 0,0
brw-r-----   1 root     sys       30, 2048 Jul  6 09:34 sd at 0,0:a
crw-r-----   1 root     sys       30, 2048 Jul  6 09:34 sd at 0,0:a,raw
brw-r-----   1 root     sys       30, 2049 Jul  6 09:34 sd at 0,0:b
crw-r-----   1 root     sys       30, 2049 Jul  6 09:34 sd at 0,0:b,raw
[snip]
crw-r-----   1 root     sys       30, 2067 Jul  6 09:44 sd at 0,0:t,raw
brw-r-----   1 root     sys       30, 2068 Jul  6 09:50 sd at 0,0:u
crw-r-----   1 root     sys       30, 2068 Jul  6 09:44 sd at 0,0:u,raw
drwxr-xr-x   2 root     sys            2 Apr 17 17:52 sd at 1,0
brw-r-----   1 root     sys       30, 2112 Jul  6 09:50 sd at 1,0:a
crw-r-----   1 root     sys       30, 2112 Jul  6 09:48 sd at 1,0:a,raw
brw-r-----   1 root     sys       30, 2113 Jul  6 09:50 sd at 1,0:b
[snip]
brw-r-----   1 root     sys       30, 2132 Jul  6 09:50 sd at 1,0:u
crw-r-----   1 root     sys       30, 2132 Jul  6 09:48 sd at 1,0:u,raw
brw-r-----   1 root     sys       30, 2119 Aug 20 02:23 sd at 1,0:wd
crw-r-----   1 root     sys       30, 2119 Aug 20 02:23 sd at 1,0:wd,raw
drwxr-xr-x   2 root     sys            2 Apr 17 17:52 sd at 2,0
brw-r-----   1 root     sys       30, 2176 Jul  6 09:50 sd at 2,0:a
crw-r-----   1 root     sys       30, 2176 Jul  6 09:48 sd at 2,0:a,raw
brw-r-----   1 root     sys       30, 2177 Jul  6 09:50 sd at 2,0:b
[snip]
brw-r-----   1 root     sys       30, 2196 Jul  6 09:50 sd at 2,0:u
crw-r-----   1 root     sys       30, 2196 Jul  6 09:48 sd at 2,0:u,raw
brw-r-----   1 root     sys       30, 2183 Aug 20 02:23 sd at 2,0:wd
crw-r-----   1 root     sys       30, 2183 Aug 20 02:23 sd at 2,0:wd,raw
drwxr-xr-x   2 root     sys            2 Apr 17 17:52 sd at 3,0
brw-r-----   1 root     sys       30, 2240 Jul  2 15:30 sd at 3,0:a
crw-r-----   1 root     sys       30, 2240 Jul  6 09:48 sd at 3,0:a,raw
brw-r-----   1 root     sys       30, 2241 Jul  6 09:50 sd at 3,0:b
[snip]
brw-r-----   1 root     sys       30, 2260 Jul  6 09:50 sd at 3,0:u
crw-r-----   1 root     sys       30, 2260 Jul  6 09:48 sd at 3,0:u,raw
brw-r-----   1 root     sys       30, 2247 Jul  6 09:50 sd at 3,0:wd
crw-r-----   1 root     sys       30, 2247 Jul  6 09:43 sd at 3,0:wd,raw
drwxr-xr-x   2 root     sys            2 Apr 17 17:52 sd at 4,0
brw-r-----   1 root     sys       30, 2304 Jul  6 09:50 sd at 4,0:a
crw-r-----   1 root     sys       30, 2304 Jul  6 09:48 sd at 4,0:a,raw
brw-r-----   1 root     sys       30, 2305 Jul  6 09:50 sd at 4,0:b
[snip]
brw-r-----   1 root     sys       30, 2324 Jul  6 09:50 sd at 4,0:u
crw-r-----   1 root     sys       30, 2324 Jul  6 09:48 sd at 4,0:u,raw
brw-r-----   1 root     sys       30, 2311 Aug 20 02:23 sd at 4,0:wd
crw-r-----   1 root     sys       30, 2311 Aug 20 02:23 sd at 4,0:wd,raw
drwxr-xr-x   2 root     sys            2 Aug  6 14:31 sd at 5,0
drwxr-xr-x   2 root     sys            2 Apr 17 17:52 sd at 6,0
brw-r-----   1 root     sys       30, 2432 Jul  6 09:50 sd at 6,0:a
crw-r-----   1 root     sys       30, 2432 Jul  6 09:48 sd at 6,0:a,raw
brw-r-----   1 root     sys       30, 2433 Jul  6 09:50 sd at 6,0:b
crw-r-----   1 root     sys       30, 2433 Jul  6 09:48 sd at 6,0:b,raw
brw-r-----   1 root     sys       30, 2434 Jul  6 09:50 sd at 6,0:c
[snip]
brw-r-----   1 root     sys       30, 2452 Jul  6 09:50 sd at 6,0:u
crw-r-----   1 root     sys       30, 2452 Jul  6 09:48 sd at 6,0:u,raw
brw-r-----   1 root     sys       30, 2439 Aug 20 02:24 sd at 6,0:wd
crw-r-----   1 root     sys       30, 2439 Aug 20 02:23 sd at 6,0:wd,raw
drwxr-xr-x   2 root     sys            2 Apr 17 17:52 sd at 7,0
brw-r-----   1 root     sys       30, 2496 Jul  2 15:30 sd at 7,0:a
crw-r-----   1 root     sys       30, 2496 Jul  6 09:48 sd at 7,0:a,raw
brw-r-----   1 root     sys       30, 2497 Jul  6 09:50 sd at 7,0:b
crw-r-----   1 root     sys       30, 2497 Jul  6 09:48 sd at 7,0:b,raw
brw-r-----   1 root     sys       30, 2498 Jul  6 09:50 sd at 7,0:c
crw-r-----   1 root     sys       30, 2498 Jul  6 09:43 sd at 7,0:c,raw
brw-r-----   1 root     sys       30, 2499 Jul  6 09:50 sd at 7,0:d
crw-r-----   1 root     sys       30, 2499 Jul  6 09:48 sd at 7,0:d,raw
brw-r-----   1 root     sys       30, 2500 Jul  6 09:50 sd at 7,0:e
crw-r-----   1 root     sys       30, 2500 Jul  6 09:48 sd at 7,0:e,raw
So it seems sd at 5,0 is empty, it is peculiar that all other HDDs on c1tX 
works though.
Eventually I notice that cfgadm goes to:
c1::dsk/c1t4d0                 disk         connected    configured 
unknown
c1::dsk/c1t5d0                 disk         connected    configured   failed
c1::dsk/c1t6d0                 disk         connected    configured 
unknown
We promoted the Spare in use to replace c1t5d0, so now the pool looks like:
   pool: zpool1
  state: ONLINE
  scrub: none requested
config:
         NAME        STATE     READ WRITE CKSUM
         zpool1      ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c0t3d0  ONLINE       0     0     0
             c1t3d0  ONLINE       0     0     0
             c2t3d0  ONLINE       0     0     0
             c3t3d0  ONLINE       0     0     0
             c4t3d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c5t3d0  ONLINE       0     0     0
             c0t7d0  ONLINE       0     0     0
             c1t7d0  ONLINE       0     0     0
             c2t7d0  ONLINE       0     0     0
             c3t7d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c2t0d0  ONLINE       0     0     0
             c3t0d0  ONLINE       0     0     0
             c4t0d0  ONLINE       0     0     0
             c5t0d0  ONLINE       0     0     0
             c0t6d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c1t6d0  ONLINE       0     0     0
             c2t6d0  ONLINE       0     0     0
             c3t6d0  ONLINE       0     0     0
             c4t6d0  ONLINE       0     0     0
             c5t6d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c0t1d0  ONLINE       0     0     0
             c1t1d0  ONLINE       0     0     0
             c2t1d0  ONLINE       0     0     0
             c3t1d0  ONLINE       0     0     0
             c4t1d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c5t1d0  ONLINE       0     0     0
             c0t5d0  ONLINE       0     0     0
             c4t7d0  ONLINE       0     0     0   [was c1t5d0]
             c2t5d0  ONLINE       0     0     0
             c3t5d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c4t5d0  ONLINE       0     0     0
             c5t5d0  ONLINE       0     0     0
             c0t2d0  ONLINE       0     0     0
             c1t2d0  ONLINE       0     0     0
             c2t2d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c3t2d0  ONLINE       0     0     0
             c4t2d0  ONLINE       0     0     0
             c5t2d0  ONLINE       0     0     0
             c0t4d0  ONLINE       0     0     0
             c1t4d0  ONLINE       0     0     0
           raidz1    ONLINE       0     0     0
             c2t4d0  ONLINE       0     0     0
             c3t4d0  ONLINE       0     0     0
             c4t4d0  ONLINE       0     0     0
             c5t4d0  ONLINE       0     0     0
             c5t7d0  ONLINE       0     0     0
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)
Ian Collins
2009-Aug-21  08:29 UTC
[zfs-discuss] x4540 dead HDD replacement, remains "configured".
Jorgen Lundman wrote:> > Finally came to the reboot maintenance to reboot the x4540 to make it > see the newly replaced HDD. > > I tried, reboot, then power-cycle, and reboot -- -r, > > but I can not make the x4540 accept any HDD in that bay. I''m starting > to think that perhaps we did not lose the original HDD, but rather the > slot, and there is a hardware problem. > > This is what I see after a reboot, the disk is c1t5d0, sd37, sd at 5,0 or > slot 13. > > c1::dsk/c1t4d0 disk connected configured > unknown > c1::dsk/c1t5d0 disk connected configured > unknown > c1::dsk/c1t6d0 disk connected configured > unknown >Does format show it? -- Ian.
Jorgen Lundman
2009-Aug-21  12:42 UTC
[zfs-discuss] x4540 dead HDD replacement, remains "configured".
Nope, that it does not. Ian Collins wrote:> Jorgen Lundman wrote: >> >> Finally came to the reboot maintenance to reboot the x4540 to make it >> see the newly replaced HDD. >> >> I tried, reboot, then power-cycle, and reboot -- -r, >> >> but I can not make the x4540 accept any HDD in that bay. I''m starting >> to think that perhaps we did not lose the original HDD, but rather the >> slot, and there is a hardware problem. >> >> This is what I see after a reboot, the disk is c1t5d0, sd37, sd at 5,0 or >> slot 13. >> >> c1::dsk/c1t4d0 disk connected configured >> unknown >> c1::dsk/c1t5d0 disk connected configured >> unknown >> c1::dsk/c1t6d0 disk connected configured >> unknown >> > Does format show it? >-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Ian Collins
2009-Aug-21  20:37 UTC
[zfs-discuss] x4540 dead HDD replacement, remains "configured".
Jorgen Lundman wrote:> Ian Collins wrote: >> Jorgen Lundman wrote: >>> >>> Finally came to the reboot maintenance to reboot the x4540 to make >>> it see the newly replaced HDD. >>> >>> I tried, reboot, then power-cycle, and reboot -- -r, >>> >>> but I can not make the x4540 accept any HDD in that bay. I''m >>> starting to think that perhaps we did not lose the original HDD, but >>> rather the slot, and there is a hardware problem. >>> >>> This is what I see after a reboot, the disk is c1t5d0, sd37, sd at 5,0 >>> or slot 13. >>> >>> c1::dsk/c1t4d0 disk connected configured >>> unknown >>> c1::dsk/c1t5d0 disk connected configured >>> unknown >>> c1::dsk/c1t6d0 disk connected configured >>> unknown >>> >> Does format show it? >> > Nope, that it does not. >Time to call the repair man! -- Ian.
John Ryan
2009-Sep-18  12:15 UTC
[zfs-discuss] x4540 dead HDD replacement, remains "configured".
I have exactly these symptoms on 3 thumpers now. 2 x x4540s and 1 x x4500 Rebooting/Power cycling doesn''t even bring them back. The only thing I found, is that if I boot from the osol.2009.06 Cd, I can see all the drives I had to reinstall the OS on one box. I''ve only just recently upgraded them to snv_122. Before that, I could change disks without problems. Could it be something introduced since snv_111? John -- This message posted from opensolaris.org