thr3ads.net - zfs discuss - [zfs-discuss] First trouble, dying HDD. [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Jorgen Lundman

2008-Apr-16 00:34 UTC

[zfs-discuss] First trouble, dying HDD.

Today, we foudn the x4500 NFS stopped responding for about 1 minute, but 
the server itself was idle. After a little looking around, we found this:

Apr 16 09:16:00 x4500-01.unix fmd: [ID 441519 daemon.error] SUNW-MSG-ID: 
DISK-8000-0X, TYPE: Fault, VER: 1, SEVERITY: Major
Apr 16 09:16:00 x4500-01.unix EVENT-TIME: Wed Apr 16 09:16:00 JST 2008
Apr 16 09:16:00 x4500-01.unix PLATFORM: Sun Fire X4500, CSN: 0738AMT047 
            , HOSTNAME: x4500-01.unix
Apr 16 09:16:00 x4500-01.unix SOURCE: eft, REV: 1.16
Apr 16 09:16:00 x4500-01.unix EVENT-ID: 949038eb-f474-421b-bb79-981810c48f25
Apr 16 09:16:00 x4500-01.unix DESC: SMART health-monitoring firmware 
reported that a disk
Apr 16 09:16:00 x4500-01.unix failure is imminent.
Apr 16 09:16:00 x4500-01.unix   Refer to http://sun.com/msg/DISK-8000-0X 
for more information.
Apr 16 09:16:00 x4500-01.unix AUTO-RESPONSE: None.
Apr 16 09:16:00 x4500-01.unix IMPACT: It is likely that the continued 
operation of
Apr 16 09:16:00 x4500-01.unix this disk will result in data loss.
Apr 16 09:16:00 x4500-01.unix REC-ACTION: Schedule a repair procedure to 
replace the affected disk.
Apr 16 09:16:00 x4500-01.unix Use fmdump -v -u <EVENT_ID> to identify 
the disk.


# fmdump -v -u 949038eb-f474-421b-bb79-981810c48f25

TIME                 UUID                                 SUNW-MSG-ID
Apr 16 09:16:00.6628 949038eb-f474-421b-bb79-981810c48f25 DISK-8000-0X
   100%  fault.io.disk.predictive-failure

         Problem in: 
hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0
            Affects: 
dev:///:devid=id1,sd at SATA_____HITACHI_HDS7250S______KRVN67ZBHDPX3H//pci at
0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 4,0
                FRU: 
hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0
           Location: HD_ID_9




The server is back and operating again, so we are happy with how it 
handled it. It would have been nice if the URL above would have further 
URLs leading to "how to remove a HDD from x4500/other platforms".

Can I tell zfs to just not use the HDD for now, until we can pull it out 
and replace it? We have 14TB spare, so no risk of running out of space 
right now.

Lund




-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Jorgen Lundman

2008-Apr-16 01:05 UTC

head link

[zfs-discuss] First trouble, dying HDD.

Although, I''m having some issues finding the exact process to go from
fmdump''s "HD_ID_9"  to zpool offline "c0t?d?" style
input.

I can not run "zpool status", nor "format" command as they
hang. All the
Sun documentation already assume you know the c?t?d? disk name. Today, 
it is easier to just pull out the HDD since the x4500 doesn''t want to 
mark it dead.



Jorgen Lundman wrote:> Today, we foudn the x4500 NFS stopped responding for about 1 minute, but 
> the server itself was idle. After a little looking around, we found this:
> 
> Apr 16 09:16:00 x4500-01.unix fmd: [ID 441519 daemon.error] SUNW-MSG-ID: 
> DISK-8000-0X, TYPE: Fault, VER: 1, SEVERITY: Major
> Apr 16 09:16:00 x4500-01.unix EVENT-TIME: Wed Apr 16 09:16:00 JST 2008
> Apr 16 09:16:00 x4500-01.unix PLATFORM: Sun Fire X4500, CSN: 0738AMT047 
>             , HOSTNAME: x4500-01.unix
> Apr 16 09:16:00 x4500-01.unix SOURCE: eft, REV: 1.16
> Apr 16 09:16:00 x4500-01.unix EVENT-ID:
949038eb-f474-421b-bb79-981810c48f25
> Apr 16 09:16:00 x4500-01.unix DESC: SMART health-monitoring firmware 
> reported that a disk
> Apr 16 09:16:00 x4500-01.unix failure is imminent.
> Apr 16 09:16:00 x4500-01.unix   Refer to http://sun.com/msg/DISK-8000-0X 
> for more information.
> Apr 16 09:16:00 x4500-01.unix AUTO-RESPONSE: None.
> Apr 16 09:16:00 x4500-01.unix IMPACT: It is likely that the continued 
> operation of
> Apr 16 09:16:00 x4500-01.unix this disk will result in data loss.
> Apr 16 09:16:00 x4500-01.unix REC-ACTION: Schedule a repair procedure to 
> replace the affected disk.
> Apr 16 09:16:00 x4500-01.unix Use fmdump -v -u <EVENT_ID> to identify
> the disk.
> 
> 
> # fmdump -v -u 949038eb-f474-421b-bb79-981810c48f25
> 
> TIME                 UUID                                 SUNW-MSG-ID
> Apr 16 09:16:00.6628 949038eb-f474-421b-bb79-981810c48f25 DISK-8000-0X
>    100%  fault.io.disk.predictive-failure
> 
>          Problem in: 
>
hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0
>             Affects: 
> dev:///:devid=id1,sd at SATA_____HITACHI_HDS7250S______KRVN67ZBHDPX3H//pci
at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 4,0
>                 FRU: 
>
hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0
>            Location: HD_ID_9
> 
> 
> 
> 
> The server is back and operating again, so we are happy with how it 
> handled it. It would have been nice if the URL above would have further 
> URLs leading to "how to remove a HDD from x4500/other platforms".
> 
> Can I tell zfs to just not use the HDD for now, until we can pull it out 
> and replace it? We have 14TB spare, so no risk of running out of space 
> right now.
> 
> Lund
> 
> 
> 
> 
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

James C. McPherson

2008-Apr-16 04:18 UTC

head link

[zfs-discuss] First trouble, dying HDD.

Hi Jorgen,

Jorgen Lundman wrote:> Although, I''m having some issues finding the exact process to go
from
> fmdump''s "HD_ID_9"  to zpool offline "c0t?d?"
style input.
> 
> I can not run "zpool status", nor "format" command as
they hang. All the
> Sun documentation already assume you know the c?t?d? disk name. Today, 
> it is easier to just pull out the HDD since the x4500 doesn''t want
to
> mark it dead.
> ...
>> # fmdump -v -u 949038eb-f474-421b-bb79-981810c48f25
>>
>> TIME                 UUID                                 SUNW-MSG-ID
>> Apr 16 09:16:00.6628 949038eb-f474-421b-bb79-981810c48f25 DISK-8000-0X
>>    100%  fault.io.disk.predictive-failure
>>
>>          Problem in: 
>>
hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0
>>             Affects: 
>> dev:///:devid=id1,sd at
SATA_____HITACHI_HDS7250S______KRVN67ZBHDPX3H//pci at 0,0/pci1022,7458 at
2/pci11ab,11ab at 1/disk at 4,0
>>                 FRU: 
>>
hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0
>>            Location: HD_ID_9
You get three extra pieces of information in the fmdump output:
the device path, the devid and from that the device manufacturer
model number and serial number.

The device path is
	/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 4,0
which you can grep for in /etc/path_to_inst and then map using
the output from iostat -En.

The devid is the unique device identifier and this shows up
in the output from prtpicl -v and prtconf -v. Both of these
utilities should also then show you the "devfs-path" property
which you should be able to use to map to a cXtYdZ number.

Finally, you can see that you''ve got a Hitachi HDS7250S
with serial number KRVN67ZBHDPX3H - this will definitely be
reported in your iostat -En output.


cheers,
James C. McPherson
--
Solaris kernel software engineer, system admin and troubleshooter
               http://www.jmcp.homeunix.com/blog
                   http://blogs.sun.com/jmcp
Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson

zfs discuss - Apr 2008 - First trouble, dying HDD.

[zfs-discuss] First trouble, dying HDD.

[zfs-discuss] First trouble, dying HDD.

[zfs-discuss] First trouble, dying HDD.