Today, we foudn the x4500 NFS stopped responding for about 1 minute, but the server itself was idle. After a little looking around, we found this: Apr 16 09:16:00 x4500-01.unix fmd: [ID 441519 daemon.error] SUNW-MSG-ID: DISK-8000-0X, TYPE: Fault, VER: 1, SEVERITY: Major Apr 16 09:16:00 x4500-01.unix EVENT-TIME: Wed Apr 16 09:16:00 JST 2008 Apr 16 09:16:00 x4500-01.unix PLATFORM: Sun Fire X4500, CSN: 0738AMT047 , HOSTNAME: x4500-01.unix Apr 16 09:16:00 x4500-01.unix SOURCE: eft, REV: 1.16 Apr 16 09:16:00 x4500-01.unix EVENT-ID: 949038eb-f474-421b-bb79-981810c48f25 Apr 16 09:16:00 x4500-01.unix DESC: SMART health-monitoring firmware reported that a disk Apr 16 09:16:00 x4500-01.unix failure is imminent. Apr 16 09:16:00 x4500-01.unix Refer to http://sun.com/msg/DISK-8000-0X for more information. Apr 16 09:16:00 x4500-01.unix AUTO-RESPONSE: None. Apr 16 09:16:00 x4500-01.unix IMPACT: It is likely that the continued operation of Apr 16 09:16:00 x4500-01.unix this disk will result in data loss. Apr 16 09:16:00 x4500-01.unix REC-ACTION: Schedule a repair procedure to replace the affected disk. Apr 16 09:16:00 x4500-01.unix Use fmdump -v -u <EVENT_ID> to identify the disk. # fmdump -v -u 949038eb-f474-421b-bb79-981810c48f25 TIME UUID SUNW-MSG-ID Apr 16 09:16:00.6628 949038eb-f474-421b-bb79-981810c48f25 DISK-8000-0X 100% fault.io.disk.predictive-failure Problem in: hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0 Affects: dev:///:devid=id1,sd at SATA_____HITACHI_HDS7250S______KRVN67ZBHDPX3H//pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 4,0 FRU: hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0 Location: HD_ID_9 The server is back and operating again, so we are happy with how it handled it. It would have been nice if the URL above would have further URLs leading to "how to remove a HDD from x4500/other platforms". Can I tell zfs to just not use the HDD for now, until we can pull it out and replace it? We have 14TB spare, so no risk of running out of space right now. Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Although, I''m having some issues finding the exact process to go from fmdump''s "HD_ID_9" to zpool offline "c0t?d?" style input. I can not run "zpool status", nor "format" command as they hang. All the Sun documentation already assume you know the c?t?d? disk name. Today, it is easier to just pull out the HDD since the x4500 doesn''t want to mark it dead. Jorgen Lundman wrote:> Today, we foudn the x4500 NFS stopped responding for about 1 minute, but > the server itself was idle. After a little looking around, we found this: > > Apr 16 09:16:00 x4500-01.unix fmd: [ID 441519 daemon.error] SUNW-MSG-ID: > DISK-8000-0X, TYPE: Fault, VER: 1, SEVERITY: Major > Apr 16 09:16:00 x4500-01.unix EVENT-TIME: Wed Apr 16 09:16:00 JST 2008 > Apr 16 09:16:00 x4500-01.unix PLATFORM: Sun Fire X4500, CSN: 0738AMT047 > , HOSTNAME: x4500-01.unix > Apr 16 09:16:00 x4500-01.unix SOURCE: eft, REV: 1.16 > Apr 16 09:16:00 x4500-01.unix EVENT-ID: 949038eb-f474-421b-bb79-981810c48f25 > Apr 16 09:16:00 x4500-01.unix DESC: SMART health-monitoring firmware > reported that a disk > Apr 16 09:16:00 x4500-01.unix failure is imminent. > Apr 16 09:16:00 x4500-01.unix Refer to http://sun.com/msg/DISK-8000-0X > for more information. > Apr 16 09:16:00 x4500-01.unix AUTO-RESPONSE: None. > Apr 16 09:16:00 x4500-01.unix IMPACT: It is likely that the continued > operation of > Apr 16 09:16:00 x4500-01.unix this disk will result in data loss. > Apr 16 09:16:00 x4500-01.unix REC-ACTION: Schedule a repair procedure to > replace the affected disk. > Apr 16 09:16:00 x4500-01.unix Use fmdump -v -u <EVENT_ID> to identify > the disk. > > > # fmdump -v -u 949038eb-f474-421b-bb79-981810c48f25 > > TIME UUID SUNW-MSG-ID > Apr 16 09:16:00.6628 949038eb-f474-421b-bb79-981810c48f25 DISK-8000-0X > 100% fault.io.disk.predictive-failure > > Problem in: > hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0 > Affects: > dev:///:devid=id1,sd at SATA_____HITACHI_HDS7250S______KRVN67ZBHDPX3H//pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 4,0 > FRU: > hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0 > Location: HD_ID_9 > > > > > The server is back and operating again, so we are happy with how it > handled it. It would have been nice if the URL above would have further > URLs leading to "how to remove a HDD from x4500/other platforms". > > Can I tell zfs to just not use the HDD for now, until we can pull it out > and replace it? We have 14TB spare, so no risk of running out of space > right now. > > Lund > > > >-- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)
Hi Jorgen, Jorgen Lundman wrote:> Although, I''m having some issues finding the exact process to go from > fmdump''s "HD_ID_9" to zpool offline "c0t?d?" style input. > > I can not run "zpool status", nor "format" command as they hang. All the > Sun documentation already assume you know the c?t?d? disk name. Today, > it is easier to just pull out the HDD since the x4500 doesn''t want to > mark it dead. >...>> # fmdump -v -u 949038eb-f474-421b-bb79-981810c48f25 >> >> TIME UUID SUNW-MSG-ID >> Apr 16 09:16:00.6628 949038eb-f474-421b-bb79-981810c48f25 DISK-8000-0X >> 100% fault.io.disk.predictive-failure >> >> Problem in: >> hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0 >> Affects: >> dev:///:devid=id1,sd at SATA_____HITACHI_HDS7250S______KRVN67ZBHDPX3H//pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 4,0 >> FRU: >> hc://:product-id=Sun-Fire-X4500:chassis-id=0738AMT047:server-id=x4500-01.unix:serial=KRVN67ZBHDPX3H:part=HITACHI-HDS7250SASUN500G-0726KDPX3H:revision=K2AOAJ0A/bay=9/disk=0 >> Location: HD_ID_9You get three extra pieces of information in the fmdump output: the device path, the devid and from that the device manufacturer model number and serial number. The device path is /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 4,0 which you can grep for in /etc/path_to_inst and then map using the output from iostat -En. The devid is the unique device identifier and this shows up in the output from prtpicl -v and prtconf -v. Both of these utilities should also then show you the "devfs-path" property which you should be able to use to map to a cXtYdZ number. Finally, you can see that you''ve got a Hitachi HDS7250S with serial number KRVN67ZBHDPX3H - this will definitely be reported in your iostat -En output. cheers, James C. McPherson -- Solaris kernel software engineer, system admin and troubleshooter http://www.jmcp.homeunix.com/blog http://blogs.sun.com/jmcp Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson