thr3ads.net - zfs discuss - [zfs-discuss] Narrow escape with FAULTED disks [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Mark Bennett

2010-Aug-17 00:53 UTC

[zfs-discuss] Narrow escape with FAULTED disks

Nothing like a "heart in mouth moment" to shave tears from your life.

I rebooted a snv_132 box in perfect heath, and it came back up with two FAULTED
disks in the same vdisk group.

Everything an hour on Google I found basically said "your data is
gone".

All 45Tb of it.

A postmortem of fmadm showed a single disk failed with smart predictive failure.
No indication why the second failed.

I don''t give up easily, and it is now back up and scrubbing - no errors
so far.

I checked both the drives were readable, so it didn''t seem to be a
hardware fault.
I moved one into a different server and ran a zpool import to see what it made
of it.

The disk was ONLINE, and it''s vdisk buddies were unavailable.
Ok, so I moved the disks into different bays and booted from the snv_134 cdrom.
Ran zpool import and the zpool came back with everything online.

That was encouraging, so I exported it and booted from the origional 132 boot
drive.

Well, it came back, and at 1:00AM I was able to get back to the origional issue
I was chasing.

So, don''t give up hope when all hope appears to be lost.

Mark.

Still an Open_Solaris fan keen to help the community achieve a 2010 release on
it''s own.
-- 
This message posted from opensolaris.org

Cindy Swearingen

2010-Aug-17 20:53 UTC

head link

[zfs-discuss] Narrow escape with FAULTED disks

Hi Mark,

I would recheck with fmdump to see if you have any persistent errors
on the second disk.

The fmdump command will display faults and fmdump -eV will display 
errors (persistent faults that have turned into errors based on some
criteria).

If fmdump -eV doesn''t show any activity for that second disk, then
review /var/adm/messages or iostat -En for driver-level resets and
so on.

Thanks,

Cindy

On 08/16/10 18:53, Mark Bennett wrote:> Nothing like a "heart in mouth moment" to shave tears from your
life.
> 
> I rebooted a snv_132 box in perfect heath, and it came back up with two
FAULTED disks in the same vdisk group.
> 
> Everything an hour on Google I found basically said "your data is
gone".
> 
> All 45Tb of it.
> 
> A postmortem of fmadm showed a single disk failed with smart predictive
failure.
> No indication why the second failed.
> 
> I don''t give up easily, and it is now back up and scrubbing - no
errors so far.
> 
> I checked both the drives were readable, so it didn''t seem to be a
hardware fault.
> I moved one into a different server and ran a zpool import to see what it
made of it.
> 
> The disk was ONLINE, and it''s vdisk buddies were unavailable.
> Ok, so I moved the disks into different bays and booted from the snv_134
cdrom.
> Ran zpool import and the zpool came back with everything online.
> 
> That was encouraging, so I exported it and booted from the origional 132
boot drive.
> 
> Well, it came back, and at 1:00AM I was able to get back to the origional
issue I was chasing.
> 
> So, don''t give up hope when all hope appears to be lost.
> 
> Mark.
> 
> Still an Open_Solaris fan keen to help the community achieve a 2010 release
on it''s own.

Mark Bennett

2010-Aug-18 08:38 UTC

head link

[zfs-discuss] Narrow escape with FAULTED disks

Hi Cindy,

Not very enlightening.
No previous errors for the disks.
I did replace one about a month earlier when it showed a rise in io errors, and
before it reached a level where fault management would have failed it.

The disk mentioned is not one of those that went FAULTED.
Also, no more smart error events since.
The ZFS failed on boot after a reboot command.

The scrub was eventually stopped at 75% due to the performance impact.
No errors were found up to that point.

One thing I see from the (attached) messages log is that the zfs error occurs
before all the disks have been logged as enumerated.
This is probably the first reboot since at least 8, and maybe 16 extra disks
were hot plugged and added to the pool.

The Hardware is a Supermicro 3U plus 2 x 4U SAS storage chassis.
The SAS controller has 16 disks on one SAS port, and 32 in the other.




Aug 16 18:44:39.2154 02f57499-ae0a-c46c-b8f8-825205a8505d ZFS-8000-D3
  100%  fault.fs.zfs.device
        Problem in: zfs://pool=drgvault/vdev=d79c5fc5b5c3b789
           Affects: zfs://pool=drgvault/vdev=d79c5fc5b5c3b789
               FRU: -
          Location: -
Aug 16 18:44:39.5569 25e0bdc2-0171-c4b5-b530-a268f8572bd1 ZFS-8000-D3
  100%  fault.fs.zfs.device
        Problem in: zfs://pool=drgvault/vdev=e912d259d7829903
           Affects: zfs://pool=drgvault/vdev=e912d259d7829903
               FRU: -
          Location: -
Aug 16 18:44:39.8964 8e9cff35-8e9d-c0f1-cd5b-bd1d0276cda1 ZFS-8000-CS
  100%  fault.fs.zfs.pool
        Problem in: zfs://pool=drgvault
           Affects: zfs://pool=drgvault
               FRU: -
          Location: -
Aug 16 18:45:47.2604 3848ba46-ee18-4aad-b632-9baf25b532ea DISK-8000-0X
  100%  fault.io.disk.predictive-failure
        Problem in:
hc://:product-id=LSILOGIC-SASX36-A.1:server-id=:chassis-id=50030480005a337f:serial=6XW15V2S:part=ST32000542AS-ST32000542AS:revision=CC34/ses-enclosure=1/bay=6/disk=0
           Affects: dev:///:devid=id1,sd at n5000c50021f4916f//pci at
0,0/pci8086,4023 at 3/pci15d9,a680 at 0/sd at 24,0
               FRU:
hc://:product-id=LSILOGIC-SASX36-A.1:server-id=:chassis-id=50030480005a337f:serial=6XW15V2S:part=ST32000542AS-ST32000542AS:revision=CC34/ses-enclosure=1/bay=6/disk=0
          Location: 006



Mark.
-- 
This message posted from opensolaris.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.zip
Type: application/octet-stream
Size: 9324 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100818/58907440/attachment.obj>

Cindy Swearingen

2010-Aug-18 16:13 UTC

head link

[zfs-discuss] Narrow escape with FAULTED disks

Its hard to tell what caused the smart predictive failure message,
like a temp fluctuation. If ZFS noticed that a disk wasn''t available
yet, then I would expect a message to that effect.

In any case, I think I would have a replacement disk available.

The important thing is that you continue to monitor your hardware
for failures.

We recommend using ZFS redundancy and alway have backups of your
data.

Thanks,

Cindy


On 08/18/10 02:38, Mark Bennett wrote:> Hi Cindy,
> 
> Not very enlightening.
> No previous errors for the disks.
> I did replace one about a month earlier when it showed a rise in io errors,
and before it reached a level where fault management would have failed it.
> 
> The disk mentioned is not one of those that went FAULTED.
> Also, no more smart error events since.
> The ZFS failed on boot after a reboot command.
> 
> The scrub was eventually stopped at 75% due to the performance impact.
> No errors were found up to that point.
> 
> One thing I see from the (attached) messages log is that the zfs error
occurs before all the disks have been logged as enumerated.
> This is probably the first reboot since at least 8, and maybe 16 extra
disks were hot plugged and added to the pool.
> 
> The Hardware is a Supermicro 3U plus 2 x 4U SAS storage chassis.
> The SAS controller has 16 disks on one SAS port, and 32 in the other.
> 
> 
> 
> 
> Aug 16 18:44:39.2154 02f57499-ae0a-c46c-b8f8-825205a8505d ZFS-8000-D3
>   100%  fault.fs.zfs.device
>         Problem in: zfs://pool=drgvault/vdev=d79c5fc5b5c3b789
>            Affects: zfs://pool=drgvault/vdev=d79c5fc5b5c3b789
>                FRU: -
>           Location: -
> Aug 16 18:44:39.5569 25e0bdc2-0171-c4b5-b530-a268f8572bd1 ZFS-8000-D3
>   100%  fault.fs.zfs.device
>         Problem in: zfs://pool=drgvault/vdev=e912d259d7829903
>            Affects: zfs://pool=drgvault/vdev=e912d259d7829903
>                FRU: -
>           Location: -
> Aug 16 18:44:39.8964 8e9cff35-8e9d-c0f1-cd5b-bd1d0276cda1 ZFS-8000-CS
>   100%  fault.fs.zfs.pool
>         Problem in: zfs://pool=drgvault
>            Affects: zfs://pool=drgvault
>                FRU: -
>           Location: -
> Aug 16 18:45:47.2604 3848ba46-ee18-4aad-b632-9baf25b532ea DISK-8000-0X
>   100%  fault.io.disk.predictive-failure
>         Problem in:
hc://:product-id=LSILOGIC-SASX36-A.1:server-id=:chassis-id=50030480005a337f:serial=6XW15V2S:part=ST32000542AS-ST32000542AS:revision=CC34/ses-enclosure=1/bay=6/disk=0
>            Affects: dev:///:devid=id1,sd at n5000c50021f4916f//pci at
0,0/pci8086,4023 at 3/pci15d9,a680 at 0/sd at 24,0
>                FRU:
hc://:product-id=LSILOGIC-SASX36-A.1:server-id=:chassis-id=50030480005a337f:serial=6XW15V2S:part=ST32000542AS-ST32000542AS:revision=CC34/ses-enclosure=1/bay=6/disk=0
>           Location: 006
> 
> 
> 
> Mark.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Mark Bennett

2010-Aug-23 09:46 UTC

head link

[zfs-discuss] Narrow escape with FAULTED disks

Well I do have a plan.

Thanks to the portability of ZFS boot disks, I''ll make two new OS disks
on another machine with the next Nexcenta release, export the data pool and swap
in the new ones.

That way, I can at least manage a zfs scrub without killing the performance and
get the Intel SSD''s I have been testing to work properly.

On the other hand, I could just use the spare 7210 Appliance boot disk I have
lying about.

Mark.
-- 
This message posted from opensolaris.org

Possibly Parallel Threads

Search for more reasonably related threads

zfs discuss - Aug 2010 - Narrow escape with FAULTED disks

[zfs-discuss] Narrow escape with FAULTED disks

[zfs-discuss] Narrow escape with FAULTED disks

[zfs-discuss] Narrow escape with FAULTED disks

[zfs-discuss] Narrow escape with FAULTED disks

[zfs-discuss] Narrow escape with FAULTED disks

Possibly Parallel Threads