Hi John,
comment below...
On Oct 11, 2012, at 3:10 AM, Carsten John <cjohn at mpi-bremen.de> wrote:
> Hello everybody,
>
> I just wanted to share my experience with a (partially) broken SSD that was
in use in a ZIL mirror.
>
> We experienced a dramatic performance problem with one of our zpools,
serving home directories. Mainly NFS clients were affected. Our SunRay
infrastructure came to a complete halt.
>
> Finally we were able to identify one SSD as the root caus. The SSD was
still working, but quite slow.
>
> The issue didn''t trigger ZFS to detect the disk as faulty. FMA
didn''t detect it, too.
>
> We identified the broken disk by issuing "iostat -en''. After
replacing the SSD, everything went back to normal.
>
> To prevent outages like this in the future I hacked together a "quick
and dirty" bash script to detect disks with a given rate of total errors.
The script might be used in conjunction with nagios.
This shouldn''t be needed. All of the fields of iostat are in kstats and
nagios can already
collect kstats.
kstat -pm sderr
The good thing about using this method is that it works with or without ZFS.
The bad thing is that some SMART tools and devices trigger complaints that
show up as errors (that can be safely ignored)
-- richard
>
> Perhaps it''s of use for others sa well:
>
> ###################################################################
> #!/bin/bash
> # check disk in all pools for errors.
> # partially failing (or slow) disks
> # may result in horribly degradded
> # performance of zpools despite the fact
> # the pool is still healthy
>
> # exit codes
> # 0 OK
> # 1 WARNING
> # 2 CRITICAL
> # 3 UNKONOWN
>
> OUTPUT=""
> WARNING="0"
> CRITICAL="0"
> SOFTLIMIT="5"
> HARDLIMIT="20"
>
> LIST=$(zpool status | grep "c[1-9].*d0 " | awk ''{print
$1}'')
> for DISK in $LIST
> do
> ERROR=$(iostat -enr $DISK | cut -d "," -f 4 | grep
"^[0-9]")
> if [[ $ERROR -gt $SOFTLIMIT ]]
> then
> OUTPUT="$OUTPUT, $DISK:$ERROR"
> WARNING="1"
> fi
> if [[ $ERROR -gt $HARDLIMIT ]]
> then
> OUTPUT="$OUTPUT, $DISK:$ERROR"
> CRITICAL="1"
> fi
> done
>
> if [[ $CRITICAL -gt 0 ]]
> then
> echo "CRITICAL: Disks with error count >= $HARDLIMIT found:
$OUTPUT"
> exit 2
> fi
> if [[ $WARNING -gt 0 ]]
> then
> echo "WARNING: Disks with error count >= $SOFTLIMIT found:
$OUTPUT"
> exit 1
> fi
>
> echo "OK: No significant disk errors found"
> exit 0
>
>
###########################################################################################
>
>
>
> cu
>
> Carsten
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Richard.Elling at RichardElling.com
+1-760-896-4422
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121011/8a26ef4e/attachment.html>