thr3ads.net - zfs discuss - [zfs-discuss] horrible slow pool [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Carsten John

2012-Oct-11 10:10 UTC

[zfs-discuss] horrible slow pool

Hello everybody,

I just wanted to share my experience with a (partially) broken SSD that was in
use in a ZIL mirror.

We experienced a dramatic performance problem with one of our zpools, serving
home directories. Mainly NFS clients were affected. Our SunRay infrastructure
came to a complete halt.

Finally we were able to identify one SSD as the root caus. The SSD was still
working, but quite slow.

The issue didn''t trigger ZFS to detect the disk as faulty. FMA
didn''t detect it, too.

We identified the broken disk by issuing "iostat -en''. After
replacing the SSD, everything went back to normal.

To prevent outages like this in the future I hacked together a "quick and
dirty" bash script to detect disks with a given rate of total errors. The
script might be used in conjunction with nagios.

Perhaps it''s of use for others sa well:

###################################################################
#!/bin/bash
# check disk in all pools for errors.
# partially failing (or slow) disks
# may result in horribly degradded 
# performance of zpools despite the fact
# the pool is still healthy

# exit codes
# 0 OK
# 1 WARNING
# 2 CRITICAL
# 3 UNKONOWN

OUTPUT=""
WARNING="0"
CRITICAL="0"
SOFTLIMIT="5"
HARDLIMIT="20"

LIST=$(zpool status | grep "c[1-9].*d0 " | awk ''{print
$1}'')
    for DISK in $LIST 
    do  
        ERROR=$(iostat -enr $DISK | cut -d "," -f 4 | grep
"^[0-9]")
        if [[ $ERROR -gt $SOFTLIMIT ]]
        then
            OUTPUT="$OUTPUT, $DISK:$ERROR"
            WARNING="1"
        fi
        if [[ $ERROR -gt $HARDLIMIT ]]
        then
            OUTPUT="$OUTPUT, $DISK:$ERROR"
            CRITICAL="1"
        fi
    done

if [[ $CRITICAL -gt 0 ]]
then
    echo "CRITICAL: Disks with error count >= $HARDLIMIT found:
$OUTPUT"
    exit 2
fi
if [[ $WARNING -gt 0 ]]
then
    echo "WARNING: Disks with error count >= $SOFTLIMIT found:
$OUTPUT"
    exit 1
fi

echo "OK: No significant disk errors found"
exit 0

###########################################################################################



cu

Carsten

Richard Elling

2012-Oct-11 18:01 UTC

head link

[zfs-discuss] horrible slow pool

Hi John,
comment below...

On Oct 11, 2012, at 3:10 AM, Carsten John <cjohn at mpi-bremen.de> wrote:
> Hello everybody,
> 
> I just wanted to share my experience with a (partially) broken SSD that was
in use in a ZIL mirror.
> 
> We experienced a dramatic performance problem with one of our zpools,
serving home directories. Mainly NFS clients were affected. Our SunRay
infrastructure came to a complete halt.
> 
> Finally we were able to identify one SSD as the root caus. The SSD was
still working, but quite slow.
> 
> The issue didn''t trigger ZFS to detect the disk as faulty. FMA
didn''t detect it, too.
> 
> We identified the broken disk by issuing "iostat -en''. After
replacing the SSD, everything went back to normal.
> 
> To prevent outages like this in the future I hacked together a "quick
and dirty" bash script to detect disks with a given rate of total errors.
The script might be used in conjunction with nagios.
This shouldn''t be needed. All of the fields of iostat are in kstats and
nagios can already
collect kstats.
	kstat -pm sderr

The good thing about using this method is that it works with or without ZFS.
The bad thing is that some SMART tools and devices trigger complaints that
show up as errors (that can be safely ignored)
 -- richard
> 
> Perhaps it''s of use for others sa well:
> 
> ###################################################################
> #!/bin/bash
> # check disk in all pools for errors.
> # partially failing (or slow) disks
> # may result in horribly degradded 
> # performance of zpools despite the fact
> # the pool is still healthy
> 
> # exit codes
> # 0 OK
> # 1 WARNING
> # 2 CRITICAL
> # 3 UNKONOWN
> 
> OUTPUT=""
> WARNING="0"
> CRITICAL="0"
> SOFTLIMIT="5"
> HARDLIMIT="20"
> 
> LIST=$(zpool status | grep "c[1-9].*d0 " | awk ''{print
$1}'')
>    for DISK in $LIST 
>    do  
>        ERROR=$(iostat -enr $DISK | cut -d "," -f 4 | grep
"^[0-9]")
>        if [[ $ERROR -gt $SOFTLIMIT ]]
>        then
>            OUTPUT="$OUTPUT, $DISK:$ERROR"
>            WARNING="1"
>        fi
>        if [[ $ERROR -gt $HARDLIMIT ]]
>        then
>            OUTPUT="$OUTPUT, $DISK:$ERROR"
>            CRITICAL="1"
>        fi
>    done
> 
> if [[ $CRITICAL -gt 0 ]]
> then
>    echo "CRITICAL: Disks with error count >= $HARDLIMIT found:
$OUTPUT"
>    exit 2
> fi
> if [[ $WARNING -gt 0 ]]
> then
>    echo "WARNING: Disks with error count >= $SOFTLIMIT found:
$OUTPUT"
>    exit 1
> fi
> 
> echo "OK: No significant disk errors found"
> exit 0
> 
>
###########################################################################################
> 
> 
> 
> cu
> 
> Carsten
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121011/8a26ef4e/attachment.html>

zfs discuss - Oct 2012 - horrible slow pool

[zfs-discuss] horrible slow pool

[zfs-discuss] horrible slow pool