thr3ads.net - zfs discuss - [zfs-discuss] Is the error threshold for a degraded device configurable? [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Ian Collins

2010-Aug-15 23:59 UTC

[zfs-discuss] Is the error threshold for a degraded device configurable?

I look after an x4500 for a client and wee keep getting drives marked as 
degraded with just over 20 checksum errors.

Most of these errors appear to be driver or hardware related and thier 
frequency increases during a resilver, which can lead to a death 
spiral.  The increase in errors within a vdev during a resilver (I 
recently had three drives in an 8 drive raidz vdev "degraded") points
to
high read activity triggering the bug.

I would like to raise threshold for marking a drive degraded to give me 
more time to spot and clear the checksum errors.  Is this possible?

-- 
Ian.

Richard Elling

2010-Aug-16 00:37 UTC

head link

[zfs-discuss] Is the error threshold for a degraded device configurable?

On Aug 15, 2010, at 4:59 PM, Ian Collins wrote:
> I look after an x4500 for a client and wee keep getting drives marked as
degraded with just over 20 checksum errors.
> 
> Most of these errors appear to be driver or hardware related and thier
frequency increases during a resilver, which can lead to a death spiral.  The
increase in errors within a vdev during a resilver (I recently had three drives
in an 8 drive raidz vdev "degraded") points to high read activity
triggering the bug.
> 
> I would like to raise threshold for marking a drive degraded to give me
more time to spot and clear the checksum errors.  Is this possible?
There is not a documented system-admin visible interface to this.  
The settings in question can be set as properties in the zfs-diagnosis.conf
file, similar to props set in other FMA modules.

The source is also currently available.
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/zfs-diagnosis/zfs_de.c#957

Examples of setting FMA module properties are in 
/usr/lib/fm/fmd/plugins/cpumem-retire.conf
and other .conf files.

If you get this to work, please publicly document your changes and 
why you felt the new settings were better.
 -- richard

-- 
Richard Elling
richard at nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com

Ian Collins

2010-Aug-16 01:08 UTC

head link

[zfs-discuss] Is the error threshold for a degraded device configurable?

On 08/16/10 12:37 PM, Richard Elling wrote:> On Aug 15, 2010, at 4:59 PM, Ian Collins wrote:
>
>    
>> I look after an x4500 for a client and wee keep getting drives marked
as degraded with just over 20 checksum errors.
>>
>> Most of these errors appear to be driver or hardware related and thier
frequency increases during a resilver, which can lead to a death spiral.  The
increase in errors within a vdev during a resilver (I recently had three drives
in an 8 drive raidz vdev "degraded") points to high read activity
triggering the bug.
>>
>> I would like to raise threshold for marking a drive degraded to give me
more time to spot and clear the checksum errors.  Is this possible?
>>      
> There is not a documented system-admin visible interface to this.
> The settings in question can be set as properties in the zfs-diagnosis.conf
> file, similar to props set in other FMA modules.
>
> The source is also currently available.
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/zfs-diagnosis/zfs_de.c#957
>
> Examples of setting FMA module properties are in
> /usr/lib/fm/fmd/plugins/cpumem-retire.conf
> and other .conf files.
>
>    Thanks for the links Richard.

Looking through the code, the only configurable read from the file is 
remove_timeout.  Anything else will require code changes.  Maybe it''s 
time to upgrade the box to something newer than Solaris 10!

-- 
Ian.

Maybe Matching Threads

Search for more reasonably related threads

zfs discuss - Aug 2010 - Is the error threshold for a degraded device configurable?

[zfs-discuss] Is the error threshold for a degraded device configurable?

[zfs-discuss] Is the error threshold for a degraded device configurable?

[zfs-discuss] Is the error threshold for a degraded device configurable?

Maybe Matching Threads