Ian Collins
2010-Aug-15 23:59 UTC
[zfs-discuss] Is the error threshold for a degraded device configurable?
I look after an x4500 for a client and wee keep getting drives marked as degraded with just over 20 checksum errors. Most of these errors appear to be driver or hardware related and thier frequency increases during a resilver, which can lead to a death spiral. The increase in errors within a vdev during a resilver (I recently had three drives in an 8 drive raidz vdev "degraded") points to high read activity triggering the bug. I would like to raise threshold for marking a drive degraded to give me more time to spot and clear the checksum errors. Is this possible? -- Ian.
Richard Elling
2010-Aug-16 00:37 UTC
[zfs-discuss] Is the error threshold for a degraded device configurable?
On Aug 15, 2010, at 4:59 PM, Ian Collins wrote:> I look after an x4500 for a client and wee keep getting drives marked as degraded with just over 20 checksum errors. > > Most of these errors appear to be driver or hardware related and thier frequency increases during a resilver, which can lead to a death spiral. The increase in errors within a vdev during a resilver (I recently had three drives in an 8 drive raidz vdev "degraded") points to high read activity triggering the bug. > > I would like to raise threshold for marking a drive degraded to give me more time to spot and clear the checksum errors. Is this possible?There is not a documented system-admin visible interface to this. The settings in question can be set as properties in the zfs-diagnosis.conf file, similar to props set in other FMA modules. The source is also currently available. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/zfs-diagnosis/zfs_de.c#957 Examples of setting FMA module properties are in /usr/lib/fm/fmd/plugins/cpumem-retire.conf and other .conf files. If you get this to work, please publicly document your changes and why you felt the new settings were better. -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com
Ian Collins
2010-Aug-16 01:08 UTC
[zfs-discuss] Is the error threshold for a degraded device configurable?
On 08/16/10 12:37 PM, Richard Elling wrote:> On Aug 15, 2010, at 4:59 PM, Ian Collins wrote: > > >> I look after an x4500 for a client and wee keep getting drives marked as degraded with just over 20 checksum errors. >> >> Most of these errors appear to be driver or hardware related and thier frequency increases during a resilver, which can lead to a death spiral. The increase in errors within a vdev during a resilver (I recently had three drives in an 8 drive raidz vdev "degraded") points to high read activity triggering the bug. >> >> I would like to raise threshold for marking a drive degraded to give me more time to spot and clear the checksum errors. Is this possible? >> > There is not a documented system-admin visible interface to this. > The settings in question can be set as properties in the zfs-diagnosis.conf > file, similar to props set in other FMA modules. > > The source is also currently available. > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/fm/modules/common/zfs-diagnosis/zfs_de.c#957 > > Examples of setting FMA module properties are in > /usr/lib/fm/fmd/plugins/cpumem-retire.conf > and other .conf files. > >Thanks for the links Richard. Looking through the code, the only configurable read from the file is remove_timeout. Anything else will require code changes. Maybe it''s time to upgrade the box to something newer than Solaris 10! -- Ian.