thr3ads.net - freebsd stable - HAST + ZFS: no action on drive failure [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Timothy Smith

2011-Jul-01 03:34 UTC

HAST + ZFS: no action on drive failure

First posting here, hopefully I'm doing it right =)

I also posted this to the FreeBSD forum, but I know some hast folks monitor
this list regularly and not so much there, so...

Basically, I'm testing failure scenarios with HAST/ZFS. I got two nodes,
scripted up a bunch of checks and failover actions between the nodes.
Looking good so far, though more complex that I expected. It would be cool
to post it somewher to get some pointers/critiques, but that's another
thing.

Anyway, now I'm just seeing what happens when a drive fails on primary node.
Oddly/sadly, NOTHING!

Hast just keeps on a ticking, and doesn't change the state of the failed
drive, so the zpool has no clue the drive is offline. The
/dev/hast/<resource> remains. The hastd does log some errors to the system
log like this, but nothing more.

messages.0:Jun 30 18:39:59 nas1 hastd[11066]: [ada6] (primary) Unable to
flush activemap to disk: Device not configured.
messages.0:Jun 30 18:39:59 nas1 hastd[11066]: [ada6] (primary) Local request
failed (Device not configured): WRITE(4736512, 512).

So, I guess the question is, "Do I have to script a cronjob to check for
these kinds of errors and then change the hast resource to 'init' or
something to handle this?" Or is there some kind of hastd config setting
that I need to set? What's the SOP for this?

As something related too, when the zpool in FreeBSD does finally notice that
the drive is missing because I have manually changed the hast resource to
INIT (so the /dev/hast/<res> is gone), my zpool (raidz2) hot spare
doesn't
engage, even with "autoreplace=on". The zpool status of the degraded
pool
seems to indicate that I should manually replace the failed drive. If that's
the case, it's not really a "hot spare". Does this mean the
"FMA Agent"
referred to in the ZFS manual is not implemented in FreeBSD?

thanks!

Mikolaj Golub

2011-Jul-02 15:49 UTC

head link

HAST + ZFS: no action on drive failure

On Thu, 30 Jun 2011 20:02:19 -0700 Timothy Smith wrote:

 TS> First posting here, hopefully I'm doing it right =)

 TS> I also posted this to the FreeBSD forum, but I know some hast folks
monitor
 TS> this list regularly and not so much there, so...

 TS> Basically, I'm testing failure scenarios with HAST/ZFS. I got two
nodes,
 TS> scripted up a bunch of checks and failover actions between the nodes.
 TS> Looking good so far, though more complex that I expected. It would be
cool
 TS> to post it somewher to get some pointers/critiques, but that's
another
 TS> thing.

 TS> Anyway, now I'm just seeing what happens when a drive fails on
primary node.
 TS> Oddly/sadly, NOTHING!

 TS> Hast just keeps on a ticking, and doesn't change the state of the
failed
 TS> drive, so the zpool has no clue the drive is offline. The
 TS> /dev/hast/<resource> remains. The hastd does log some errors to
the system
 TS> log like this, but nothing more.

 TS> messages.0:Jun 30 18:39:59 nas1 hastd[11066]: [ada6] (primary) Unable to
 TS> flush activemap to disk: Device not configured.
 TS> messages.0:Jun 30 18:39:59 nas1 hastd[11066]: [ada6] (primary) Local
request
 TS> failed (Device not configured): WRITE(4736512, 512).

Although the request to local drive failed it succeeded on remote node, so
data was not lost, it was considered as successful, and no error was returned
to ZFS.

 TS> So, I guess the question is, "Do I have to script a cronjob to
check for
 TS> these kinds of errors and then change the hast resource to
'init' or
 TS> something to handle this?" Or is there some kind of hastd config
setting
 TS> that I need to set? What's the SOP for this?

Currently the only way to know is monitoring logs. It is not difficult to hook
event for these errors in the HAST code (like it is done for
connect/disconnect, syncstart/done etc) so one could script what to do on an
error occurrence but I am not sure it is a good idea -- the errors may be
generated with high rate.

 TS> As something related too, when the zpool in FreeBSD does finally notice
that
 TS> the drive is missing because I have manually changed the hast resource
to
 TS> INIT (so the /dev/hast/<res> is gone), my zpool (raidz2) hot spare
doesn't
 TS> engage, even with "autoreplace=on". The zpool status of the
degraded pool
 TS> seems to indicate that I should manually replace the failed drive. If
that's
 TS> the case, it's not really a "hot spare". Does this mean
the "FMA Agent"
 TS> referred to in the ZFS manual is not implemented in FreeBSD?

 TS> thanks!
 TS> _______________________________________________
 TS> freebsd-stable@freebsd.org mailing list
 TS> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 TS> To unsubscribe, send any mail to
"freebsd-stable-unsubscribe@freebsd.org"

-- 
Mikolaj Golub

freebsd stable - Jul 2011 - HAST + ZFS: no action on drive failure

HAST + ZFS: no action on drive failure

HAST + ZFS: no action on drive failure