Jeremy Mann wrote:> I have set up 20 compute nodes as OSTs, one off each other like
> compute-0-0 -> 0-1, 0-2 -> 0-3 and so on. However this morning, one
of
> the drives in a OST failed. The node didn''t reboot, it just
remounted
> its lustre OST device read-only. This caused our normal storage scripts
> to fail.
>
You could mount your devices errors=panic to panic the node instead of
remounting RO, thus
giving your HA scripts something more useful to work
with.> I had to reboot the node anyway to replace the drive, so that''s
when the
> failover to the next node happened. I can see on the Meta server that
> Lustre did indeed switch to the failover node, however, the files that
> were associated with that node are visible but not readable.
Shouldn''t
> the failover node have prevented this?
>
The files are visible because the namespace is contained on the MDT, not
the individual OSTs.
All files will be visible; files on the affected OST will be
inaccessible.> The drive that failed is completely dead, I can''t even mount it to
try a
> dd to restore the filesystem, so it looks like I''m going to have
to
> rebuild the filesystem.
>
A disk failure is considered an unrecoverable error as far as Lustre is
concerned. Your back-end storage must
be reliable for Lustre to function -- that''s what raid is for.
Dual-ported standalone raid boxes allow for failover
Lustre servers to take over from each other in case of _node_ failure,
not _disk_ failure.
In the meantime, you can deactivate the affected OST using lctl on the
clients and MDT; this will allow access functions to complete
without errors (the files on the affected OST will be 0-length, but the
rest of your files will be ok)