So From My experience I''ll try to answer the questions... and then The
experts will probably correct me..
So my question is, once a Lustre cluster is set up and configured, how much
administration does it require in practice?
Not much, assuming users don''t abuse it, fill it up, etc standard
sysadmin of the nodes, updates,etc
More precisely: Apart from hardware failures, how often (and under what
circumstances) should I expect the file system itself to lose integrity and
require manual intervention?
By design None.. in reality there are time that a filesystem may need some
help. Mostly these are cause by hardware issues.
For example, if someone hard resets a client, will the file system always
recover automatically? How about if they reset a server (MDS or OSS)?
(Obviously unwritten data at the time of such resets could be lost or otherwise
damaged; that is not what I mean. I mean, should I expect to need to manually
run fsck, or to track down and release locks, or to do anything else to restore
the consistency of the
cluster?)
After normal in-flight dataloss, and the Servers are back online, the cluster
should recover into a usable state. Given some help the filesystem can come
back without an OSS, and run until it can be repaired(with empty files for the
missing OST data)
For another example, if the file system runs out of space, can I recover from
that merely by deleting some files? Or would additional Lustre-specific action
be needed to restore the cluster to a consistent state?
Yes deleting files frees up space. But you have to consider there are many
places you can run out of space. Each OST has a space limit, and if a file you
are appending to resides on a full one, removing files from other OST''s
will not help. New files will make an attempt to land on low-used
OST''s
In general, how "self-healing" is a Lustre cluster?
IT does pretty good once you get past the hardware issues, and its configured
correctly. Most of the time I feel very comfortable just telling other admins
to just reboot a node, or just to failover a node for maintenance, since luste
recovers and users only see short pauses in IO, if at all.
Evan Felix