Hi Anselm,
On Thu, 2006-05-11 at 16:38 +0200, Anselm Strauss wrote:> hi.
>
> i had the feeling most people are using lustre in striping mode to
> reach high bandwidth. there are certainly good techniques in lustre
> to reach also good availability, but i was still wondering what is
> the typical setup for a lustre system striped over multiple servers
> but not missing too much availability.
> so, if there are people willing to tell a bit about their lustre
> setup and availability issues i would welcome this a lot, and they
> might also get some feedback.
> is it, that
>
> 1) you are not providing a high available storage for the users, e.g.
> forcing them to store only short-lived data on the system which they
> are archiving somewhere else?
Yes. In our case, the somewhere else is supposed to be IBM''s HPSS.
Other
clusters, personal storage, large servers, and NFS filers are also
leveraged by the users.
>
> 2) you use one or more fail-overed hosts for each oss?
Yes. One fail-over host per OSS. Active-active.
>
> 3) you use failover and even mirror your disks behind lustre, e.g. drbd?
Nothing else. Just the fail-over.
>
> 4) you do not have redundancy but good soft- and hardware that is
> highly available by itself?
Yes. Kind of. The disks are all attached to hardware RAID 5 controllers.
Host-controller link is FC2 but no switches are present. So, each host
is direct-attached to two RAID controllers. We''ve run two cable configs
at our site on various machines:
1)
[host A] ----- [ctlr 1] \
| - [storage]
[host B] ----- [ctlr 2] /
2)
| - [storage]
[host A] ----- [ctlr 2] /
\
---- [ctlr 1] \
| - [storage]
[host B] ----- [ctlr 2] /
\
---- [ctlr 1] \
| - [storage]
In (2), above, the last OSS is connected to the last RAID and the first.
We don''t use that cabling anymore, anywhere, that I know of.
>
> we are planning to use lustre for a linux cluster of about 150 nodes.
> what we need for sure is a fast network storage on each node which is
> only possible using multiple oss and striping. but our users are also
> accustomed to store all their research data on the cluster and have
> it regularly backed up. i was thinking of providing a second lustre
> service for archiving data which is not striped therefore slow, but
> highly available. users will have to copy their data from one system
> to the other for computation, and copy back again their results to
> archive them. i''m also not sure how the crash of one oss will
affect
> a striped system. i think already striped data is gone but new data
> will be written to the remaining servers. i don''t know if lustre
is
> able to balance striped data if a crashed server comes back online,
> as it is possible with lvm.
We use multiple Lustre file systems on the same machine to achieve some
form of non-stop availability. Certainly not high-availability of a
given FS, though. Altogether, 5 Lustre FS of various size and capability
are available for production use, ranging from 16 OSS to 64 OSS nodes in
size. There are about 10,000 Lustre clients. Lustre has been in use for
about a year.
We have never contemplated back-ups for our production parallel file
systems. Only home directories.
Right now, the death of an OSS is problematic due to a bug in Lustre on
our machine. A bug unique to our machine architecture, Cray XT3? Maybe
someone from CFS will weigh in and verify that? In any case, fail-over
does not work for us. We try to unmount what we can, remove knowledge of
the problem file system from the compute and login service nodes, and
continue production runs. Not wonderful but functional, mostly.
Depending on what went wrong, data may or may not be lost. If it was an
OSS, or that the network partitioned, the data is often still there.
When it''s storage hardware, we have lost data on the affected
volume(s).
We have never taken advantage of Lustre''s ability to continue operation
after replacing a failed storage volume -- We mount read-only, tell
users to get their surviving data off, reformat and restart the FS.
To my knowledge, Lustre does not rebalance striped data after faulty
OSS/storage is repaired. We''ve never tried it though, as I mentioned
above.
Hope something in there is useful.
--Lee
>
> cheers,
> anselm strauss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>