Howdy all, I had a question from a colleague and did not have a ready answer. What is the community''s experience with putting together a small and inexpensive cluster that serves Lustre from (some of) the compute nodes''s local disks? They have run some simple tests with using a) just local disc, b) simple NFS service mounted to compute nodes, and c) Lustre with OSS and MDS on the same node. A typical workload for them is to compile the "Visit" visualization package. On a local disk this takes 2 to 3 hours. On NFS it was closer to 24 hours, and on the small Lustre example it was about 5 hours. Now they''d like to go a little further and try to find a Lustre solution that would improve performance as compared to local disk. Their workload will be mostly metadata intensive rather than bulk I/O intensive. Is there any experience like that out there? Cheers, Andrew Notes from the one asking the question: ---------------------------------------------------------- What I would like to do now is to develop the cheapest small cluster possible that still has good I/O performance. NetAps raise the cost significantly. Also, I think the whole system must come out of the box with the application and all dependencies built and good I/O. So one possible way would be a system with a head node and N compute nodes, each with multiple CPUs and cores, of course. I can then imagine a Lustre file system with the MDS on the head node and perhaps M OSSs on the compute nodes, which then serve up their local disks. Of course, now the compute nodes are running both the computational application (on all cores likely) and 0 or 1 OSS. It sounds like from what you are saying that at a minimum I would need two interfaces per node: one over which the MPI communication goes for the apps, and one for serving the Lustre file system on those nodes which are serving that. Is this right? Is this a reasonable direction to go? (Having both OSS and computation on some nodes.) Are there examples of good systems designs out there? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110421/20ab476e/attachment.html
On 2011-04-21, at 11:52 AM, Andrew Uselton wrote:> I had a question from a colleague and did not have a ready answer. > What is the community''s experience with putting together a small and inexpensive cluster that serves Lustre from (some of) the compute nodes''s local disks? > > They have run some simple tests with using a) just local disc, b) simple NFS > service mounted to compute nodes, and c) Lustre with OSS and MDS on the same > node. > > A typical workload for them is to compile the "Visit" visualization package. > On a local disk this takes 2 to 3 hours. On NFS it was closer to 24 hours, and > on the small Lustre example it was about 5 hours. Now they''d like to go a > little further and try to find a Lustre solution that would improve performance > as compared to local disk. Their workload will be mostly metadata intensive > rather than bulk I/O intensive. Is there any experience like that out there?Running client-on-OSS is a configuration that isn''t "officially" supported, but I''ve done it for ages on my home system. There are potential memory deadlocks if the client is consuming a lot of RAM and also doing heavy IO, but over time we''ve removed a lot of them. That said, no harm in trying this, and it is relatively straight forward to set up. One important factor is to ensure that the OSTs on a node are mounted before the client mountpoint (e.g. via fstab). Unfortunately, there is as yet no MDS policy that would preferentially allocate and store file objects on an OST local to the client. That would be an interesting optimization, and not too hard for someone to implement. I also heard once about someone using Lustre OSTs backed by a ramdisk for truly "scratch" filesystems that were quite fast. The filesystem would lose data whenever a node rebooted, but is interesting for some limited use cases. That said, there are also parallel compilation tools, and ccache that can speed up compiles dramatically, and they don''t depend on shared storage at all. That doesn''t help if compilation isn''t their only workload, but just worth mentioning.> Notes from the one asking the question: > ---------------------------------------------------------- > What I would like to do now is to develop the cheapest > small cluster possible that still has good I/O > performance. NetAps raise the cost significantly. Also, I > > think the whole system must come out of the box with the > application and all dependencies built and good I/O. > > > So one possible way would be a system with a head node and > N compute nodes, each with multiple CPUs and cores, of > course. I can then imagine a Lustre file system with the > MDS on the head node and perhaps M OSSs on the compute > nodes, which then serve up their local disks. Of course, > now the compute nodes are running both the computational > application (on all cores likely) and 0 or 1 OSS. > > It sounds like from what you are saying that at a minimum > I would need two interfaces per node: one over which the > MPI communication goes for the apps, and one for serving > the Lustre file system on those nodes which are serving > that. Is this right? > > Is this a reasonable direction to go? (Having both OSS and > computation on some nodes.) > > Are there examples of good systems designs out there? > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
Greetings, I had been a part of a team that has done this twice. Once at NASA Goddard Space Flight Center Hydrological Sciences Branch and one more time at the Center for Research on Environment & Water. Both times were successful experiences I thought. We used commercial off-the-shelf PC hardware and managed switches to build a beowulf-style cluster consisting of compute nodes, OSS and MDS nodes. The OSS and the MGS/MDS units were separate as per the recommendation of the Lustre team. The back-end storage OST units were 4U boxes containing sATA disks connected to the OST via CX4 (I think) cables. We used Perc6/i RAID and the corresponding MegaCLI64 s/w tool on the OSS units to manage the disks within. The OS was Red Hat-based CentOS 4 and upgraded before I left to CentOS 5.5. The OST disks were formatted in the Lustre Cluster file system. We were able to successfully export the Lustre mount-points via NFS from the main client box. We used the data on the Lustre file system to produce and display Earth science images on an ordinary web interface (using a combination of IDL proprietary imaging software and the freely available GrADS imaging software from IGES). We chose Lustre cluster files system for the project because of its price point (Free/Open-Source -- FOSS) and the fact that it performed better for our purposes than GFS and our test of the, back then early, glustre. Just a data point for you. megan