Hi, I wonder what''s the current status of Lustre Lite, is it still supported? We''re looking for a CFS for an embedded system without a lot of nodes. Lustre sounds a bit too heavyweight for that. Lin Shen Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20061027/838ff9bf/attachment.html
On Oct 27, 2006 11:37 -0700, Lin Shen (lshen) wrote:> I wonder what''s the current status of Lustre Lite, is it still > supported? We''re looking for a CFS for an embedded system without a lot > of nodes. Lustre sounds a bit too heavyweight for that.How much RAM is in the embedded system? With tuning if various buffer sizes and thread counts, the amount of RAM used on a client could be reduced substantially (at some cost in performance), depending on just how tight the RAM is. Also, it would likely be possible to shrink the code in a number of ways by adding #ifdefs for different functionality you might not need: - disable flock - disable quotas - disable O_DIRECT and associated IO path - make no-op debugging macros We''d likely accept patches to make these configure options if it is done cleanly. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Lin Shen (lshen) wrote:> Hi, > > I wonder what''s the current status of Lustre Lite, is it still > supported? We''re looking for a CFS for an embedded system without a > lot of nodes. Lustre sounds a bit too heavyweight for that. >Lustre Lite is just the historical name of the client software -- there is no reduced-footprint Lustre client.
On Oct 30, 2006 10:20 -0800, Lin Shen (lshen) wrote:> Thanks for getting back to me. The CFS would be on a series of > platforms, so the RAM number varies. The low end could be as little as > 512MG and that''s for the whole system including routing, VoIP etc.The lustre clients (IO nodes) on the BG/L system (#1 supercomputer) "only" have 1MB of RAM and they do IO for up to 128 compute clients at a time. Lustre as a whole (MDS + 5 OSTs + 2 client mounts) can run inside a 64MB UML. The lustre code already has internal tunables that adjust the default buffer size, thread count, etc based on the amount of RAM on the node. The only issue is that by restricting the amount of memory that lustre uses it might negatively impact the performance because the client cannot keep IO in flight to saturate the network.> What''s the footprint of Lustre and how much lower could it be reduced to > with the code optimization you mentioned? Maybe we can cut it even more > w/o medata server cluster etc.The clustered metadata isn''t in the released Lustre code, and even if it was available there wouldn''t be a requirement to use it. With debugging symbols the lustre modules are a whopping 3-8MB a piece, and you need about 10 of them loaded to have a functioning client. If you strip out the debugging symbols you lose about 70% of that. If you _really_ wanted to slim down the modules you could make the CDEBUG() macro a no-op for anything except error messages, at the expense of making debugging much harder. That could be mitigated (and I''d be thrilled to see it) by having systemtap scripts to replace the CDEBUG trace/debug functionality. In terms of Lustre buffers and such, on a small clients this can be limited to a few MB in total. On larger clients it grows into the 10s of MB, and of course whatever the VM gives the filesystem for data cache. Lustre has its own tunable limits on how much dirty data a client can have, as older kernels didn''t do a good job with that.> And I don''t think the number of nodes will ever exceed a dozen.The number of clients doesn''t impact the amount of memory on other clients. Lustre is strictly a client-server filesystem in that regard. It does impact memory usage on the server, which would be more important if you also want the server to live inside this environment.> -----Original Message----- > From: Andreas Dilger [mailto:adilger@clusterfs.com] > Sent: Saturday, October 28, 2006 7:38 PM > To: Lin Shen (lshen) > Cc: lustre-discuss@clusterfs.com > Subject: Re: [Lustre-discuss] Lustre Lite > > On Oct 27, 2006 11:37 -0700, Lin Shen (lshen) wrote: > > I wonder what''s the current status of Lustre Lite, is it still > > supported? We''re looking for a CFS for an embedded system without a > > lot of nodes. Lustre sounds a bit too heavyweight for that. > > How much RAM is in the embedded system? With tuning if various buffer > sizes and thread counts, the amount of RAM used on a client could be > reduced substantially (at some cost in performance), depending on just > how tight the RAM is. Also, it would likely be possible to shrink the > code in a number of ways by adding #ifdefs for different functionality > you might not need: > > - disable flock > - disable quotas > - disable O_DIRECT and associated IO path > - make no-op debugging macros > > We''d likely accept patches to make these configure options if it is done > cleanly. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc.Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
On Nov 01, 2006 10:27 -0800, Lin Shen (lshen) wrote:> Yes, the server will be running in the embedded environment as well.With only tens of clients there shouldn''t be a problem with 512MB of RAM.> A few more things to clarify: > > 1. The definition of a client. Is it one client per OS instance?Correct.> 2. In our system, there will potentially be a wide range of storage > media such as disk array, hard disk, flash and USB devices and we want > to share all of them. Is it feasible and I''m having a hard time to > picture sharing among those devices with very different performance > capability and reliability.This isn''t really what Lustre was designed to do. It currently assumes a uniform IO performance (and with 1.4 it also is best to have the same size devices). That isn''t to say you cannot have Lustre running on different block devices, but rather (a) it isn''t very efficient to use on small devices because of overhead, and (b) to aggregate all of these devices into a single filesystem wouldn''t work very well.> 3. We also want the system to share database. Is this related to CFS?You should be able to run a database on top of Lustre, but this is not a common mode of operation for our current customers and I don''t know how well it functions. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
On Fri, 3 Nov 2006, Andreas Dilger wrote:> You should be able to run a database on top of Lustre, but this is not a > common mode of operation for our current customers and I don''t know how > well it functions.Maybe poor performance is what limits adoption of Lustre for transactional workloads: I heard some stories of users trying database-like access (ie. updates of many small files) on Lustre, with mixed or terrible results. NFS gives much better results with small files, but it does not scale well, and it lacks good POSIX conformance (ie. wrt caching). Too bad we don''t have the best of both worlds! -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
Hi Jean-Marc, Do you have a test on which NFS performs much better. We''d like to know more about that. If you go bak on this list, you will see that Lustre can handsomely beat NFS at unpacking archives, for example. So I am curious. - Peter -> -----Original Message----- > From: lustre-discuss-bounces@clusterfs.com > [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of > Jean-Marc Saffroy > Sent: Friday, November 03, 2006 9:30 PM > To: Andreas Dilger > Cc: Lin Shen (lshen); Lustre User Discussion Mailing List > Subject: Re: [Lustre-discuss] Lustre Lite > > On Fri, 3 Nov 2006, Andreas Dilger wrote: > > > You should be able to run a database on top of Lustre, but > this is not > > a common mode of operation for our current customers and I > don''t know > > how well it functions. > > Maybe poor performance is what limits adoption of Lustre for > transactional > workloads: I heard some stories of users trying database-like > access (ie. > updates of many small files) on Lustre, with mixed or > terrible results. > > NFS gives much better results with small files, but it does > not scale well, and it lacks good POSIX conformance (ie. wrt caching). > > Too bad we don''t have the best of both worlds! > > > -- > Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Jean-Marc, Perhaps a quick run of "fileop -f 100" from the Iozone suite might help identify which meta-data operations are problematic ? Enjoy Don Capps ----- Original Message ----- From: "Peter J. Braam" <braam@clusterfs.com> To: "''Jean-Marc Saffroy''" <jean-marc.saffroy@ext.bull.net>; "''Andreas Dilger''" <adilger@clusterfs.com> Cc: "''Lin Shen (lshen)''" <lshen@cisco.com>; "''Lustre User Discussion Mailing List''" <lustre-discuss@clusterfs.com> Sent: Friday, November 03, 2006 5:55 PM Subject: RE: [Lustre-discuss] Lustre Lite> Hi Jean-Marc, > > Do you have a test on which NFS performs much better. We''d like to know > more about that. > > If you go bak on this list, you will see that Lustre can handsomely beat > NFS > at unpacking archives, for example. So I am curious. > > - Peter - > >> -----Original Message----- >> From: lustre-discuss-bounces@clusterfs.com >> [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of >> Jean-Marc Saffroy >> Sent: Friday, November 03, 2006 9:30 PM >> To: Andreas Dilger >> Cc: Lin Shen (lshen); Lustre User Discussion Mailing List >> Subject: Re: [Lustre-discuss] Lustre Lite >> >> On Fri, 3 Nov 2006, Andreas Dilger wrote: >> >> > You should be able to run a database on top of Lustre, but >> this is not >> > a common mode of operation for our current customers and I >> don''t know >> > how well it functions. >> >> Maybe poor performance is what limits adoption of Lustre for >> transactional >> workloads: I heard some stories of users trying database-like >> access (ie. >> updates of many small files) on Lustre, with mixed or >> terrible results. >> >> NFS gives much better results with small files, but it does >> not scale well, and it lacks good POSIX conformance (ie. wrt caching). >> >> Too bad we don''t have the best of both worlds! >> >> >> -- >> Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > >
On Sat, 4 Nov 2006, Peter J. Braam wrote:> Do you have a test on which NFS performs much better. We''d like to know > more about that.One test I can mention is postmark, originally written by Netapp folks, now available from Debian: http://packages.debian.org/stable/utils/postmark In the test, postmark was configured to perform 10k "transactions" on 10k files between 500B and 22kB. This was run with up to 16 processes on *one* client connecting to various server configs on several interconnects and storage systems, running 1.4.6 and 1.6beta4. For convenience, I wrote the attached wrapper script: $ ssh $client postmark.sh -n 10000 -t 10000 -j $numprocs > $log The metric was the number of transactions per second (sum of column 3 across all lines of $log). NFS reached 800-1000 t/s with 2-8 processes (then performance dropped) over GigE, while Lustre peaked at 200 t/s with 2 process on a tiny GigE cluster (1 MDS, 1 OSS serving 1 OST). A bigger cluster could not give more than 900 t/s. Back then I suspected that the test was probably network bound (because of high interrupt and packet rates vs. low CPU and disk usage), but I could not find a good measurement. A tool showing metadata RPC statistics (such as LMT maybe, but I didn''t know it then) would have been useful.> If you go bak on this list, you will see that Lustre can handsomely beat > NFS at unpacking archives, for example. So I am curious.I suppose you refer to the thread "I/O performance on small files" (end of June). IIRC the recipes mentioned there had been used in the tests. Now I no longer have time to spend on these issues, but ideas for improvement would still be interesting. -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net -------------- next part -------------- A non-text attachment was scrubbed... Name: postmark.sh Type: application/x-sh Size: 2599 bytes Desc: Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20061103/983ead4e/postmark.sh
On Nov 04, 2006 02:58 +0100, Jean-Marc Saffroy wrote:> Back then I suspected that the test was probably network bound (because of > high interrupt and packet rates vs. low CPU and disk usage), but I could > not find a good measurement. A tool showing metadata RPC statistics (such > as LMT maybe, but I didn''t know it then) would have been useful.You could use llstat.pl to show MDS RPC stats, or are you thinking of something else? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Your email explains a fair amount and here are two reasons contributing to this. One is the mdc semaphore - our clients are effectively single threading metadata updates - it will be a little while before we get to this, it is not that hard. The second is that many operations (but not the most critical open/create/read/write) have too many RPC''s - this one we have a good plan for, and I think it might be fixed quite soon. Thanks! - Peter -> -----Original Message----- > From: Jean-Marc Saffroy [mailto:jean-marc.saffroy@ext.bull.net] > Sent: Saturday, November 04, 2006 9:59 AM > To: Peter J. Braam > Cc: ''Andreas Dilger''; ''Lin Shen (lshen)''; ''Lustre User > Discussion Mailing List'' > Subject: RE: [Lustre-discuss] Lustre Lite > > On Sat, 4 Nov 2006, Peter J. Braam wrote: > > > Do you have a test on which NFS performs much better. We''d like to > > know more about that. > > One test I can mention is postmark, originally written by > Netapp folks, now available from Debian: > http://packages.debian.org/stable/utils/postmark > > In the test, postmark was configured to perform 10k > "transactions" on 10k files between 500B and 22kB. > > This was run with up to 16 processes on *one* client > connecting to various server configs on several interconnects > and storage systems, running 1.4.6 and 1.6beta4. > > For convenience, I wrote the attached wrapper script: > $ ssh $client postmark.sh -n 10000 -t 10000 -j $numprocs > $log > > The metric was the number of transactions per second (sum of > column 3 across all lines of $log). NFS reached 800-1000 t/s > with 2-8 processes (then performance dropped) over GigE, > while Lustre peaked at 200 t/s with > 2 process on a tiny GigE cluster (1 MDS, 1 OSS serving 1 > OST). A bigger cluster could not give more than 900 t/s. > > Back then I suspected that the test was probably network > bound (because of high interrupt and packet rates vs. low CPU > and disk usage), but I could not find a good measurement. A > tool showing metadata RPC statistics (such as LMT maybe, but > I didn''t know it then) would have been useful. > > > If you go bak on this list, you will see that Lustre can handsomely > > beat NFS at unpacking archives, for example. So I am curious. > > I suppose you refer to the thread "I/O performance on small > files" (end of June). IIRC the recipes mentioned there had > been used in the tests. > > Now I no longer have time to spend on these issues, but ideas > for improvement would still be interesting. > > > -- > Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
On Sat, 4 Nov 2006, ''Andreas Dilger'' wrote:> On Nov 04, 2006 02:58 +0100, Jean-Marc Saffroy wrote: >> Back then I suspected that the test was probably network bound (because of >> high interrupt and packet rates vs. low CPU and disk usage), but I could >> not find a good measurement. A tool showing metadata RPC statistics (such >> as LMT maybe, but I didn''t know it then) would have been useful. > > You could use llstat.pl to show MDS RPC stats, or are you thinking of > something else?No, actually that''s exactly what I had in mind. :-) Thanks for the tip! -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
> > 2. In our system, there will potentially be a wide range of storage> > media such as disk array, hard disk, flash and USB devices > and we want > > to share all of them. Is it feasible and I''m having a hard time to > > picture sharing among those devices with very different performance > > capability and reliability. > > This isn''t really what Lustre was designed to do. It > currently assumes a uniform IO performance (and with 1.4 it > also is best to have the same size devices). That isn''t to > say you cannot have Lustre running on different block > devices, but rather (a) it isn''t very efficient to use on > small devices because of overhead, and (b) to aggregate all > of these devices into a single filesystem wouldn''t work very well. >How hard would it be to make Lustre to work reasonably well with hybrid storage media? Are you aware of any other Open Source or Commerical CFS that does this? Thanks lin
On Nov 11, 2006 15:04 -0800, Lin Shen (lshen) wrote:> In talking to the application team, they think there are a > number of reasons that NFS is not good enough. Do you think Lustre can > adequately address the following issues? > > 1) NFS adds too many data copies and complex marshaling of messages. > Particularly for virtual memory paging, this is a performance killer.Lustre supports O_DIRECT and RDMA network transfers (zero-copy send only for TCP). With O_DIRECT IO and, say, Infiniband, it is possible to have no data copies (except RDMA over the network) all the way to the disk.> 2) NFS directory traversal is very slow because every path element > requires message exchanges between client and server. This means that > administrative tasks like backups are really expensive.Lustre is the same in this regard currently.> 3) NFS locking has always been problematic, sometimes with deadlock > cases, sometimes problems with recovery after node failure. Simultaneous > access of a single file from multiple nodes has additional caching and > other coherency problems.Lustre has full data coherency between clients.> 4) NFS doesn''t provide any notion of raw device access so the > optimizations created in storage layers running on top of NFS > (databases, specialized file systems, etc.) don''t work as expected.This is O_DIRECT, or something else?> 5) The layering of volume management under NFS doesn''t really work. If > the storage media is spread across several nodes, with the NFS server > located on a single node, the read/write requests have to move from > client to server to media node. With a cluster file system, the > read/write requests should always be from client to media node.This is the case with Lustre - that client IO is always directly (and only) to the storage node(s) that contains the data. A single file can be striped over multiple storage nodes. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.