Hi I am considering a new storage of 30 TB usable space with a 2 GB/s sustained read write performance in clustered mode. But not able to figure out sizing part of it like what OSS, what OST and what MDS. Urgent help would be highly appreciable Thanks and Regards Deval -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of lustre-discuss-request at lists.lustre.org Sent: Friday, January 08, 2010 12:30 AM To: lustre-discuss at lists.lustre.org Subject: Lustre-discuss Digest, Vol 48, Issue 13 Send Lustre-discuss mailing list submissions to lustre-discuss at lists.lustre.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.lustre.org/mailman/listinfo/lustre-discuss or, via email, send a message with subject or body ''help'' to lustre-discuss-request at lists.lustre.org You can reach the person managing the list at lustre-discuss-owner at lists.lustre.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Lustre-discuss digest..." Today''s Topics: 1. Re: Lustre-discuss Digest, Vol 48, Issue 11 (Jim Garlick) 2. Error on restarted Lustre disk--follow-up (Ms. Megan Larko) 3. Re: Lustre Monitoring Tools (Erik Froese) ---------------------------------------------------------------------- Message: 1 Date: Thu, 7 Jan 2010 09:13:55 -0800 From: Jim Garlick <garlick at llnl.gov> Subject: Re: [Lustre-discuss] Lustre-discuss Digest, Vol 48, Issue 11 To: Dam Thanh Tung <tungdt at isds.vn> Cc: lustre-discuss at lists.lustre.org Message-ID: <20100107171355.GA32305 at llnl.gov> Content-Type: text/plain; charset=us-ascii Actually lmt is not web-based. Tools for viewing lustre performance are included: "ltop" is curses based, and "lwatch" is X based. http://code.google.com/p/lmt/ Jim On Thu, Jan 07, 2010 at 09:08:09AM +0700, Dam Thanh Tung wrote:> You can try collectl <http://*collectl.sourceforge.net/>, i see it fromthe> 1.8 manual, maybe it''s options is not really rich, but i think it''s quite > good. If you need a web-based monitor tools, you can try > lmt<http://*sourceforge.net/projects/lmt/>, > i haven''t tried this yet. If you feel well with it, let me know please :) > > Hope this helps > > Hi Guys, > > > > I would like to monitor the performance and usage of my Lustrefilesystem> > and was wondering what are the commonly used monitoring tools for this? > > Cacti? Nagios? Any input would be greatly appreciated. > > > > Regards, > > -Simran > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: > >http://*lists.lustre.org/pipermail/lustre-discuss/attachments/20100106/c45e5 a90/attachment-0001.html> > > > ------------------------------ > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://*lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > End of Lustre-discuss Digest, Vol 48, Issue 11 > > ********************************************** > >> _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://*lists.lustre.org/mailman/listinfo/lustre-discuss------------------------------ Message: 2 Date: Thu, 7 Jan 2010 12:19:09 -0500 From: "Ms. Megan Larko" <dobsonunit at gmail.com> Subject: [Lustre-discuss] Error on restarted Lustre disk--follow-up To: Lustre User Discussion Mailing List <lustre-discuss at lists.lustre.org> Message-ID: <9e24b8301001070919q6b4a24fcm9a5d2fa72d125999 at mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Hello, Replying to my own subject line, I had to reboot the Lustre-1.6.7.1smp client for reasons completely unrelated to Lustre (very related to NFS). After the reboot, the error messages regarding the handle change went away. The Lustre disk mounted correctly and is usable after the client reboot. Just FYI, megan ------------------------------ Message: 3 Date: Thu, 7 Jan 2010 12:27:39 -0500 From: Erik Froese <erik.froese at gmail.com> Subject: Re: [Lustre-discuss] Lustre Monitoring Tools To: Jim Garlick <garlick at llnl.gov> Cc: Cliff White <Cliff.White at sun.com>, "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org> Message-ID: <f9ba03e41001070927u1888f7edh29a063703b958dc5 at mail.gmail.com> Content-Type: text/plain; charset="iso-8859-1" I have it running on 1.8. 1.1. It works well. I had to edit the SQL it generated though. Erik On Wed, Jan 6, 2010 at 2:37 PM, Jim Garlick <garlick at llnl.gov> wrote:> I''m using LMT with 1.8 on our test system and it seems to be OK. > We''re still 1.6.6 in production so it hasn''t been extensively tested with > 1.8. > > Jim > > On Wed, Jan 06, 2010 at 11:23:54AM -0800, Cliff White wrote: > > Jeffrey Bennett wrote: > > > Last time I checked, LMT was designed for Lustre 1.4. LLNL stopped > development of LMT some time ago. Not sure if LMT will work with Lustre1.8.> If somebody has tried, please let everyone know. > > > > > > > Ah, it has moved to Google: > > http://code.google.com/p/lmt/ > > > > "The current release has been tested with Lustre 1.6.6." > > So, yup, seems a bit old. But might be worth looking into. > > cliffw > > > > > jab > > > > > > > > > -----Original Message----- > > > From: lustre-discuss-bounces at lists.lustre.org [mailto: > lustre-discuss-bounces at lists.lustre.org] On Behalf Of Cliff White > > > Sent: Wednesday, January 06, 2010 11:12 AM > > > To: Jagga Soorma > > > Cc: lustre-discuss at lists.lustre.org > > > Subject: Re: [Lustre-discuss] Lustre Monitoring Tools > > > > > > Jagga Soorma wrote: > > >> Hi Guys, > > >> > > >> I would like to monitor the performance and usage of my Lustre > > >> filesystem and was wondering what are the commonly used monitoring > tools > > >> for this? Cacti? Nagios? Any input would be greatly appreciated. > > >> > > >> Regards, > > >> -Simran > > >> > > > > > > LLNL''s LMT tool is very good. It''s available on Sourceforge, afaik. > > > cliffw > > > > > >> > ------------------------------------------------------------------------ > > >> > > >> _______________________________________________ > > >> Lustre-discuss mailing list > > >> Lustre-discuss at lists.lustre.org > > >> http://*lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://*lists.lustre.org/mailman/listinfo/lustre-discuss > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://*lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100107/3063de ce/attachment-0001.html ------------------------------ _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss End of Lustre-discuss Digest, Vol 48, Issue 13 ********************************************** ==========================================================Privileged or confidential information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), please delete this message and kindly notify the sender by an emailed reply. Opinions, conclusions and other information in this message that do not relate to the official business of Progression and its associate entities shall be understood as neither given nor endorsed by them. ----------------------------------------------------------------------- Progression Infonet Private Limited, Gurgaon (Haryana), India Authorised dealer of PostMaster, by QuantumLink Communications Pvt. Ltd Get your free copy of PostMaster at http://www.postmaster.co.in/
Hi Deval, Lustre storage sizing is largely driven by: * Capacity required * Performance required * Type of workload Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say you are using SATA disks for OST. A Seagate enterprise 1TB SATA disk can do around 90 MB/Sec with 1 MB blocksize using dd (can go upto 110 MB/Sec if blocksize is really large). Assuming that you are looking for RAID6 protection for OST, you need 10 SATA disks to form a 8 TB lun. You will need 4 such OSTs to give you 32 TB unformatted space. Lets consider performance: Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data disks in (8+2) RAID6 set]. But you have to cater for overhead of software/hardware RAID and limits of SAS PCIe HCA (or FC hardware RAID HCA). A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6 FC HCAs to utilize storage bandwidth of 4 RAID6 OSTs [Total bandwidth = 4 X 720 MB/Sec/OST = 2.8 GB/Sec]. So, now you have a storage system that delivers 32 TB unformatted space and 2.8 GB/Sec of performance for large sequential read/write workload. If you are planning to have mixed or small io workload and still want to achieve 2 GB/Sec throughput, you have to double the specs. Small, random IO (think of home directories) kills storage performance. Lets size MDS now. There is no direct relation between size of OST and that of MDT. MDTs are purely based on number of files required. It is a good idea to use FC or SAS disks for MDS as they spin at higher rate and have better IOPS performance. For example, lets consider Seagate enterprise 15 K rpm 300 GB SAS disks. You can put 4 such SAS disks in RAID10 configuration for MDT which will give you 600 GB of unformatted space. Lustre needs 4 KB of metadata for each file created, so you can store about 150 Million files in 600 GB MDT. In reality, this number would be much smaller depending on your average file size [no of files = total size of OST/average file size]. Hope this helps. Cheers, _Atul Deval kulshrestha wrote:> Hi > I am considering a new storage of 30 TB usable space with a 2 GB/s sustained > read write performance in clustered mode. But not able to figure out sizing > part of it like what OSS, what OST and what MDS. > Urgent help would be highly appreciable > > Thanks and Regards > Deval > > -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org > [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of > lustre-discuss-request at lists.lustre.org > Sent: Friday, January 08, 2010 12:30 AM > To: lustre-discuss at lists.lustre.org > Subject: Lustre-discuss Digest, Vol 48, Issue 13 > > Send Lustre-discuss mailing list submissions to > lustre-discuss at lists.lustre.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.lustre.org/mailman/listinfo/lustre-discuss > or, via email, send a message with subject or body ''help'' to > lustre-discuss-request at lists.lustre.org > > You can reach the person managing the list at > lustre-discuss-owner at lists.lustre.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Lustre-discuss digest..." > > > Today''s Topics: > > 1. Re: Lustre-discuss Digest, Vol 48, Issue 11 (Jim Garlick) > 2. Error on restarted Lustre disk--follow-up (Ms. Megan Larko) > 3. Re: Lustre Monitoring Tools (Erik Froese) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 7 Jan 2010 09:13:55 -0800 > From: Jim Garlick <garlick at llnl.gov> > Subject: Re: [Lustre-discuss] Lustre-discuss Digest, Vol 48, Issue 11 > To: Dam Thanh Tung <tungdt at isds.vn> > Cc: lustre-discuss at lists.lustre.org > Message-ID: <20100107171355.GA32305 at llnl.gov> > Content-Type: text/plain; charset=us-ascii > > Actually lmt is not web-based. Tools for viewing lustre performance > are included: "ltop" is curses based, and "lwatch" is X based. > > http://code.google.com/p/lmt/ > > Jim > > On Thu, Jan 07, 2010 at 09:08:09AM +0700, Dam Thanh Tung wrote: > >> You can try collectl <http://*collectl.sourceforge.net/>, i see it from >> > the > >> 1.8 manual, maybe it''s options is not really rich, but i think it''s quite >> good. If you need a web-based monitor tools, you can try >> lmt<http://*sourceforge.net/projects/lmt/>, >> i haven''t tried this yet. If you feel well with it, let me know please :) >> >> Hope this helps >> >> Hi Guys, >> >>> I would like to monitor the performance and usage of my Lustre >>> > filesystem > >>> and was wondering what are the commonly used monitoring tools for this? >>> Cacti? Nagios? Any input would be greatly appreciated. >>> >>> Regards, >>> -Simran >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: >>> >>> > http://*lists.lustre.org/pipermail/lustre-discuss/attachments/20100106/c45e5 > a90/attachment-0001.html > >>> ------------------------------ >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> End of Lustre-discuss Digest, Vol 48, Issue 11 >>> ********************************************** >>> >>> > > >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://*lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > > ------------------------------ > > Message: 2 > Date: Thu, 7 Jan 2010 12:19:09 -0500 > From: "Ms. Megan Larko" <dobsonunit at gmail.com> > Subject: [Lustre-discuss] Error on restarted Lustre disk--follow-up > To: Lustre User Discussion Mailing List > <lustre-discuss at lists.lustre.org> > Message-ID: > <9e24b8301001070919q6b4a24fcm9a5d2fa72d125999 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hello, > > Replying to my own subject line, I had to reboot the > Lustre-1.6.7.1smp client for reasons completely unrelated to Lustre > (very related to NFS). After the reboot, the error messages regarding > the handle change went away. The Lustre disk mounted correctly and > is usable after the client reboot. > > Just FYI, > megan > > > ------------------------------ > > Message: 3 > Date: Thu, 7 Jan 2010 12:27:39 -0500 > From: Erik Froese <erik.froese at gmail.com> > Subject: Re: [Lustre-discuss] Lustre Monitoring Tools > To: Jim Garlick <garlick at llnl.gov> > Cc: Cliff White <Cliff.White at sun.com>, > "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org> > Message-ID: > <f9ba03e41001070927u1888f7edh29a063703b958dc5 at mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > I have it running on 1.8. 1.1. It works well. I had to edit the SQL it > generated though. > Erik > > On Wed, Jan 6, 2010 at 2:37 PM, Jim Garlick <garlick at llnl.gov> wrote: > > >> I''m using LMT with 1.8 on our test system and it seems to be OK. >> We''re still 1.6.6 in production so it hasn''t been extensively tested with >> 1.8. >> >> Jim >> >> On Wed, Jan 06, 2010 at 11:23:54AM -0800, Cliff White wrote: >> >>> Jeffrey Bennett wrote: >>> >>>> Last time I checked, LMT was designed for Lustre 1.4. LLNL stopped >>>> >> development of LMT some time ago. Not sure if LMT will work with Lustre >> > 1.8. > >> If somebody has tried, please let everyone know. >> >>> Ah, it has moved to Google: >>> http://code.google.com/p/lmt/ >>> >>> "The current release has been tested with Lustre 1.6.6." >>> So, yup, seems a bit old. But might be worth looking into. >>> cliffw >>> >>> >>>> jab >>>> >>>> >>>> -----Original Message----- >>>> From: lustre-discuss-bounces at lists.lustre.org [mailto: >>>> >> lustre-discuss-bounces at lists.lustre.org] On Behalf Of Cliff White >> >>>> Sent: Wednesday, January 06, 2010 11:12 AM >>>> To: Jagga Soorma >>>> Cc: lustre-discuss at lists.lustre.org >>>> Subject: Re: [Lustre-discuss] Lustre Monitoring Tools >>>> >>>> Jagga Soorma wrote: >>>> >>>>> Hi Guys, >>>>> >>>>> I would like to monitor the performance and usage of my Lustre >>>>> filesystem and was wondering what are the commonly used monitoring >>>>> >> tools >> >>>>> for this? Cacti? Nagios? Any input would be greatly appreciated. >>>>> >>>>> Regards, >>>>> -Simran >>>>> >>>>> >>>> LLNL''s LMT tool is very good. It''s available on Sourceforge, afaik. >>>> cliffw >>>> >>>> >> ------------------------------------------------------------------------ >> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100107/3063de > ce/attachment-0001.html > > ------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > End of Lustre-discuss Digest, Vol 48, Issue 13 > ********************************************** > > ==========================================================> Privileged or confidential information may be contained > in this message. If you are not the addressee indicated > in this message (or responsible for delivery of the > message to such person), please delete this message and > kindly notify the sender by an emailed reply. Opinions, > conclusions and other information in this message that > do not relate to the official business of Progression > and its associate entities shall be understood as neither > given nor endorsed by them. > > > ----------------------------------------------------------------------- > Progression Infonet Private Limited, Gurgaon (Haryana), India > Authorised dealer of PostMaster, by QuantumLink Communications Pvt. Ltd > Get your free copy of PostMaster at http://www.postmaster.co.in/ > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Dear Atul Thanks for your valuable inputs, this also helped me understand the fundamental of Lustre storage sizing. I really appreciate your help Thanks and Regards Deval -----Original Message----- From: Atul.Vidwansa at Sun.COM [mailto:Atul.Vidwansa at Sun.COM] Sent: Friday, January 08, 2010 3:09 PM To: Deval kulshrestha Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre Storage Sizing- How? Hi Deval, Lustre storage sizing is largely driven by: * Capacity required * Performance required * Type of workload Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say you are using SATA disks for OST. A Seagate enterprise 1TB SATA disk can do around 90 MB/Sec with 1 MB blocksize using dd (can go upto 110 MB/Sec if blocksize is really large). Assuming that you are looking for RAID6 protection for OST, you need 10 SATA disks to form a 8 TB lun. You will need 4 such OSTs to give you 32 TB unformatted space. Lets consider performance: Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data disks in (8+2) RAID6 set]. But you have to cater for overhead of software/hardware RAID and limits of SAS PCIe HCA (or FC hardware RAID HCA). A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6 FC HCAs to utilize storage bandwidth of 4 RAID6 OSTs [Total bandwidth = 4 X 720 MB/Sec/OST = 2.8 GB/Sec]. So, now you have a storage system that delivers 32 TB unformatted space and 2.8 GB/Sec of performance for large sequential read/write workload. If you are planning to have mixed or small io workload and still want to achieve 2 GB/Sec throughput, you have to double the specs. Small, random IO (think of home directories) kills storage performance. Lets size MDS now. There is no direct relation between size of OST and that of MDT. MDTs are purely based on number of files required. It is a good idea to use FC or SAS disks for MDS as they spin at higher rate and have better IOPS performance. For example, lets consider Seagate enterprise 15 K rpm 300 GB SAS disks. You can put 4 such SAS disks in RAID10 configuration for MDT which will give you 600 GB of unformatted space. Lustre needs 4 KB of metadata for each file created, so you can store about 150 Million files in 600 GB MDT. In reality, this number would be much smaller depending on your average file size [no of files = total size of OST/average file size]. Hope this helps. Cheers, _Atul Deval kulshrestha wrote:> Hi > I am considering a new storage of 30 TB usable space with a 2 GB/ssustained> read write performance in clustered mode. But not able to figure outsizing> part of it like what OSS, what OST and what MDS. > Urgent help would be highly appreciable > > Thanks and Regards > Deval==========================================================Privileged or confidential information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), please delete this message and kindly notify the sender by an emailed reply. Opinions, conclusions and other information in this message that do not relate to the official business of Progression and its associate entities shall be understood as neither given nor endorsed by them. ----------------------------------------------------------------------- Progression Infonet Private Limited, Gurgaon (Haryana), India Authorised dealer of PostMaster, by QuantumLink Communications Pvt. Ltd Get your free copy of PostMaster at http://www.postmaster.co.in/
[ ... ]>> I am considering a new storage of 30 TB usable space with a 2 >> GB/s sustained read write performance in clustered mode.This spec is comically vague (see below) and anyhow it is going to be quite challenging. Some people who usually know what they are doing (CERN) currently expect around 20MB/s duplex transfer rate per TB, and I know of one storage system that is getting 3-4GB/s duplex (on a fresh install) with around 240 1TB drives (including overheads). If you want to guarantee 2GB/s sustained duplex perhaps aiming for 3-4GB/s is a good idea. See at the end for a similar conclusion. So getting to 2GB sustained and for duplex is going to require quite careful consideration of the particular circumstances if one wants it done with just 30GB worth of drives. As to vague, one Lustre storage I know of was initially specified with the same level of (un)detail, and with a bit of prodding it got a more definite target performance envelope. That is vital.>> But not able to figure out sizing part of it like what OSS, what >> OST and what MDS.Partitioning space between various types of Lustre data areas is the least of your problems. The bigger issues is the structure of the storage system on which Lustre runs.>> Urgent help would be highly appreciablePeople usually pay for urgent help, especially for difficult cases. You should hire a good consultant (e.g. from Sun) who will ask you a lot of questions.> Hi Deval, Lustre storage sizing is largely driven by: * Capacity > required * Performance required * Type of workloadJust about only on capacity required. Performance required given the type of workload (both static, distribution of file sizes, and dynamic, patterns of access) drives storage structure more than storage sizing, and indeed later you talk about structure without considering Storage and Lustre filesystem structure (not mere sizing) depends greatly on things like size of files, sequentiality of access, size of IO operations, number of files being concurrently worked on, number of process concurrently working on the same file, etc.; a list of several of these is here: http://www.sabi.co.uk/blog/0804apr.html#080415 A pretty vital detail here is how many clients will be in that target of 2GB/s duplex, and distinctly for reading and writing. It could be 20 1Gb/s clients writing at 100MBs/, and 1 analysis client reading at the same time at 2GB/s over multiple 10Gb/s links, for example. Another interesting dimensions is whether a single storage pool is necessary or not, or just a single namespace (to some extent Lustre is in between) with multiple pools and suitable use of mountpoints: http://www.sabi.co.uk/blog/0906Jun.html#090614 It is also important to know the availability requirements for the storage system. Does the "sustained" in "2 GB/s sustained" mean for a stretch of time or 24x7? Someone who asks for "Urgent help" should be nice enough to provide all these interesting aspects of requirements to the storage consultant they are going to hire.> Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say > you are using SATA disks for OST. A Seagate enterprise 1TB SATA > disk can do around 90 MB/Sec with 1 MB blocksize using dd (can go > upto 110 MB/Sec if blocksize is really large).Unfortunately only on the outer tracks and on a fresh filesystem. See for example: https://www.rz.uni-karlsruhe.de/rz/docs/Lustre/ssck_sfs_isc2007 ?Performance degradation on xc2 After 6 months of production we lost half of the file system performance Problem is under investigation by HP We had a similar problem on xc1 which was due to fragmentation Current solution for defragmentation is to recreate file systems? Note that it is not just "due to fragmentation", even without it as a filesystem fills blocks will (usually) start being allocated from the inner tracks and thus the raw transfer rate will eventually nearly halve: base# disktype /dev/sdd | grep ''device, size'' Block device, size 931.5 GiB (1000203804160 bytes) base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=0 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 9.53604 seconds, 110 MB/s base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=950000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 18.3843 seconds, 57.0 MB/s Amazingly in the "outer tracks and on fresh filesystem" case a modern 1TB SATA disk with a reasonable file system type can do 110MB/s even with smallish block sizes: base# fdisk -l /dev/sdd | grep sdd3 /dev/sdd3 * 2 1769 14201460 17 Hidden HPFS/NTFS base# mkfs.ext4 -q /dev/sdd3 base# mount -t ext4 /dev/sdd3 /mnt base# dd bs=64k count=100000 conv=fsync if=/dev/zero of=/mnt/TEST 100000+0 records in 100000+0 records out 6553600000 bytes (6.6 GB) copied, 59.7662 seconds, 110 MB/s (BTW I have used ''ext4'' because this is about Lustre, I usually prefer JFS for various reasons) I have been quite impressed that one can get 90MB/s "outer tracks and on fresh filesystem" from a contemporary low power laptop drive: http://www.sabi.co.uk/blog/0906Jun.html#090605> Assuming that you are looking for RAID6 protection for OST, > you need 10 SATA disks to form a 8 TB lun.Why would you assume that? (see below on formulaic approaches) Why use parity RAID that is know to cause performance problems on writes unless they are all aligned, when the only detail provided is a target for writing? Perhaps it is a DAQ or other recording application given the 2GB/s goal, but perhaps it does not do large writes.> You will need 4 such OSTs to give you 32 TB unformatted space. > Lets consider performance:> Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data > disks in (8+2) RAID6 set]. But you have to cater for overhead of > software/hardware RAID and limits of SAS PCIe HCA (or FC hardware > RAID HCA). A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6 > FC HCAs to utilize storage bandwidth of 4 RAID6 OSTs [Total > bandwidth = 4 X 720 MB/Sec/OST = 2.8 GB/Sec].That "4 X 720MB/Sec" applies only if there are at least 4 stripes per file and they get written in bulk, as you imply immediately below, or there are at least 4 files being written at the same time and they end up on different OSTs. The very vague requirement for "clustered mode" does not quite make clear which one.> So, now you have a storage system that delivers 32 TB unformatted > space and 2.8 GB/Sec of performance for large sequential read/write > workload.The read and write performance may be quite different on many workloads because of the RAID6 (stripe alignment), even if probably "large sequential" is going to be fine, and again only if the concurrency is just right and on outer tracks on a fresh filesystem.> If you are planning to have mixed or small io workload and still > want to achieve 2 GB/Sec throughput, you have to double the specs.Why just double? Why not consider other storage systems like RAID10 or SSDs?> Small, random IO (think of home directories) kills storage > performance.Depends on the storage system...> Lets size MDS now.> It is a good idea to use FC or SAS disks for MDS as they spin at > higher rate and have better IOPS performance. For example, lets > consider Seagate enterprise 15 K rpm 300 GB SAS disks. You can put > 4 such SAS disks in RAID10 configuration for MDT which will give > you 600 GB of unformatted space. [ ... ]The MDS is another story indeed. But I seem to detect here a formulaic approach: the Lustre "don''t need to think" approach seems to be SAS RAID10 for metadata and SATA RAID6 for data, and this is what is being discussed here, straight out of the 3-ring binder, without asking any further questions despite the extreme vagueness of the target. Which is mostly better BTW than what a site I know got from EMC as their "don''t need to think" formula at the time seemed to be RAID3 of all things. Fine (perhaps), but I have a different formulaic approach: without knowing all the details, and even if in some cases parity RAID does make sense: http://www.sabi.co.uk/blog/0709sep.html#070923b my "generic" formula (and apparently shared by several academic sites that use Lustre for HPC storage) is to get Sun X4500 "Thumper"s (or thei more recent equivalents) and RAID10 a bunch of disks inside them (and then use JFS/XFS or Lustre on top, possibly with DRBD between). With Lustre it is easy then to aggregate them by spreading the OSTs across multiple "Thumper"s. In this case given that he goal is roughly 70MB/s duplex sustained per TB of storage, which is rather high, so I would use either SSDs or lots of small fast SAS drives for data (or lots of large SATA ones with the data partition only in the outer 1/3 of the disk, which some people call "short stroking"). Depending on how big the writes are and how big the files are and the degree of concurrency, and the availability target, and all the other important aspects of the requirements, most importantly the read and write access patterns, as writing and then reading implies quite a bit of head movement. If we assume the 20MB/s duplex rule per TB that CERN uses for bulk storage, that translates to 100x SATA 1TB drives, or around 200x 1TB with RAID10 (spread around 5-6 "Thumper"s). Or perhap smaller but higher IOP/s SAS 15k drives. Perhaps large SSDs would be a nice idea. But the details matter a great deal. Your mileage may vary.
Peter, You have provided an excellent explanation. One thing that was clearly missing from my answer is "number of Lustre clients required to get 2 GB/Sec sustained unidirectional throughput". If Lustre clients are going to be connected via standard gigabit ethernet, which gets maximum of 110 MB/Sec (almost impossible to achieve in real life). If we consider a modest 50 MB/Sec/client over GbE, you will need about 40 clients writing a single 4 way striped file with large blocksize to reach 2 GB/Sec. With 10 Gigabit ethernet, you will need at least 4-5 clients to get 2 GB/Sec aggregate throughput while writing a single 4 way striped file with large blocksize. And, as Peter said, this performance is for a fresh Lustre filesystem with only outer platters of disk used. On a different note, we have hacked sgpdd-survey tool that comes with Lustre-iokit to benchmark whole disk. You can get details in Bugzilla Bug 17218. In my experience, performance of IO drops by more than 60 % when disks start using inner platters. Cheers, _Atul Peter Grandi wrote:> [ ... ] > > >>> I am considering a new storage of 30 TB usable space with a 2 >>> GB/s sustained read write performance in clustered mode. >>> > > This spec is comically vague (see below) and anyhow it is going to be > quite challenging. Some people who usually know what they are doing > (CERN) currently expect around 20MB/s duplex transfer rate per TB, > and I know of one storage system that is getting 3-4GB/s duplex (on a > fresh install) with around 240 1TB drives (including overheads). If > you want to guarantee 2GB/s sustained duplex perhaps aiming for > 3-4GB/s is a good idea. See at the end for a similar conclusion. > > So getting to 2GB sustained and for duplex is going to require quite > careful consideration of the particular circumstances if one wants it > done with just 30GB worth of drives. > > As to vague, one Lustre storage I know of was initially specified > with the same level of (un)detail, and with a bit of prodding it got > a more definite target performance envelope. That is vital. > > >>> But not able to figure out sizing part of it like what OSS, what >>> OST and what MDS. >>> > > Partitioning space between various types of Lustre data areas is the > least of your problems. The bigger issues is the structure of the > storage system on which Lustre runs. > > >>> Urgent help would be highly appreciable >>> > > People usually pay for urgent help, especially for difficult cases. > You should hire a good consultant (e.g. from Sun) who will ask you a > lot of questions. > > >> Hi Deval, Lustre storage sizing is largely driven by: * Capacity >> required * Performance required * Type of workload >> > > Just about only on capacity required. Performance required given the > type of workload (both static, distribution of file sizes, and > dynamic, patterns of access) drives storage structure more than > storage sizing, and indeed later you talk about structure without > considering > > Storage and Lustre filesystem structure (not mere sizing) depends > greatly on things like size of files, sequentiality of access, size > of IO operations, number of files being concurrently worked on, > number of process concurrently working on the same file, etc.; a list > of several of these is here: > > http://www.sabi.co.uk/blog/0804apr.html#080415 > > A pretty vital detail here is how many clients will be in that target > of 2GB/s duplex, and distinctly for reading and writing. It could be > 20 1Gb/s clients writing at 100MBs/, and 1 analysis client reading at > the same time at 2GB/s over multiple 10Gb/s links, for example. > > Another interesting dimensions is whether a single storage pool is > necessary or not, or just a single namespace (to some extent Lustre > is in between) with multiple pools and suitable use of mountpoints: > > http://www.sabi.co.uk/blog/0906Jun.html#090614 > > It is also important to know the availability requirements for the > storage system. Does the "sustained" in "2 GB/s sustained" mean for a > stretch of time or 24x7? > > Someone who asks for "Urgent help" should be nice enough to provide > all these interesting aspects of requirements to the storage > consultant they are going to hire. > > >> Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say >> you are using SATA disks for OST. A Seagate enterprise 1TB SATA >> disk can do around 90 MB/Sec with 1 MB blocksize using dd (can go >> upto 110 MB/Sec if blocksize is really large). >> > > Unfortunately only on the outer tracks and on a fresh filesystem. See > for example: > > https://www.rz.uni-karlsruhe.de/rz/docs/Lustre/ssck_sfs_isc2007 > > ?Performance degradation on xc2 > After 6 months of production we lost half of the file system > performance > Problem is under investigation by HP > We had a similar problem on xc1 which was due to fragmentation > Current solution for defragmentation is to recreate file systems? > > Note that it is not just "due to fragmentation", even without it as a > filesystem fills blocks will (usually) start being allocated from the > inner tracks and thus the raw transfer rate will eventually nearly halve: > > base# disktype /dev/sdd | grep ''device, size'' > Block device, size 931.5 GiB (1000203804160 bytes) > base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=0 > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB) copied, 9.53604 seconds, 110 MB/s > base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=950000 > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB) copied, 18.3843 seconds, 57.0 MB/s > > Amazingly in the "outer tracks and on fresh filesystem" case a modern > 1TB SATA disk with a reasonable file system type can do 110MB/s even > with smallish block sizes: > > base# fdisk -l /dev/sdd | grep sdd3 > /dev/sdd3 * 2 1769 14201460 17 Hidden HPFS/NTFS > base# mkfs.ext4 -q /dev/sdd3 > base# mount -t ext4 /dev/sdd3 /mnt > base# dd bs=64k count=100000 conv=fsync if=/dev/zero of=/mnt/TEST > 100000+0 records in > 100000+0 records out > 6553600000 bytes (6.6 GB) copied, 59.7662 seconds, 110 MB/s > > (BTW I have used ''ext4'' because this is about Lustre, I usually > prefer JFS for various reasons) > > I have been quite impressed that one can get 90MB/s "outer tracks and > on fresh filesystem" from a contemporary low power laptop drive: > > http://www.sabi.co.uk/blog/0906Jun.html#090605 > > >> Assuming that you are looking for RAID6 protection for OST, >> you need 10 SATA disks to form a 8 TB lun. >> > > Why would you assume that? (see below on formulaic approaches) Why > use parity RAID that is know to cause performance problems on writes > unless they are all aligned, when the only detail provided is a > target for writing? Perhaps it is a DAQ or other recording > application given the 2GB/s goal, but perhaps it does not do large > writes. > > >> You will need 4 such OSTs to give you 32 TB unformatted space. >> Lets consider performance: >> > > >> Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data >> disks in (8+2) RAID6 set]. But you have to cater for overhead of >> software/hardware RAID and limits of SAS PCIe HCA (or FC hardware >> RAID HCA). A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6 >> FC HCAs to utilize storage bandwidth of 4 RAID6 OSTs [Total >> bandwidth = 4 X 720 MB/Sec/OST = 2.8 GB/Sec]. >> > > That "4 X 720MB/Sec" applies only if there are at least 4 stripes per > file and they get written in bulk, as you imply immediately below, or > there are at least 4 files being written at the same time and they > end up on different OSTs. The very vague requirement for "clustered > mode" does not quite make clear which one. > > >> So, now you have a storage system that delivers 32 TB unformatted >> space and 2.8 GB/Sec of performance for large sequential read/write >> workload. >> > > The read and write performance may be quite different on many > workloads because of the RAID6 (stripe alignment), even if probably > "large sequential" is going to be fine, and again only if the > concurrency is just right and on outer tracks on a fresh filesystem. > > >> If you are planning to have mixed or small io workload and still >> want to achieve 2 GB/Sec throughput, you have to double the specs. >> > > Why just double? Why not consider other storage systems like RAID10 > or SSDs? > > >> Small, random IO (think of home directories) kills storage >> performance. >> > > Depends on the storage system... > > >> Lets size MDS now. >> > > >> It is a good idea to use FC or SAS disks for MDS as they spin at >> higher rate and have better IOPS performance. For example, lets >> consider Seagate enterprise 15 K rpm 300 GB SAS disks. You can put >> 4 such SAS disks in RAID10 configuration for MDT which will give >> you 600 GB of unformatted space. [ ... ] >> > > The MDS is another story indeed. > > But I seem to detect here a formulaic approach: the Lustre "don''t > need to think" approach seems to be SAS RAID10 for metadata and SATA > RAID6 for data, and this is what is being discussed here, straight > out of the 3-ring binder, without asking any further questions > despite the extreme vagueness of the target. Which is mostly better > BTW than what a site I know got from EMC as their "don''t need to > think" formula at the time seemed to be RAID3 of all things. > > Fine (perhaps), but I have a different formulaic approach: without > knowing all the details, and even if in some cases parity RAID does > make sense: > > http://www.sabi.co.uk/blog/0709sep.html#070923b > > my "generic" formula (and apparently shared by several academic sites > that use Lustre for HPC storage) is to get Sun X4500 "Thumper"s (or > thei more recent equivalents) and RAID10 a bunch of disks inside them > (and then use JFS/XFS or Lustre on top, possibly with DRBD between). > With Lustre it is easy then to aggregate them by spreading the OSTs > across multiple "Thumper"s. > > In this case given that he goal is roughly 70MB/s duplex sustained > per TB of storage, which is rather high, so I would use either SSDs > or lots of small fast SAS drives for data (or lots of large SATA ones > with the data partition only in the outer 1/3 of the disk, which some > people call "short stroking"). > > Depending on how big the writes are and how big the files are and the > degree of concurrency, and the availability target, and all the other > important aspects of the requirements, most importantly the read and > write access patterns, as writing and then reading implies quite a > bit of head movement. > > If we assume the 20MB/s duplex rule per TB that CERN uses for bulk > storage, that translates to 100x SATA 1TB drives, or around 200x 1TB > with RAID10 (spread around 5-6 "Thumper"s). Or perhap smaller but > higher IOP/s SAS 15k drives. Perhaps large SSDs would be a nice idea. > > But the details matter a great deal. Your mileage may vary. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >