thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre Storage Sizing- How? [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Deval kulshrestha

2010-Jan-08 06:31 UTC

[Lustre-discuss] Lustre Storage Sizing- How?

Hi
I am considering a new storage of 30 TB usable space with a 2 GB/s sustained
read write performance in clustered mode. But not able to figure out sizing
part of it like what OSS, what OST and what MDS.
Urgent help would be highly appreciable 

Thanks and Regards
Deval

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
lustre-discuss-request at lists.lustre.org
Sent: Friday, January 08, 2010 12:30 AM
To: lustre-discuss at lists.lustre.org
Subject: Lustre-discuss Digest, Vol 48, Issue 13

Send Lustre-discuss mailing list submissions to
	lustre-discuss at lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.lustre.org/mailman/listinfo/lustre-discuss
or, via email, send a message with subject or body ''help'' to
	lustre-discuss-request at lists.lustre.org

You can reach the person managing the list at
	lustre-discuss-owner at lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Lustre-discuss digest..."


Today''s Topics:

   1. Re: Lustre-discuss Digest, Vol 48, Issue 11 (Jim Garlick)
   2. Error on restarted Lustre disk--follow-up (Ms. Megan Larko)
   3. Re: Lustre Monitoring Tools (Erik Froese)


----------------------------------------------------------------------

Message: 1
Date: Thu, 7 Jan 2010 09:13:55 -0800
From: Jim Garlick <garlick at llnl.gov>
Subject: Re: [Lustre-discuss] Lustre-discuss Digest, Vol 48, Issue 11
To: Dam Thanh Tung <tungdt at isds.vn>
Cc: lustre-discuss at lists.lustre.org
Message-ID: <20100107171355.GA32305 at llnl.gov>
Content-Type: text/plain; charset=us-ascii

Actually lmt is not web-based.  Tools for viewing lustre performance
are included: "ltop" is curses based, and "lwatch" is X
based.

http://code.google.com/p/lmt/

Jim

On Thu, Jan 07, 2010 at 09:08:09AM +0700, Dam Thanh Tung
wrote:> You can try collectl <http://*collectl.sourceforge.net/>, i see it
from
the> 1.8 manual, maybe it''s options is not really rich, but i think
it''s quite
> good. If you need a web-based monitor tools, you can try
> lmt<http://*sourceforge.net/projects/lmt/>,
> i haven''t tried this yet. If you feel well with it, let me know
please :)
> 
> Hope this helps
> 
>  Hi Guys,
> >
> > I would like to monitor the performance and usage of my Lustre
filesystem> > and was wondering what are the commonly used monitoring tools for
this?
> > Cacti? Nagios?  Any input would be greatly appreciated.
> >
> > Regards,
> > -Simran
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL:
> >http://*lists.lustre.org/pipermail/lustre-discuss/attachments/20100106/c45e5
a90/attachment-0001.html> >
> > ------------------------------
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://*lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> > End of Lustre-discuss Digest, Vol 48, Issue 11
> > **********************************************
> >
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://*lists.lustre.org/mailman/listinfo/lustre-discuss


------------------------------

Message: 2
Date: Thu, 7 Jan 2010 12:19:09 -0500
From: "Ms. Megan Larko" <dobsonunit at gmail.com>
Subject: [Lustre-discuss] Error on restarted Lustre disk--follow-up
To: Lustre User Discussion Mailing List
	<lustre-discuss at lists.lustre.org>
Message-ID:
	<9e24b8301001070919q6b4a24fcm9a5d2fa72d125999 at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

Hello,

Replying to my own subject line,   I had to reboot the
Lustre-1.6.7.1smp client for reasons completely unrelated to Lustre
(very related to NFS).  After the reboot, the error messages regarding
the handle change went away.   The Lustre disk mounted correctly and
is usable after the client reboot.

Just FYI,
megan


------------------------------

Message: 3
Date: Thu, 7 Jan 2010 12:27:39 -0500
From: Erik Froese <erik.froese at gmail.com>
Subject: Re: [Lustre-discuss] Lustre Monitoring Tools
To: Jim Garlick <garlick at llnl.gov>
Cc: Cliff White <Cliff.White at sun.com>,
	"lustre-discuss at lists.lustre.org" <lustre-discuss at
lists.lustre.org>
Message-ID:
	<f9ba03e41001070927u1888f7edh29a063703b958dc5 at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

I have it running on 1.8. 1.1. It works well. I had to edit the SQL it
generated though.
Erik

On Wed, Jan 6, 2010 at 2:37 PM, Jim Garlick <garlick at llnl.gov> wrote:
> I''m using LMT with 1.8 on our test system and it seems to be OK.
> We''re still 1.6.6 in production so it hasn''t been
extensively tested with
> 1.8.
>
> Jim
>
> On Wed, Jan 06, 2010 at 11:23:54AM -0800, Cliff White wrote:
> > Jeffrey Bennett wrote:
> > > Last time I checked, LMT was designed for Lustre 1.4. LLNL
stopped
> development of LMT some time ago. Not sure if LMT will work with Lustre
1.8.> If somebody has tried, please let everyone know.
> > >
> >
> > Ah, it has moved to Google:
> > http://code.google.com/p/lmt/
> >
> > "The current release has been tested with Lustre 1.6.6."
> > So, yup, seems a bit old. But might be worth looking into.
> > cliffw
> >
> > > jab
> > >
> > >
> > > -----Original Message-----
> > > From: lustre-discuss-bounces at lists.lustre.org [mailto:
> lustre-discuss-bounces at lists.lustre.org] On Behalf Of Cliff White
> > > Sent: Wednesday, January 06, 2010 11:12 AM
> > > To: Jagga Soorma
> > > Cc: lustre-discuss at lists.lustre.org
> > > Subject: Re: [Lustre-discuss] Lustre Monitoring Tools
> > >
> > > Jagga Soorma wrote:
> > >> Hi Guys,
> > >>
> > >> I would like to monitor the performance and usage of my
Lustre
> > >> filesystem and was wondering what are the commonly used
monitoring
> tools
> > >> for this?  Cacti? Nagios?  Any input would be greatly
appreciated.
> > >>
> > >> Regards,
> > >> -Simran
> > >>
> > >
> > > LLNL''s LMT tool is very good. It''s available on
Sourceforge, afaik.
> > > cliffw
> > >
> > >>
> ------------------------------------------------------------------------
> > >>
> > >> _______________________________________________
> > >> Lustre-discuss mailing list
> > >> Lustre-discuss at lists.lustre.org
> > >> http://*lists.lustre.org/mailman/listinfo/lustre-discuss
> > >
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss at lists.lustre.org
> > > http://*lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://*lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100107/3063de
ce/attachment-0001.html 

------------------------------

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


End of Lustre-discuss Digest, Vol 48, Issue 13
**********************************************

==========================================================Privileged or
confidential information may be contained
in this message. If you are not the addressee indicated
in this message (or responsible for delivery of the 
message to such person), please delete this message and
kindly notify the sender by an emailed reply. Opinions,
conclusions and other information in this message that
do not relate to the official business of Progression
and its associate entities shall be understood as neither
given nor endorsed by them.
  

-----------------------------------------------------------------------
Progression Infonet Private Limited, Gurgaon (Haryana), India
Authorised dealer of PostMaster, by QuantumLink Communications Pvt. Ltd
Get your free copy of PostMaster at http://www.postmaster.co.in/

Atul Vidwansa

2010-Jan-08 09:38 UTC

head link

[Lustre-discuss] Lustre Storage Sizing- How?

Hi Deval,

Lustre storage sizing is largely driven by:
* Capacity required
* Performance required
* Type of workload

Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say you 
are using SATA disks for OST. A Seagate enterprise 1TB SATA disk can do 
around 90 MB/Sec with 1 MB blocksize using dd (can go upto 110 MB/Sec if 
blocksize is really large). Assuming that you are looking for RAID6 
protection for OST, you need 10 SATA disks to form a 8 TB lun.

You will need 4 such OSTs to give you 32 TB unformatted space.

Lets consider performance:

Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data disks 
in (8+2) RAID6 set]. But you have to cater for overhead of 
software/hardware RAID and limits of SAS PCIe HCA (or FC hardware RAID 
HCA).  A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6 FC HCAs to 
utilize storage bandwidth of 4 RAID6 OSTs [Total bandwidth = 4 X 720 
MB/Sec/OST = 2.8 GB/Sec].

So, now you have a storage system that delivers 32 TB unformatted space 
and 2.8 GB/Sec of performance for large sequential read/write workload. 
If you are planning to have mixed or small io workload and still want to 
achieve 2 GB/Sec throughput, you have to double the specs. Small, random 
IO (think of home directories) kills storage performance.

Lets size MDS now.

There is no direct relation between size of OST and that of MDT. MDTs 
are purely based on number of files required. It is a good idea to use 
FC or SAS disks for MDS as they spin at higher rate and have better IOPS 
performance.  For example, lets consider Seagate enterprise 15 K rpm 300 
GB SAS disks. You can put 4 such SAS disks in RAID10 configuration for 
MDT which will give you 600 GB of unformatted space.

Lustre needs 4 KB of metadata for each file created, so you can store 
about 150 Million files in 600 GB MDT.  In reality, this number would be 
much smaller depending on your average file size [no of files = total 
size of OST/average file size].

Hope this helps.

Cheers,
_Atul

Deval kulshrestha wrote:> Hi
> I am considering a new storage of 30 TB usable space with a 2 GB/s
sustained
> read write performance in clustered mode. But not able to figure out sizing
> part of it like what OSS, what OST and what MDS.
> Urgent help would be highly appreciable 
>
> Thanks and Regards
> Deval
>
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
> lustre-discuss-request at lists.lustre.org
> Sent: Friday, January 08, 2010 12:30 AM
> To: lustre-discuss at lists.lustre.org
> Subject: Lustre-discuss Digest, Vol 48, Issue 13
>
> Send Lustre-discuss mailing list submissions to
> 	lustre-discuss at lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.lustre.org/mailman/listinfo/lustre-discuss
> or, via email, send a message with subject or body ''help''
to
> 	lustre-discuss-request at lists.lustre.org
>
> You can reach the person managing the list at
> 	lustre-discuss-owner at lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Lustre-discuss digest..."
>
>
> Today''s Topics:
>
>    1. Re: Lustre-discuss Digest, Vol 48, Issue 11 (Jim Garlick)
>    2. Error on restarted Lustre disk--follow-up (Ms. Megan Larko)
>    3. Re: Lustre Monitoring Tools (Erik Froese)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 7 Jan 2010 09:13:55 -0800
> From: Jim Garlick <garlick at llnl.gov>
> Subject: Re: [Lustre-discuss] Lustre-discuss Digest, Vol 48, Issue 11
> To: Dam Thanh Tung <tungdt at isds.vn>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID: <20100107171355.GA32305 at llnl.gov>
> Content-Type: text/plain; charset=us-ascii
>
> Actually lmt is not web-based.  Tools for viewing lustre performance
> are included: "ltop" is curses based, and "lwatch" is X
based.
>
> http://code.google.com/p/lmt/
>
> Jim
>
> On Thu, Jan 07, 2010 at 09:08:09AM +0700, Dam Thanh Tung wrote:
>   
>> You can try collectl <http://*collectl.sourceforge.net/>, i see
it from
>>     
> the
>   
>> 1.8 manual, maybe it''s options is not really rich, but i think
it''s quite
>> good. If you need a web-based monitor tools, you can try
>> lmt<http://*sourceforge.net/projects/lmt/>,
>> i haven''t tried this yet. If you feel well with it, let me
know please :)
>>
>> Hope this helps
>>
>>  Hi Guys,
>>     
>>> I would like to monitor the performance and usage of my Lustre
>>>       
> filesystem
>   
>>> and was wondering what are the commonly used monitoring tools for
this?
>>> Cacti? Nagios?  Any input would be greatly appreciated.
>>>
>>> Regards,
>>> -Simran
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>>
>>>       
>
http://*lists.lustre.org/pipermail/lustre-discuss/attachments/20100106/c45e5
> a90/attachment-0001.html
>   
>>> ------------------------------
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>> End of Lustre-discuss Digest, Vol 48, Issue 11
>>> **********************************************
>>>
>>>       
>
>   
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss
>>     
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 7 Jan 2010 12:19:09 -0500
> From: "Ms. Megan Larko" <dobsonunit at gmail.com>
> Subject: [Lustre-discuss] Error on restarted Lustre disk--follow-up
> To: Lustre User Discussion Mailing List
> 	<lustre-discuss at lists.lustre.org>
> Message-ID:
> 	<9e24b8301001070919q6b4a24fcm9a5d2fa72d125999 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hello,
>
> Replying to my own subject line,   I had to reboot the
> Lustre-1.6.7.1smp client for reasons completely unrelated to Lustre
> (very related to NFS).  After the reboot, the error messages regarding
> the handle change went away.   The Lustre disk mounted correctly and
> is usable after the client reboot.
>
> Just FYI,
> megan
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 7 Jan 2010 12:27:39 -0500
> From: Erik Froese <erik.froese at gmail.com>
> Subject: Re: [Lustre-discuss] Lustre Monitoring Tools
> To: Jim Garlick <garlick at llnl.gov>
> Cc: Cliff White <Cliff.White at sun.com>,
> 	"lustre-discuss at lists.lustre.org" <lustre-discuss at
lists.lustre.org>
> Message-ID:
> 	<f9ba03e41001070927u1888f7edh29a063703b958dc5 at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I have it running on 1.8. 1.1. It works well. I had to edit the SQL it
> generated though.
> Erik
>
> On Wed, Jan 6, 2010 at 2:37 PM, Jim Garlick <garlick at llnl.gov>
wrote:
>
>   
>> I''m using LMT with 1.8 on our test system and it seems to be
OK.
>> We''re still 1.6.6 in production so it hasn''t been
extensively tested with
>> 1.8.
>>
>> Jim
>>
>> On Wed, Jan 06, 2010 at 11:23:54AM -0800, Cliff White wrote:
>>     
>>> Jeffrey Bennett wrote:
>>>       
>>>> Last time I checked, LMT was designed for Lustre 1.4. LLNL
stopped
>>>>         
>> development of LMT some time ago. Not sure if LMT will work with Lustre
>>     
> 1.8.
>   
>> If somebody has tried, please let everyone know.
>>     
>>> Ah, it has moved to Google:
>>> http://code.google.com/p/lmt/
>>>
>>> "The current release has been tested with Lustre 1.6.6."
>>> So, yup, seems a bit old. But might be worth looking into.
>>> cliffw
>>>
>>>       
>>>> jab
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: lustre-discuss-bounces at lists.lustre.org [mailto:
>>>>         
>> lustre-discuss-bounces at lists.lustre.org] On Behalf Of Cliff White
>>     
>>>> Sent: Wednesday, January 06, 2010 11:12 AM
>>>> To: Jagga Soorma
>>>> Cc: lustre-discuss at lists.lustre.org
>>>> Subject: Re: [Lustre-discuss] Lustre Monitoring Tools
>>>>
>>>> Jagga Soorma wrote:
>>>>         
>>>>> Hi Guys,
>>>>>
>>>>> I would like to monitor the performance and usage of my
Lustre
>>>>> filesystem and was wondering what are the commonly used
monitoring
>>>>>           
>> tools
>>     
>>>>> for this?  Cacti? Nagios?  Any input would be greatly
appreciated.
>>>>>
>>>>> Regards,
>>>>> -Simran
>>>>>
>>>>>           
>>>> LLNL''s LMT tool is very good. It''s available
on Sourceforge, afaik.
>>>> cliffw
>>>>
>>>>         
>>
------------------------------------------------------------------------
>>     
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>           
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>         
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://*lists.lustre.org/mailman/listinfo/lustre-discuss
>>>       
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>     
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
>
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100107/3063de
> ce/attachment-0001.html 
>
> ------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> End of Lustre-discuss Digest, Vol 48, Issue 13
> **********************************************
>
> ==========================================================> Privileged
or confidential information may be contained
> in this message. If you are not the addressee indicated
> in this message (or responsible for delivery of the 
> message to such person), please delete this message and
> kindly notify the sender by an emailed reply. Opinions,
> conclusions and other information in this message that
> do not relate to the official business of Progression
> and its associate entities shall be understood as neither
> given nor endorsed by them.
>   
>
> -----------------------------------------------------------------------
> Progression Infonet Private Limited, Gurgaon (Haryana), India
> Authorised dealer of PostMaster, by QuantumLink Communications Pvt. Ltd
> Get your free copy of PostMaster at http://www.postmaster.co.in/
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Deval kulshrestha

2010-Jan-09 06:16 UTC

head link

[Lustre-discuss] Lustre Storage Sizing- How?

Dear Atul

Thanks for your valuable inputs, this also helped me understand the
fundamental of Lustre storage sizing.
I really appreciate your help

Thanks and Regards
Deval

-----Original Message-----
From: Atul.Vidwansa at Sun.COM [mailto:Atul.Vidwansa at Sun.COM] 
Sent: Friday, January 08, 2010 3:09 PM
To: Deval kulshrestha
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Lustre Storage Sizing- How?

Hi Deval,

Lustre storage sizing is largely driven by:
* Capacity required
* Performance required
* Type of workload

Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say you 
are using SATA disks for OST. A Seagate enterprise 1TB SATA disk can do 
around 90 MB/Sec with 1 MB blocksize using dd (can go upto 110 MB/Sec if 
blocksize is really large). Assuming that you are looking for RAID6 
protection for OST, you need 10 SATA disks to form a 8 TB lun.

You will need 4 such OSTs to give you 32 TB unformatted space.

Lets consider performance:

Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data disks 
in (8+2) RAID6 set]. But you have to cater for overhead of 
software/hardware RAID and limits of SAS PCIe HCA (or FC hardware RAID 
HCA).  A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6 FC HCAs to 
utilize storage bandwidth of 4 RAID6 OSTs [Total bandwidth = 4 X 720 
MB/Sec/OST = 2.8 GB/Sec].

So, now you have a storage system that delivers 32 TB unformatted space 
and 2.8 GB/Sec of performance for large sequential read/write workload. 
If you are planning to have mixed or small io workload and still want to 
achieve 2 GB/Sec throughput, you have to double the specs. Small, random 
IO (think of home directories) kills storage performance.

Lets size MDS now.

There is no direct relation between size of OST and that of MDT. MDTs 
are purely based on number of files required. It is a good idea to use 
FC or SAS disks for MDS as they spin at higher rate and have better IOPS 
performance.  For example, lets consider Seagate enterprise 15 K rpm 300 
GB SAS disks. You can put 4 such SAS disks in RAID10 configuration for 
MDT which will give you 600 GB of unformatted space.

Lustre needs 4 KB of metadata for each file created, so you can store 
about 150 Million files in 600 GB MDT.  In reality, this number would be 
much smaller depending on your average file size [no of files = total 
size of OST/average file size].

Hope this helps.

Cheers,
_Atul

Deval kulshrestha wrote:> Hi
> I am considering a new storage of 30 TB usable space with a 2 GB/s
sustained> read write performance in clustered mode. But not able to figure out
sizing> part of it like what OSS, what OST and what MDS.
> Urgent help would be highly appreciable 
>
> Thanks and Regards
> Deval

==========================================================Privileged or
confidential information may be contained
in this message. If you are not the addressee indicated
in this message (or responsible for delivery of the 
message to such person), please delete this message and
kindly notify the sender by an emailed reply. Opinions,
conclusions and other information in this message that
do not relate to the official business of Progression
and its associate entities shall be understood as neither
given nor endorsed by them.

-----------------------------------------------------------------------
Progression Infonet Private Limited, Gurgaon (Haryana), India
Authorised dealer of PostMaster, by QuantumLink Communications Pvt. Ltd
Get your free copy of PostMaster at http://www.postmaster.co.in/

Peter Grandi

2010-Jan-10 19:57 UTC

head link

[Lustre-discuss] Lustre Storage Sizing- How?

[ ... ]
>> I am considering a new storage of 30 TB usable space with a 2
>> GB/s sustained read write performance in clustered mode.
This spec is comically vague (see below) and anyhow it is going to be
quite challenging. Some people who usually know what they are doing
(CERN) currently expect around 20MB/s duplex transfer rate per TB,
and I know of one storage system that is getting 3-4GB/s duplex (on a
fresh install) with around 240 1TB drives (including overheads). If
you want to guarantee 2GB/s sustained duplex perhaps aiming for
3-4GB/s is a good idea. See at the end for a similar conclusion.

So getting to 2GB sustained and for duplex is going to require quite
careful consideration of the particular circumstances if one wants it
done with just 30GB worth of drives.

As to vague, one Lustre storage I know of was initially specified
with the same level of (un)detail, and with a bit of prodding it got
a more definite target performance envelope. That is vital.
>> But not able to figure out sizing part of it like what OSS, what
>> OST and what MDS.
Partitioning space between various types of Lustre data areas is the
least of your problems. The bigger issues is the structure of the
storage system on which Lustre runs.
>> Urgent help would be highly appreciable
People usually pay for urgent help, especially for difficult cases.
You should hire a good consultant (e.g. from Sun) who will ask you a
lot of questions.
> Hi Deval, Lustre storage sizing is largely driven by: * Capacity
> required * Performance required * Type of workload
Just about only on capacity required. Performance required given the
type of workload (both static, distribution of file sizes, and
dynamic, patterns of access) drives storage structure more than
storage sizing, and indeed later you talk about structure without
considering 

Storage and Lustre filesystem structure (not mere sizing) depends
greatly on things like size of files, sequentiality of access, size
of IO operations, number of files being concurrently worked on,
number of process concurrently working on the same file, etc.; a list
of several of these is here:

  http://www.sabi.co.uk/blog/0804apr.html#080415

A pretty vital detail here is how many clients will be in that target
of 2GB/s duplex, and distinctly for reading and writing. It could be
20 1Gb/s clients writing at 100MBs/, and 1 analysis client reading at
the same time at 2GB/s over multiple 10Gb/s links, for example.

Another interesting dimensions is whether a single storage pool is
necessary or not, or just a single namespace (to some extent Lustre
is in between) with multiple pools and suitable use of mountpoints:

  http://www.sabi.co.uk/blog/0906Jun.html#090614

It is also important to know the availability requirements for the
storage system. Does the "sustained" in "2 GB/s sustained"
mean for a
stretch of time or 24x7?

Someone who asks for "Urgent help" should be nice enough to provide
all these interesting aspects of requirements to the storage
consultant they are going to hire.
> Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say
> you are using SATA disks for OST. A Seagate enterprise 1TB SATA
> disk can do around 90 MB/Sec with 1 MB blocksize using dd (can go
> upto 110 MB/Sec if blocksize is really large).
Unfortunately only on the outer tracks and on a fresh filesystem. See
for example:

 https://www.rz.uni-karlsruhe.de/rz/docs/Lustre/ssck_sfs_isc2007

   ?Performance degradation on xc2
          After 6 months of production we lost half of the file system
          performance
              Problem is under investigation by HP
              We had a similar problem on xc1 which was due to fragmentation
              Current solution for defragmentation is to recreate file systems?

Note that it is not just "due to fragmentation", even without it as a
filesystem fills blocks will (usually) start being allocated from the
inner tracks and thus the raw transfer rate will eventually nearly halve:

  base# disktype /dev/sdd | grep ''device, size''
  Block device, size 931.5 GiB (1000203804160 bytes)
  base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=0
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 9.53604 seconds, 110 MB/s
  base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=950000
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 18.3843 seconds, 57.0 MB/s

Amazingly in the "outer tracks and on fresh filesystem" case a modern
1TB SATA disk with a reasonable file system type can do 110MB/s even
with smallish block sizes:

  base# fdisk -l /dev/sdd | grep sdd3
  /dev/sdd3   *           2        1769    14201460   17  Hidden HPFS/NTFS
  base# mkfs.ext4 -q /dev/sdd3
  base# mount -t ext4 /dev/sdd3 /mnt
  base# dd bs=64k count=100000 conv=fsync if=/dev/zero of=/mnt/TEST
  100000+0 records in
  100000+0 records out
  6553600000 bytes (6.6 GB) copied, 59.7662 seconds, 110 MB/s

(BTW I have used ''ext4'' because this is about Lustre, I
usually
prefer JFS for various reasons)

I have been quite impressed that one can get 90MB/s "outer tracks and
on fresh filesystem" from a contemporary low power laptop drive:

  http://www.sabi.co.uk/blog/0906Jun.html#090605
> Assuming that you are looking for RAID6 protection for OST,
> you need 10 SATA disks to form a 8 TB lun.
Why would you assume that? (see below on formulaic approaches) Why
use parity RAID that is know to cause performance problems on writes
unless they are all aligned, when the only detail provided is a
target for writing?  Perhaps it is a DAQ or other recording
application given the 2GB/s goal, but perhaps it does not do large
writes.
> You will need 4 such OSTs to give you 32 TB unformatted space.
> Lets consider performance:
> Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data
> disks in (8+2) RAID6 set]. But you have to cater for overhead of
> software/hardware RAID and limits of SAS PCIe HCA (or FC hardware
> RAID HCA). A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6
> FC HCAs to utilize storage bandwidth of 4 RAID6 OSTs [Total
> bandwidth = 4 X 720 MB/Sec/OST = 2.8 GB/Sec].
That "4 X 720MB/Sec" applies only if there are at least 4 stripes per
file and they get written in bulk, as you imply immediately below, or
there are at least 4 files being written at the same time and they
end up on different OSTs. The very vague requirement for "clustered
mode" does not quite make clear which one.
> So, now you have a storage system that delivers 32 TB unformatted
> space and 2.8 GB/Sec of performance for large sequential read/write
> workload.
The read and write performance may be quite different on many
workloads because of the RAID6 (stripe alignment), even if probably
"large sequential" is going to be fine, and again only if the
concurrency is just right and on outer tracks on a fresh filesystem.
> If you are planning to have mixed or small io workload and still
> want to achieve 2 GB/Sec throughput, you have to double the specs.
Why just double? Why not consider other storage systems like RAID10
or SSDs?
> Small, random IO (think of home directories) kills storage
> performance.
Depends on the storage system...
> Lets size MDS now.
> It is a good idea to use FC or SAS disks for MDS as they spin at
> higher rate and have better IOPS performance. For example, lets
> consider Seagate enterprise 15 K rpm 300 GB SAS disks. You can put
> 4 such SAS disks in RAID10 configuration for MDT which will give
> you 600 GB of unformatted space. [ ... ]
The MDS is another story indeed.

But I seem to detect here a formulaic approach: the Lustre "don''t
need to think" approach seems to be SAS RAID10 for metadata and SATA
RAID6 for data, and this is what is being discussed here, straight
out of the 3-ring binder, without asking any further questions
despite the extreme vagueness of the target. Which is mostly better
BTW than what a site I know got from EMC as their "don''t need to
think" formula at the time seemed to be RAID3 of all things.

Fine (perhaps), but I have a different formulaic approach: without
knowing all the details, and even if in some cases parity RAID does
make sense:

  http://www.sabi.co.uk/blog/0709sep.html#070923b

my "generic" formula (and apparently shared by several academic sites
that use Lustre for HPC storage) is to get Sun X4500 "Thumper"s (or
thei more recent equivalents) and RAID10 a bunch of disks inside them
(and then use JFS/XFS or Lustre on top, possibly with DRBD between).
With Lustre it is easy then to aggregate them by spreading the OSTs
across multiple "Thumper"s.

In this case given that he goal is roughly 70MB/s duplex sustained
per TB of storage, which is rather high, so I would use either SSDs
or lots of small fast SAS drives for data (or lots of large SATA ones
with the data partition only in the outer 1/3 of the disk, which some
people call "short stroking").

Depending on how big the writes are and how big the files are and the
degree of concurrency, and the availability target, and all the other
important aspects of the requirements, most importantly the read and
write access patterns, as writing and then reading implies quite a
bit of head movement.

If we assume the 20MB/s duplex rule per TB that CERN uses for bulk
storage, that translates to 100x SATA 1TB drives, or around 200x 1TB
with RAID10 (spread around 5-6 "Thumper"s). Or perhap smaller but
higher IOP/s SAS 15k drives. Perhaps large SSDs would be a nice idea.

But the details matter a great deal. Your mileage may vary.

Atul Vidwansa

2010-Jan-10 23:12 UTC

head link

[Lustre-discuss] Lustre Storage Sizing- How?

Peter,

You have provided an excellent explanation. One thing that was clearly 
missing from my answer is "number of Lustre clients required to get 2 
GB/Sec sustained unidirectional throughput".

If Lustre clients are going to be connected via standard gigabit 
ethernet, which gets maximum of 110 MB/Sec (almost impossible to achieve 
in real life). If we consider a modest 50 MB/Sec/client over GbE, you 
will need about 40 clients writing a single 4 way striped file with 
large blocksize to reach 2 GB/Sec.

With 10 Gigabit ethernet, you will need at least 4-5 clients to get 2 
GB/Sec aggregate throughput while writing a single 4 way striped file 
with large blocksize.

And, as Peter said, this performance is for a fresh Lustre filesystem 
with only outer platters of disk used.

On a different note, we have hacked sgpdd-survey tool that comes with 
Lustre-iokit to benchmark whole disk. You can get details in Bugzilla 
Bug 17218. In my experience, performance of IO drops by more than 60 % 
when disks start using inner platters.

Cheers,
_Atul

Peter Grandi wrote:> [ ... ]
>
>   
>>> I am considering a new storage of 30 TB usable space with a 2
>>> GB/s sustained read write performance in clustered mode.
>>>       
>
> This spec is comically vague (see below) and anyhow it is going to be
> quite challenging. Some people who usually know what they are doing
> (CERN) currently expect around 20MB/s duplex transfer rate per TB,
> and I know of one storage system that is getting 3-4GB/s duplex (on a
> fresh install) with around 240 1TB drives (including overheads). If
> you want to guarantee 2GB/s sustained duplex perhaps aiming for
> 3-4GB/s is a good idea. See at the end for a similar conclusion.
>
> So getting to 2GB sustained and for duplex is going to require quite
> careful consideration of the particular circumstances if one wants it
> done with just 30GB worth of drives.
>
> As to vague, one Lustre storage I know of was initially specified
> with the same level of (un)detail, and with a bit of prodding it got
> a more definite target performance envelope. That is vital.
>
>   
>>> But not able to figure out sizing part of it like what OSS, what
>>> OST and what MDS.
>>>       
>
> Partitioning space between various types of Lustre data areas is the
> least of your problems. The bigger issues is the structure of the
> storage system on which Lustre runs.
>
>   
>>> Urgent help would be highly appreciable
>>>       
>
> People usually pay for urgent help, especially for difficult cases.
> You should hire a good consultant (e.g. from Sun) who will ask you a
> lot of questions.
>
>   
>> Hi Deval, Lustre storage sizing is largely driven by: * Capacity
>> required * Performance required * Type of workload
>>     
>
> Just about only on capacity required. Performance required given the
> type of workload (both static, distribution of file sizes, and
> dynamic, patterns of access) drives storage structure more than
> storage sizing, and indeed later you talk about structure without
> considering 
>
> Storage and Lustre filesystem structure (not mere sizing) depends
> greatly on things like size of files, sequentiality of access, size
> of IO operations, number of files being concurrently worked on,
> number of process concurrently working on the same file, etc.; a list
> of several of these is here:
>
>   http://www.sabi.co.uk/blog/0804apr.html#080415
>
> A pretty vital detail here is how many clients will be in that target
> of 2GB/s duplex, and distinctly for reading and writing. It could be
> 20 1Gb/s clients writing at 100MBs/, and 1 analysis client reading at
> the same time at 2GB/s over multiple 10Gb/s links, for example.
>
> Another interesting dimensions is whether a single storage pool is
> necessary or not, or just a single namespace (to some extent Lustre
> is in between) with multiple pools and suitable use of mountpoints:
>
>   http://www.sabi.co.uk/blog/0906Jun.html#090614
>
> It is also important to know the availability requirements for the
> storage system. Does the "sustained" in "2 GB/s
sustained" mean for a
> stretch of time or 24x7?
>
> Someone who asks for "Urgent help" should be nice enough to
provide
> all these interesting aspects of requirements to the storage
> consultant they are going to hire.
>
>   
>> Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say
>> you are using SATA disks for OST. A Seagate enterprise 1TB SATA
>> disk can do around 90 MB/Sec with 1 MB blocksize using dd (can go
>> upto 110 MB/Sec if blocksize is really large).
>>     
>
> Unfortunately only on the outer tracks and on a fresh filesystem. See
> for example:
>
>  https://www.rz.uni-karlsruhe.de/rz/docs/Lustre/ssck_sfs_isc2007
>
>    ?Performance degradation on xc2
>           After 6 months of production we lost half of the file system
>           performance
>               Problem is under investigation by HP
>               We had a similar problem on xc1 which was due to
fragmentation
>               Current solution for defragmentation is to recreate file
systems?
>
> Note that it is not just "due to fragmentation", even without it
as a
> filesystem fills blocks will (usually) start being allocated from the
> inner tracks and thus the raw transfer rate will eventually nearly halve:
>
>   base# disktype /dev/sdd | grep ''device, size''
>   Block device, size 931.5 GiB (1000203804160 bytes)
>   base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=0
>   1000+0 records in
>   1000+0 records out
>   1048576000 bytes (1.0 GB) copied, 9.53604 seconds, 110 MB/s
>   base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null
skip=950000
>   1000+0 records in
>   1000+0 records out
>   1048576000 bytes (1.0 GB) copied, 18.3843 seconds, 57.0 MB/s
>
> Amazingly in the "outer tracks and on fresh filesystem" case a
modern
> 1TB SATA disk with a reasonable file system type can do 110MB/s even
> with smallish block sizes:
>
>   base# fdisk -l /dev/sdd | grep sdd3
>   /dev/sdd3   *           2        1769    14201460   17  Hidden HPFS/NTFS
>   base# mkfs.ext4 -q /dev/sdd3
>   base# mount -t ext4 /dev/sdd3 /mnt
>   base# dd bs=64k count=100000 conv=fsync if=/dev/zero of=/mnt/TEST
>   100000+0 records in
>   100000+0 records out
>   6553600000 bytes (6.6 GB) copied, 59.7662 seconds, 110 MB/s
>
> (BTW I have used ''ext4'' because this is about Lustre, I
usually
> prefer JFS for various reasons)
>
> I have been quite impressed that one can get 90MB/s "outer tracks and
> on fresh filesystem" from a contemporary low power laptop drive:
>
>   http://www.sabi.co.uk/blog/0906Jun.html#090605
>
>   
>> Assuming that you are looking for RAID6 protection for OST,
>> you need 10 SATA disks to form a 8 TB lun.
>>     
>
> Why would you assume that? (see below on formulaic approaches) Why
> use parity RAID that is know to cause performance problems on writes
> unless they are all aligned, when the only detail provided is a
> target for writing?  Perhaps it is a DAQ or other recording
> application given the 2GB/s goal, but perhaps it does not do large
> writes.
>
>   
>> You will need 4 such OSTs to give you 32 TB unformatted space.
>> Lets consider performance:
>>     
>
>   
>> Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data
>> disks in (8+2) RAID6 set]. But you have to cater for overhead of
>> software/hardware RAID and limits of SAS PCIe HCA (or FC hardware
>> RAID HCA). A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6
>> FC HCAs to utilize storage bandwidth of 4 RAID6 OSTs [Total
>> bandwidth = 4 X 720 MB/Sec/OST = 2.8 GB/Sec].
>>     
>
> That "4 X 720MB/Sec" applies only if there are at least 4 stripes
per
> file and they get written in bulk, as you imply immediately below, or
> there are at least 4 files being written at the same time and they
> end up on different OSTs. The very vague requirement for "clustered
> mode" does not quite make clear which one.
>
>   
>> So, now you have a storage system that delivers 32 TB unformatted
>> space and 2.8 GB/Sec of performance for large sequential read/write
>> workload.
>>     
>
> The read and write performance may be quite different on many
> workloads because of the RAID6 (stripe alignment), even if probably
> "large sequential" is going to be fine, and again only if the
> concurrency is just right and on outer tracks on a fresh filesystem.
>
>   
>> If you are planning to have mixed or small io workload and still
>> want to achieve 2 GB/Sec throughput, you have to double the specs.
>>     
>
> Why just double? Why not consider other storage systems like RAID10
> or SSDs?
>
>   
>> Small, random IO (think of home directories) kills storage
>> performance.
>>     
>
> Depends on the storage system...
>
>   
>> Lets size MDS now.
>>     
>
>   
>> It is a good idea to use FC or SAS disks for MDS as they spin at
>> higher rate and have better IOPS performance. For example, lets
>> consider Seagate enterprise 15 K rpm 300 GB SAS disks. You can put
>> 4 such SAS disks in RAID10 configuration for MDT which will give
>> you 600 GB of unformatted space. [ ... ]
>>     
>
> The MDS is another story indeed.
>
> But I seem to detect here a formulaic approach: the Lustre
"don''t
> need to think" approach seems to be SAS RAID10 for metadata and SATA
> RAID6 for data, and this is what is being discussed here, straight
> out of the 3-ring binder, without asking any further questions
> despite the extreme vagueness of the target. Which is mostly better
> BTW than what a site I know got from EMC as their "don''t need
to
> think" formula at the time seemed to be RAID3 of all things.
>
> Fine (perhaps), but I have a different formulaic approach: without
> knowing all the details, and even if in some cases parity RAID does
> make sense:
>
>   http://www.sabi.co.uk/blog/0709sep.html#070923b
>
> my "generic" formula (and apparently shared by several academic
sites
> that use Lustre for HPC storage) is to get Sun X4500 "Thumper"s
(or
> thei more recent equivalents) and RAID10 a bunch of disks inside them
> (and then use JFS/XFS or Lustre on top, possibly with DRBD between).
> With Lustre it is easy then to aggregate them by spreading the OSTs
> across multiple "Thumper"s.
>
> In this case given that he goal is roughly 70MB/s duplex sustained
> per TB of storage, which is rather high, so I would use either SSDs
> or lots of small fast SAS drives for data (or lots of large SATA ones
> with the data partition only in the outer 1/3 of the disk, which some
> people call "short stroking").
>
> Depending on how big the writes are and how big the files are and the
> degree of concurrency, and the availability target, and all the other
> important aspects of the requirements, most importantly the read and
> write access patterns, as writing and then reading implies quite a
> bit of head movement.
>
> If we assume the 20MB/s duplex rule per TB that CERN uses for bulk
> storage, that translates to 100x SATA 1TB drives, or around 200x 1TB
> with RAID10 (spread around 5-6 "Thumper"s). Or perhap smaller but
> higher IOP/s SAS 15k drives. Perhaps large SSDs would be a nice idea.
>
> But the details matter a great deal. Your mileage may vary.
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Lustre discuss - Jan 2010 - Lustre Storage Sizing- How?

[Lustre-discuss] Lustre Storage Sizing- How?

[Lustre-discuss] Lustre Storage Sizing- How?

[Lustre-discuss] Lustre Storage Sizing- How?

[Lustre-discuss] Lustre Storage Sizing- How?

[Lustre-discuss] Lustre Storage Sizing- How?