thr3ads.net - Lustre discuss - [Lustre-discuss] Off-topic: largest existing Lustre file system? [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Klaus Steden

2008-Jan-14 20:41 UTC

[Lustre-discuss] Off-topic: largest existing Lustre file system?

Hi there,

I was asked by a friend of a business contact of mine the other day to share
some information about Lustre; seems he''s planning to build what will
eventually be about a 3 PB file system.

The CFS website doesn''t appear to have any information on field
deployments
worth bragging about, so I figured I''d ask, just for fun; does anyone
know:

- the size of the largest working Lustre file system currently in the field
- the highest sustained number of IOPS seen with Lustre, and what the
backend was?

cheers,
Klaus

D. Marc Stearman

2008-Jan-14 21:36 UTC

head link

[Lustre-discuss] Off-topic: largest existing Lustre file system?

Klaus,

We currently have a 1.2PB lustre filesystem that we will be expanding  
to 2.4PB in the near future.  I not sure about the highest sustained  
IOPS, but we did have a user peak 19GB/s to one of our 500TB  
filesystems recently. The backend for that was 16 DDN 8500 couplets  
with write-cache turned OFF.

-Marc

----
D. Marc Stearman
LC Lustre Systems Administrator
marc at llnl.gov
925.423.9670
Pager: 1.888.203.0641


On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>
> Hi there,
>
> I was asked by a friend of a business contact of mine the other day  
> to share
> some information about Lustre; seems he''s planning to build what
will
> eventually be about a 3 PB file system.
>
> The CFS website doesn''t appear to have any information on field  
> deployments
> worth bragging about, so I figured I''d ask, just for fun; does  
> anyone know:
>
> - the size of the largest working Lustre file system currently in  
> the field
> - the highest sustained number of IOPS seen with Lustre, and what the
> backend was?
>
> cheers,
> Klaus
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

D. Marc Stearman

2008-Jan-14 21:41 UTC

head link

[Lustre-discuss] Off-topic: largest existing Lustre file system?

Sorry, I confused myself.  We have a 1.2PB filesystem that will  
remain that size.  We will be expanding our 958TB filesystem to  
~2.4PB in the near future.

-Marc

----
D. Marc Stearman
LC Lustre Systems Administrator
marc at llnl.gov
925.423.9670
Pager: 1.888.203.0641


On Jan 14, 2008, at 1:36 PM, D. Marc Stearman wrote:
> Klaus,
>
> We currently have a 1.2PB lustre filesystem that we will be expanding
> to 2.4PB in the near future.  I not sure about the highest sustained
> IOPS, but we did have a user peak 19GB/s to one of our 500TB
> filesystems recently. The backend for that was 16 DDN 8500 couplets
> with write-cache turned OFF.
>
> -Marc
>
> ----
> D. Marc Stearman
> LC Lustre Systems Administrator
> marc at llnl.gov
> 925.423.9670
> Pager: 1.888.203.0641
>
>
> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>
>>
>> Hi there,
>>
>> I was asked by a friend of a business contact of mine the other day
>> to share
>> some information about Lustre; seems he''s planning to build
what will
>> eventually be about a 3 PB file system.
>>
>> The CFS website doesn''t appear to have any information on
field
>> deployments
>> worth bragging about, so I figured I''d ask, just for fun; does
>> anyone know:
>>
>> - the size of the largest working Lustre file system currently in
>> the field
>> - the highest sustained number of IOPS seen with Lustre, and what the
>> backend was?
>>
>> cheers,
>> Klaus
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Canon, Richard Shane

2008-Jan-14 21:48 UTC

head link

[Lustre-discuss] Off-topic: largest existing Lustre file system?

Klaus,

Here are some that I know are pretty large.

* RedStorm - I think it has two roughly 50 GB/s file systems.  The
capacity may not be quite as large though.  I think they used FC drives.
It was DDN 8500 although that may have changed.
* CEA - I think they have a file system approaching 100 GB/s.  I think
it is DDN 9550.  Not sure about the capacities.
* TACC has a large Thumper based system.  Not sure of the specs.
* ORNL - We have a 44 GB/s file system with around 800 TB of total
capacity.  That is DDN 9550.  We also have two new file system (20 GB/s
and 10 GB/s currently LSI XBB2 and DDN 9550 respectively).  Those have
around 800 TB each (after RAID6).
* We are planning a 200 GB/s, around 10 PB file system now.

--Shane

-----Original Message-----
From: lustre-discuss-bounces at clusterfs.com
[mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of D. Marc
Stearman
Sent: Monday, January 14, 2008 4:37 PM
To: lustre-discuss at clusterfs.com
Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
system?

Klaus,

We currently have a 1.2PB lustre filesystem that we will be expanding  
to 2.4PB in the near future.  I not sure about the highest sustained  
IOPS, but we did have a user peak 19GB/s to one of our 500TB  
filesystems recently. The backend for that was 16 DDN 8500 couplets  
with write-cache turned OFF.

-Marc

----
D. Marc Stearman
LC Lustre Systems Administrator
marc at llnl.gov
925.423.9670
Pager: 1.888.203.0641

On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>
> Hi there,
>
> I was asked by a friend of a business contact of mine the other day  
> to share
> some information about Lustre; seems he''s planning to build what
will
> eventually be about a 3 PB file system.
>
> The CFS website doesn''t appear to have any information on field  
> deployments
> worth bragging about, so I figured I''d ask, just for fun; does  
> anyone know:
>
> - the size of the largest working Lustre file system currently in  
> the field
> - the highest sustained number of IOPS seen with Lustre, and what the
> backend was?
>
> cheers,
> Klaus
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Kennedy, Jeffrey

2008-Jan-14 21:55 UTC

head link

[Lustre-discuss] Off-topic: largest existing Lustre file system?

Any spec''s on IOPS rather than throughput?

Thanks.

Jeff Kennedy
QCT Engineering Compute
858-651-6592
 > -----Original Message-----
> From: lustre-discuss-bounces at clusterfs.com [mailto:lustre-discuss-
> bounces at clusterfs.com] On Behalf Of Canon, Richard Shane
> Sent: Monday, January 14, 2008 1:49 PM
> To: lustre-discuss at clusterfs.com
> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
> system?
> 
> 
> Klaus,
> 
> Here are some that I know are pretty large.
> 
> * RedStorm - I think it has two roughly 50 GB/s file systems.  The
> capacity may not be quite as large though.  I think they used FC
drives.> It was DDN 8500 although that may have changed.
> * CEA - I think they have a file system approaching 100 GB/s.  I think
> it is DDN 9550.  Not sure about the capacities.
> * TACC has a large Thumper based system.  Not sure of the specs.
> * ORNL - We have a 44 GB/s file system with around 800 TB of total
> capacity.  That is DDN 9550.  We also have two new file system (20
GB/s> and 10 GB/s currently LSI XBB2 and DDN 9550 respectively).  Those have
> around 800 TB each (after RAID6).
> * We are planning a 200 GB/s, around 10 PB file system now.
> 
> --Shane
> 
> -----Original Message-----
> From: lustre-discuss-bounces at clusterfs.com
> [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of D. Marc
> Stearman
> Sent: Monday, January 14, 2008 4:37 PM
> To: lustre-discuss at clusterfs.com
> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
> system?
> 
> Klaus,
> 
> We currently have a 1.2PB lustre filesystem that we will be expanding
> to 2.4PB in the near future.  I not sure about the highest sustained
> IOPS, but we did have a user peak 19GB/s to one of our 500TB
> filesystems recently. The backend for that was 16 DDN 8500 couplets
> with write-cache turned OFF.
> 
> -Marc
> 
> ----
> D. Marc Stearman
> LC Lustre Systems Administrator
> marc at llnl.gov
> 925.423.9670
> Pager: 1.888.203.0641
> 
> 
> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
> 
> >
> > Hi there,
> >
> > I was asked by a friend of a business contact of mine the other day
> > to share
> > some information about Lustre; seems he''s planning to build
what
will> > eventually be about a 3 PB file system.
> >
> > The CFS website doesn''t appear to have any information on
field
> > deployments
> > worth bragging about, so I figured I''d ask, just for fun;
does
> > anyone know:
> >
> > - the size of the largest working Lustre file system currently in
> > the field
> > - the highest sustained number of IOPS seen with Lustre, and what
the> > backend was?
> >
> > cheers,
> > Klaus
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at clusterfs.com
> > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Canon, Richard Shane

2008-Jan-14 23:11 UTC

head link

[Lustre-discuss] Off-topic: largest existing Lustre file system?

Jeff,

I''m not aware of any.  For parallel file systems it is usually
bandwidth
centric.

--Shane

-----Original Message-----
From: Kennedy, Jeffrey [mailto:jkennedy at qualcomm.com] 
Sent: Monday, January 14, 2008 4:56 PM
To: Canon, Richard Shane; lustre-discuss at clusterfs.com
Subject: RE: [Lustre-discuss] Off-topic: largest existing Lustre file
system?

Any spec''s on IOPS rather than throughput?

Thanks.

Jeff Kennedy
QCT Engineering Compute
858-651-6592
 > -----Original Message-----
> From: lustre-discuss-bounces at clusterfs.com [mailto:lustre-discuss-
> bounces at clusterfs.com] On Behalf Of Canon, Richard Shane
> Sent: Monday, January 14, 2008 1:49 PM
> To: lustre-discuss at clusterfs.com
> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
> system?
> 
> 
> Klaus,
> 
> Here are some that I know are pretty large.
> 
> * RedStorm - I think it has two roughly 50 GB/s file systems.  The
> capacity may not be quite as large though.  I think they used FC
drives.> It was DDN 8500 although that may have changed.
> * CEA - I think they have a file system approaching 100 GB/s.  I think
> it is DDN 9550.  Not sure about the capacities.
> * TACC has a large Thumper based system.  Not sure of the specs.
> * ORNL - We have a 44 GB/s file system with around 800 TB of total
> capacity.  That is DDN 9550.  We also have two new file system (20
GB/s> and 10 GB/s currently LSI XBB2 and DDN 9550 respectively).  Those have
> around 800 TB each (after RAID6).
> * We are planning a 200 GB/s, around 10 PB file system now.
> 
> --Shane
> 
> -----Original Message-----
> From: lustre-discuss-bounces at clusterfs.com
> [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of D. Marc
> Stearman
> Sent: Monday, January 14, 2008 4:37 PM
> To: lustre-discuss at clusterfs.com
> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
> system?
> 
> Klaus,
> 
> We currently have a 1.2PB lustre filesystem that we will be expanding
> to 2.4PB in the near future.  I not sure about the highest sustained
> IOPS, but we did have a user peak 19GB/s to one of our 500TB
> filesystems recently. The backend for that was 16 DDN 8500 couplets
> with write-cache turned OFF.
> 
> -Marc
> 
> ----
> D. Marc Stearman
> LC Lustre Systems Administrator
> marc at llnl.gov
> 925.423.9670
> Pager: 1.888.203.0641
> 
> 
> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
> 
> >
> > Hi there,
> >
> > I was asked by a friend of a business contact of mine the other day
> > to share
> > some information about Lustre; seems he''s planning to build
what
will> > eventually be about a 3 PB file system.
> >
> > The CFS website doesn''t appear to have any information on
field
> > deployments
> > worth bragging about, so I figured I''d ask, just for fun;
does
> > anyone know:
> >
> > - the size of the largest working Lustre file system currently in
> > the field
> > - the highest sustained number of IOPS seen with Lustre, and what
the> > backend was?
> >
> > cheers,
> > Klaus
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at clusterfs.com
> > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

mlbarna

2008-Jan-16 17:43 UTC

head link

[Lustre-discuss] Off-topic: largest existing Lustre file system?

Could you elaborate on the benchmarking application(s) run that provided
these bandwidth numbers. I have a particular interest in MPI coded programs
that perform collective I/O. In discussions, I find this topic sometimes
confused; my meaning is streamed, appending with all the data from all the
processors for a single, atomic write operation filling disjoint sections of
the same file. In MPI-IO, the MPI_File_write_all* family seems to define my
focus area, run with or without two-phase aggregation. Imitating the
operation with simple, Posix I/O is acceptable, as far as I am concerned.

In tests on redstorm from last year, I appended to a single, open file at a
rate of 26 GB/s. I had to use exceptional parameters to achieve this
however: the file had an LFS stripe-count of 160, and I sent a 20 MB buffer,
respectively, from each of a 160, total processor job, for an aggregate of
3.2 GB per write_all operation. I consider this configuration out of the
range of any normal usage.

I believe that a faster rate could be achieved by a similar program that
wrote independently--that is, one-file-per-processor--such as via NetCDF.
For this case, I would set the LFS stripe-count down to one.


Marty Barnaby



On 1/14/08 4:11 PM, "Canon, Richard Shane" <canonrs at ornl.gov>
wrote:
> 
> Jeff,
> 
> I''m not aware of any.  For parallel file systems it is usually
bandwidth
> centric.
> 
> --Shane
> 
> -----Original Message-----
> From: Kennedy, Jeffrey [mailto:jkennedy at qualcomm.com]
> Sent: Monday, January 14, 2008 4:56 PM
> To: Canon, Richard Shane; lustre-discuss at clusterfs.com
> Subject: RE: [Lustre-discuss] Off-topic: largest existing Lustre file
> system?
> 
> Any spec''s on IOPS rather than throughput?
> 
> Thanks.
> 
> Jeff Kennedy
> QCT Engineering Compute
> 858-651-6592
>  
>> -----Original Message-----
>> From: lustre-discuss-bounces at clusterfs.com [mailto:lustre-discuss-
>> bounces at clusterfs.com] On Behalf Of Canon, Richard Shane
>> Sent: Monday, January 14, 2008 1:49 PM
>> To: lustre-discuss at clusterfs.com
>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>> system?
>> 
>> 
>> Klaus,
>> 
>> Here are some that I know are pretty large.
>> 
>> * RedStorm - I think it has two roughly 50 GB/s file systems.  The
>> capacity may not be quite as large though.  I think they used FC
> drives.
>> It was DDN 8500 although that may have changed.
>> * CEA - I think they have a file system approaching 100 GB/s.  I think
>> it is DDN 9550.  Not sure about the capacities.
>> * TACC has a large Thumper based system.  Not sure of the specs.
>> * ORNL - We have a 44 GB/s file system with around 800 TB of total
>> capacity.  That is DDN 9550.  We also have two new file system (20
> GB/s
>> and 10 GB/s currently LSI XBB2 and DDN 9550 respectively).  Those have
>> around 800 TB each (after RAID6).
>> * We are planning a 200 GB/s, around 10 PB file system now.
>> 
>> --Shane
>> 
>> -----Original Message-----
>> From: lustre-discuss-bounces at clusterfs.com
>> [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of D. Marc
>> Stearman
>> Sent: Monday, January 14, 2008 4:37 PM
>> To: lustre-discuss at clusterfs.com
>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>> system?
>> 
>> Klaus,
>> 
>> We currently have a 1.2PB lustre filesystem that we will be expanding
>> to 2.4PB in the near future.  I not sure about the highest sustained
>> IOPS, but we did have a user peak 19GB/s to one of our 500TB
>> filesystems recently. The backend for that was 16 DDN 8500 couplets
>> with write-cache turned OFF.
>> 
>> -Marc
>> 
>> ----
>> D. Marc Stearman
>> LC Lustre Systems Administrator
>> marc at llnl.gov
>> 925.423.9670
>> Pager: 1.888.203.0641
>> 
>> 
>> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>> 
>>> 
>>> Hi there,
>>> 
>>> I was asked by a friend of a business contact of mine the other day
>>> to share
>>> some information about Lustre; seems he''s planning to
build what
> will
>>> eventually be about a 3 PB file system.
>>> 
>>> The CFS website doesn''t appear to have any information on
field
>>> deployments
>>> worth bragging about, so I figured I''d ask, just for fun;
does
>>> anyone know:
>>> 
>>> - the size of the largest working Lustre file system currently in
>>> the field
>>> - the highest sustained number of IOPS seen with Lustre, and what
> the
>>> backend was?
>>> 
>>> cheers,
>>> Klaus
>>> 
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

Canon, Richard Shane

2008-Jan-30 21:09 UTC

head link

[Lustre-discuss] Off-topic: largest existing Lustre file system?

Marty,

Our benchmark measurements were made using IOR doing POSIX IO to a
single shared file (I believe).  

Since you mentioned MPI-IO...  Weikuan Yu (at ORNL) has done some work
to improve the MPI-IO Lustre ADIO driver.  Also, we have been sponsoring
work through a Lustre Centre of Excellence to further improve the ADIO
driver.  I''m optimistic that this can make collective IO perform at a
level that one would expect.  File-per-process runs often do run faster
up until the meta data activity associated with creating 10k+ files
starts to slow things down.  I''m a firm believer that collective IO
through libraries like MPI-IO, HDF5, and pNetCDF are the way things
should move.  It should be possible to embed enough intelligence in
these middle layers to do good stripe alignment, automatically tune
stripe counts, and stripe width, etc. Some of this will hopefully be
accomplished with the improvements being made to the ADIO.

--Shane

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of mlbarna
Sent: Wednesday, January 16, 2008 12:43 PM
To: lustre-discuss at clusterfs.com
Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
system?

Could you elaborate on the benchmarking application(s) run that provided
these bandwidth numbers. I have a particular interest in MPI coded
programs
that perform collective I/O. In discussions, I find this topic sometimes
confused; my meaning is streamed, appending with all the data from all
the
processors for a single, atomic write operation filling disjoint
sections of
the same file. In MPI-IO, the MPI_File_write_all* family seems to define
my
focus area, run with or without two-phase aggregation. Imitating the
operation with simple, Posix I/O is acceptable, as far as I am
concerned.

In tests on redstorm from last year, I appended to a single, open file
at a
rate of 26 GB/s. I had to use exceptional parameters to achieve this
however: the file had an LFS stripe-count of 160, and I sent a 20 MB
buffer,
respectively, from each of a 160, total processor job, for an aggregate
of
3.2 GB per write_all operation. I consider this configuration out of the
range of any normal usage.

I believe that a faster rate could be achieved by a similar program that
wrote independently--that is, one-file-per-processor--such as via
NetCDF.
For this case, I would set the LFS stripe-count down to one.

Marty Barnaby

On 1/14/08 4:11 PM, "Canon, Richard Shane" <canonrs at ornl.gov>
wrote:
> 
> Jeff,
> 
> I''m not aware of any.  For parallel file systems it is usually
bandwidth> centric.
> 
> --Shane
> 
> -----Original Message-----
> From: Kennedy, Jeffrey [mailto:jkennedy at qualcomm.com]
> Sent: Monday, January 14, 2008 4:56 PM
> To: Canon, Richard Shane; lustre-discuss at clusterfs.com
> Subject: RE: [Lustre-discuss] Off-topic: largest existing Lustre file
> system?
> 
> Any spec''s on IOPS rather than throughput?
> 
> Thanks.
> 
> Jeff Kennedy
> QCT Engineering Compute
> 858-651-6592
>  
>> -----Original Message-----
>> From: lustre-discuss-bounces at clusterfs.com [mailto:lustre-discuss-
>> bounces at clusterfs.com] On Behalf Of Canon, Richard Shane
>> Sent: Monday, January 14, 2008 1:49 PM
>> To: lustre-discuss at clusterfs.com
>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>> system?
>> 
>> 
>> Klaus,
>> 
>> Here are some that I know are pretty large.
>> 
>> * RedStorm - I think it has two roughly 50 GB/s file systems.  The
>> capacity may not be quite as large though.  I think they used FC
> drives.
>> It was DDN 8500 although that may have changed.
>> * CEA - I think they have a file system approaching 100 GB/s.  I
think>> it is DDN 9550.  Not sure about the capacities.
>> * TACC has a large Thumper based system.  Not sure of the specs.
>> * ORNL - We have a 44 GB/s file system with around 800 TB of total
>> capacity.  That is DDN 9550.  We also have two new file system (20
> GB/s
>> and 10 GB/s currently LSI XBB2 and DDN 9550 respectively).  Those
have>> around 800 TB each (after RAID6).
>> * We are planning a 200 GB/s, around 10 PB file system now.
>> 
>> --Shane
>> 
>> -----Original Message-----
>> From: lustre-discuss-bounces at clusterfs.com
>> [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of D. Marc
>> Stearman
>> Sent: Monday, January 14, 2008 4:37 PM
>> To: lustre-discuss at clusterfs.com
>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>> system?
>> 
>> Klaus,
>> 
>> We currently have a 1.2PB lustre filesystem that we will be expanding
>> to 2.4PB in the near future.  I not sure about the highest sustained
>> IOPS, but we did have a user peak 19GB/s to one of our 500TB
>> filesystems recently. The backend for that was 16 DDN 8500 couplets
>> with write-cache turned OFF.
>> 
>> -Marc
>> 
>> ----
>> D. Marc Stearman
>> LC Lustre Systems Administrator
>> marc at llnl.gov
>> 925.423.9670
>> Pager: 1.888.203.0641
>> 
>> 
>> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>> 
>>> 
>>> Hi there,
>>> 
>>> I was asked by a friend of a business contact of mine the other day
>>> to share
>>> some information about Lustre; seems he''s planning to
build what
> will
>>> eventually be about a 3 PB file system.
>>> 
>>> The CFS website doesn''t appear to have any information on
field
>>> deployments
>>> worth bragging about, so I figured I''d ask, just for fun;
does
>>> anyone know:
>>> 
>>> - the size of the largest working Lustre file system currently in
>>> the field
>>> - the highest sustained number of IOPS seen with Lustre, and what
> the
>>> backend was?
>>> 
>>> cheers,
>>> Klaus
>>> 
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> 

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Weikuan Yu

2008-Jan-31 17:13 UTC

head link

[Lustre-discuss] Off-topic: largest existing Lustre file system?

I would throw in some of my experience for discussion as Shane mentioned 
my name here :)

(1) First, I am not under the impression that the collective I/O is 
designed to reveal the peak performance of a particular system. Well, 
there are publications claiming that colective I/O might be a preferred 
case for some particular architecture, e.g., for BG/P (HPCA06)... But I 
am not fully positive or clear on the context of the claim.

(2) With the intended disjoint access as Marty mentioned, using 
collective I/O would only add a phase of coordination before each 
process streams its data to the file system.

(3) The testing as Marty described below seem small, both in terms of 
process counts and the amount of data. Aside from whether this is good 
for revealing the system''s best performance, I can confirm that there 
are applications at ORNL that has much larger process counts and bigger 
data volume from each processes.

 > In tests on redstorm from last year, I appended to a single, open file
 > at a
 > rate of 26 GB/s. I had to use exceptional parameters to achieve this
 > however: the file had an LFS stripe-count of 160, and I sent a 20 MB
 > buffer,
 > respectively, from each of a 160, total processor job, for an aggregate
 > of
 > 3.2 GB per write_all operation. I consider this configuration out of the
 > range of any normal usage.

IIRC, something regarding collective I/O has been discussed earlier, 
also with Marty (?).

In the upcoming IPDPS08 conference, ORNL has two papers on I/O 
performance of Jaguar. So you may find the numbers interesting. The 
final version of the papers should be available from me or Mark Fahey 
(the author of another paper).

--Weikuan


Canon, Richard Shane wrote:> Marty,
> 
> Our benchmark measurements were made using IOR doing POSIX IO to a
> single shared file (I believe).  
> 
> Since you mentioned MPI-IO...  Weikuan Yu (at ORNL) has done some work
> to improve the MPI-IO Lustre ADIO driver.  Also, we have been sponsoring
> work through a Lustre Centre of Excellence to further improve the ADIO
> driver.  I''m optimistic that this can make collective IO perform
at a
> level that one would expect.  File-per-process runs often do run faster
> up until the meta data activity associated with creating 10k+ files
> starts to slow things down.  I''m a firm believer that collective
IO
> through libraries like MPI-IO, HDF5, and pNetCDF are the way things
> should move.  It should be possible to embed enough intelligence in
> these middle layers to do good stripe alignment, automatically tune
> stripe counts, and stripe width, etc. Some of this will hopefully be
> accomplished with the improvements being made to the ADIO.
> 
> --Shane
> 
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of mlbarna
> Sent: Wednesday, January 16, 2008 12:43 PM
> To: lustre-discuss at clusterfs.com
> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
> system?
> 
> Could you elaborate on the benchmarking application(s) run that provided
> these bandwidth numbers. I have a particular interest in MPI coded
> programs
> that perform collective I/O. In discussions, I find this topic sometimes
> confused; my meaning is streamed, appending with all the data from all
> the
> processors for a single, atomic write operation filling disjoint
> sections of
> the same file. In MPI-IO, the MPI_File_write_all* family seems to define
> my
> focus area, run with or without two-phase aggregation. Imitating the
> operation with simple, Posix I/O is acceptable, as far as I am
> concerned.
> 
> In tests on redstorm from last year, I appended to a single, open file
> at a
> rate of 26 GB/s. I had to use exceptional parameters to achieve this
> however: the file had an LFS stripe-count of 160, and I sent a 20 MB
> buffer,
> respectively, from each of a 160, total processor job, for an aggregate
> of
> 3.2 GB per write_all operation. I consider this configuration out of the
> range of any normal usage.
> 
> I believe that a faster rate could be achieved by a similar program that
> wrote independently--that is, one-file-per-processor--such as via
> NetCDF.
> For this case, I would set the LFS stripe-count down to one.
> 
> 
> Marty Barnaby
> 
> 
> 
> On 1/14/08 4:11 PM, "Canon, Richard Shane" <canonrs at
ornl.gov> wrote:
> 
>> Jeff,
>>
>> I''m not aware of any.  For parallel file systems it is usually
> bandwidth
>> centric.
>>
>> --Shane
>>
>> -----Original Message-----
>> From: Kennedy, Jeffrey [mailto:jkennedy at qualcomm.com]
>> Sent: Monday, January 14, 2008 4:56 PM
>> To: Canon, Richard Shane; lustre-discuss at clusterfs.com
>> Subject: RE: [Lustre-discuss] Off-topic: largest existing Lustre file
>> system?
>>
>> Any spec''s on IOPS rather than throughput?
>>
>> Thanks.
>>
>> Jeff Kennedy
>> QCT Engineering Compute
>> 858-651-6592
>>  
>>> -----Original Message-----
>>> From: lustre-discuss-bounces at clusterfs.com
[mailto:lustre-discuss-
>>> bounces at clusterfs.com] On Behalf Of Canon, Richard Shane
>>> Sent: Monday, January 14, 2008 1:49 PM
>>> To: lustre-discuss at clusterfs.com
>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre
file
>>> system?
>>>
>>>
>>> Klaus,
>>>
>>> Here are some that I know are pretty large.
>>>
>>> * RedStorm - I think it has two roughly 50 GB/s file systems.  The
>>> capacity may not be quite as large though.  I think they used FC
>> drives.
>>> It was DDN 8500 although that may have changed.
>>> * CEA - I think they have a file system approaching 100 GB/s.  I
> think
>>> it is DDN 9550.  Not sure about the capacities.
>>> * TACC has a large Thumper based system.  Not sure of the specs.
>>> * ORNL - We have a 44 GB/s file system with around 800 TB of total
>>> capacity.  That is DDN 9550.  We also have two new file system (20
>> GB/s
>>> and 10 GB/s currently LSI XBB2 and DDN 9550 respectively).  Those
> have
>>> around 800 TB each (after RAID6).
>>> * We are planning a 200 GB/s, around 10 PB file system now.
>>>
>>> --Shane
>>>
>>> -----Original Message-----
>>> From: lustre-discuss-bounces at clusterfs.com
>>> [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of D.
Marc
>>> Stearman
>>> Sent: Monday, January 14, 2008 4:37 PM
>>> To: lustre-discuss at clusterfs.com
>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre
file
>>> system?
>>>
>>> Klaus,
>>>
>>> We currently have a 1.2PB lustre filesystem that we will be
expanding
>>> to 2.4PB in the near future.  I not sure about the highest
sustained
>>> IOPS, but we did have a user peak 19GB/s to one of our 500TB
>>> filesystems recently. The backend for that was 16 DDN 8500 couplets
>>> with write-cache turned OFF.
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> LC Lustre Systems Administrator
>>> marc at llnl.gov
>>> 925.423.9670
>>> Pager: 1.888.203.0641
>>>
>>>
>>> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>>>
>>>> Hi there,
>>>>
>>>> I was asked by a friend of a business contact of mine the other
day
>>>> to share
>>>> some information about Lustre; seems he''s planning to
build what
>> will
>>>> eventually be about a 3 PB file system.
>>>>
>>>> The CFS website doesn''t appear to have any information
on field
>>>> deployments
>>>> worth bragging about, so I figured I''d ask, just for
fun; does
>>>> anyone know:
>>>>
>>>> - the size of the largest working Lustre file system currently
in
>>>> the field
>>>> - the highest sustained number of IOPS seen with Lustre, and
what
>> the
>>>> backend was?
>>>>
>>>> cheers,
>>>> Klaus
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Marty Barnaby

2008-Jan-31 19:25 UTC

head link

[Lustre-discuss] Off-topic: largest existing Lustre file system?

I concur that at 160, my processor count was low. At the time, I had 
access to as many as 1000 on our big Cray XT3, Redstorm, but now it is 
not available to me at all. Through my trials, I found that, for 
appending to a single, shared file, matching the lfs maximum stripe 
count of 160, with the equivalent job size was the only combination that 
got me this up to this rate. My interest in this case ends here, because 
real usage involves processor counts in the thousands.

I''m not certain about the distinctions of collective I/O and shared 
files. It seems the latter is to be determined by the application 
authors and their users. With, for instance, a mesh-type application, 
there might be trade-offs; but having, at least, all the data for one 
computed field or dynamic, for one time step that is saved to the output 
(we usually calls these dumps), stored in the same file is regularly 
more advantageous for the entire work cycle. Actually, having all the 
output data in a single file seems like the most desirable approach.

Though MPI-IO independent writing operations can be on the lowest level, 
mustn''t there still be some global coordination to determine where each
processors chunk of a communicators complete vector of values is 
written, respectively, in a shared file? Further, what we now call 
two-phase aggregation, as a means of turning many, tiny block-size 
writes into a small number of large ones to leverage the greater 
efficiency of the FS to respond to this type of activity, has potential 
benefits. However, I''ve seen the considerable ROMIO provisions for this
technique used incorrectly, delivering a decrease in performance.

Marty Barnaby


Weikuan Yu wrote:> I would throw in some of my experience for discussion as Shane mentioned 
> my name here :)
>
> (1) First, I am not under the impression that the collective I/O is 
> designed to reveal the peak performance of a particular system. Well, 
> there are publications claiming that colective I/O might be a preferred 
> case for some particular architecture, e.g., for BG/P (HPCA06)... But I 
> am not fully positive or clear on the context of the claim.
>
> (2) With the intended disjoint access as Marty mentioned, using 
> collective I/O would only add a phase of coordination before each 
> process streams its data to the file system.
>
> (3) The testing as Marty described below seem small, both in terms of 
> process counts and the amount of data. Aside from whether this is good 
> for revealing the system''s best performance, I can confirm that
there
> are applications at ORNL that has much larger process counts and bigger 
> data volume from each processes.
>
>  > In tests on redstorm from last year, I appended to a single, open
file
>  > at a
>  > rate of 26 GB/s. I had to use exceptional parameters to achieve this
>  > however: the file had an LFS stripe-count of 160, and I sent a 20 MB
>  > buffer,
>  > respectively, from each of a 160, total processor job, for an
aggregate
>  > of
>  > 3.2 GB per write_all operation. I consider this configuration out of
the
>  > range of any normal usage.
>
> IIRC, something regarding collective I/O has been discussed earlier, 
> also with Marty (?).
>
> In the upcoming IPDPS08 conference, ORNL has two papers on I/O 
> performance of Jaguar. So you may find the numbers interesting. The 
> final version of the papers should be available from me or Mark Fahey 
> (the author of another paper).
>
> --Weikuan
>
>
> Canon, Richard Shane wrote:
>   
>> Marty,
>>
>> Our benchmark measurements were made using IOR doing POSIX IO to a
>> single shared file (I believe).  
>>
>> Since you mentioned MPI-IO...  Weikuan Yu (at ORNL) has done some work
>> to improve the MPI-IO Lustre ADIO driver.  Also, we have been
sponsoring
>> work through a Lustre Centre of Excellence to further improve the ADIO
>> driver.  I''m optimistic that this can make collective IO
perform at a
>> level that one would expect.  File-per-process runs often do run faster
>> up until the meta data activity associated with creating 10k+ files
>> starts to slow things down.  I''m a firm believer that
collective IO
>> through libraries like MPI-IO, HDF5, and pNetCDF are the way things
>> should move.  It should be possible to embed enough intelligence in
>> these middle layers to do good stripe alignment, automatically tune
>> stripe counts, and stripe width, etc. Some of this will hopefully be
>> accomplished with the improvements being made to the ADIO.
>>
>> --Shane
>>
>> -----Original Message-----
>> From: lustre-discuss-bounces at lists.lustre.org
>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
mlbarna
>> Sent: Wednesday, January 16, 2008 12:43 PM
>> To: lustre-discuss at clusterfs.com
>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>> system?
>>
>> Could you elaborate on the benchmarking application(s) run that
provided
>> these bandwidth numbers. I have a particular interest in MPI coded
>> programs
>> that perform collective I/O. In discussions, I find this topic
sometimes
>> confused; my meaning is streamed, appending with all the data from all
>> the
>> processors for a single, atomic write operation filling disjoint
>> sections of
>> the same file. In MPI-IO, the MPI_File_write_all* family seems to
define
>> my
>> focus area, run with or without two-phase aggregation. Imitating the
>> operation with simple, Posix I/O is acceptable, as far as I am
>> concerned.
>>
>> In tests on redstorm from last year, I appended to a single, open file
>> at a
>> rate of 26 GB/s. I had to use exceptional parameters to achieve this
>> however: the file had an LFS stripe-count of 160, and I sent a 20 MB
>> buffer,
>> respectively, from each of a 160, total processor job, for an aggregate
>> of
>> 3.2 GB per write_all operation. I consider this configuration out of
the
>> range of any normal usage.
>>
>> I believe that a faster rate could be achieved by a similar program
that
>> wrote independently--that is, one-file-per-processor--such as via
>> NetCDF.
>> For this case, I would set the LFS stripe-count down to one.
>>
>>
>> Marty Barnaby
>>
>>
>>
>> On 1/14/08 4:11 PM, "Canon, Richard Shane" <canonrs at
ornl.gov> wrote:
>>
>>     
>>> Jeff,
>>>
>>> I''m not aware of any.  For parallel file systems it is
usually
>>>       
>> bandwidth
>>     
>>> centric.
>>>
>>> --Shane
>>>
>>> -----Original Message-----
>>> From: Kennedy, Jeffrey [mailto:jkennedy at qualcomm.com]
>>> Sent: Monday, January 14, 2008 4:56 PM
>>> To: Canon, Richard Shane; lustre-discuss at clusterfs.com
>>> Subject: RE: [Lustre-discuss] Off-topic: largest existing Lustre
file
>>> system?
>>>
>>> Any spec''s on IOPS rather than throughput?
>>>
>>> Thanks.
>>>
>>> Jeff Kennedy
>>> QCT Engineering Compute
>>> 858-651-6592
>>>  
>>>       
>>>> -----Original Message-----
>>>> From: lustre-discuss-bounces at clusterfs.com
[mailto:lustre-discuss-
>>>> bounces at clusterfs.com] On Behalf Of Canon, Richard Shane
>>>> Sent: Monday, January 14, 2008 1:49 PM
>>>> To: lustre-discuss at clusterfs.com
>>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing
Lustre file
>>>> system?
>>>>
>>>>
>>>> Klaus,
>>>>
>>>> Here are some that I know are pretty large.
>>>>
>>>> * RedStorm - I think it has two roughly 50 GB/s file systems. 
The
>>>> capacity may not be quite as large though.  I think they used
FC
>>>>         
>>> drives.
>>>       
>>>> It was DDN 8500 although that may have changed.
>>>> * CEA - I think they have a file system approaching 100 GB/s. 
I
>>>>         
>> think
>>     
>>>> it is DDN 9550.  Not sure about the capacities.
>>>> * TACC has a large Thumper based system.  Not sure of the
specs.
>>>> * ORNL - We have a 44 GB/s file system with around 800 TB of
total
>>>> capacity.  That is DDN 9550.  We also have two new file system
(20
>>>>         
>>> GB/s
>>>       
>>>> and 10 GB/s currently LSI XBB2 and DDN 9550 respectively). 
Those
>>>>         
>> have
>>     
>>>> around 800 TB each (after RAID6).
>>>> * We are planning a 200 GB/s, around 10 PB file system now.
>>>>
>>>> --Shane
>>>>
>>>> -----Original Message-----
>>>> From: lustre-discuss-bounces at clusterfs.com
>>>> [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of
D. Marc
>>>> Stearman
>>>> Sent: Monday, January 14, 2008 4:37 PM
>>>> To: lustre-discuss at clusterfs.com
>>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing
Lustre file
>>>> system?
>>>>
>>>> Klaus,
>>>>
>>>> We currently have a 1.2PB lustre filesystem that we will be
expanding
>>>> to 2.4PB in the near future.  I not sure about the highest
sustained
>>>> IOPS, but we did have a user peak 19GB/s to one of our 500TB
>>>> filesystems recently. The backend for that was 16 DDN 8500
couplets
>>>> with write-cache turned OFF.
>>>>
>>>> -Marc
>>>>
>>>> ----
>>>> D. Marc Stearman
>>>> LC Lustre Systems Administrator
>>>> marc at llnl.gov
>>>> 925.423.9670
>>>> Pager: 1.888.203.0641
>>>>
>>>>
>>>> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>>>>
>>>>         
>>>>> Hi there,
>>>>>
>>>>> I was asked by a friend of a business contact of mine the
other day
>>>>> to share
>>>>> some information about Lustre; seems he''s planning
to build what
>>>>>           
>>> will
>>>       
>>>>> eventually be about a 3 PB file system.
>>>>>
>>>>> The CFS website doesn''t appear to have any
information on field
>>>>> deployments
>>>>> worth bragging about, so I figured I''d ask, just
for fun; does
>>>>> anyone know:
>>>>>
>>>>> - the size of the largest working Lustre file system
currently in
>>>>> the field
>>>>> - the highest sustained number of IOPS seen with Lustre,
and what
>>>>>           
>>> the
>>>       
>>>>> backend was?
>>>>>
>>>>> cheers,
>>>>> Klaus
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at clusterfs.com
>>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>>           
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>         
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>
>>>       
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>     
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080131/5c71e4ee/attachment-0002.html

Tom.Wang

2008-Jan-31 21:11 UTC

head link

[Lustre-discuss] Off-topic: largest existing Lustre file system?

Marty Barnaby wrote:> I concur that at 160, my processor count was low. At the time, I had 
> access to as many as 1000 on our big Cray XT3, Redstorm, but now it is 
> not available to me at all. Through my trials, I found that, for 
> appending to a single, shared file, matching the lfs maximum stripe 
> count of 160, with the equivalent job size was the only combination 
> that got me this up to this rate. My interest in this case ends here, 
> because real usage involves processor counts in the thousands.
>
> I''m not certain about the distinctions of collective I/O and
shared
> files. It seems the latter is to be determined by the application 
> authors and their users. With, for instance, a mesh-type application, 
> there might be trade-offs; but having, at least, all the data for one 
> computed field or dynamic, for one time step that is saved to the 
> output (we usually calls these dumps), stored in the same file is 
> regularly more advantageous for the entire work cycle. Actually, 
> having all the output data in a single file seems like the most 
> desirable approach.
>
> Though MPI-IO independent writing operations can be on the lowest 
> level, mustn''t there still be some global coordination to
determine
> where each processors chunk of a communicators complete vector of 
> values is written, respectively, in a shared file? Further, what we 
> now call two-phase aggregation, as a means of turning many, tiny 
> block-size writes into a small number of large ones to leverage the 
> greater efficiency of the FS to respond to this type of activity, has 
> potential benefits. However, I''ve seen the considerable ROMIO 
> provisions for this technique used incorrectly, delivering a decrease 
> in performance.The collective I/O should reorganize the data among the I/O nodes to 
some kind of pattern the bottom file system prefer, IMHO.
As for lustre, sometimes it is sensitive to the I/O size(or alignment). 
especially for redstorm, since no client cache, and client
just throw the I/O req to server as soon as it gets from the 
application. So collective I/O is the only layer I/O req might be
optimized in the client, except the application. But current two-phase 
aggregagation does not consider too much about the
specification information from the file system.  for example it splits 
data evenly to all the I/O clients, instead of considering
stripe-size(alignment) and which OST the data goes to (sometimes, even 
client I/O load != even OST I/O load); improper use of
read-modify-write in data_sieving somtimes; no collection for 
non-interleave data.

Our current experience is that be careful of using collective_io and 
data_sieving, except we are very clear about
how adio driver will reorganize your data. For example, for HDF5 lib, 
mpiposix(no collective and data_sieve) layer is favorable.
But I do agree with Shane,  the collective I/O with HDF5, pNetCDF should 
be the way moving forward.

WangDi>
> Marty Barnaby
>
>
> Weikuan Yu wrote:
>> I would throw in some of my experience for discussion as Shane
mentioned
>> my name here :)
>>
>> (1) First, I am not under the impression that the collective I/O is 
>> designed to reveal the peak performance of a particular system. Well, 
>> there are publications claiming that colective I/O might be a preferred
>> case for some particular architecture, e.g., for BG/P (HPCA06)... But I
>> am not fully positive or clear on the context of the claim.
>>
>> (2) With the intended disjoint access as Marty mentioned, using 
>> collective I/O would only add a phase of coordination before each 
>> process streams its data to the file system.
>>
>> (3) The testing as Marty described below seem small, both in terms of 
>> process counts and the amount of data. Aside from whether this is good 
>> for revealing the system''s best performance, I can confirm
that there
>> are applications at ORNL that has much larger process counts and bigger
>> data volume from each processes.
>>
>>  > In tests on redstorm from last year, I appended to a single, open
file
>>  > at a
>>  > rate of 26 GB/s. I had to use exceptional parameters to achieve
this
>>  > however: the file had an LFS stripe-count of 160, and I sent a 20
MB
>>  > buffer,
>>  > respectively, from each of a 160, total processor job, for an
aggregate
>>  > of
>>  > 3.2 GB per write_all operation. I consider this configuration out
of the
>>  > range of any normal usage.
>>
>> IIRC, something regarding collective I/O has been discussed earlier, 
>> also with Marty (?).
>>
>> In the upcoming IPDPS08 conference, ORNL has two papers on I/O 
>> performance of Jaguar. So you may find the numbers interesting. The 
>> final version of the papers should be available from me or Mark Fahey 
>> (the author of another paper).
>>
>> --Weikuan
>>
>>
>> Canon, Richard Shane wrote:
>>   
>>> Marty,
>>>
>>> Our benchmark measurements were made using IOR doing POSIX IO to a
>>> single shared file (I believe).  
>>>
>>> Since you mentioned MPI-IO...  Weikuan Yu (at ORNL) has done some
work
>>> to improve the MPI-IO Lustre ADIO driver.  Also, we have been
sponsoring
>>> work through a Lustre Centre of Excellence to further improve the
ADIO
>>> driver.  I''m optimistic that this can make collective IO
perform at a
>>> level that one would expect.  File-per-process runs often do run
faster
>>> up until the meta data activity associated with creating 10k+ files
>>> starts to slow things down.  I''m a firm believer that
collective IO
>>> through libraries like MPI-IO, HDF5, and pNetCDF are the way things
>>> should move.  It should be possible to embed enough intelligence in
>>> these middle layers to do good stripe alignment, automatically tune
>>> stripe counts, and stripe width, etc. Some of this will hopefully
be
>>> accomplished with the improvements being made to the ADIO.
>>>
>>> --Shane
>>>
>>> -----Original Message-----
>>> From: lustre-discuss-bounces at lists.lustre.org
>>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
mlbarna
>>> Sent: Wednesday, January 16, 2008 12:43 PM
>>> To: lustre-discuss at clusterfs.com
>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre
file
>>> system?
>>>
>>> Could you elaborate on the benchmarking application(s) run that
provided
>>> these bandwidth numbers. I have a particular interest in MPI coded
>>> programs
>>> that perform collective I/O. In discussions, I find this topic
sometimes
>>> confused; my meaning is streamed, appending with all the data from
all
>>> the
>>> processors for a single, atomic write operation filling disjoint
>>> sections of
>>> the same file. In MPI-IO, the MPI_File_write_all* family seems to
define
>>> my
>>> focus area, run with or without two-phase aggregation. Imitating
the
>>> operation with simple, Posix I/O is acceptable, as far as I am
>>> concerned.
>>>
>>> In tests on redstorm from last year, I appended to a single, open
file
>>> at a
>>> rate of 26 GB/s. I had to use exceptional parameters to achieve
this
>>> however: the file had an LFS stripe-count of 160, and I sent a 20
MB
>>> buffer,
>>> respectively, from each of a 160, total processor job, for an
aggregate
>>> of
>>> 3.2 GB per write_all operation. I consider this configuration out
of the
>>> range of any normal usage.
>>>
>>> I believe that a faster rate could be achieved by a similar program
that
>>> wrote independently--that is, one-file-per-processor--such as via
>>> NetCDF.
>>> For this case, I would set the LFS stripe-count down to one.
>>>
>>>
>>> Marty Barnaby
>>>
>>>
>>>
>>> On 1/14/08 4:11 PM, "Canon, Richard Shane" <canonrs at
ornl.gov> wrote:
>>>
>>>     
>>>> Jeff,
>>>>
>>>> I''m not aware of any.  For parallel file systems it is
usually
>>>>       
>>> bandwidth
>>>     
>>>> centric.
>>>>
>>>> --Shane
>>>>
>>>> -----Original Message-----
>>>> From: Kennedy, Jeffrey [mailto:jkennedy at qualcomm.com]
>>>> Sent: Monday, January 14, 2008 4:56 PM
>>>> To: Canon, Richard Shane; lustre-discuss at clusterfs.com
>>>> Subject: RE: [Lustre-discuss] Off-topic: largest existing
Lustre file
>>>> system?
>>>>
>>>> Any spec''s on IOPS rather than throughput?
>>>>
>>>> Thanks.
>>>>
>>>> Jeff Kennedy
>>>> QCT Engineering Compute
>>>> 858-651-6592
>>>>  
>>>>       
>>>>> -----Original Message-----
>>>>> From: lustre-discuss-bounces at clusterfs.com
[mailto:lustre-discuss-
>>>>> bounces at clusterfs.com] On Behalf Of Canon, Richard Shane
>>>>> Sent: Monday, January 14, 2008 1:49 PM
>>>>> To: lustre-discuss at clusterfs.com
>>>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing
Lustre file
>>>>> system?
>>>>>
>>>>>
>>>>> Klaus,
>>>>>
>>>>> Here are some that I know are pretty large.
>>>>>
>>>>> * RedStorm - I think it has two roughly 50 GB/s file
systems.  The
>>>>> capacity may not be quite as large though.  I think they
used FC
>>>>>         
>>>> drives.
>>>>       
>>>>> It was DDN 8500 although that may have changed.
>>>>> * CEA - I think they have a file system approaching 100
GB/s.  I
>>>>>         
>>> think
>>>     
>>>>> it is DDN 9550.  Not sure about the capacities.
>>>>> * TACC has a large Thumper based system.  Not sure of the
specs.
>>>>> * ORNL - We have a 44 GB/s file system with around 800 TB
of total
>>>>> capacity.  That is DDN 9550.  We also have two new file
system (20
>>>>>         
>>>> GB/s
>>>>       
>>>>> and 10 GB/s currently LSI XBB2 and DDN 9550 respectively). 
Those
>>>>>         
>>> have
>>>     
>>>>> around 800 TB each (after RAID6).
>>>>> * We are planning a 200 GB/s, around 10 PB file system now.
>>>>>
>>>>> --Shane
>>>>>
>>>>> -----Original Message-----
>>>>> From: lustre-discuss-bounces at clusterfs.com
>>>>> [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf
Of D. Marc
>>>>> Stearman
>>>>> Sent: Monday, January 14, 2008 4:37 PM
>>>>> To: lustre-discuss at clusterfs.com
>>>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing
Lustre file
>>>>> system?
>>>>>
>>>>> Klaus,
>>>>>
>>>>> We currently have a 1.2PB lustre filesystem that we will be
expanding
>>>>> to 2.4PB in the near future.  I not sure about the highest
sustained
>>>>> IOPS, but we did have a user peak 19GB/s to one of our
500TB
>>>>> filesystems recently. The backend for that was 16 DDN 8500
couplets
>>>>> with write-cache turned OFF.
>>>>>
>>>>> -Marc
>>>>>
>>>>> ----
>>>>> D. Marc Stearman
>>>>> LC Lustre Systems Administrator
>>>>> marc at llnl.gov
>>>>> 925.423.9670
>>>>> Pager: 1.888.203.0641
>>>>>
>>>>>
>>>>> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>>>>>
>>>>>         
>>>>>> Hi there,
>>>>>>
>>>>>> I was asked by a friend of a business contact of mine
the other day
>>>>>> to share
>>>>>> some information about Lustre; seems he''s
planning to build what
>>>>>>           
>>>> will
>>>>       
>>>>>> eventually be about a 3 PB file system.
>>>>>>
>>>>>> The CFS website doesn''t appear to have any
information on field
>>>>>> deployments
>>>>>> worth bragging about, so I figured I''d ask,
just for fun; does
>>>>>> anyone know:
>>>>>>
>>>>>> - the size of the largest working Lustre file system
currently in
>>>>>> the field
>>>>>> - the highest sustained number of IOPS seen with
Lustre, and what
>>>>>>           
>>>> the
>>>>       
>>>>>> backend was?
>>>>>>
>>>>>> cheers,
>>>>>> Klaus
>>>>>>
>>>>>> _______________________________________________
>>>>>> Lustre-discuss mailing list
>>>>>> Lustre-discuss at clusterfs.com
>>>>>>
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>>>           
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at clusterfs.com
>>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at clusterfs.com
>>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>>         
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>
>>>>       
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>     
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>   
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Lustre discuss - Jan 2008 - Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?

[Lustre-discuss] Off-topic: largest existing Lustre file system?