thr3ads.net - Lustre discuss - [Lustre-discuss] lfs setstripe stripe

If this information is useful, please help other people find it:
Share via:

Marty Barnaby

2007-Jul-22 16:09 UTC

[Lustre-discuss] lfs setstripe stripe_size optimization

I''m attempting to establish an absolute maximum, byte-rate performance 
value, running a bare bones MPI_File_write_all_at benchmark program, for 
our Cray XT3 installation, RedStorm, here at Sandia National 
Laboratories. Processor time is at a premium, and I only run in the 
standard queue, so I''m not able to do everything I would imagine,
though
maybe what I can run is adequate.

I have a directory under our Lustre, redstorm:/scratch_grande which I 
have defined with:

lfs setstripe -1 0 -1

Though there are 320 OST''s comprising the FS, these defaults give me a 
stripe_count of 160 (I''m sure someone could explain that), and I
don''t
know the stripe_size. With a job of 160 processors, each of which has a 
contiguous chuck of 20 MB of memory, respectively, to append to an open 
file in an iterative series of singular, atomic, write_all operations, I 
can normally average 25 GB/s. To curb any confusion here, that 
represents only an experimental maximum to me; none of our many, 
complex, science and engineering simulation applications perform their 
output dumping with per-processor blocks as large as a single MB.

I would like any succinct suggestions on explicitly setting my lfs 
stripe_size, given the configuration and parameters I''ve mentioned
here,
to optimize it and, perhaps see a decrease in the time spent storing my 
data on the FS.

Thank you,
Marty Barnaby

wangdi

2007-Jul-22 21:23 UTC

head link

[Lustre-discuss] lfs setstripe stripe_size optimization

There is a new lustre adio driver(bug 12521), there are some 
optimizations for collective write(MPI_File_write_all_at)
If you are interested, you can try  it. Could you please tell us more 
about this benchmark program? 
For example, its I/O pattern, shared or separated file for each client, 
how many clients will take part in the I/O?

Thanks

Marty Barnaby wrote:> I''m attempting to establish an absolute maximum, byte-rate
performance
> value, running a bare bones MPI_File_write_all_at benchmark program, 
> for our Cray XT3 installation, RedStorm, here at Sandia National 
> Laboratories. Processor time is at a premium, and I only run in the 
> standard queue, so I''m not able to do everything I would imagine, 
> though maybe what I can run is adequate.
>
> I have a directory under our Lustre, redstorm:/scratch_grande which I 
> have defined with:
>
> lfs setstripe -1 0 -1
>
> Though there are 320 OST''s comprising the FS, these defaults give
me a
> stripe_count of 160 (I''m sure someone could explain that), and I
don''t
> know the stripe_size. With a job of 160 processors, each of which has 
> a contiguous chuck of 20 MB of memory, respectively, to append to an 
> open file in an iterative series of singular, atomic, write_all 
> operations, I can normally average 25 GB/s. To curb any confusion 
> here, that represents only an experimental maximum to me; none of our 
> many, complex, science and engineering simulation applications perform 
> their output dumping with per-processor blocks as large as a single MB.
>
> I would like any succinct suggestions on explicitly setting my lfs 
> stripe_size, given the configuration and parameters I''ve mentioned
> here, to optimize it and, perhaps see a decrease in the time spent 
> storing my data on the FS.
>
> Thank you,
> Marty Barnaby
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

-- 
Regards,
Tom Wangdi    
--
Cluster File Systems, Inc
Software Engineer 
http://www.clusterfs.com

wangdi

2007-Jul-22 22:42 UTC

head link

[Lustre-discuss] lfs setstripe stripe_size optimization

How many data it will write in each I/O for each client? 
With this driver, you can specify stripe_size, stripe_count and stripe 
index
by MPI_Info_set.You may need --with-file-system=lustre when building the 
driver + mpich2.
So it will not use UFS adio driver.  And you may need "lustre:"
indicator
before your file, then it will use that driver. (CC to WeiKuan, he 
should provide more details).

Thanks
WangDi

Marty Barnaby wrote:> In this case I''m running what I think is the simplest, collective
I/O
> case. All processors are writing their respective buffers, 
> concurrently, to the same file, though not with two-phase aggregation, 
> or any other special hints. I''m not aware of any non-posix, Lustre
> API. What will the adio driver have beyond the standard UFS options?
>
> MLB
>
>
> wangdi wrote:
>> There is a new lustre adio driver(bug 12521), there are some 
>> optimizations for collective write(MPI_File_write_all_at)
>> If you are interested, you can try  it. Could you please tell us more 
>> about this benchmark program? For example, its I/O pattern, shared or 
>> separated file for each client, how many clients will take part in 
>> the I/O?
>>
>> Thanks
>>
>> Marty Barnaby wrote:
>>> I''m attempting to establish an absolute maximum, byte-rate
>>> performance value, running a bare bones MPI_File_write_all_at 
>>> benchmark program, for our Cray XT3 installation, RedStorm, here at
>>> Sandia National Laboratories. Processor time is at a premium, and I
>>> only run in the standard queue, so I''m not able to do
everything I
>>> would imagine, though maybe what I can run is adequate.
>>>
>>> I have a directory under our Lustre, redstorm:/scratch_grande which
>>> I have defined with:
>>>
>>> lfs setstripe -1 0 -1
>>>
>>> Though there are 320 OST''s comprising the FS, these
defaults give me
>>> a stripe_count of 160 (I''m sure someone could explain
that), and I
>>> don''t know the stripe_size. With a job of 160 processors,
each of
>>> which has a contiguous chuck of 20 MB of memory, respectively, to 
>>> append to an open file in an iterative series of singular, atomic, 
>>> write_all operations, I can normally average 25 GB/s. To curb any 
>>> confusion here, that represents only an experimental maximum to me;
>>> none of our many, complex, science and engineering simulation 
>>> applications perform their output dumping with per-processor blocks
>>> as large as a single MB.
>>>
>>> I would like any succinct suggestions on explicitly setting my lfs 
>>> stripe_size, given the configuration and parameters I''ve
mentioned
>>> here, to optimize it and, perhaps see a decrease in the time spent 
>>> storing my data on the FS.
>>>
>>> Thank you,
>>> Marty Barnaby
>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss@clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>
>>
>>
>
>

-- 
Regards,
Tom Wangdi    
--
Cluster File Systems, Inc
Software Engineer 
http://www.clusterfs.com

Kalpak Shah

2007-Jul-22 22:43 UTC

head link

[Lustre-discuss] lfs setstripe stripe_size optimization

On Sun, 2007-07-22 at 16:09 -0600, Marty Barnaby wrote:> I''m attempting to establish an absolute maximum, byte-rate
performance
> value, running a bare bones MPI_File_write_all_at benchmark program, for 
> our Cray XT3 installation, RedStorm, here at Sandia National 
> Laboratories. Processor time is at a premium, and I only run in the 
> standard queue, so I''m not able to do everything I would imagine,
though
> maybe what I can run is adequate.
> 
> I have a directory under our Lustre, redstorm:/scratch_grande which I 
> have defined with:
> 
> lfs setstripe -1 0 -1
> 
> Though there are 320 OST''s comprising the FS, these defaults give
me a
> stripe_count of 160 (I''m sure someone could explain that), and I
don''t
> know the stripe_size.
Lustre can currentlt stripe over a maximum of 160 OSTs. (This is a
limitation due to the maximum extended attribute size available on the
MDT.)

The default stripe_size can be seen
at /proc/fs/lustre/lov/lustre-mdtlov/stripesize. 

Thanks,
Kalpak.

Weikuan Yu

2007-Jul-23 09:17 UTC

head link

[Lustre-discuss] lfs setstripe stripe_size optimization

If the test is only meant to do 20MB per proc, no two-phase collective IO, 
then there is not much benefit using MPI-IO. So I miss the idea of using 
MPI_File_write_at_all here. But in any case, Cray XT is a little different 
story. Other folks may be better fit suggesting striping parameters for this 
need on RedStorm.

--Weikuan

wangdi wrote:> 
> How many data it will write in each I/O for each client? With this 
> driver, you can specify stripe_size, stripe_count and stripe index
> by MPI_Info_set.You may need --with-file-system=lustre when building the 
> driver + mpich2.
> So it will not use UFS adio driver.  And you may need "lustre:"
indicator
> before your file, then it will use that driver. (CC to WeiKuan, he 
> should provide more details).
> 
> Thanks
> WangDi
> 
> Marty Barnaby wrote:
>> In this case I''m running what I think is the simplest,
collective I/O
>> case. All processors are writing their respective buffers, 
>> concurrently, to the same file, though not with two-phase aggregation, 
>> or any other special hints. I''m not aware of any non-posix,
Lustre
>> API. What will the adio driver have beyond the standard UFS options?
>>
>> MLB
>>
>>
>> wangdi wrote:
>>> There is a new lustre adio driver(bug 12521), there are some 
>>> optimizations for collective write(MPI_File_write_all_at)
>>> If you are interested, you can try  it. Could you please tell us
more
>>> about this benchmark program? For example, its I/O pattern, shared
or
>>> separated file for each client, how many clients will take part in 
>>> the I/O?
>>>
>>> Thanks
>>>
>>> Marty Barnaby wrote:
>>>> I''m attempting to establish an absolute maximum,
byte-rate
>>>> performance value, running a bare bones MPI_File_write_all_at 
>>>> benchmark program, for our Cray XT3 installation, RedStorm,
here at
>>>> Sandia National Laboratories. Processor time is at a premium,
and I
>>>> only run in the standard queue, so I''m not able to do
everything I
>>>> would imagine, though maybe what I can run is adequate.
>>>>
>>>> I have a directory under our Lustre, redstorm:/scratch_grande
which
>>>> I have defined with:
>>>>
>>>> lfs setstripe -1 0 -1
>>>>
>>>> Though there are 320 OST''s comprising the FS, these
defaults give me
>>>> a stripe_count of 160 (I''m sure someone could explain
that), and I
>>>> don''t know the stripe_size. With a job of 160
processors, each of
>>>> which has a contiguous chuck of 20 MB of memory, respectively,
to
>>>> append to an open file in an iterative series of singular,
atomic,
>>>> write_all operations, I can normally average 25 GB/s. To curb
any
>>>> confusion here, that represents only an experimental maximum to
me;
>>>> none of our many, complex, science and engineering simulation 
>>>> applications perform their output dumping with per-processor
blocks
>>>> as large as a single MB.
>>>>
>>>> I would like any succinct suggestions on explicitly setting my
lfs
>>>> stripe_size, given the configuration and parameters
I''ve mentioned
>>>> here, to optimize it and, perhaps see a decrease in the time
spent
>>>> storing my data on the FS.
>>>>
>>>> Thank you,
>>>> Marty Barnaby
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss@clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>
>>>
>>>
>>
>>
> 
>

Nathaniel Rutman

2007-Jul-23 13:50 UTC

head link

[Lustre-discuss] lfs setstripe stripe_size optimization

Marty Barnaby wrote:> I''m attempting to establish an absolute maximum, byte-rate
performance
> value, running a bare bones MPI_File_write_all_at benchmark program, 
> for our Cray XT3 installation, RedStorm, here at Sandia National 
> Laboratories. Processor time is at a premium, and I only run in the 
> standard queue, so I''m not able to do everything I would imagine, 
> though maybe what I can run is adequate.
>
> I have a directory under our Lustre, redstorm:/scratch_grande which I 
> have defined with:
>
> lfs setstripe -1 0 -1
>
> Though there are 320 OST''s comprising the FS, these defaults give
me a
> stripe_count of 160 (I''m sure someone could explain that), and I
don''t
> know the stripe_size. With a job of 160 processors, each of which has 
> a contiguous chuck of 20 MB of memory, respectively, to append to an 
> open file in an iterative series of singular, atomic, write_all 
> operations, I can normally average 25 GB/s. To curb any confusion 
> here, that represents only an experimental maximum to me; none of our 
> many, complex, science and engineering simulation applications perform 
> their output dumping with per-processor blocks as large as a single MB.
>
> I would like any succinct suggestions on explicitly setting my lfs 
> stripe_size, given the configuration and parameters I''ve mentioned
> here, to optimize it and, perhaps see a decrease in the time spent 
> storing my data on the FS.Try setting your stripesize to 20MB.  As Kalpak mentioned, we currently 
have a limit of 160 OSTs for any 1 file (although, of course, there are 
plans to remove this limitation soon).

Would you mind posting your test prog?  I can imagine others (besides 
me) might be interested in such experimental maximums.

Nathaniel Rutman

2007-Jul-24 15:09 UTC

head link

[Lustre-discuss] lfs setstripe stripe_size optimization

Marty Barnaby wrote:> Nathaniel Rutman wrote:
>> Marty Barnaby wrote:
>>> I''m attempting to establish an absolute maximum, byte-rate
>>> performance value, running a bare bones MPI_File_write_all_at 
>>> benchmark program, for our Cray XT3 installation, RedStorm, here at
>>> Sandia National Laboratories. Processor time is at a premium, and I
>>> only run in the standard queue, so I''m not able to do
everything I
>>> would imagine, though maybe what I can run is adequate.
>>>
>>> I have a directory under our Lustre, redstorm:/scratch_grande which
>>> I have defined with:
>>>
>>> lfs setstripe -1 0 -1
>>>
>>> Though there are 320 OST''s comprising the FS, these
defaults give me
>>> a stripe_count of 160 (I''m sure someone could explain
that), and I
>>> don''t know the stripe_size. With a job of 160 processors,
each of
>>> which has a contiguous chuck of 20 MB of memory, respectively, to 
>>> append to an open file in an iterative series of singular, atomic, 
>>> write_all operations, I can normally average 25 GB/s. To curb any 
>>> confusion here, that represents only an experimental maximum to me;
>>> none of our many, complex, science and engineering simulation 
>>> applications perform their output dumping with per-processor blocks
>>> as large as a single MB.
>>>
>>> I would like any succinct suggestions on explicitly setting my lfs 
>>> stripe_size, given the configuration and parameters I''ve
mentioned
>>> here, to optimize it and, perhaps see a decrease in the time spent 
>>> storing my data on the FS.
>> Try setting your stripesize to 20MB.  As Kalpak mentioned, we 
>> currently have a limit of 160 OSTs for any 1 file (although, of 
>> course, there are plans to remove this limitation soon).
>>
>> Would you mind posting your test prog?  I can imagine others (besides 
>> me) might be interested in such experimental maximums.
>>
> I made a new directory and set the parameters as you suggest. I 
> verified them by touching a file, then checking it with ''lfs
getstripe
> -v'', which returned:
>
> /scratch_grande/mlbarna/max.ss20MB/t
> lmm_magic:          0x0BD10BD0
> lmm_object_gr:      0
> lmm_object_id:      0x501819f
> lmm_stripe_count:   160
> lmm_stripe_size:    20971520
> lmm_stripe_pattern: 1
>
> Since yesterday, I got a dozen runs each, respectively, for the 
> default stripe_size directory (which is 2 MB), and the new, 20 MB 
> stripe_size, with the allocation of 160 client processes. The 20 MB 
> stripe_size case was slower by 10%. Am I mistaken in my understanding 
> that, in the faster case, that I''ve run many times this week, the 
> individual, per-processor writes of 20 MB, are distributed across 10 
> of the 2 MB stripe_size, of 160 OST in the stripe_count of my file? 
Correct> If so, and correct my math or my understanding if I am not seeing this 
> right, each OST is responding to write requests in one 
> MPI_File_write_at_all operation, from 10 seperate clients, and doing 
> it faster than when there is just one client where the OST stripe_size 
> is set up at 20 MB.Kind of a series vs parallel thing.  1 client writing to 10 OSTs 10 
times in series, or 10 clients writing to 1 OST each in parallel.  If 
you''re still tweaking, you might try numbers in between - 1MB stripe 
size, 4MB, 8MB, 10MB.  Might be fun to plot.>
> For my benchmark program, I am running a fairly simple item, left over 
> as a legacy from when the DOE ASCI had a project to create a standard 
> data format. The package became know as SAF (Sets and Fields), and the 
> lower layers were HDF5, and, ultimately, MPI-IO in collective, _all, 
> calls. A few years ago, in a vie to increase throughput, the HDF5 
> project implemented an optional posix, virtual file driver, because it 
> was imagined the MPI-IO might be an impediment.
>
> Rob Matzke authored the testing client, which he named
''rb'', with some
> tricks I like to leverage for the various approaches for which I have 
> agenda; including the ability to choose the layer, which can be HDF5 
> via MPI-IO or the posix virtual file driver, MPI-IO write_all, and 
> also plain posix, with a simple design for collective, strided 
> writing. For the main activity, it is a plain loop, iterating --nrecs 
> <number> of times with a --record <size> buffer per processor,
calling
> the API routine appropriate to the library level chosen.
>
> In my community, IOR is the most commonly accepted benchmark program. 
> Though rb isn''t as transparent as some of the absolutely, 
> johnny-one-note executables I have created for my own information in 
> the past, I find IOR overly convoluted. I have seen several cases 
> where people ran IOR without really understanding all the parameters 
> they were getting.
>
> How would I post ''rb''? I could merely send you a .tar.gz?If it''s not huge, post it to the list?  Or we could stick it on a wiki
page.

Lustre discuss - Jul 2007 - lfs setstripe stripe_size optimization

[Lustre-discuss] lfs setstripe stripe_size optimization

[Lustre-discuss] lfs setstripe stripe_size optimization

[Lustre-discuss] lfs setstripe stripe_size optimization

[Lustre-discuss] lfs setstripe stripe_size optimization

[Lustre-discuss] lfs setstripe stripe_size optimization

[Lustre-discuss] lfs setstripe stripe_size optimization

[Lustre-discuss] lfs setstripe stripe_size optimization