Marty Barnaby
2007-Jul-22 16:09 UTC
[Lustre-discuss] lfs setstripe stripe_size optimization
I''m attempting to establish an absolute maximum, byte-rate performance value, running a bare bones MPI_File_write_all_at benchmark program, for our Cray XT3 installation, RedStorm, here at Sandia National Laboratories. Processor time is at a premium, and I only run in the standard queue, so I''m not able to do everything I would imagine, though maybe what I can run is adequate. I have a directory under our Lustre, redstorm:/scratch_grande which I have defined with: lfs setstripe -1 0 -1 Though there are 320 OST''s comprising the FS, these defaults give me a stripe_count of 160 (I''m sure someone could explain that), and I don''t know the stripe_size. With a job of 160 processors, each of which has a contiguous chuck of 20 MB of memory, respectively, to append to an open file in an iterative series of singular, atomic, write_all operations, I can normally average 25 GB/s. To curb any confusion here, that represents only an experimental maximum to me; none of our many, complex, science and engineering simulation applications perform their output dumping with per-processor blocks as large as a single MB. I would like any succinct suggestions on explicitly setting my lfs stripe_size, given the configuration and parameters I''ve mentioned here, to optimize it and, perhaps see a decrease in the time spent storing my data on the FS. Thank you, Marty Barnaby
There is a new lustre adio driver(bug 12521), there are some optimizations for collective write(MPI_File_write_all_at) If you are interested, you can try it. Could you please tell us more about this benchmark program? For example, its I/O pattern, shared or separated file for each client, how many clients will take part in the I/O? Thanks Marty Barnaby wrote:> I''m attempting to establish an absolute maximum, byte-rate performance > value, running a bare bones MPI_File_write_all_at benchmark program, > for our Cray XT3 installation, RedStorm, here at Sandia National > Laboratories. Processor time is at a premium, and I only run in the > standard queue, so I''m not able to do everything I would imagine, > though maybe what I can run is adequate. > > I have a directory under our Lustre, redstorm:/scratch_grande which I > have defined with: > > lfs setstripe -1 0 -1 > > Though there are 320 OST''s comprising the FS, these defaults give me a > stripe_count of 160 (I''m sure someone could explain that), and I don''t > know the stripe_size. With a job of 160 processors, each of which has > a contiguous chuck of 20 MB of memory, respectively, to append to an > open file in an iterative series of singular, atomic, write_all > operations, I can normally average 25 GB/s. To curb any confusion > here, that represents only an experimental maximum to me; none of our > many, complex, science and engineering simulation applications perform > their output dumping with per-processor blocks as large as a single MB. > > I would like any succinct suggestions on explicitly setting my lfs > stripe_size, given the configuration and parameters I''ve mentioned > here, to optimize it and, perhaps see a decrease in the time spent > storing my data on the FS. > > Thank you, > Marty Barnaby > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >-- Regards, Tom Wangdi -- Cluster File Systems, Inc Software Engineer http://www.clusterfs.com
How many data it will write in each I/O for each client? With this driver, you can specify stripe_size, stripe_count and stripe index by MPI_Info_set.You may need --with-file-system=lustre when building the driver + mpich2. So it will not use UFS adio driver. And you may need "lustre:" indicator before your file, then it will use that driver. (CC to WeiKuan, he should provide more details). Thanks WangDi Marty Barnaby wrote:> In this case I''m running what I think is the simplest, collective I/O > case. All processors are writing their respective buffers, > concurrently, to the same file, though not with two-phase aggregation, > or any other special hints. I''m not aware of any non-posix, Lustre > API. What will the adio driver have beyond the standard UFS options? > > MLB > > > wangdi wrote: >> There is a new lustre adio driver(bug 12521), there are some >> optimizations for collective write(MPI_File_write_all_at) >> If you are interested, you can try it. Could you please tell us more >> about this benchmark program? For example, its I/O pattern, shared or >> separated file for each client, how many clients will take part in >> the I/O? >> >> Thanks >> >> Marty Barnaby wrote: >>> I''m attempting to establish an absolute maximum, byte-rate >>> performance value, running a bare bones MPI_File_write_all_at >>> benchmark program, for our Cray XT3 installation, RedStorm, here at >>> Sandia National Laboratories. Processor time is at a premium, and I >>> only run in the standard queue, so I''m not able to do everything I >>> would imagine, though maybe what I can run is adequate. >>> >>> I have a directory under our Lustre, redstorm:/scratch_grande which >>> I have defined with: >>> >>> lfs setstripe -1 0 -1 >>> >>> Though there are 320 OST''s comprising the FS, these defaults give me >>> a stripe_count of 160 (I''m sure someone could explain that), and I >>> don''t know the stripe_size. With a job of 160 processors, each of >>> which has a contiguous chuck of 20 MB of memory, respectively, to >>> append to an open file in an iterative series of singular, atomic, >>> write_all operations, I can normally average 25 GB/s. To curb any >>> confusion here, that represents only an experimental maximum to me; >>> none of our many, complex, science and engineering simulation >>> applications perform their output dumping with per-processor blocks >>> as large as a single MB. >>> >>> I would like any succinct suggestions on explicitly setting my lfs >>> stripe_size, given the configuration and parameters I''ve mentioned >>> here, to optimize it and, perhaps see a decrease in the time spent >>> storing my data on the FS. >>> >>> Thank you, >>> Marty Barnaby >>> >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss@clusterfs.com >>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>> >> >> > >-- Regards, Tom Wangdi -- Cluster File Systems, Inc Software Engineer http://www.clusterfs.com
On Sun, 2007-07-22 at 16:09 -0600, Marty Barnaby wrote:> I''m attempting to establish an absolute maximum, byte-rate performance > value, running a bare bones MPI_File_write_all_at benchmark program, for > our Cray XT3 installation, RedStorm, here at Sandia National > Laboratories. Processor time is at a premium, and I only run in the > standard queue, so I''m not able to do everything I would imagine, though > maybe what I can run is adequate. > > I have a directory under our Lustre, redstorm:/scratch_grande which I > have defined with: > > lfs setstripe -1 0 -1 > > Though there are 320 OST''s comprising the FS, these defaults give me a > stripe_count of 160 (I''m sure someone could explain that), and I don''t > know the stripe_size.Lustre can currentlt stripe over a maximum of 160 OSTs. (This is a limitation due to the maximum extended attribute size available on the MDT.) The default stripe_size can be seen at /proc/fs/lustre/lov/lustre-mdtlov/stripesize. Thanks, Kalpak.
If the test is only meant to do 20MB per proc, no two-phase collective IO, then there is not much benefit using MPI-IO. So I miss the idea of using MPI_File_write_at_all here. But in any case, Cray XT is a little different story. Other folks may be better fit suggesting striping parameters for this need on RedStorm. --Weikuan wangdi wrote:> > How many data it will write in each I/O for each client? With this > driver, you can specify stripe_size, stripe_count and stripe index > by MPI_Info_set.You may need --with-file-system=lustre when building the > driver + mpich2. > So it will not use UFS adio driver. And you may need "lustre:" indicator > before your file, then it will use that driver. (CC to WeiKuan, he > should provide more details). > > Thanks > WangDi > > Marty Barnaby wrote: >> In this case I''m running what I think is the simplest, collective I/O >> case. All processors are writing their respective buffers, >> concurrently, to the same file, though not with two-phase aggregation, >> or any other special hints. I''m not aware of any non-posix, Lustre >> API. What will the adio driver have beyond the standard UFS options? >> >> MLB >> >> >> wangdi wrote: >>> There is a new lustre adio driver(bug 12521), there are some >>> optimizations for collective write(MPI_File_write_all_at) >>> If you are interested, you can try it. Could you please tell us more >>> about this benchmark program? For example, its I/O pattern, shared or >>> separated file for each client, how many clients will take part in >>> the I/O? >>> >>> Thanks >>> >>> Marty Barnaby wrote: >>>> I''m attempting to establish an absolute maximum, byte-rate >>>> performance value, running a bare bones MPI_File_write_all_at >>>> benchmark program, for our Cray XT3 installation, RedStorm, here at >>>> Sandia National Laboratories. Processor time is at a premium, and I >>>> only run in the standard queue, so I''m not able to do everything I >>>> would imagine, though maybe what I can run is adequate. >>>> >>>> I have a directory under our Lustre, redstorm:/scratch_grande which >>>> I have defined with: >>>> >>>> lfs setstripe -1 0 -1 >>>> >>>> Though there are 320 OST''s comprising the FS, these defaults give me >>>> a stripe_count of 160 (I''m sure someone could explain that), and I >>>> don''t know the stripe_size. With a job of 160 processors, each of >>>> which has a contiguous chuck of 20 MB of memory, respectively, to >>>> append to an open file in an iterative series of singular, atomic, >>>> write_all operations, I can normally average 25 GB/s. To curb any >>>> confusion here, that represents only an experimental maximum to me; >>>> none of our many, complex, science and engineering simulation >>>> applications perform their output dumping with per-processor blocks >>>> as large as a single MB. >>>> >>>> I would like any succinct suggestions on explicitly setting my lfs >>>> stripe_size, given the configuration and parameters I''ve mentioned >>>> here, to optimize it and, perhaps see a decrease in the time spent >>>> storing my data on the FS. >>>> >>>> Thank you, >>>> Marty Barnaby >>>> >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss@clusterfs.com >>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>>> >>> >>> >> >> > >
Nathaniel Rutman
2007-Jul-23 13:50 UTC
[Lustre-discuss] lfs setstripe stripe_size optimization
Marty Barnaby wrote:> I''m attempting to establish an absolute maximum, byte-rate performance > value, running a bare bones MPI_File_write_all_at benchmark program, > for our Cray XT3 installation, RedStorm, here at Sandia National > Laboratories. Processor time is at a premium, and I only run in the > standard queue, so I''m not able to do everything I would imagine, > though maybe what I can run is adequate. > > I have a directory under our Lustre, redstorm:/scratch_grande which I > have defined with: > > lfs setstripe -1 0 -1 > > Though there are 320 OST''s comprising the FS, these defaults give me a > stripe_count of 160 (I''m sure someone could explain that), and I don''t > know the stripe_size. With a job of 160 processors, each of which has > a contiguous chuck of 20 MB of memory, respectively, to append to an > open file in an iterative series of singular, atomic, write_all > operations, I can normally average 25 GB/s. To curb any confusion > here, that represents only an experimental maximum to me; none of our > many, complex, science and engineering simulation applications perform > their output dumping with per-processor blocks as large as a single MB. > > I would like any succinct suggestions on explicitly setting my lfs > stripe_size, given the configuration and parameters I''ve mentioned > here, to optimize it and, perhaps see a decrease in the time spent > storing my data on the FS.Try setting your stripesize to 20MB. As Kalpak mentioned, we currently have a limit of 160 OSTs for any 1 file (although, of course, there are plans to remove this limitation soon). Would you mind posting your test prog? I can imagine others (besides me) might be interested in such experimental maximums.
Nathaniel Rutman
2007-Jul-24 15:09 UTC
[Lustre-discuss] lfs setstripe stripe_size optimization
Marty Barnaby wrote:> Nathaniel Rutman wrote: >> Marty Barnaby wrote: >>> I''m attempting to establish an absolute maximum, byte-rate >>> performance value, running a bare bones MPI_File_write_all_at >>> benchmark program, for our Cray XT3 installation, RedStorm, here at >>> Sandia National Laboratories. Processor time is at a premium, and I >>> only run in the standard queue, so I''m not able to do everything I >>> would imagine, though maybe what I can run is adequate. >>> >>> I have a directory under our Lustre, redstorm:/scratch_grande which >>> I have defined with: >>> >>> lfs setstripe -1 0 -1 >>> >>> Though there are 320 OST''s comprising the FS, these defaults give me >>> a stripe_count of 160 (I''m sure someone could explain that), and I >>> don''t know the stripe_size. With a job of 160 processors, each of >>> which has a contiguous chuck of 20 MB of memory, respectively, to >>> append to an open file in an iterative series of singular, atomic, >>> write_all operations, I can normally average 25 GB/s. To curb any >>> confusion here, that represents only an experimental maximum to me; >>> none of our many, complex, science and engineering simulation >>> applications perform their output dumping with per-processor blocks >>> as large as a single MB. >>> >>> I would like any succinct suggestions on explicitly setting my lfs >>> stripe_size, given the configuration and parameters I''ve mentioned >>> here, to optimize it and, perhaps see a decrease in the time spent >>> storing my data on the FS. >> Try setting your stripesize to 20MB. As Kalpak mentioned, we >> currently have a limit of 160 OSTs for any 1 file (although, of >> course, there are plans to remove this limitation soon). >> >> Would you mind posting your test prog? I can imagine others (besides >> me) might be interested in such experimental maximums. >> > I made a new directory and set the parameters as you suggest. I > verified them by touching a file, then checking it with ''lfs getstripe > -v'', which returned: > > /scratch_grande/mlbarna/max.ss20MB/t > lmm_magic: 0x0BD10BD0 > lmm_object_gr: 0 > lmm_object_id: 0x501819f > lmm_stripe_count: 160 > lmm_stripe_size: 20971520 > lmm_stripe_pattern: 1 > > Since yesterday, I got a dozen runs each, respectively, for the > default stripe_size directory (which is 2 MB), and the new, 20 MB > stripe_size, with the allocation of 160 client processes. The 20 MB > stripe_size case was slower by 10%. Am I mistaken in my understanding > that, in the faster case, that I''ve run many times this week, the > individual, per-processor writes of 20 MB, are distributed across 10 > of the 2 MB stripe_size, of 160 OST in the stripe_count of my file?Correct> If so, and correct my math or my understanding if I am not seeing this > right, each OST is responding to write requests in one > MPI_File_write_at_all operation, from 10 seperate clients, and doing > it faster than when there is just one client where the OST stripe_size > is set up at 20 MB.Kind of a series vs parallel thing. 1 client writing to 10 OSTs 10 times in series, or 10 clients writing to 1 OST each in parallel. If you''re still tweaking, you might try numbers in between - 1MB stripe size, 4MB, 8MB, 10MB. Might be fun to plot.> > For my benchmark program, I am running a fairly simple item, left over > as a legacy from when the DOE ASCI had a project to create a standard > data format. The package became know as SAF (Sets and Fields), and the > lower layers were HDF5, and, ultimately, MPI-IO in collective, _all, > calls. A few years ago, in a vie to increase throughput, the HDF5 > project implemented an optional posix, virtual file driver, because it > was imagined the MPI-IO might be an impediment. > > Rob Matzke authored the testing client, which he named ''rb'', with some > tricks I like to leverage for the various approaches for which I have > agenda; including the ability to choose the layer, which can be HDF5 > via MPI-IO or the posix virtual file driver, MPI-IO write_all, and > also plain posix, with a simple design for collective, strided > writing. For the main activity, it is a plain loop, iterating --nrecs > <number> of times with a --record <size> buffer per processor, calling > the API routine appropriate to the library level chosen. > > In my community, IOR is the most commonly accepted benchmark program. > Though rb isn''t as transparent as some of the absolutely, > johnny-one-note executables I have created for my own information in > the past, I find IOR overly convoluted. I have seen several cases > where people ran IOR without really understanding all the parameters > they were getting. > > How would I post ''rb''? I could merely send you a .tar.gz?If it''s not huge, post it to the list? Or we could stick it on a wiki page.