thr3ads.net - Lustre discuss - [Lustre-discuss] performance tuning [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Martin Pokorny

2009-Jul-02 19:34 UTC

[Lustre-discuss] performance tuning

Hello all,

I''m in the process of evaluating Lustre for a project I''m
working on,
and I''d like to ask for some advice on tuning my configuration for 
better performance. For my evaluation work, I''ve got one MGS/MDS and 
four OSSes each hosting one OST. This storage cluster was put together 
using some spare nodes that we had from a small, currently unused 
compute cluster, and the disks are all single scsi drives. All of the 
Lustre servers are running 2.6.18-92.1.17.el5_lustre.1.8.0smp kernels, 
and the clients are patchless. All networking is over 1Gb Ethernet.

In our application we have an instrument streaming data to a (compute) 
cluster, which then does some work and writes results to a file, all of 
which generally has to occur in real time (that is, keep up with the 
streaming data). The files are written by processes running on the 
cluster concurrently; that is, for a particular data set, multiple 
processes are writing to one file. Due to the way the instrument 
distributes data to the cluster nodes, as well as the format of output 
files, each cluster process generally writes a relatively small amount 
of data in a block, but at a high frequency (about every 10-100ms). It 
might be important to note that the blocks written by a single process 
are not in general contiguous. The aggregate data rate being written to 
the output files is approximately 100MB/s at this time, although this 
may ramp up considerably at a later date.

While my brief testing with IOR showed acceptable write throughput to 
the Lustre filesystem, I have been unable to achieve anywhere near that 
figure with our application doing the writes --- I''m concerned that the
write pattern being used is a severely limiting factor. In this 
situation, does anyone have any advice about what I ought to be looking 
at to improve performance on Lustre?

-- 
Martin

Brian J. Murrell

2009-Jul-02 19:53 UTC

head link

[Lustre-discuss] performance tuning

On Thu, 2009-07-02 at 13:34 -0600, Martin Pokorny wrote:> 
> While my brief testing with IOR showed acceptable write throughput to 
> the Lustre filesystem,
Was the IOR test file-per-client or single-file, segment-per-client?  If
the latter, how big was the segment and was the file being written to
striped?  If yes, what were it''s striping parameters?
> I have been unable to achieve anywhere near that 
> figure with our application doing the writes --- I''m concerned
that the
> write pattern being used is a severely limiting factor.
If you need more bandwidth to the file than a single client/network
connection/OSS/disk can provide, then you need to do some
parallellization, as you have surmised.  But the parallellization has to
be effective.

The most effective parallization you can get is by mapping a single
client to a single OST on an unshared OSS across an unshared network
connection.  Frequently one or more of those components will be
underutilized though so some economy can be introduced in sharing an OSS
or even OST, or network connection, etc. among several clients to the
limit of the performance of the shared resource.
> In this 
> situation, does anyone have any advice about what I ought to be looking 
> at to improve performance on Lustre?
Well, you need to figure out why your application is not able to take
advantage of the parallellization as IOR has demonstrated can be
achieved.

Perhaps you need to (more effectively) stripe your file across some OSTs
so that clients are not competing for the same OST in their write
operations.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090702/7d2d49b4/attachment.bin

Andreas Dilger

2009-Jul-02 22:06 UTC

head link

[Lustre-discuss] performance tuning

On Jul 02, 2009  13:34 -0600, Martin Pokorny wrote:> I''m in the process of evaluating Lustre for a project I''m
working on,
> and I''d like to ask for some advice on tuning my configuration for
> better performance. For my evaluation work, I''ve got one MGS/MDS
and
> four OSSes each hosting one OST. This storage cluster was put together 
> using some spare nodes that we had from a small, currently unused 
> compute cluster, and the disks are all single scsi drives. All of the 
> Lustre servers are running 2.6.18-92.1.17.el5_lustre.1.8.0smp kernels, 
> and the clients are patchless. All networking is over 1Gb Ethernet.
Note that using single SCSI disks means you have no redundancy of your
data.  If any disk is lost, and you are striping your files over all
of the OSTs (as it seems from below) then all of your files will also
lose data.  That might be fine if Lustre is just used as a scratch
filesystem, but it might also not be what you are expecting.
> In our application we have an instrument streaming data to a (compute) 
> cluster, which then does some work and writes results to a file, all of 
> which generally has to occur in real time (that is, keep up with the 
> streaming data). The files are written by processes running on the 
> cluster concurrently; that is, for a particular data set, multiple 
> processes are writing to one file. Due to the way the instrument 
> distributes data to the cluster nodes, as well as the format of output 
> files, each cluster process generally writes a relatively small amount 
> of data in a block, but at a high frequency (about every 10-100ms). It 
> might be important to note that the blocks written by a single process 
> are not in general contiguous. The aggregate data rate being written to 
> the output files is approximately 100MB/s at this time, although this 
> may ramp up considerably at a later date.
> 
> While my brief testing with IOR showed acceptable write throughput to 
> the Lustre filesystem, I have been unable to achieve anywhere near that 
> figure with our application doing the writes --- I''m concerned
that the
> write pattern being used is a severely limiting factor. In this 
> situation, does anyone have any advice about what I ought to be looking 
> at to improve performance on Lustre?
Writing small file chunks from many clients to a single file is definitely
one way to have very bad IO performance with Lustre.

Some ways to improve this:
- have the application aggregate writes some amount before submitting
  them to Lustre.  Lustre by default enforces POSIX coherency semantics,
  so it will result in lock ping-pong between client nodes if they are
  all writing to the same file at one time
- have the application to 4kB O_DIRECT sized IOs to the file and disable
  locking on the output file.  That will avoid partial-page IO submissions,
  and by disabling locking you will at least avoid the contention between
  the clients.
- I thought there was also an option to have clients do lockless/uncached
  IO wihtout changing the app, but I can''t recall the details on how to
  activate it.  Possibly another of the Lustre engineers will recall.
- have the application write contiguous data?
- add more disks, or use SSD disks for the OSTs.  This will improve your
  IOPS rate dramatically.  It probably makes sense to create larger OSTs
  rather than many smaller OSTs due to less overhead (journal, connections,
  etc).
- using MPI-IO might also help

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Martin Pokorny

2009-Jul-02 22:43 UTC

head link

[Lustre-discuss] performance tuning

Hi Andreas,

Thanks for your informative reply. In general terms, what you''ve
written
confirms my suspicions as to the underlying factors limiting the 
filesystem''s performance in my application. I''ve interspersed
a few
comments below.

Andreas Dilger wrote:> Note that using single SCSI disks means you have no redundancy of your
> data.  If any disk is lost, and you are striping your files over all
> of the OSTs (as it seems from below) then all of your files will also
> lose data.  That might be fine if Lustre is just used as a scratch
> filesystem, but it might also not be what you are expecting.
The Lustre filesystem in this application is, in fact, a scratch 
filesystem. Once the files have been written, they are copied to an 
archive area. Although I might be interested in availability/reliability 
for this filesystem to some degree in the future, presently it''s 
performance that I''m after.
> Writing small file chunks from many clients to a single file is definitely
> one way to have very bad IO performance with Lustre.
> 
> Some ways to improve this:
> - have the application aggregate writes some amount before submitting
>   them to Lustre.  Lustre by default enforces POSIX coherency semantics,
>   so it will result in lock ping-pong between client nodes if they are
>   all writing to the same file at one time
That''s a possibility, but limited to a degree by the instrument 
streaming the raw data into the cluster, and the output file format.
I''m
already in discussion with others on the project about this approach.
> - have the application to 4kB O_DIRECT sized IOs to the file and disable
>   locking on the output file.  That will avoid partial-page IO submissions,
>   and by disabling locking you will at least avoid the contention between
>   the clients.
I''ll try this out. Luckily, no application level locking is being done 
at this time.
> - I thought there was also an option to have clients do lockless/uncached
>   IO wihtout changing the app, but I can''t recall the details on
how to
>   activate it.  Possibly another of the Lustre engineers will recall.
I''d be interested in finding out how to do that.
> - add more disks, or use SSD disks for the OSTs.  This will improve your
>   IOPS rate dramatically.  It probably makes sense to create larger OSTs
>   rather than many smaller OSTs due to less overhead (journal, connections,
>   etc).
I have been wondering about the effect SSD disks might have. 
Unfortunately, for now, I need to show that it''s worth my time to keep 
working on a Lustre solution.
> - using MPI-IO might also help
MPI-IO is already on my list of things to try.

Thanks again.

-- 
Martin

Andreas Dilger

2009-Jul-03 17:56 UTC

head link

[Lustre-discuss] performance tuning

On Jul 02, 2009  16:43 -0600, Martin Pokorny wrote:>> - add more disks, or use SSD disks for the OSTs.  This will improve
your
>>   IOPS rate dramatically.  It probably makes sense to create larger
OSTs
>>   rather than many smaller OSTs due to less overhead (journal,
connections,
>>   etc).
>
> I have been wondering about the effect SSD disks might have.  
> Unfortunately, for now, I need to show that it''s worth my time to
keep
> working on a Lustre solution.
Another option, if you are using 1.8 is to enable the async journal
commit on the OSTs.  This can improve performance for small IO dramatically,
but is currently not supported in a production environment due to recovery
issues.  Since you are using a scratch filesystem only, you could do this
to test the performance.  It is "lctl set_param
obdfilter.*.syncjournal=0"
on the OSTs.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Jul 2009 - performance tuning

[Lustre-discuss] performance tuning

[Lustre-discuss] performance tuning

[Lustre-discuss] performance tuning

[Lustre-discuss] performance tuning

[Lustre-discuss] performance tuning