thr3ads.net - Lustre discuss - [Lustre-discuss] Max obtainable performance for a single OST [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Andrei Maslennikov

2007-Apr-13 04:00 UTC

[Lustre-discuss] Max obtainable performance for a single OST

We are currently evaluating possible commodity hardware candidates
suitable for a single OSS with a single OST served to the clients via the
IB/RDMA. The goal is to provide the peak performance around 1 GB/sec
for large streaming I/O for a single file at the client level, *without*
striping.
In other words, we want to see if we could build a high performance
standalone box which would be acting as a Lustre Head for a couple
of clients (obviously, we will have to run also the metadata service on it).

Economically, the most attractive scenario is to use the
"storage-in-a-box"
element as it allows to save on FC/SCSI cards and external disk enclosures.
One such candidate box that we tried had three RAID-6 controllers, with
8 disk modules per controller. The machine is Intel dual core 3 GHz, with
8 GB of RAM. And we are able to get an aggregate disk performance of
300+, 600+, 900+ MB/sec on writes if we run 1,2,3 processes against
1,2,3 distinct logical drives.

Now comes the interesting point: if we run a single write process against
a striped logical volume built upon the three available drives, we only are
able to obtain 750 MB/sec. The writer process eats 100% of CPU, and
there is no way to improve this. This behaviour, of course, is perfectly
normal, but for us this means that if we would base our OST on this
combination of CPU + striped volume, we probably will never be able to
spit out more than 750 MB/sec peak i/o perf to the clients. Unless
the OST backend service itself is multithreaded!

As we do not have at the moment a running Lustre/IB environment to check
it, I would appreciate if someone could comment on how OST processes are
organized internally. If only one thread is doing i/o towards the the
backend
ext3 partition, we won''t be able to go over 750 MB/sec on such a
machine.
Otherwise, we could probably grow ip to 900 MB/sec.

Thanks ahead for any comment - Andrei.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070413/7bb97cac/attachment.html

Andreas Dilger

2007-Apr-13 04:24 UTC

head link

[Lustre-discuss] Max obtainable performance for a single OST

On Apr 13, 2007  12:00 +0200, Andrei Maslennikov wrote:> We are currently evaluating possible commodity hardware candidates
> suitable for a single OSS with a single OST served to the clients via the
> IB/RDMA. The goal is to provide the peak performance around 1 GB/sec
> for large streaming I/O for a single file at the client level, *without*
> striping.
> In other words, we want to see if we could build a high performance
> standalone box which would be acting as a Lustre Head for a couple
> of clients (obviously, we will have to run also the metadata service on
it).
> 
> Now comes the interesting point: if we run a single write process against
> a striped logical volume built upon the three available drives, we only are
> able to obtain 750 MB/sec. The writer process eats 100% of CPU, and
> there is no way to improve this.  This behaviour, of course, is perfectly
> normal, but for us this means that if we would base our OST on this
> combination of CPU + striped volume, we probably will never be able to
> spit out more than 750 MB/sec peak i/o perf to the clients. Unless
> the OST backend service itself is multithreaded!
You should try with 3 OSTs on the box, and set the filesystem default stripe
count to 3 and see if the mutli-threading on the client can work better than
that in LVM.  You will get 3x as many requests in flight, 3x as much write
cache space in Lustre.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andrei Maslennikov

2007-Apr-13 06:34 UTC

head link

[Lustre-discuss] Max obtainable performance for a single OST

Thanks Andreas,

clearly the 3-OST configuration will work better (we see that three distinct
processes are able to use all the power of all RAID controllers), but we
wanted
to avoid striping at the file system level inside one OSS. This is why:
what we are aiming at is a single mount point file system composed of one or
more first level subdirectories each of which is based on one standalone OSS

capable to deliver around 1GB/sec of peak performance. This configuration
has an obvious "affinity" advantage of only partial damage in case of
loss
of
one of the OSS machines (only one of the first level subdirectories will
become
unavailable).

I hence thought that enabling the filesystem default stripe count to 3 will
then
be applied to the mother (0-level) directory, and then there will be the
cases
when one file could be spread over two or three OSS. And this is exactly
the case we wish to avoid.

To conclude, is it possible to configure a Lustre file system spread over
multiple
OSS machines each of which contains 3 OSTs with striping, and still ensure
(at the subdirectory level) that any file will always be ending up in one
and only
one OSS box and will always be striped over 3 OSTs availabe internally
inside
this given OSS?

Thanks ahead again for claryfing this - Andrei.

On 4/13/07, Andreas Dilger <adilger@clusterfs.com>
wrote:>
> On Apr 13, 2007  12:00 +0200, Andrei Maslennikov wrote:
> ....
> > for large streaming I/O for a single file at the client level,
*without*
> > striping.
>
> You should try with 3 OSTs on the box, and set the filesystem default
> stripe
> count to 3 and see if the mutli-threading on the client can work better
> than
> that in LVM.  You will get 3x as many requests in flight, 3x as much write
> cache space in Lustre.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070413/3a32bd16/attachment.html

Stephen Simms

2007-Apr-13 06:57 UTC

head link

[Lustre-discuss] Max obtainable performance for a single OST

Hi Andrei-

750 MB/s or so is about the max that we have seen from a single client to
multiple OSSs across TCP, however we discovered that you can use both
front side busses if you perform two simultaneous writes (turning
ksocklnd''s irq affinity off on the server side).  This got us over 1
GB/s
aggregate writes with multiple OSSs on the back end.  Reads have been less
- roughly 400 MB/s and 600 MB/s respectively.  These numbers were using
Myri-10G cards in ethernet mode with DDN 9550 controllers on the back end.  
So I believe that front side bus speed and internal memory copies have
prevented us from better single file performance (reads worse than writes
because you can''t use zero copy for reads).  My suspicion is that this
is
the case for you as well.  Our network performance (measured with netperf)
has been 9.1 Gb/s or better using the Myricom cards in Ethernet mode so we
know that is not the limiting factor.  Likewise, we see better than 350
MB/s per port on the DDN side (using sgpdd) so that''s not the limiting
factor either.

I hope this helps,
simms

On Fri, 13 Apr 2007, Andrei Maslennikov wrote:
> We are currently evaluating possible commodity hardware candidates
> suitable for a single OSS with a single OST served to the clients via the
> IB/RDMA. The goal is to provide the peak performance around 1 GB/sec
> for large streaming I/O for a single file at the client level, *without*
> striping.
> In other words, we want to see if we could build a high performance
> standalone box which would be acting as a Lustre Head for a couple
> of clients (obviously, we will have to run also the metadata service on
it).
> 
> 
> Economically, the most attractive scenario is to use the
"storage-in-a-box"
> element as it allows to save on FC/SCSI cards and external disk enclosures.
> One such candidate box that we tried had three RAID-6 controllers, with
> 8 disk modules per controller. The machine is Intel dual core 3 GHz, with
> 8 GB of RAM. And we are able to get an aggregate disk performance of
> 300+, 600+, 900+ MB/sec on writes if we run 1,2,3 processes against
> 1,2,3 distinct logical drives.
> 
> Now comes the interesting point: if we run a single write process against
> a striped logical volume built upon the three available drives, we only are
> able to obtain 750 MB/sec. The writer process eats 100% of CPU, and
> there is no way to improve this.  This behaviour, of course, is perfectly
> normal, but for us this means that if we would base our OST on this
> combination of CPU + striped volume, we probably will never be able to
> spit out more than 750 MB/sec peak i/o perf to the clients. Unless
> the OST backend service itself is multithreaded!
> 
> As we do not have at the moment a running Lustre/IB environment to check
> it, I would appreciate if someone could comment on how OST processes are
> organized internally.  If only one thread is doing i/o towards the the
> backend
> ext3 partition, we won''t be able to go over 750 MB/sec on such a
machine.
> Otherwise, we could probably grow ip to 900 MB/sec.
> 
> Thanks ahead for any comment - Andrei.
>

Andrei Maslennikov

2007-Apr-13 09:30 UTC

head link

[Lustre-discuss] Max obtainable performance for a single OST

Thanks Stephen,

we are looking at a possible peak performance for a single OSS with IB
outlet.
I understand that at the client level the tradeoffs may be visible, and 750
MB/sec
aggregate that you observe is not bad at all. But we want to to ensure that
our OSS is able to unleash around 1 GB/sec into the clients'' IB
network...


Performance of a single OSS depends on the perfomance of the local ext3
backend
file system, and we were unable to push it over 750 MB/sec. The advice of
Andreas
from Clusterfs is to use 3 OSTs inside one OSS and stripe files over all
three of them.
Time ago we have however considered and discarded this solution as we
wanted
to ensure that every file is confined to one and only one OSS capable to
deliver
0.9-1 GB/sec. Setting the filesystem default stripe count to 3 may lead to a
situation
when a file may end up on different OSS machines, and that''s exactly
what we
want
to avoid. (I have asked Andreas to comment on the configuration; would it be
possible
to migrate to striping over 3 OST per OSS, and still ensure the OSS
confinement,
we would certainly follow the 3-OST solution).

Greetings - Andrei.

On 4/13/07, Stephen Simms <ssimms@indiana.edu>
wrote:>
>
> Hi Andrei-
>
> 750 MB/s or so is about the max that we have seen from a single client to
> multiple OSSs across TCP, however we discovered that you can use both
> front side busses if you perform two simultaneous writes (turning
> ksocklnd''s irq affinity off on the server side).  This got us over
1 GB/s
> aggregate writes with multiple OSSs on the back end.  Reads have been less
> - roughly 400 MB/s and 600 MB/s respectively.  These numbers were using
> Myri-10G cards in ethernet mode with DDN 9550 controllers on the back end.
> So I believe that front side bus speed and internal memory copies have
> prevented us from better single file performance (reads worse than writes
> because you can''t use zero copy for reads).  My suspicion is that
this is
> the case for you as well.  Our network performance (measured with netperf)
> has been 9.1 Gb/s or better using the Myricom cards in Ethernet mode so we
> know that is not the limiting factor.  Likewise, we see better than 350
> MB/s per port on the DDN side (using sgpdd) so that''s not the
limiting
> factor either.
>
> I hope this helps,
> simms
>
> On Fri, 13 Apr 2007, Andrei Maslennikov wrote:
>
> > We are currently evaluating possible commodity hardware candidates
> > suitable for a single OSS with a single OST served to the clients via
> the
> > IB/RDMA. The goal is to provide the peak performance around 1 GB/sec
> > for large streaming I/O for a single file at the client level,
*without*
> > striping.
> > In other words, we want to see if we could build a high performance
> > standalone box which would be acting as a Lustre Head for a couple
> > of clients (obviously, we will have to run also the metadata service
on
> it).
> >
> >
> > Economically, the most attractive scenario is to use the
> "storage-in-a-box"
> > element as it allows to save on FC/SCSI cards and external disk
> enclosures.
> > One such candidate box that we tried had three RAID-6 controllers,
with
> > 8 disk modules per controller. The machine is Intel dual core 3 GHz,
> with
> > 8 GB of RAM. And we are able to get an aggregate disk performance of
> > 300+, 600+, 900+ MB/sec on writes if we run 1,2,3 processes against
> > 1,2,3 distinct logical drives.
> >
> > Now comes the interesting point: if we run a single write process
> against
> > a striped logical volume built upon the three available drives, we
only
> are
> > able to obtain 750 MB/sec. The writer process eats 100% of CPU, and
> > there is no way to improve this.  This behaviour, of course, is
> perfectly
> > normal, but for us this means that if we would base our OST on this
> > combination of CPU + striped volume, we probably will never be able to
> > spit out more than 750 MB/sec peak i/o perf to the clients. Unless
> > the OST backend service itself is multithreaded!
> >
> > As we do not have at the moment a running Lustre/IB environment to
check
> > it, I would appreciate if someone could comment on how OST processes
are
> > organized internally.  If only one thread is doing i/o towards the the
> > backend
> > ext3 partition, we won''t be able to go over 750 MB/sec on
such a
> machine.
> > Otherwise, we could probably grow ip to 900 MB/sec.
> >
> > Thanks ahead for any comment - Andrei.
> >
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070413/94beb82c/attachment.html

Brian J. Murrell

2007-Apr-13 09:44 UTC

head link

[Lustre-discuss] Max obtainable performance for a single OST

On Fri, 2007-13-04 at 17:30 +0200, Andrei Maslennikov
wrote:> Performance of a single OSS depends on the perfomance of the local
> ext3 backend
> file system, and we were unable to push it over 750 MB/sec.
Nomenclature in such a discussion is very important.  An OSS is an
Object Storage Server.  An OSS can contain many OSTs.  An OST is an
Object Storage Target.  An OST is a Lustre building block constructed
from a single block device, be it a single physical disk (/dev/sda for
example) or a partition on a disk (/dev/sda1 for example) or an LVM LV
(/dev/vg00/ost1 for example), or a software raid device (/dev/md0 for
example) or the block device a hardware RAID card presents to the
operating system.
> The advice of Andreas 
> from Clusterfs is to use 3 OSTs inside one OSS and stripe files over
> all three of them.
Right.  I believe the 3 OSTs he recommended you use would be the three
RAID disks you have in the machine.
> Time ago we have however considered and discarded this solution as we
> wanted  
> to ensure that every file is confined to one and only one OSS capable
> to deliver 
> 0.9-1 GB/sec.
Do you really mean OSS here or OST?  Given the 3 OSTs you would put on
the 3 RAID cards you have in your OSS, this would meet your requirements
wouldn''t it?
> Setting the filesystem default stripe count to 3 may lead to a
> situation 
> when a file may end up on different OSS machines,
I believe Andreas described striping across the 3 OSTs in that single
OSS machine.
> and that''s exactly what we want 
> to avoid. (I have asked Andreas to comment on the configuration; would
> it be possible 
> to migrate to striping over 3 OST per OSS, and still ensure the OSS
> confinement,
Sure.  I do think that is what Andreas was suggesting.

b.

Andrei Maslennikov

2007-Apr-13 10:34 UTC

head link

[Lustre-discuss] Max obtainable performance for a single OST

On 4/13/07, Brian J. Murrell <brian@clusterfs.com>
wrote:>
> On Fri, 2007-13-04 at 17:30 +0200, Andrei Maslennikov wrote:
> > Time ago we have however considered and discarded this solution as we
> > wanted to ensure that every file is confined to one and only one OSS
> capable
> > to deliver 0.9-1 GB/sec.
>
> Do you really mean OSS here or OST?  Given the 3 OSTs you would put on
> the 3 RAID cards you have in your OSS, this would meet your requirements
> wouldn''t it?
>
> Hi Brian,

 I meant OSS here. If our configuration would include only one OSS with 3
OSTs
 and the default stripe count for the file system would be 3, the solution
mentioned
 by Andreas would be ok (and we have even considered it before asking the
first
 question on this list). The point is, if we would be adding other
3-OST-based OSS
 machines to the same file system, it may so happen that the files may end
up
 on different OSS machines (with a default stripe count of 3 one file may be
split
 over 3 OSTs in three different OSS hosts).

 Our  main requirement  is to  ensure that  every file ends up  in one and
only
 one OSS box. If this is not possible with the multi-OST OSS machines, we
 may only be able to use one OST  per OSS, and hence  will have to  stay
with
 max 750  MB/sec  single file performance per OSS.

 Best regards - Andrei.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070413/d497b97d/attachment.html

Nathaniel Rutman

2007-Apr-13 11:55 UTC

head link

[Lustre-discuss] Max obtainable performance for a single OST

Andrei Maslennikov wrote:>
> On 4/13/07, *Brian J. Murrell* <brian@clusterfs.com 
> <mailto:brian@clusterfs.com>> wrote:
>
>     On Fri, 2007-13-04 at 17:30 +0200, Andrei Maslennikov wrote:
>     > Time ago we have however considered and discarded this solution
>     as we
>     > wanted to ensure that every file is confined to one and only one
>     OSS capable
>     > to deliver 0.9-1 GB/sec.
>
>     Do you really mean OSS here or OST?  Given the 3 OSTs you would put on
>     the 3 RAID cards you have in your OSS, this would meet your
>     requirements
>     wouldn''t it?
>
>
>  Hi Brian,
>
>  I meant OSS here. If our configuration would include only one OSS 
> with 3 OSTs
>  and the default stripe count for the file system would be 3, the 
> solution mentioned
>  by Andreas would be ok (and we have even considered it before asking 
> the first
>  question on this list). The point is, if we would be adding other 
> 3-OST-based OSS
>  machines to the same file system, it may so happen that the files may 
> end up
>  on different OSS machines (with a default stripe count of 3 one file 
> may be split
>  over 3 OSTs in three different OSS hosts).
>
>  Our  main requirement  is to  ensure that  every file ends up  in one 
> and only
>  one OSS box. If this is not possible with the multi-OST OSS machines, we
>  may only be able to use one OST  per OSS, and hence  will have to  
> stay with 
>  max 750  MB/sec  single file performance per OSS.
>
>  Best regards - Andrei.You can use ''lfs setstripe'' to designate a starting OST index,
for a
file or a directory.  With that, you could set up different single-OSS 
directories.
If you have more than a single client reading from an OSS at the same 
time, you don''t even need to do that -- use three OSTs per OSS, with 
everything single-striped.  Your multiple clients can use the different 
OSTs simultaneously, increasing your total performance.

Andreas Dilger

2007-Apr-13 16:00 UTC

head link

[Lustre-discuss] Max obtainable performance for a single OST

On Apr 13, 2007  14:34 +0200, Andrei Maslennikov wrote:> what we are aiming at is a single mount point file system composed of one
or
> more first level subdirectories each of which is based on one standalone
OSS
> 
> capable to deliver around 1GB/sec of peak performance. This configuration
> has an obvious "affinity" advantage of only partial damage in
case of loss
> of one of the OSS machines (only one of the first level subdirectories will
> become unavailable).
Use "lfs setstripe 0 0 3" for directory for ost[0,1,2], "lfs
setstripe 0 3 3"
for directory for ost[3,4,5], etc.


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Lustre discuss - Apr 2007 - Max obtainable performance for a single OST

[Lustre-discuss] Max obtainable performance for a single OST

[Lustre-discuss] Max obtainable performance for a single OST

[Lustre-discuss] Max obtainable performance for a single OST

[Lustre-discuss] Max obtainable performance for a single OST

[Lustre-discuss] Max obtainable performance for a single OST

[Lustre-discuss] Max obtainable performance for a single OST

[Lustre-discuss] Max obtainable performance for a single OST

[Lustre-discuss] Max obtainable performance for a single OST

[Lustre-discuss] Max obtainable performance for a single OST