Andrei Maslennikov
2007-Apr-13 04:00 UTC
[Lustre-discuss] Max obtainable performance for a single OST
We are currently evaluating possible commodity hardware candidates suitable for a single OSS with a single OST served to the clients via the IB/RDMA. The goal is to provide the peak performance around 1 GB/sec for large streaming I/O for a single file at the client level, *without* striping. In other words, we want to see if we could build a high performance standalone box which would be acting as a Lustre Head for a couple of clients (obviously, we will have to run also the metadata service on it). Economically, the most attractive scenario is to use the "storage-in-a-box" element as it allows to save on FC/SCSI cards and external disk enclosures. One such candidate box that we tried had three RAID-6 controllers, with 8 disk modules per controller. The machine is Intel dual core 3 GHz, with 8 GB of RAM. And we are able to get an aggregate disk performance of 300+, 600+, 900+ MB/sec on writes if we run 1,2,3 processes against 1,2,3 distinct logical drives. Now comes the interesting point: if we run a single write process against a striped logical volume built upon the three available drives, we only are able to obtain 750 MB/sec. The writer process eats 100% of CPU, and there is no way to improve this. This behaviour, of course, is perfectly normal, but for us this means that if we would base our OST on this combination of CPU + striped volume, we probably will never be able to spit out more than 750 MB/sec peak i/o perf to the clients. Unless the OST backend service itself is multithreaded! As we do not have at the moment a running Lustre/IB environment to check it, I would appreciate if someone could comment on how OST processes are organized internally. If only one thread is doing i/o towards the the backend ext3 partition, we won''t be able to go over 750 MB/sec on such a machine. Otherwise, we could probably grow ip to 900 MB/sec. Thanks ahead for any comment - Andrei. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070413/7bb97cac/attachment.html
Andreas Dilger
2007-Apr-13 04:24 UTC
[Lustre-discuss] Max obtainable performance for a single OST
On Apr 13, 2007 12:00 +0200, Andrei Maslennikov wrote:> We are currently evaluating possible commodity hardware candidates > suitable for a single OSS with a single OST served to the clients via the > IB/RDMA. The goal is to provide the peak performance around 1 GB/sec > for large streaming I/O for a single file at the client level, *without* > striping. > In other words, we want to see if we could build a high performance > standalone box which would be acting as a Lustre Head for a couple > of clients (obviously, we will have to run also the metadata service on it). > > Now comes the interesting point: if we run a single write process against > a striped logical volume built upon the three available drives, we only are > able to obtain 750 MB/sec. The writer process eats 100% of CPU, and > there is no way to improve this. This behaviour, of course, is perfectly > normal, but for us this means that if we would base our OST on this > combination of CPU + striped volume, we probably will never be able to > spit out more than 750 MB/sec peak i/o perf to the clients. Unless > the OST backend service itself is multithreaded!You should try with 3 OSTs on the box, and set the filesystem default stripe count to 3 and see if the mutli-threading on the client can work better than that in LVM. You will get 3x as many requests in flight, 3x as much write cache space in Lustre. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Andrei Maslennikov
2007-Apr-13 06:34 UTC
[Lustre-discuss] Max obtainable performance for a single OST
Thanks Andreas, clearly the 3-OST configuration will work better (we see that three distinct processes are able to use all the power of all RAID controllers), but we wanted to avoid striping at the file system level inside one OSS. This is why: what we are aiming at is a single mount point file system composed of one or more first level subdirectories each of which is based on one standalone OSS capable to deliver around 1GB/sec of peak performance. This configuration has an obvious "affinity" advantage of only partial damage in case of loss of one of the OSS machines (only one of the first level subdirectories will become unavailable). I hence thought that enabling the filesystem default stripe count to 3 will then be applied to the mother (0-level) directory, and then there will be the cases when one file could be spread over two or three OSS. And this is exactly the case we wish to avoid. To conclude, is it possible to configure a Lustre file system spread over multiple OSS machines each of which contains 3 OSTs with striping, and still ensure (at the subdirectory level) that any file will always be ending up in one and only one OSS box and will always be striped over 3 OSTs availabe internally inside this given OSS? Thanks ahead again for claryfing this - Andrei. On 4/13/07, Andreas Dilger <adilger@clusterfs.com> wrote:> > On Apr 13, 2007 12:00 +0200, Andrei Maslennikov wrote: > .... > > for large streaming I/O for a single file at the client level, *without* > > striping. > > You should try with 3 OSTs on the box, and set the filesystem default > stripe > count to 3 and see if the mutli-threading on the client can work better > than > that in LVM. You will get 3x as many requests in flight, 3x as much write > cache space in Lustre. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070413/3a32bd16/attachment.html
Stephen Simms
2007-Apr-13 06:57 UTC
[Lustre-discuss] Max obtainable performance for a single OST
Hi Andrei- 750 MB/s or so is about the max that we have seen from a single client to multiple OSSs across TCP, however we discovered that you can use both front side busses if you perform two simultaneous writes (turning ksocklnd''s irq affinity off on the server side). This got us over 1 GB/s aggregate writes with multiple OSSs on the back end. Reads have been less - roughly 400 MB/s and 600 MB/s respectively. These numbers were using Myri-10G cards in ethernet mode with DDN 9550 controllers on the back end. So I believe that front side bus speed and internal memory copies have prevented us from better single file performance (reads worse than writes because you can''t use zero copy for reads). My suspicion is that this is the case for you as well. Our network performance (measured with netperf) has been 9.1 Gb/s or better using the Myricom cards in Ethernet mode so we know that is not the limiting factor. Likewise, we see better than 350 MB/s per port on the DDN side (using sgpdd) so that''s not the limiting factor either. I hope this helps, simms On Fri, 13 Apr 2007, Andrei Maslennikov wrote:> We are currently evaluating possible commodity hardware candidates > suitable for a single OSS with a single OST served to the clients via the > IB/RDMA. The goal is to provide the peak performance around 1 GB/sec > for large streaming I/O for a single file at the client level, *without* > striping. > In other words, we want to see if we could build a high performance > standalone box which would be acting as a Lustre Head for a couple > of clients (obviously, we will have to run also the metadata service on it). > > > Economically, the most attractive scenario is to use the "storage-in-a-box" > element as it allows to save on FC/SCSI cards and external disk enclosures. > One such candidate box that we tried had three RAID-6 controllers, with > 8 disk modules per controller. The machine is Intel dual core 3 GHz, with > 8 GB of RAM. And we are able to get an aggregate disk performance of > 300+, 600+, 900+ MB/sec on writes if we run 1,2,3 processes against > 1,2,3 distinct logical drives. > > Now comes the interesting point: if we run a single write process against > a striped logical volume built upon the three available drives, we only are > able to obtain 750 MB/sec. The writer process eats 100% of CPU, and > there is no way to improve this. This behaviour, of course, is perfectly > normal, but for us this means that if we would base our OST on this > combination of CPU + striped volume, we probably will never be able to > spit out more than 750 MB/sec peak i/o perf to the clients. Unless > the OST backend service itself is multithreaded! > > As we do not have at the moment a running Lustre/IB environment to check > it, I would appreciate if someone could comment on how OST processes are > organized internally. If only one thread is doing i/o towards the the > backend > ext3 partition, we won''t be able to go over 750 MB/sec on such a machine. > Otherwise, we could probably grow ip to 900 MB/sec. > > Thanks ahead for any comment - Andrei. >
Andrei Maslennikov
2007-Apr-13 09:30 UTC
[Lustre-discuss] Max obtainable performance for a single OST
Thanks Stephen, we are looking at a possible peak performance for a single OSS with IB outlet. I understand that at the client level the tradeoffs may be visible, and 750 MB/sec aggregate that you observe is not bad at all. But we want to to ensure that our OSS is able to unleash around 1 GB/sec into the clients'' IB network... Performance of a single OSS depends on the perfomance of the local ext3 backend file system, and we were unable to push it over 750 MB/sec. The advice of Andreas from Clusterfs is to use 3 OSTs inside one OSS and stripe files over all three of them. Time ago we have however considered and discarded this solution as we wanted to ensure that every file is confined to one and only one OSS capable to deliver 0.9-1 GB/sec. Setting the filesystem default stripe count to 3 may lead to a situation when a file may end up on different OSS machines, and that''s exactly what we want to avoid. (I have asked Andreas to comment on the configuration; would it be possible to migrate to striping over 3 OST per OSS, and still ensure the OSS confinement, we would certainly follow the 3-OST solution). Greetings - Andrei. On 4/13/07, Stephen Simms <ssimms@indiana.edu> wrote:> > > Hi Andrei- > > 750 MB/s or so is about the max that we have seen from a single client to > multiple OSSs across TCP, however we discovered that you can use both > front side busses if you perform two simultaneous writes (turning > ksocklnd''s irq affinity off on the server side). This got us over 1 GB/s > aggregate writes with multiple OSSs on the back end. Reads have been less > - roughly 400 MB/s and 600 MB/s respectively. These numbers were using > Myri-10G cards in ethernet mode with DDN 9550 controllers on the back end. > So I believe that front side bus speed and internal memory copies have > prevented us from better single file performance (reads worse than writes > because you can''t use zero copy for reads). My suspicion is that this is > the case for you as well. Our network performance (measured with netperf) > has been 9.1 Gb/s or better using the Myricom cards in Ethernet mode so we > know that is not the limiting factor. Likewise, we see better than 350 > MB/s per port on the DDN side (using sgpdd) so that''s not the limiting > factor either. > > I hope this helps, > simms > > On Fri, 13 Apr 2007, Andrei Maslennikov wrote: > > > We are currently evaluating possible commodity hardware candidates > > suitable for a single OSS with a single OST served to the clients via > the > > IB/RDMA. The goal is to provide the peak performance around 1 GB/sec > > for large streaming I/O for a single file at the client level, *without* > > striping. > > In other words, we want to see if we could build a high performance > > standalone box which would be acting as a Lustre Head for a couple > > of clients (obviously, we will have to run also the metadata service on > it). > > > > > > Economically, the most attractive scenario is to use the > "storage-in-a-box" > > element as it allows to save on FC/SCSI cards and external disk > enclosures. > > One such candidate box that we tried had three RAID-6 controllers, with > > 8 disk modules per controller. The machine is Intel dual core 3 GHz, > with > > 8 GB of RAM. And we are able to get an aggregate disk performance of > > 300+, 600+, 900+ MB/sec on writes if we run 1,2,3 processes against > > 1,2,3 distinct logical drives. > > > > Now comes the interesting point: if we run a single write process > against > > a striped logical volume built upon the three available drives, we only > are > > able to obtain 750 MB/sec. The writer process eats 100% of CPU, and > > there is no way to improve this. This behaviour, of course, is > perfectly > > normal, but for us this means that if we would base our OST on this > > combination of CPU + striped volume, we probably will never be able to > > spit out more than 750 MB/sec peak i/o perf to the clients. Unless > > the OST backend service itself is multithreaded! > > > > As we do not have at the moment a running Lustre/IB environment to check > > it, I would appreciate if someone could comment on how OST processes are > > organized internally. If only one thread is doing i/o towards the the > > backend > > ext3 partition, we won''t be able to go over 750 MB/sec on such a > machine. > > Otherwise, we could probably grow ip to 900 MB/sec. > > > > Thanks ahead for any comment - Andrei. > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070413/94beb82c/attachment.html
Brian J. Murrell
2007-Apr-13 09:44 UTC
[Lustre-discuss] Max obtainable performance for a single OST
On Fri, 2007-13-04 at 17:30 +0200, Andrei Maslennikov wrote:> Performance of a single OSS depends on the perfomance of the local > ext3 backend > file system, and we were unable to push it over 750 MB/sec.Nomenclature in such a discussion is very important. An OSS is an Object Storage Server. An OSS can contain many OSTs. An OST is an Object Storage Target. An OST is a Lustre building block constructed from a single block device, be it a single physical disk (/dev/sda for example) or a partition on a disk (/dev/sda1 for example) or an LVM LV (/dev/vg00/ost1 for example), or a software raid device (/dev/md0 for example) or the block device a hardware RAID card presents to the operating system.> The advice of Andreas > from Clusterfs is to use 3 OSTs inside one OSS and stripe files over > all three of them.Right. I believe the 3 OSTs he recommended you use would be the three RAID disks you have in the machine.> Time ago we have however considered and discarded this solution as we > wanted > to ensure that every file is confined to one and only one OSS capable > to deliver > 0.9-1 GB/sec.Do you really mean OSS here or OST? Given the 3 OSTs you would put on the 3 RAID cards you have in your OSS, this would meet your requirements wouldn''t it?> Setting the filesystem default stripe count to 3 may lead to a > situation > when a file may end up on different OSS machines,I believe Andreas described striping across the 3 OSTs in that single OSS machine.> and that''s exactly what we want > to avoid. (I have asked Andreas to comment on the configuration; would > it be possible > to migrate to striping over 3 OST per OSS, and still ensure the OSS > confinement,Sure. I do think that is what Andreas was suggesting. b.
Andrei Maslennikov
2007-Apr-13 10:34 UTC
[Lustre-discuss] Max obtainable performance for a single OST
On 4/13/07, Brian J. Murrell <brian@clusterfs.com> wrote:> > On Fri, 2007-13-04 at 17:30 +0200, Andrei Maslennikov wrote: > > Time ago we have however considered and discarded this solution as we > > wanted to ensure that every file is confined to one and only one OSS > capable > > to deliver 0.9-1 GB/sec. > > Do you really mean OSS here or OST? Given the 3 OSTs you would put on > the 3 RAID cards you have in your OSS, this would meet your requirements > wouldn''t it? > >Hi Brian, I meant OSS here. If our configuration would include only one OSS with 3 OSTs and the default stripe count for the file system would be 3, the solution mentioned by Andreas would be ok (and we have even considered it before asking the first question on this list). The point is, if we would be adding other 3-OST-based OSS machines to the same file system, it may so happen that the files may end up on different OSS machines (with a default stripe count of 3 one file may be split over 3 OSTs in three different OSS hosts). Our main requirement is to ensure that every file ends up in one and only one OSS box. If this is not possible with the multi-OST OSS machines, we may only be able to use one OST per OSS, and hence will have to stay with max 750 MB/sec single file performance per OSS. Best regards - Andrei. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070413/d497b97d/attachment.html
Nathaniel Rutman
2007-Apr-13 11:55 UTC
[Lustre-discuss] Max obtainable performance for a single OST
Andrei Maslennikov wrote:> > On 4/13/07, *Brian J. Murrell* <brian@clusterfs.com > <mailto:brian@clusterfs.com>> wrote: > > On Fri, 2007-13-04 at 17:30 +0200, Andrei Maslennikov wrote: > > Time ago we have however considered and discarded this solution > as we > > wanted to ensure that every file is confined to one and only one > OSS capable > > to deliver 0.9-1 GB/sec. > > Do you really mean OSS here or OST? Given the 3 OSTs you would put on > the 3 RAID cards you have in your OSS, this would meet your > requirements > wouldn''t it? > > > Hi Brian, > > I meant OSS here. If our configuration would include only one OSS > with 3 OSTs > and the default stripe count for the file system would be 3, the > solution mentioned > by Andreas would be ok (and we have even considered it before asking > the first > question on this list). The point is, if we would be adding other > 3-OST-based OSS > machines to the same file system, it may so happen that the files may > end up > on different OSS machines (with a default stripe count of 3 one file > may be split > over 3 OSTs in three different OSS hosts). > > Our main requirement is to ensure that every file ends up in one > and only > one OSS box. If this is not possible with the multi-OST OSS machines, we > may only be able to use one OST per OSS, and hence will have to > stay with > max 750 MB/sec single file performance per OSS. > > Best regards - Andrei.You can use ''lfs setstripe'' to designate a starting OST index, for a file or a directory. With that, you could set up different single-OSS directories. If you have more than a single client reading from an OSS at the same time, you don''t even need to do that -- use three OSTs per OSS, with everything single-striped. Your multiple clients can use the different OSTs simultaneously, increasing your total performance.
Andreas Dilger
2007-Apr-13 16:00 UTC
[Lustre-discuss] Max obtainable performance for a single OST
On Apr 13, 2007 14:34 +0200, Andrei Maslennikov wrote:> what we are aiming at is a single mount point file system composed of one or > more first level subdirectories each of which is based on one standalone OSS > > capable to deliver around 1GB/sec of peak performance. This configuration > has an obvious "affinity" advantage of only partial damage in case of loss > of one of the OSS machines (only one of the first level subdirectories will > become unavailable).Use "lfs setstripe 0 0 3" for directory for ost[0,1,2], "lfs setstripe 0 3 3" for directory for ost[3,4,5], etc. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.