Anybody had any experience using an IB based storage target as an OST? Apart from the obvious issue of separating the IB SAN(SRP/SER) storage traffic from the Lustre traffic are there any issues? What about failover? -- Brian O''Connor ----------------------------------------------------------------------- SGI Consulting Email: briano at sgi.com, Mobile +61 417 746 452 Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 357 Camberwell Road, Camberwell, Victoria, 3124 AUSTRALIA http://www.sgi.com/support/services -----------------------------------------------------------------------
Hi Brian, Here at ORNL we don''t sepearate the IB SAN from the Lustre fabric and we don''t see any performance degredation on any of it. We have two links to the backed storage from the OSS nodes - one is direct to the storage controller, the other (primary) is through the SAN. Any specific questions you have? Thanks, -- -Jason ------------------------------------------------- // Jason J. Hill // // HPC Systems Administrator // // National Center for Computational Sciences // // Oak Ridge National Laboratory // // e-mail: hilljj at ornl.gov // // Phone: (865) 576-5867 // ------------------------------------------------- On Mon, Mar 28, 2011 at 01:21:26AM -0400, Brian O''Connor wrote:> > Anybody had any experience using an IB based storage > target as an OST? > > Apart from the obvious issue of separating the IB SAN(SRP/SER) > storage traffic from the Lustre traffic are there any issues? > > What about failover? > > -- > Brian O''Connor > ----------------------------------------------------------------------- > SGI Consulting > Email: briano at sgi.com, Mobile +61 417 746 452 > Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 > 357 Camberwell Road, Camberwell, Victoria, 3124 > AUSTRALIA > http://www.sgi.com/support/services > ----------------------------------------------------------------------- > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On 28 Mar 2011, at 06:21, Brian O''Connor wrote:> Anybody had any experience using an IB based storage > target as an OST?We do this all the time at DDN.> Apart from the obvious issue of separating the IB SAN(SRP/SER) > storage traffic from the Lustre traffic are there any issues?Personally I don''t see a huge problem sharing traffic, some people like to have different networks for MPI and Lustre but I don''t see real advantage in this either. What is important about using IB devices is what happens in case of failure, if your IB network has problems and the Lustre clients can''t talk to the servers they get evicted and have to re-connect. Inconvenient perhaps but not usually a major issue. If however your back-end devices get detached from your Lustre servers then this is very different as the on-disk data is possibly stale or inconsistent. People will tell you the on-disk design is such that any corruption caused by sudden loss of access is minimal however as this typically happens across many OSTs simultaneously you increase the risk. I would estimate around 5-10% of devices which see a disconnect in this way require a fsck before they will mount however when you take into account the number of OSTs you can see that for even a very small number of OSTs it becomes the norm rather than the exception. As such I would pay attention to this when designing your network, does your SRP traffic need to go over the core switch or can you isolate it at all? Ideally direct-connect the Lustre servers to whatever is serving the SRP targets. If you have a small number of servers and need say 10 ports to connect them all to the backend storage don''t be tempted to use 10 ports on your large site switch as this will be a lot more error prone than using a small, dedicated switch.> What about failover?Make sure that MMP is enabled, it will be if you specify more than one failnode when you format the OST. Ashley.
>Anybody had any experience using an IB based storage >target as an OST?We do that.>Apart from the obvious issue of separating the IB SAN(SRP/SER) >storage traffic from the Lustre traffic are there any issues?We don''t actually separate the IB traffic from the Lustre traffic; in our cases they actually run over the same IB HCAs. That isn''t the setup I would have chosen, but it was the system that was available. Here is one implementation detail that stands out in my mind. Because the IB storage tends to come on line rather late in the boot process, we had to develop a custom boot script that waits around for the IB device nodes to appear before attempting to mount the Lustre filesystems. That was a bit of a pain until we had it all worked out. As other as pointed out, if your backend storage disappears (which happens more often than I would prefer, but in our case the issues which caused that have been resolved for the most part) then that makes Lustre very unhappy very quickly. We''ve been able to recover from those situations, but it can be a royal pain.>What about failover?We use MMP as others have mentioned, but we don''t actually have the Lustre failover stuff all up and running; mostly it hasn''t been an issue for us, so we haven''t seen a need to finish it. --Ken
We direct-connected to LSI back-ends, and the only real stumbling blocks were in the intitial setup phase where we assigned wwns to various hosts and host groups. The wwns were a concatenation of the port IDs on both ends, rather than just on the host end. So if we moved wires around, we had to re-assign wwns to the hosts and host groups. Failover was no problem, MMP worked correctly. IIRC, we had to specify which devices used SRP and which used IPoIB, but that''s about it. It''s been awhile since we did it, so my memory make be flaky -Ben Evans ben at terascala.com -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Brian O''Connor Sent: Monday, March 28, 2011 1:21 AM To: lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] IB storage as an OST target Anybody had any experience using an IB based storage target as an OST? Apart from the obvious issue of separating the IB SAN(SRP/SER) storage traffic from the Lustre traffic are there any issues? What about failover? -- Brian O''Connor ----------------------------------------------------------------------- SGI Consulting Email: briano at sgi.com, Mobile +61 417 746 452 Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 357 Camberwell Road, Camberwell, Victoria, 3124 AUSTRALIA http://www.sgi.com/support/services ----------------------------------------------------------------------- _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Mon, Mar 28, 2011 at 09:36:43AM -0400, Jason Hill wrote:> Hi Brian, > > Here at ORNL we don''t sepearate the IB SAN from the Lustre fabricI misspoke yesterday. Our OSSes have a dual-port card used for storage, and a single port card used for LNET. The storage has half of its ports connected to the in-rack switch, and the other half are directly connected to the OSS via one port on the dual port card, with the other going to the switch. This is a legacy of risk mitigation during acceptance, but as the storage traffic is local it does not traverse the uplinks to the core network, so it does not congest/interfere with the LNET traffic. As others have said the things you have to be aware of are what failure modes in your network affect your connections to the storage and if any of the MPI traffic from a compute cluster could interfere with your connections to the storage as well. I would definately suggest putting your LNET and SRP connections on different physical HCA''s to keep the traffic at least isolated on the OSS side. -- -Jason
chas williams - CONTRACTOR
2011-Mar-30 14:44 UTC
[Lustre-discuss] IB storage as an OST target
On Tue, 29 Mar 2011 09:23:59 -0400 Jason Hill <hilljj at ornl.gov> wrote:> On Mon, Mar 28, 2011 at 09:36:43AM -0400, Jason Hill wrote: > storage as well. I would definately suggest putting your LNET and SRP > connections on different physical HCA''s to keep the traffic at least isolated > on the OSS side.i doubt this matters as much as it once did. pci/pci-x wasnt capacble of reading/writing at the same time due to its bus nature (even though pci-x was point-to-point to the bridge chip). pci-express (and infiniband) can both read and write at the same time. so you can stream in data from srp and stream it out via lnet at the same time on the same port. a bigger concern might be the number of luns behind a single port on your storage controller. most people have more ost''s/oss''s than ports on the storage controllers.
On Wed, 2011-03-30 at 10:44 -0400, chas williams - CONTRACTOR wrote:> On Tue, 29 Mar 2011 09:23:59 -0400 > Jason Hill <hilljj at ornl.gov> wrote: > > > On Mon, Mar 28, 2011 at 09:36:43AM -0400, Jason Hill wrote: > > storage as well. I would definately suggest putting your LNET and SRP > > connections on different physical HCA''s to keep the traffic at least isolated > > on the OSS side. > > i doubt this matters as much as it once did. pci/pci-x wasnt capacble > of reading/writing at the same time due to its bus nature (even though > pci-x was point-to-point to the bridge chip). pci-express (and > infiniband) can both read and write at the same time. so you can > stream in data from srp and stream it out via lnet at the same time on > the same port.While it is true both are full duplex, there are also setup messages flowing in both directions to set up the large transfers. In the past, we''ve certainly seen problems at scale with small messages getting blocked behind large bulk traffic on LNET. It would be interesting to see how much self-interference is generated when running storage over the same HCA as LNET, versus having them on separate NICs -- especially when we''re maxing out NIC capacity at peak demand. That experiment could give some good guidance as to whether or not this is actually something to worry about.> a bigger concern might be the number of luns behind a single port on > your storage controller. most people have more ost''s/oss''s than ports > on the storage controllers.We''ve found that on the IB storage systems we have in production -- as well as under test -- we can easily saturate the controller with 4 OSSes. Each OSS is driving 7 OSTs -- 5 are needed to saturate bandwidth -- and this has worked pretty well. It''s not clear that having more OSSes than needed brings a win, other than perhaps having more memory available for a read cache. Adding more memory to the existing OSSes can also achieve that, to a certain economic bound, and at larger scales it isn''t completely clear that the cache is a win -- we tend to blow through it. YMMV, of course. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office
Andrus, Brian Contractor
2011-Mar-31 09:17 UTC
[Lustre-discuss] IB storage as an OST target
Brian We do that here at NPS without much trouble - anymore ;) One thing to watch: if you are using opensm for your subnet manager and the latest kernels for lustre (1.8.4/5 or 2.0) you need a newer opensm than is on the CentOS 5.5 disks. Get the one from the updates. It seems there was a change in the ibsrp kernel module that made it incompatible with opensm on the same disk. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Brian O''Connor Sent: Sunday, March 27, 2011 10:21 PM To: lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] IB storage as an OST target Anybody had any experience using an IB based storage target as an OST? Apart from the obvious issue of separating the IB SAN(SRP/SER) storage traffic from the Lustre traffic are there any issues? What about failover? -- Brian O''Connor ----------------------------------------------------------------------- SGI Consulting Email: briano at sgi.com, Mobile +61 417 746 452 Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 357 Camberwell Road, Camberwell, Victoria, 3124 AUSTRALIA http://www.sgi.com/support/services ----------------------------------------------------------------------- _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
chas williams - CONTRACTOR
2011-Mar-31 14:19 UTC
[Lustre-discuss] IB storage as an OST target
On Wed, 30 Mar 2011 12:00:04 -0400 David Dillow <dave at thedillows.org> wrote:> While it is true both are full duplex, there are also setup messages > flowing in both directions to set up the large transfers. In the past, > we''ve certainly seen problems at scale with small messages getting > blocked behind large bulk traffic on LNET. It would be interesting to > see how much self-interference is generated when running storage over > the same HCA as LNET, versus having them on separate NICs -- especiallyi kind of gather that if your clients are doing a mix of i/o (reads and writes) that this is going to happen regardless. two hcas/ports get a sort of parallelism to help alleviate some of this congestion. but a faster port (say 40G) would just as helpful?
On Thu, 2011-03-31 at 02:17 -0700, Andrus, Brian Contractor wrote:> Brian > > We do that here at NPS without much trouble - anymore ;) > > One thing to watch: if you are using opensm for your subnet manager and > the latest kernels for lustre (1.8.4/5 or 2.0) you need a newer opensm > than is on the CentOS 5.5 disks. Get the one from the updates. It seems > there was a change in the ibsrp kernel module that made it incompatible > with opensm on the same disk.Actually, IIRC, the issue with that opensm version is that it improperly includes an extra field in the path records which confuses most SRP targets on the market -- they think the initiator is coming from LID 0xffff, which is illegal. It doesn''t matter which SRP initiator version you use. We had a fun time tracking that down as well... -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office
On Thu, 2011-03-31 at 10:19 -0400, chas williams - CONTRACTOR wrote:> On Wed, 30 Mar 2011 12:00:04 -0400 > David Dillow <dave at thedillows.org> wrote: > > > While it is true both are full duplex, there are also setup messages > > flowing in both directions to set up the large transfers. In the past, > > we''ve certainly seen problems at scale with small messages getting > > blocked behind large bulk traffic on LNET. It would be interesting to > > see how much self-interference is generated when running storage over > > the same HCA as LNET, versus having them on separate NICs -- especially > > i kind of gather that if your clients are doing a mix of i/o (reads and > writes) that this is going to happen regardless. two hcas/ports get a > sort of parallelism to help alleviate some of this congestion. but a > faster port (say 40G) would just as helpful?That makes sense, but then if we we''re doing QDR ports, we''d probably matching that with QDR on the storage as well. If you have DDR on the storage, then using a single QDR HCA would probably work; I''d still want to test a bit before running with it to be sure. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office