Hi all! I have the following setup: * 4 x OSS servers with 2 x 10GbE * 4 x clients with 2 x 10GbE (these are kind of NFS servers to redistribute filesystem to all other clients) I would like to use multi-rail LNET over Ethernet instead of bonding for performance reasons (if there are any). I am using 192.168.110.x for one adapter and 192.168.111.x for another one. Mounting filesystem on the clients as: mount.lustre 192.168.110.11 at tcp0,192.168.111.11 at tcp1:/lustre /mnt/lustre But all LNET traffic from any client goes to the 192.168.110.11 at tcp0 only, despite of current load on the nid. lctl ping is working fine for both nids. I know that I can mount 2 clients to tcp0 and 2 other to tcp1, but would like to use both interfaces on each client for performance reasons. I''ve been looking into Lustre manual and found multi-rail for Infiniband only. I wonder what is recommended now for LNET over Ethernet with multiple adapters, where fault-tolerance is less important than performance? Thank you, Alex.
Hi Alexander, If I understand correctly, you have two subnets. 192.168.110.0/24 is for tcp0, and 192.168.111.0/24 is for tcp1. So you have to configure two LNET networks on clients as well as servers. Could you please give the lnet configuration you have setup on your OSSes and clients? What you have to understand is that multirail LNET provides static load balancing over several LNET networks. Because LNET is not able to choose between several available routes, you have to restrict each target to a specific network. This is the purpose of the ''--network'' mkfs.lustre option. In your case, you would format half of the OSTs of each OSS with ''--network=tcp0'', and the other half with ''--network=tcp1''. This would make clients use alternatively tcp0 or tcp1, depending on the targets they communicate with. So if you write or read on all the OSTs at the same time, you would aggregate performance of your two 10GbE links, on the clients and the servers. And finally, we really do not care about LNET traffic with the MGS. This is not what defines the networks that will be used for the communication between the clients and the target servers. HTH, Sebastien. Le 18/12/2012 11:36, Alexander Oltu a ?crit :> Hi all! > > I have the following setup: > > * 4 x OSS servers with 2 x 10GbE > * 4 x clients with 2 x 10GbE (these are kind of NFS servers to > redistribute filesystem to all other clients) > > I would like to use multi-rail LNET over Ethernet instead of bonding > for performance reasons (if there are any). I am using 192.168.110.x for > one adapter and 192.168.111.x for another one. > > Mounting filesystem on the clients as: > mount.lustre 192.168.110.11 at tcp0,192.168.111.11 at tcp1:/lustre /mnt/lustre > > But all LNET traffic from any client goes to the 192.168.110.11 at tcp0 > only, despite of current load on the nid. lctl ping is working fine for > both nids. > > I know that I can mount 2 clients to tcp0 and 2 other to tcp1, but > would like to use both interfaces on each client for performance > reasons. > > I''ve been looking into Lustre manual and found multi-rail for > Infiniband only. > > I wonder what is recommended now for LNET over Ethernet with multiple > adapters, where fault-tolerance is less important than performance? > > Thank you, > Alex. > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Hi S?bastien, Please see my answer in-line On Tue, 18 Dec 2012 13:48:54 +0100 S?bastien Buisson wrote:> If I understand correctly, you have two subnets. 192.168.110.0/24 is > for tcp0, and 192.168.111.0/24 is for tcp1. So you have to configure > two LNET networks on clients as well as servers. Could you please > give the lnet configuration you have setup on your OSSes and clients?This is one of OSSes: [root at oss1 ~]# lctl list_nids 192.168.110.21 at tcp 192.168.111.21 at tcp1 This is one of the clients: nid00030:~ # lctl list_nids 192.168.110.101 at tcp 192.168.111.101 at tcp1> What you have to understand is that multirail LNET provides static > load balancing over several LNET networks. Because LNET is not able > to choose between several available routes, you have to restrict each > target to a specific network. This is the purpose of the ''--network'' > mkfs.lustre option. > > In your case, you would format half of the OSTs of each OSS with > ''--network=tcp0'', and the other half with ''--network=tcp1''. This > would make clients use alternatively tcp0 or tcp1, depending on the > targets they communicate with. So if you write or read on all the > OSTs at the same time, you would aggregate performance of your two > 10GbE links, on the clients and the servers.Thank you for the clear explanation, I will try --network option distributed over OSTs and will get back. BTW, do you know if this setup has better performance than simply bonded NICs?> > And finally, we really do not care about LNET traffic with the MGS. > This is not what defines the networks that will be used for the > communication between the clients and the target servers. >Does this mean that it is enough to mount like this? mount.lustre 192.168.110.11 at tcp0:/lustre /mnt/lustre Thank you, Alex.
Hi, Le 18/12/2012 19:27, Alexander Oltu a ?crit :> This is one of OSSes: > [root at oss1 ~]# lctl list_nids > 192.168.110.21 at tcp > 192.168.111.21 at tcp1 > > This is one of the clients: > nid00030:~ # lctl list_nids > 192.168.110.101 at tcp > 192.168.111.101 at tcp1Everything seems fine.> > Thank you for the clear explanation, I will try --network > option distributed over OSTs and will get back. BTW, do you know > if this setup has better performance than simply bonded NICs?I have no experience in doing multirail on ethernet, sorry. The principle is exactly the same as for Infiniband, but as Infiniband interfaces cannot be bonded (except for IPoIB which is not of interest when considering performance), I cannot tell.> > Does this mean that it is enough to mount like this? > mount.lustre 192.168.110.11 at tcp0:/lustre /mnt/lustreAbsolutely. Adding a second NID only serves for high-availability purpose when establishing the initial connection. Cheers, Sebastien.
> I have no experience in doing multirail on ethernet, sorry. The > principle is exactly the same as for Infiniband, but as Infiniband > interfaces cannot be bonded (except for IPoIB which is not of > interest when considering performance), I cannot tell.Looks like --network option is not helping. I have tried to unmount osts and run for half of OSTs: tunefs.lustre --network=tcp0 and for another half: tunefs.lustre --network=tcp1 Mounted OSTs back. The client still sends all traffic through tcp0. I have suspicion that --network option is to separate from different network types, like tcp, o2ib, etc. Probably I will go bonding way. Thank you for your help, Alex.
On 12/19/12 2:03 PM, "Alexander Oltu" <Alexander.Oltu at uni.no> wrote:> >> I have no experience in doing multirail on ethernet, sorry. The >> principle is exactly the same as for Infiniband, but as Infiniband >> interfaces cannot be bonded (except for IPoIB which is not of >> interest when considering performance), I cannot tell. > >Looks like --network option is not helping. I have tried to unmount osts >and run for half of OSTs: > >tunefs.lustre --network=tcp0 > >and for another half: > >tunefs.lustre --network=tcp1 > >Mounted OSTs back. The client still sends all traffic through tcp0. I >have suspicion that --network option is to separate from different >network types, like tcp, o2ib, etc. > >Probably I will go bonding way.For Ethernet, bonding is the preferred method. Use the standard method, and then point Lustre at the ''bond'' interface. The modprobe option ''networks='' is used to bind hardware interfaces, for kernel bonding you would normally use: options lnet networks=tcp(bond0) See http://wiki.lustre.org/manual/LustreManual20_HTML/SettingUpBonding.html#504 38258_99571 For more (perhaps a little old) info. cliffw> >Thank you for your help, >Alex. >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
If you are back with the bounding solution, you may have to be careful with and aware of the available/configured mode/policy on all ends (including on intermediate switches for some modes) !!? Le Dec 19, 2012 ? 11:03 PM, Alexander Oltu <Alexander.Oltu at uni.no> a ?crit :> >> I have no experience in doing multirail on ethernet, sorry. The >> principle is exactly the same as for Infiniband, but as Infiniband >> interfaces cannot be bonded (except for IPoIB which is not of >> interest when considering performance), I cannot tell. > > Looks like --network option is not helping. I have tried to unmount osts > and run for half of OSTs: > > tunefs.lustre --network=tcp0 > > and for another half: > > tunefs.lustre --network=tcp1 > > Mounted OSTs back. The client still sends all traffic through tcp0. I > have suspicion that --network option is to separate from different > network types, like tcp, o2ib, etc. > > Probably I will go bonding way. > > Thank you for your help, > Alex. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss--------------------------------------------------------------------- Intel Corporation SAS (French simplified joint stock company) Registered headquarters: "Les Montalets"- 2, rue de Paris, 92196 Meudon Cedex, France Registration Number: 302 456 199 R.C.S. NANTERRE Capital: 4,572,000 Euros This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Le 19/12/2012 23:03, Alexander Oltu a ?crit :> Looks like --network option is not helping. I have tried to unmount osts > and run for half of OSTs: > > tunefs.lustre --network=tcp0 > > and for another half: > > tunefs.lustre --network=tcp1 > > Mounted OSTs back. The client still sends all traffic through tcp0.Maybe your commands lack some ''--erase-params'' or ''--writeconf'' options... You should try reformatting your targets.> I have suspicion that --network option is to separate from different > network types, like tcp, o2ib, etc.No it is not. We use it to split targets between 2 Infiniband interfaces. Regards, Sebastien.
On 19/12/12 23:42, White, Cliff wrote:> > > On 12/19/12 2:03 PM, "Alexander Oltu" <Alexander.Oltu at uni.no> wrote: > >> >>> I have no experience in doing multirail on ethernet, sorry. The >>> principle is exactly the same as for Infiniband, but as Infiniband >>> interfaces cannot be bonded (except for IPoIB which is not of >>> interest when considering performance), I cannot tell. >> >> Looks like --network option is not helping. I have tried to unmount osts >> and run for half of OSTs: >> >> tunefs.lustre --network=tcp0 >> >> and for another half: >> >> tunefs.lustre --network=tcp1 >> >> Mounted OSTs back. The client still sends all traffic through tcp0. I >> have suspicion that --network option is to separate from different >> network types, like tcp, o2ib, etc. >> >> Probably I will go bonding way. > > For Ethernet, bonding is the preferred method. Use the standard method, and > then point Lustre at the ''bond'' interface.One point worth remembering with bonding is that the onboard ethernet cards (at least on the machine we bought) have adjacent MAC addresses. With mode4 (lacp) bonding, the default linux (layer2) hashing algorithm, the traffic is likely to be unevenly distributed. We ended up using xmit_hash_policy=layer3+4 to distribute traffic more evenly between the cards. We didn''t perform detailed tests on whether layer2+3 would have done, and how much difference the switch (which I suspect uses layer2+3) made. We subsequently upgraded to 10Gig - which has made this moot for a while.> The modprobe option ''networks='' is used to bind hardware interfaces, for > kernel bonding you would normally use: options lnet networks=tcp(bond0) > > See > http://wiki.lustre.org/manual/LustreManual20_HTML/SettingUpBonding.html#504 > 38258_99571 > For more (perhaps a little old) info. > cliffw >Chris Chris>> >> Thank you for your help, >> Alex. >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Mon, 07 Jan 2013 14:45:02 +0100 S?bastien Buisson wrote:> > Le 19/12/2012 23:03, Alexander Oltu a ?crit : > > Looks like --network option is not helping. I have tried to unmount > > osts and run for half of OSTs: > > > > tunefs.lustre --network=tcp0 > > > > and for another half: > > > > tunefs.lustre --network=tcp1 > > > > Mounted OSTs back. The client still sends all traffic through tcp0. > > Maybe your commands lack some ''--erase-params'' or ''--writeconf'' > options... You should try reformatting your targets.S?bastien, you are right. I did --erase-params and --writeconf and traffic is distributed now as I wanted to. Eventually I came back from bonded (mode6 + xmit_hash_policy:layer3+4) to multirail because I see that I can squeeze extra 20% of performance. The problem is that I am using Lustre routers to provide Lustre to the single network and apparently it looks like this is not quite supported. I tried: routes="gni0 xxx.xxx.110.xxx at tcp0 xxx.xxx.111.xxx at tcp1" And getting: LustreError: 5598:0:(router.c:399:lnet_check_routes()) Routes to gni via xxx.xxx.111.xxx at tcp1 and xxx.xxx.110.xxx at tcp not supported I checked the manual and it says (p301): "It is an error to specify routes to the same destination with routers on different local networks." So I wonder how do you export your Lustre through multi-rail? Thank you, Alex.
On Wed, Jan 23, 2013 at 02:10:54PM +0100, Alexander Oltu wrote:> ...... > routes="gni0 xxx.xxx.110.xxx at tcp0 xxx.xxx.111.xxx at tcp1" > > And getting: > > LustreError: 5598:0:(router.c:399:lnet_check_routes()) Routes to gni > via xxx.xxx.111.xxx at tcp1 and xxx.xxx.110.xxx at tcp not supported > > I checked the manual and it says (p301): > "It is an error to specify routes to the same destination with routers > on different local networks."This restriction came from an old routing implementation. I believe it can be now removed. Please report a bug and CC me on it - it should not be hard to fix. http://jira.whamcloud.com/ Thanks, Isaac
Hi, Le 23/01/2013 14:10, Alexander Oltu a ?crit :> > So I wonder how do you export your Lustre through multi-rail? >In our particular configuration, we have 2 Infiniband interfaces per server, so 2 LNET networks, say o2ib0 and o2ib1. For the routing, we dedicated half of the routers'' IB interfaces to the routing of o2ib0 to the outside, and the other half to the routing of o2ib1 to the outside. That way, we comply with the limitation of having routers on the same network if they serve the same routing purpose. In your case, I think it would mean: routes="gni0 xxx.xxx.110.xxx at tcp0 \ gni1 xxx.xxx.111.xxx at tcp1" HTH, Sebastien.
On Thu, 24 Jan 2013 17:29:00 +0100 S?bastien Buisson wrote:> > In your case, I think it would mean: > routes="gni0 xxx.xxx.110.xxx at tcp0 \ > gni1 xxx.xxx.111.xxx at tcp1" >Looks like this can be a workaround before multi-net routing will be possible. I can just add gni1 to all nodes. Does anyone know how to add additional network to LNET without reloading kernel module? I looked in ''lctl help'' and tried some "googling" - no results, the "/sys/module/lnet/parameters/networks" is read-only. Thanks, Alex.
You cannot add additional NIDS to a node without reloading the kernel modules. You can add routes via lclt: lctl --net tcp1 add_route <nid> -Marc ---- D. Marc Stearman Lustre Operations Lead stearman2 at llnl.gov 925.423.9670 On Jan 28, 2013, at 7:23 AM, Alexander Oltu <Alexander.Oltu at uni.no> wrote:> On Thu, 24 Jan 2013 17:29:00 +0100 > S?bastien Buisson wrote: > > >> >> In your case, I think it would mean: >> routes="gni0 xxx.xxx.110.xxx at tcp0 \ >> gni1 xxx.xxx.111.xxx at tcp1" >> > > Looks like this can be a workaround before multi-net routing will be > possible. I can just add gni1 to all nodes. > > Does anyone know how to add additional network to LNET without > reloading kernel module? > > I looked in ''lctl help'' and tried some "googling" - no results, the > "/sys/module/lnet/parameters/networks" is read-only. > > Thanks, > Alex. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Mon, Jan 28, 2013 at 04:23:37PM +0100, Alexander Oltu wrote:> On Thu, 24 Jan 2013 17:29:00 +0100 > S?bastien Buisson wrote: > > > > > > In your case, I think it would mean: > > routes="gni0 xxx.xxx.110.xxx at tcp0 \ > > gni1 xxx.xxx.111.xxx at tcp1" > > > > Looks like this can be a workaround before multi-net routing will be > possible. I can just add gni1 to all nodes. > > Does anyone know how to add additional network to LNET without > reloading kernel module?Currently it can''t be done without reloading the kernel modules. However, the LNet Dynamic Control project is close to completion, which makes it possible to do something similar to ifdown/ifup without reloading kernel modules. - Isaac