Hendelman, Rob
2008-Oct-21 16:41 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
Hello everyone, I''d like to dedicate 1 gigabit Ethernet interface per OST. What is the way to do this? My OST''s are fairly slow (raid 10, 4 sata drives ea) by enterprise standards but still can sustain 80-90MByte/second on large sequential reads. I imagine if I striped a file over 3 OST''s I would hit speed limitations using GigE. What I''d like to do is have multiple gige nics connected to a (dumb) switch to start with. I''d put each gige nic in a separate /28. My first OSS box has 3 OST''s (12 drives) and 3 gig-e nics. My MDS/MGS has a single gige nic. I have assigned 3 ip''s (on different /28''s) in linux to the gig-e interface on the mds/mgs box, and 1 ip (also on different /28''s) to each interface on the oss server. The client (only 1 to start) is also going to have 3 gige nics, with 1 ip on each subnet. First off, what does modules.conf look like on the mgs/mds? Should it be something like: Options lnet networks=tcp0(eth0),tcp1(eth0:0),tcp2(eth0:1) Should the modules.conf on the oss/clients be something like: Options lnet networks=tcp0(eth0),tcp1(eth1),tcp2(eth2)? Can I then format each ost with the mgs/mds parameter pointing to tcp0, tcp1, or tcp2 to force it to use a specific gige interface? I have been reading through the Lustre 1.6 operations manual but have not gained a clear understanding of exactly how to set this up so it works properly. I am running Lustre 1.6.5.1 on centos 5.1 with the rpm''s provided by the Sun website. Thanks, Robert Robert Hendelman Jr Magnetar Capital LLC Rob.Hendelman at magnetar.com 1-847-905-4557 The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
Brian J. Murrell
2008-Oct-21 16:55 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
On Tue, 2008-10-21 at 11:41 -0500, Hendelman, Rob wrote:> Hello everyone,Hi,> I''d like to dedicate 1 gigabit Ethernet interface per OST.So to be clear, you have multiple OSTs and multiple GigE interfaces in each OSS? I wonder why you want to try to (effectively partition the aggregate network bandwidth of all your GigE NICs and) dedicate bandwidth on a per OST basis. Why not create a large pool of bandwidth on each OSS usable by all clients and all OSTs and just let demand sort it out. As long as your network is not the bottleneck things should just work out.> My OST''s are fairly slow (raid 10, 4 sata drives ea) by > enterprise standards but still can sustain 80-90MByte/second on large > sequential reads. I imagine if I striped a file over 3 OST''s I would > hit speed limitations using GigE.A single GigE? Yes. A single GigE has a theoretical maximum of 125MB/s so two of your OSTs would saturate that.> What I''d like to do is have multiple gige nics connected to a (dumb) > switch to start with. I''d put each gige nic in a separate /28. My > first OSS box has 3 OST''s (12 drives) and 3 gig-e nics. My MDS/MGS has > a single gige nic.But why not just keep it all on the same network and just bond the GigE NICs to increase your total per/OSS network bandwidth (i,e, if a single interface) and ensure it''s always more than the total disk bandwidth of your OSTs? Sounds like you are making it a _lot_ more complicated than it needs to be. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081021/db22e145/attachment.bin
Hendelman, Rob
2008-Oct-21 17:15 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
I was under the impression that bonding nics required a manged switch to support this. I do not have a managed switch available. If this is not required, then I will look into this. Thanks, Robert The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
Brian J. Murrell
2008-Oct-21 17:23 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
On Tue, 2008-10-21 at 12:15 -0500, Hendelman, Rob wrote:> I was under the impression that bonding nics required a manged switch to > support this.It does require a switch that supports link aggregation yes. Sorry, I overlooked that you only had a dumb switch. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081021/4fe0d03a/attachment.bin
Andreas Dilger
2008-Oct-21 18:47 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
On Oct 21, 2008 11:41 -0500, Hendelman, Rob wrote:> I have assigned 3 ip''s (on different /28''s) in linux to the gig-e > interface on the mds/mgs box, and 1 ip (also on different /28''s) to each > interface on the oss server. The client (only 1 to start) is also going > to have 3 gige nics, with 1 ip on each subnet. > > First off, what does modules.conf look like on the mgs/mds? Should it > be something like: > > Options lnet networks=tcp0(eth0),tcp1(eth0:0),tcp2(eth0:1) > > Should the modules.conf on the oss/clients be something like: > > Options lnet networks=tcp0(eth0),tcp1(eth1),tcp2(eth2)? > > Can I then format each ost with the mgs/mds parameter pointing to tcp0, > tcp1, or tcp2 to force it to use a specific gige interface?You can use the "ip2nets" option to keep the configuration the same on all of the nodes, instead of explicitly specifying the mapping. This allows mapping IP subnets to Lustre networks regardless of what the interface names are. This is in section 7.1.1 of the manual. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Steden Klaus
2008-Oct-21 21:17 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
Hi Rob, I''m not sure on this point (but Andreas no doubt will know) but you might be able to get more overall bandwidth with a single logical network interface if you use a different aggregation mode, such as adaptive load balancing, instead of LACP. ALB and TLB modes will still operate on a dumb switch (at least, they operate without a configuration change on intelligent switches). As long as Lustre doesn''t care about the internal details of ''bond0'', you should be able to set it up with ALB on a dumb switch and use ''bond0'' as your interface. Just a thought. cheers, Klaus -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org on behalf of Andreas Dilger Sent: Tue 10/21/2008 11:47 AM To: Hendelman, Rob Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lnet configuration: 1 ost per gige interface. On Oct 21, 2008 11:41 -0500, Hendelman, Rob wrote:> I have assigned 3 ip''s (on different /28''s) in linux to the gig-e > interface on the mds/mgs box, and 1 ip (also on different /28''s) to each > interface on the oss server. The client (only 1 to start) is also going > to have 3 gige nics, with 1 ip on each subnet. > > First off, what does modules.conf look like on the mgs/mds? Should it > be something like: > > Options lnet networks=tcp0(eth0),tcp1(eth0:0),tcp2(eth0:1) > > Should the modules.conf on the oss/clients be something like: > > Options lnet networks=tcp0(eth0),tcp1(eth1),tcp2(eth2)? > > Can I then format each ost with the mgs/mds parameter pointing to tcp0, > tcp1, or tcp2 to force it to use a specific gige interface?You can use the "ip2nets" option to keep the configuration the same on all of the nodes, instead of explicitly specifying the mapping. This allows mapping IP subnets to Lustre networks regardless of what the interface names are. This is in section 7.1.1 of the manual. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Joe Georger
2008-Oct-22 12:00 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
Even for bonding mode 6? * balance-alb or 6 Adaptive load balancing: includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special switch support. The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the slaves in the bond such that different peers use different hardware addresses for the server. Receive traffic from connections created by the server is also balanced. When the local system sends an ARP Request the bonding driver copies and saves the peer''s IP information from the ARP packet. When the ARP Reply arrives from the peer, its hardware address is retrieved and the bonding driver initiates an ARP reply to this peer assigning it to one of the slaves in the bond. A problematic outcome of using ARP negotiation for balancing is that each time that an ARP request is broadcast it uses the hardware address of the bond. Hence, peers learn the hardware address of the bond and the balancing of receive traffic collapses to the current slave. This is handled by sending updates (ARP Replies) to all the peers with their individually assigned hardware address such that the traffic is redistributed. Receive traffic is also redistributed when a new slave is added to the bond and when an inactive slave is re-activated. The receive load is distributed sequentially (round robin) among the group of highest speed slaves in the bond. When a link is reconnected or a new slave joins the bond the receive traffic is redistributed among all active slaves in the bond by initiating ARP Replies with the selected mac address to each of the clients. The updelay parameter (detailed below) must be set to a value equal or greater than the switch''s forwarding delay so that the ARP Replies sent to the peers will not be blocked by the switch. * Prerequisites: 1. Ethtool support in the base drivers for retrieving the speed of each slave. 2. Base driver support for setting the hardware address of a device while it is open. This is required so that there will always be one slave in the team using the bond hardware address (the curr_active_slave) while having a unique hardware address for each slave in the bond. If the curr_active_slave fails its hardware address is swapped with the new curr_active_slave that was chosen. Brian J. Murrell wrote:> On Tue, 2008-10-21 at 12:15 -0500, Hendelman, Rob wrote: > >> I was under the impression that bonding nics required a manged switch to >> support this. >> > > It does require a switch that supports link aggregation yes. Sorry, I > overlooked that you only had a dumb switch. > > b. > >
Steden Klaus
2008-Oct-22 19:22 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
Hi Joe, If you want to use a true link aggregation protocol such as LACP or Cisco''s Etherchannel, you''ll need an L3 switch that supports the protocol as well (and a Cisco switch at that, in the case of Etherchannel). Both partners in a link aggregate must be aware of the aggregate, and ports that are not part of the aggregate cannot be connected to it, and vice-versa (although typically a switch will simply stop forwarding if they detect agg links on non-agg ports, or non-agg links on agg ports). In the case of ALB, the uplink switch does not need to be made aware that there''s an aggregate, since the kernel that manages the aggregate will transparently remap everything. The switch will simply notice that a given IP is now associated with a new MAC address and update its ARP cache. This can noisy in switch logs, but on a dumb switch, nobody''s the wiser. My gut tells me this works, even with a dumb switch. The only way to know for sure, though, is simply to test it out yourself. I would try it out with a pair of dumb switches connected together rather than putting it directly on your network. If STP is active, plugging in this configuration may shut down your whole network if STP thinks it found a loop, so test it out in a sandbox before you go live. hth, Klaus -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org on behalf of Joe Georger Sent: Wed 10/22/2008 5:00 AM To: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lnet configuration: 1 ost per gige interface. Even for bonding mode 6? * balance-alb or 6 Adaptive load balancing: includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special switch support. The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the slaves in the bond such that different peers use different hardware addresses for the server. Receive traffic from connections created by the server is also balanced. When the local system sends an ARP Request the bonding driver copies and saves the peer''s IP information from the ARP packet. When the ARP Reply arrives from the peer, its hardware address is retrieved and the bonding driver initiates an ARP reply to this peer assigning it to one of the slaves in the bond. A problematic outcome of using ARP negotiation for balancing is that each time that an ARP request is broadcast it uses the hardware address of the bond. Hence, peers learn the hardware address of the bond and the balancing of receive traffic collapses to the current slave. This is handled by sending updates (ARP Replies) to all the peers with their individually assigned hardware address such that the traffic is redistributed. Receive traffic is also redistributed when a new slave is added to the bond and when an inactive slave is re-activated. The receive load is distributed sequentially (round robin) among the group of highest speed slaves in the bond. When a link is reconnected or a new slave joins the bond the receive traffic is redistributed among all active slaves in the bond by initiating ARP Replies with the selected mac address to each of the clients. The updelay parameter (detailed below) must be set to a value equal or greater than the switch''s forwarding delay so that the ARP Replies sent to the peers will not be blocked by the switch. * Prerequisites: 1. Ethtool support in the base drivers for retrieving the speed of each slave. 2. Base driver support for setting the hardware address of a device while it is open. This is required so that there will always be one slave in the team using the bond hardware address (the curr_active_slave) while having a unique hardware address for each slave in the bond. If the curr_active_slave fails its hardware address is swapped with the new curr_active_slave that was chosen. Brian J. Murrell wrote:> On Tue, 2008-10-21 at 12:15 -0500, Hendelman, Rob wrote: > >> I was under the impression that bonding nics required a manged switch to >> support this. >> > > It does require a switch that supports link aggregation yes. Sorry, I > overlooked that you only had a dumb switch. > > b. > >_______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss