Hendelman, Rob
2008-Oct-21  16:41 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
Hello everyone, I''d like to dedicate 1 gigabit Ethernet interface per OST. What is the way to do this? My OST''s are fairly slow (raid 10, 4 sata drives ea) by enterprise standards but still can sustain 80-90MByte/second on large sequential reads. I imagine if I striped a file over 3 OST''s I would hit speed limitations using GigE. What I''d like to do is have multiple gige nics connected to a (dumb) switch to start with. I''d put each gige nic in a separate /28. My first OSS box has 3 OST''s (12 drives) and 3 gig-e nics. My MDS/MGS has a single gige nic. I have assigned 3 ip''s (on different /28''s) in linux to the gig-e interface on the mds/mgs box, and 1 ip (also on different /28''s) to each interface on the oss server. The client (only 1 to start) is also going to have 3 gige nics, with 1 ip on each subnet. First off, what does modules.conf look like on the mgs/mds? Should it be something like: Options lnet networks=tcp0(eth0),tcp1(eth0:0),tcp2(eth0:1) Should the modules.conf on the oss/clients be something like: Options lnet networks=tcp0(eth0),tcp1(eth1),tcp2(eth2)? Can I then format each ost with the mgs/mds parameter pointing to tcp0, tcp1, or tcp2 to force it to use a specific gige interface? I have been reading through the Lustre 1.6 operations manual but have not gained a clear understanding of exactly how to set this up so it works properly. I am running Lustre 1.6.5.1 on centos 5.1 with the rpm''s provided by the Sun website. Thanks, Robert Robert Hendelman Jr Magnetar Capital LLC Rob.Hendelman at magnetar.com 1-847-905-4557 The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
Brian J. Murrell
2008-Oct-21  16:55 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
On Tue, 2008-10-21 at 11:41 -0500, Hendelman, Rob wrote:> Hello everyone,Hi,> I''d like to dedicate 1 gigabit Ethernet interface per OST.So to be clear, you have multiple OSTs and multiple GigE interfaces in each OSS? I wonder why you want to try to (effectively partition the aggregate network bandwidth of all your GigE NICs and) dedicate bandwidth on a per OST basis. Why not create a large pool of bandwidth on each OSS usable by all clients and all OSTs and just let demand sort it out. As long as your network is not the bottleneck things should just work out.> My OST''s are fairly slow (raid 10, 4 sata drives ea) by > enterprise standards but still can sustain 80-90MByte/second on large > sequential reads. I imagine if I striped a file over 3 OST''s I would > hit speed limitations using GigE.A single GigE? Yes. A single GigE has a theoretical maximum of 125MB/s so two of your OSTs would saturate that.> What I''d like to do is have multiple gige nics connected to a (dumb) > switch to start with. I''d put each gige nic in a separate /28. My > first OSS box has 3 OST''s (12 drives) and 3 gig-e nics. My MDS/MGS has > a single gige nic.But why not just keep it all on the same network and just bond the GigE NICs to increase your total per/OSS network bandwidth (i,e, if a single interface) and ensure it''s always more than the total disk bandwidth of your OSTs? Sounds like you are making it a _lot_ more complicated than it needs to be. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081021/db22e145/attachment.bin
Hendelman, Rob
2008-Oct-21  17:15 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
I was under the impression that bonding nics required a manged switch to support this. I do not have a managed switch available. If this is not required, then I will look into this. Thanks, Robert The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.
Brian J. Murrell
2008-Oct-21  17:23 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
On Tue, 2008-10-21 at 12:15 -0500, Hendelman, Rob wrote:> I was under the impression that bonding nics required a manged switch to > support this.It does require a switch that supports link aggregation yes. Sorry, I overlooked that you only had a dumb switch. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081021/4fe0d03a/attachment.bin
Andreas Dilger
2008-Oct-21  18:47 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
On Oct 21, 2008 11:41 -0500, Hendelman, Rob wrote:> I have assigned 3 ip''s (on different /28''s) in linux to the gig-e > interface on the mds/mgs box, and 1 ip (also on different /28''s) to each > interface on the oss server. The client (only 1 to start) is also going > to have 3 gige nics, with 1 ip on each subnet. > > First off, what does modules.conf look like on the mgs/mds? Should it > be something like: > > Options lnet networks=tcp0(eth0),tcp1(eth0:0),tcp2(eth0:1) > > Should the modules.conf on the oss/clients be something like: > > Options lnet networks=tcp0(eth0),tcp1(eth1),tcp2(eth2)? > > Can I then format each ost with the mgs/mds parameter pointing to tcp0, > tcp1, or tcp2 to force it to use a specific gige interface?You can use the "ip2nets" option to keep the configuration the same on all of the nodes, instead of explicitly specifying the mapping. This allows mapping IP subnets to Lustre networks regardless of what the interface names are. This is in section 7.1.1 of the manual. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Steden Klaus
2008-Oct-21  21:17 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
Hi Rob, I''m not sure on this point (but Andreas no doubt will know) but you might be able to get more overall bandwidth with a single logical network interface if you use a different aggregation mode, such as adaptive load balancing, instead of LACP. ALB and TLB modes will still operate on a dumb switch (at least, they operate without a configuration change on intelligent switches). As long as Lustre doesn''t care about the internal details of ''bond0'', you should be able to set it up with ALB on a dumb switch and use ''bond0'' as your interface. Just a thought. cheers, Klaus -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org on behalf of Andreas Dilger Sent: Tue 10/21/2008 11:47 AM To: Hendelman, Rob Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lnet configuration: 1 ost per gige interface. On Oct 21, 2008 11:41 -0500, Hendelman, Rob wrote:> I have assigned 3 ip''s (on different /28''s) in linux to the gig-e > interface on the mds/mgs box, and 1 ip (also on different /28''s) to each > interface on the oss server. The client (only 1 to start) is also going > to have 3 gige nics, with 1 ip on each subnet. > > First off, what does modules.conf look like on the mgs/mds? Should it > be something like: > > Options lnet networks=tcp0(eth0),tcp1(eth0:0),tcp2(eth0:1) > > Should the modules.conf on the oss/clients be something like: > > Options lnet networks=tcp0(eth0),tcp1(eth1),tcp2(eth2)? > > Can I then format each ost with the mgs/mds parameter pointing to tcp0, > tcp1, or tcp2 to force it to use a specific gige interface?You can use the "ip2nets" option to keep the configuration the same on all of the nodes, instead of explicitly specifying the mapping. This allows mapping IP subnets to Lustre networks regardless of what the interface names are. This is in section 7.1.1 of the manual. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Joe Georger
2008-Oct-22  12:00 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
Even for bonding mode 6?
    *
      balance-alb or 6 
          Adaptive load balancing: includes balance-tlb plus receive
          load balancing (rlb) for IPV4 traffic, and does not require
          any special switch support. The receive load balancing is
          achieved by ARP negotiation. 
    The bonding driver intercepts the ARP Replies sent by the local
    system on their way out and overwrites the source hardware address
    with the unique hardware address of one of the slaves in the bond
    such that different peers use different hardware addresses for the
    server. 
    Receive traffic from connections created by the server is also
    balanced. When the local system sends an ARP Request the bonding
    driver copies and saves the peer''s IP information from the ARP
packet.
    When the ARP Reply arrives from the peer, its hardware address is
    retrieved and the bonding driver initiates an ARP reply to this peer
    assigning it to one of the slaves in the bond. 
    A problematic outcome of using ARP negotiation for balancing is that
    each time that an ARP request is broadcast it uses the hardware
    address of the bond. Hence, peers learn the hardware address of the
    bond and the balancing of receive traffic collapses to the current
    slave. This is handled by sending updates (ARP Replies) to all the
    peers with their individually assigned hardware address such that
    the traffic is redistributed. Receive traffic is also redistributed
    when a new slave is added to the bond and when an inactive slave is
    re-activated. The receive load is distributed sequentially (round
    robin) among the group of highest speed slaves in the bond. 
    When a link is reconnected or a new slave joins the bond the receive
    traffic is redistributed among all active slaves in the bond by
    initiating ARP Replies with the selected mac address to each of the
    clients. The updelay parameter (detailed below) must be set to a
    value equal or greater than the switch''s forwarding delay so that
    the ARP Replies sent to the peers will not be blocked by the switch. 
        * Prerequisites:
             1. Ethtool support in the base drivers for retrieving the
                speed of each slave.
             2. Base driver support for setting the hardware address of
                a device while it is open. This is required so that
                there will always be one slave in the team using the
                bond hardware address (the curr_active_slave) while
                having a unique hardware address for each slave in the
                bond. If the curr_active_slave fails its hardware
                address is swapped with the new curr_active_slave that
                was chosen.
Brian J. Murrell wrote:> On Tue, 2008-10-21 at 12:15 -0500, Hendelman, Rob wrote:
>   
>> I was under the impression that bonding nics required a manged switch
to
>> support this.
>>     
>
> It does require a switch that supports link aggregation yes.  Sorry, I
> overlooked that you only had a dumb switch.
>
> b.
>
>
Steden Klaus
2008-Oct-22  19:22 UTC
[Lustre-discuss] Lnet configuration: 1 ost per gige interface.
Hi Joe,
If you want to use a true link aggregation protocol such as LACP or
Cisco''s Etherchannel, you''ll need an L3 switch that supports
the protocol as well (and a Cisco switch at that, in the case of Etherchannel).
Both partners in a link aggregate must be aware of the aggregate, and ports that
are not part of the aggregate cannot be connected to it, and vice-versa
(although typically a switch will simply stop forwarding if they detect agg
links on non-agg ports, or non-agg links on agg ports).
In the case of ALB, the uplink switch does not need to be made aware that
there''s an aggregate, since the kernel that manages the aggregate will
transparently remap everything. The switch will simply notice that a given IP is
now associated with a new MAC address and update its ARP cache. This can noisy
in switch logs, but on a dumb switch, nobody''s the wiser. My gut tells
me this works, even with a dumb switch.
The only way to know for sure, though, is simply to test it out yourself. I
would try it out with a pair of dumb switches connected together rather than
putting it directly on your network. If STP is active, plugging in this
configuration may shut down your whole network if STP thinks it found a loop, so
test it out in a sandbox before you go live.
hth,
Klaus
-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org on behalf of Joe Georger
Sent: Wed 10/22/2008 5:00 AM
To: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Lnet configuration: 1 ost per gige interface.
 
Even for bonding mode 6?
    *
      balance-alb or 6 
          Adaptive load balancing: includes balance-tlb plus receive
          load balancing (rlb) for IPV4 traffic, and does not require
          any special switch support. The receive load balancing is
          achieved by ARP negotiation. 
    The bonding driver intercepts the ARP Replies sent by the local
    system on their way out and overwrites the source hardware address
    with the unique hardware address of one of the slaves in the bond
    such that different peers use different hardware addresses for the
    server. 
    Receive traffic from connections created by the server is also
    balanced. When the local system sends an ARP Request the bonding
    driver copies and saves the peer''s IP information from the ARP
packet.
    When the ARP Reply arrives from the peer, its hardware address is
    retrieved and the bonding driver initiates an ARP reply to this peer
    assigning it to one of the slaves in the bond. 
    A problematic outcome of using ARP negotiation for balancing is that
    each time that an ARP request is broadcast it uses the hardware
    address of the bond. Hence, peers learn the hardware address of the
    bond and the balancing of receive traffic collapses to the current
    slave. This is handled by sending updates (ARP Replies) to all the
    peers with their individually assigned hardware address such that
    the traffic is redistributed. Receive traffic is also redistributed
    when a new slave is added to the bond and when an inactive slave is
    re-activated. The receive load is distributed sequentially (round
    robin) among the group of highest speed slaves in the bond. 
    When a link is reconnected or a new slave joins the bond the receive
    traffic is redistributed among all active slaves in the bond by
    initiating ARP Replies with the selected mac address to each of the
    clients. The updelay parameter (detailed below) must be set to a
    value equal or greater than the switch''s forwarding delay so that
    the ARP Replies sent to the peers will not be blocked by the switch. 
        * Prerequisites:
             1. Ethtool support in the base drivers for retrieving the
                speed of each slave.
             2. Base driver support for setting the hardware address of
                a device while it is open. This is required so that
                there will always be one slave in the team using the
                bond hardware address (the curr_active_slave) while
                having a unique hardware address for each slave in the
                bond. If the curr_active_slave fails its hardware
                address is swapped with the new curr_active_slave that
                was chosen.
Brian J. Murrell wrote:> On Tue, 2008-10-21 at 12:15 -0500, Hendelman, Rob wrote:
>   
>> I was under the impression that bonding nics required a manged switch
to
>> support this.
>>     
>
> It does require a switch that supports link aggregation yes.  Sorry, I
> overlooked that you only had a dumb switch.
>
> b.
>
>   
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss