thr3ads.net - Lustre devel - [Lustre-devel] Multiple NICs [May 2006]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2006-May-31 10:06 UTC

[Lustre-devel] Multiple NICs

Duane, Scott,
> Duane Cloud wrote:
>
> Is each interface identified by an "lnet_ni" structure, with its
own
> NID, driver, etc, similar to how interfaces are presented to IP?  In
> other words, each remote node would be identified by X NIDs, which
> map to specific hardware interfaces, where X represents the number
> of interfaces (e.g. X = 4, if 2 IP interfaces and 2 IB interfaces
> are present).  So which NIC is used would be identified by the
> lnet_ni associated with that peer''s NID...And control of which NIC
> to use would be via which NID is used when talking to that peer.
Precisely, if by "interface" you mean "LNET interface".  But
the
socklnd confuses the issue by allowing a single LNET interface (=NID) to span a
number of hardware NICs.  In this case, the NID is
taken from the "first" NIC.
> Or is each LND identified by one lnet_ni and each implements it''s
> own decision logic to control multiple interfaces if/when present?
This was the portals (== pre-LNET) model of lustre networking.
> I see where there''s been some performance measurements using
> "socknal load balancing" and was wondering if there''s
some detailed
> information available as to how it works.
Only the socknal (now called the socklnd) was coded to load balance
over multiple NICs; see below...
> Any thoughts floating around about assigning one NID to each peer
> and then having a "list of paths" that can be used to get to that
> peer (e.g.  peer NID001 can be reached using the SOCKLND with IP
> address "a.b.c.d", or using the MXLND with address "whatever
is
> meaningful to it")?
Yes, this was how portals was configured.  Information was specified
in ''lmc'' scripts and configured by ''lconf''
at runtime.  But this
method of configuring lustre was becoming unwieldy.  O(#nodes) network
configuration data was required to make these associations, and routed
configurations (i.e. where there are different physical networks) were
growing O(#nodes**2) rather than O(#nets**2).

So we decided to simplify lustre configuration by retiring lmc/lconf
and making network and filesystem configuration separate.  This
allowed a single site-wide network configuration, configured by module
parameters, to provide connectivity to multiple lustre filesystems,
configured using the lustre mount command.

At this time we changed the "flat" portals NID into the LNET
<address-within-network>@<network> NID to make the routing table on
each node grows with the number of networks rather than the number of
nodes.  Furthermore, we decided that <address-within-network> should
be determined by the underlying transport, so that once you have an
LNET NID, you know right away how to get there without needing any
additional configuration information.

However we still had the "legacy" requirement that the socklnd support
multiple NICs.  We included that, but with the assumption that there
is full connectivity between all NICs assigned to the same LNET
network.  For example...

options lnet ip2nets="tcp(eth1,eth2)  192.168.1.*"

...says all nodes in my cluster have 2 NICs, their NIDs are <IP
address of eth1>@tcp, and when they connect, they do so via both NICs.

However this doesn''t handle the case where a server has multiple NICs
but clients only have a single NIC e.g...

options lnet ip2net="tcp(eth1)      192.168.[1-2].[100-254]; #clients\
                     tcp(eth1,eth2) 192.168.[1-2].[1-99];    #servers"

...which says that my servers have 2 NICs on the 192.168.1.0/24 and
192.168.2.0/24 subnets, but my clients only have one NIC on either one
or the other subnet.  In this configuration, ALL clients will connect
to the servers'' eth1 IP addresses (i.e. the IP address that determines
their NID), but create no further connections, since they are already
connection on all their (count ''em... 1) NICs.  In this case we could
use IP routes on the subnet 2 clients to ensure they can ping the
subnet 1 server addresses, or use 2 LNET networks to make the required
connectivity explicit...

options lnet ip2net="tcp0(eth1) 192.168.1.[100-254]; #subnet 1 clients\
                     tcp1(eth1) 192.168.2.[100-254]; #subnet 2 clients\
                     tcp0(eth1) 192.168.1.[1-99];    #servers\
                     tcp1(eth2) 192.168.2.[1-99];    #servers" 

...so subnet 1 clients connect using the tcp0 server NIDs, and subnet
2 clients connect using the tcp1 server NIDs.
> Scott Atchley wrote:
> 
> Can other LNDs handle multiple NICs of the same type (e.g. IB or 
> Quadrics)? If so, how does LNET specify which NIC an instance uses?
No, not in the way you think; i.e. the socklnd is the only one that
you can configure with multiple NICs.  But since the linux bonding
driver appears to have made sufficient progress regarding
load-balancing algorithms and has the upside of transparent
fault-tolerance, we''d now rather recommend it.

The Quadrics kernel API abstracts multiple NICs which allows the
qswlnd to use them all.  

The RapidArray LND (ralnd) "knows" that it may have one or 2 NICs and
uses either by XORing its own and the peer''s NID.

All other LNDs only handle a single interface.
>> I have two cases that I want to make sure that I handle correctly.
>> Both involve a machine with two (or more) Myrinet NICs. The first
>> case is that I want to sysadmin to be able to specify which NIC to
>> use. The second case is to use both NICs (e.g. for routing).
>>
>> For the first case, I have a module parameter that allows the
>> caller to set the board ID (0, 1, 2, etc.). That is simple
>> enough. The sysadmin can then modify /etc/modprobe.conf (or
>> /etc/modprobe.d/kmxlnd) to add "options kmxlnd board=1".
>>
>> For the second case, I can''t rely on the above since both
instances
>> will be passed the same board number. How do other LNDs handle this
>> (or do they)?
In retrospect, implementing link aggregation specifically in each LND
seems like a cul-de-sac; especially w.r.t. code duplication.  If LNET
needs it because the underlying networks don''t support it, I''d
rather
it was moved into LNET proper and restricted to "multi-rail"
configurations; i.e. where every node that connects on one rail
connects on all others.  For example...

options lnet ip2nets="tcp0(eth1)      192.168.1.*; #rail 1\
                      tcp1(eth2)      192.168.2.*; #rail 2"\
             bond="(tcp0,tcp1)"

...which describes a cluster where every node has 2 NICs.  Of course,
the devil is in the details :)

-- 

                Cheers,
                        Eric

Duane Cloud

2006-May-31 11:28 UTC

head link

[Lustre-devel] Re: Multiple NICs

Eric Barton wrote:> Duane, Scott,
> 
>> Duane Cloud wrote:
>>
>> Is each interface identified by an "lnet_ni" structure, with
its own
>> NID, driver, etc, similar to how interfaces are presented to IP?  In
>> other words, each remote node would be identified by X NIDs, which
>> map to specific hardware interfaces, where X represents the number
>> of interfaces (e.g. X = 4, if 2 IP interfaces and 2 IB interfaces
>> are present).  So which NIC is used would be identified by the
>> lnet_ni associated with that peer''s NID...And control of which
NIC
>> to use would be via which NID is used when talking to that peer.
> 
> Precisely, if by "interface" you mean "LNET interface".
But the
> socklnd confuses the issue by allowing a single LNET interface (=> NID)
to span a number of hardware NICs.  In this case, the NID is
> taken from the "first" NIC.
> 
Thanks for the information...I''ll have to take some time looking it
over.

I just wanted to clarify a few things.  When I used "interface" I was 
referring to the actual physical hardware connection to the network, 
which is different than the logical representation implemented via the 
software.

I meant to use "interface" in a manner which I assumed Scott was using
the term NIC.  And I quoted "lnet_ni" cause I wasn''t sure
what term to
use in order to describe the logical representation of what''s
physically
present in the system.

...I think I need to stop using "interface" by itself!

-- 

Thank you,

Duane Cloud
Systems Programmer

Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)

cloud@ahpcrc.org, 612-337-3407 Desk

Scott Atchley

2006-Jun-02 08:50 UTC

head link

[Lustre-devel] Re: Multiple NICs

On May 31, 2006, at 12:06 PM, Eric Barton wrote:
>> Scott Atchley wrote:
>>
>> Can other LNDs handle multiple NICs of the same type (e.g. IB or
>> Quadrics)? If so, how does LNET specify which NIC an instance uses?
>
> No, not in the way you think; i.e. the socklnd is the only one that
> you can configure with multiple NICs.  But since the linux bonding
> driver appears to have made sufficient progress regarding
> load-balancing algorithms and has the upside of transparent
> fault-tolerance, we''d now rather recommend it.
>
> The Quadrics kernel API abstracts multiple NICs which allows the
> qswlnd to use them all.
>
> The RapidArray LND (ralnd) "knows" that it may have one or 2 NICs
and
> uses either by XORing its own and the peer''s NID.
>
> All other LNDs only handle a single interface.
What I am interested in is the case of routing. For example, a site  
has 10G and older 2G Myrinet networks. They will need either (a)  
routers or (b) servers connected to both networks. MX allows me to  
use up to 4 NICs.

Does LNET support this type of setup? If so, how would it invoke  
MXLND? I am assuming it would have two LNET nids. If so, I would then  
need to determine which NIC to use for each nid, no?
>>> I have two cases that I want to make sure that I handle correctly.
>>> Both involve a machine with two (or more) Myrinet NICs. The first
>>> case is that I want to sysadmin to be able to specify which NIC to
>>> use. The second case is to use both NICs (e.g. for routing).
>>>
>>> For the first case, I have a module parameter that allows the
>>> caller to set the board ID (0, 1, 2, etc.). That is simple
>>> enough. The sysadmin can then modify /etc/modprobe.conf (or
>>> /etc/modprobe.d/kmxlnd) to add "options kmxlnd board=1".
>>>
>>> For the second case, I can''t rely on the above since both
instances
>>> will be passed the same board number. How do other LNDs handle this
>>> (or do they)?
>
> In retrospect, implementing link aggregation specifically in each LND
> seems like a cul-de-sac; especially w.r.t. code duplication.  If LNET
> needs it because the underlying networks don''t support it,
I''d rather
> it was moved into LNET proper and restricted to "multi-rail"
> configurations; i.e. where every node that connects on one rail
> connects on all others.  For example...
>
> options lnet ip2nets="tcp0(eth1)      192.168.1.*; #rail 1\
>                       tcp1(eth2)      192.168.2.*; #rail 2"\
>              bond="(tcp0,tcp1)"
>
> ...which describes a cluster where every node has 2 NICs.  Of course,
> the devil is in the details :)
I''m not interested in link aggregation, but bridging separate networks.

Thanks,

Scott

Eric Barton

2006-Jun-02 09:18 UTC

head link

[Lustre-devel] RE: Multiple NICs

> What I am interested in is the case of routing. For example, a site  
> has 10G and older 2G Myrinet networks. They will need either (a)  
> routers or (b) servers connected to both networks. MX allows me to  
> use up to 4 NICs.
> 
> Does LNET support this type of setup? If so, how would it invoke  
> MXLND? I am assuming it would have two LNET nids. If so, I would then  
> need to determine which NIC to use for each nid, no?
Yes.  The network driver''s lnd_startup() API is called once for each
network
specified by the LNET ''networks'' module parameter (or each
matching network
specified by the ''ip2nets'' module parameter).  So if you were
to specify...

options lnet networks="mx0(0),mx1(1)"

...then mxlnd_startup() will be called twice, once for each network.  Since
interfaces have been specified explicitly (i.e. "mx0(0)" as opposed to
plain
"mx"), ni->ni_interfaces[0] != NULL points to the string between
the
brackets for each network; i.e. "0" and "1" in the example
above.  You can
choose your own interface specification syntax if you want, but a simple
number seems unambiguous. 

BTW, the only LND that currently supports > 1 instance of itself is the
socklnd, but that''s bound to change.

Once your driver supports multiple instances of itself bound to different
NICs like this, serving on both and/or routing between them will "just
work".

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------

Duane Cloud

2006-Jun-02 13:06 UTC

head link

[Lustre-devel] Re: Multiple NICs

Hello Eric,

Again, I appreciate the information.  I think I''m getting a better 
handle on things.

Also it looks like there are a few issues that seem to have been brought 
up by this one thread.  For example, understanding what routing and/or 
bridging capabilities exist, how to do link aggregation, and the 
information concerning configuration complexity.  I''d like to continue 
discussing each of these...But 1st some thoughts/questions about the 
following:

Eric Barton wrote:>> What I am interested in is the case of routing. For example, a site  
>> has 10G and older 2G Myrinet networks. They will need either (a)  
>> routers or (b) servers connected to both networks. MX allows me to  
>> use up to 4 NICs.
>>
>> Does LNET support this type of setup? If so, how would it invoke  
>> MXLND? I am assuming it would have two LNET nids. If so, I would then  
>> need to determine which NIC to use for each nid, no?
> 
> Yes.  The network driver''s lnd_startup() API is called once for
each network
> specified by the LNET ''networks'' module parameter (or
each matching network
> specified by the ''ip2nets'' module parameter).  So if you
were to specify...
> 
> options lnet networks="mx0(0),mx1(1)"
> 
> ...then mxlnd_startup() will be called twice, once for each network.  Since
> interfaces have been specified explicitly (i.e. "mx0(0)" as
opposed to plain
> "mx"), ni->ni_interfaces[0] != NULL points to the string
between the
> brackets for each network; i.e. "0" and "1" in the
example above.  You can
> choose your own interface specification syntax if you want, but a simple
> number seems unambiguous. 
> 
So is it correct to say that the LNET layer supports routing between NIs 
(i.e. LNET''s network interfaces) independent of what LND is used to 
control the source NI and destination NI?  If this is true, then I guess 
routing and/or bridging is synonymous with regards to LNET.  I''m use to
using the term "bridging" to identify connecting different networking 
hardware, while "routing" is just providing access between 2 points on
a
network (e.g. what IP does).

Also, given your description, all an LND''s start up function would have
to do is allocate and initialize some memory, to identify and control 
the specified hardware interface, and stash a pointer to it in the 
ni_data field in order to support multiple instances of itself, correct? 
  Of course the LND has to be reentrant with respect to controlling its 
interface.

What is the purpose for the additional ni_interfaces[] pointers 
contained in the lnet_ni_t structure?

This would imply that routing/bridging between NIs is accomplished by 
1st receiving all of the data contained in the packet/message (i.e. the 
entire payload as defined by the lnet_hdr_t structure) before sending 
the data out via the next NI in the path to the destination NID. 
Basically LNET''s model is identical to how IP does routing, with the 
lnet_hdr_t being equivalent to the IP header. Is this correct?

I appreciate the help understanding how all of this works.  Please let 
me know if my current understanding is on track.

-- 

Thank you,

Duane Cloud
Systems Programmer

Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)

cloud@ahpcrc.org, 612-337-3407 Desk
-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted.  If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------

Eric Barton

2006-Jun-05 11:48 UTC

head link

[Lustre-devel] RE: Multiple NICs

Duane,
> Hello Eric,
> 
> Is there a reason for not including this e-mail on Lustre-devel?
Nope; finger trouble I guess :)
> Eric Barton wrote:
> 
> > 
> >> I''m use to using the term "bridging" to
identify connecting
> >> different networking hardware, while "routing" is just
providing
> >> access between 2 points on a network (e.g. what IP does).
> > 
> > Packets may pass through arbitrary many routers/bridges, so I
> > think of it as routing but a rose by any other name...
> 
> Well, actually there''s a significant difference in how this
"task"
> is accomplished.  The most apparent difference is in the impact on
> performance (see below).
> 
> >> What is the purpose for the additional ni_interfaces[] pointers
> >> contained in the lnet_ni_t structure?
> > 
> > socklnd can establish multiple connections via multiple NICs.
> > ni_interfaces[] are their names.  I''ll be happier when
that''s done
> > in LNET proper and ni_interfaces[] becomes ni_interface.
> 
> I see...So the direction that LNET is being taken is TCP/IP oriented
> (i.e. the model you are using for the LNET layer is IP), correct?
W.r.t how routing is done; yes.  But not w.r.t reliability
fragmentation etc.
> Was there any consideration made to implementing something similar
> to InfiniBand''s method of routing between clusters?  When routing
> out of the local cluster, InfiniBand adds a Global Route Header.
> This allows for routing between clusters in such a way as to not
> impact performance when going between nodes local to each cluster.
No.  I don''t believe there is any measurable performance impact from
LNET''s use of a single global network address space.  The big issue is
taking an interrupt latency; this has to be avoided as much as
possible since the hardware networks underlying LNET can typically
move many kilobytes of data.  So although "penalising" the local
network case by always using global addresses may add several bytes to
the message header, the additional latency is in the noise.
> >> This would imply that routing/bridging between NIs is
> >> accomplished by 1st receiving all of the data contained in the
> >> packet/message (i.e. the entire payload as defined by the
> >> lnet_hdr_t structure) before sending the data out via the next NI
> >> in the path to the destination NID.
> > 
> > Yup
> > 
> >> Basically LNET''s model is identical to how IP does
routing, with
> >> the lnet_hdr_t being equivalent to the IP header. Is this
> >> correct?
> > 
> > Yup
> > 
> 
> Given this approach, the best one can do with regards to throughput
> of a single "routed" stream is inversely proportional to the
number
> of hops in the path between source and destination nodes.  That is,
> as the number of hops increases, the available bandwidth decreases.
> 
> This occurs because the Lustre protocol relies on a reliable,
> datagram oriented protocol, which is represented by an LNET message
> (this is true, correct?) and the time to get a message from source
> to destination node using LNET''s routing scheme is:
> 
>     time = time_to_send_message_to_router +
>            time_for_router_to_process_message +
>            time_to_send_message_to_destination;
> 
> Where:
>     time_to_send_message_to_router ~= #bytes / 1st_network_bandwidth
>     time_for_router_to_process_message = constant
>     time_to_send_message_to_destination ~= #bytes / 2nd_network_bandwidth
> 
> So to route between 2 networks of equal bandwidth gives:
> 
>     time = O( 2 * (#bytes / bandwidth) );
> 
> Which indicates that throughput would be O( 1/2 * bandwidth).
> 
> Have I made a mistake?
No, your analysis for a sequential stream of RPCs is quite correct.
> I haven''t seen any performance numbers showing how LNET performs
> when routing between networks, and I see that Scott Atchley has done
> some analysis of Myrinet which shows single stream throughput at
> roughly 1/2 of the available bandwidth (640 MBytes / 1,250 MBytes),
> which shows the synchronous nature of the protocol, and I was hoping
> he''d post results once he has things working using LNET''s
routing
> mechanism.  My expectation being that single stream throughput would
> be approximately 320 MBytes...So the results would either confirm my
> analysis, or tell me that I have more to learn.
Indeed.  However lustre issues many concurrent bulk RPCs to "keep the
pipe full".  This not only ensures that we max out single networks,
but also pipelines dataflow through routers.

-- 

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------

Scott Atchley

2006-Jun-05 12:42 UTC

head link

[Lustre-devel] RE: Multiple NICs

>> I haven''t seen any performance numbers showing how LNET
performs
>> when routing between networks, and I see that Scott Atchley has done
>> some analysis of Myrinet which shows single stream throughput at
>> roughly 1/2 of the available bandwidth (640 MBytes / 1,250 MBytes),
>> which shows the synchronous nature of the protocol, and I was hoping
>> he''d post results once he has things working using
LNET''s routing
>> mechanism.  My expectation being that single stream throughput would
>> be approximately 320 MBytes...So the results would either confirm my
>> analysis, or tell me that I have more to learn.
>
> Indeed.  However lustre issues many concurrent bulk RPCs to "keep the
> pipe full".  This not only ensures that we max out single networks,
> but also pipelines dataflow through routers.
>
> --  
>
>                 Cheers,
>                         Eric
I plan to test routing soon. I will post numbers as soon as I do. I  
expect that we will hit some limits in the host trying to handle two  
10G NICs simultaneously (either CPU or memory read/write performance).

To second Eric''s statement, concurrent requests do help pipeline both  
bulk IO (read and write) as well as metadata requests. Most LNDs  
currently allow 8 concurrent requests per pair of hosts and 128 or  
256 total requests outstanding.

Scott

--
Scott Atchley
Myricom Inc.
http://www.myri.com

Duane Cloud

2006-Jun-05 14:36 UTC

head link

[Lustre-devel] Re: Multiple NICs

Eric Barton wrote:> Duane,
> 
>> Hello Eric,
>>
>> Is there a reason for not including this e-mail on Lustre-devel?
> 
> Nope; finger trouble I guess :)
Thanks...I''m just wondering about whether/when certain discussions 
should be "taken off-line".  I''m still getting my feet wet.
> 
>> Eric Barton wrote:
>>
>>>> I''m use to using the term "bridging" to
identify connecting
>>>> different networking hardware, while "routing" is
just providing
>>>> access between 2 points on a network (e.g. what IP does).
>>> Packets may pass through arbitrary many routers/bridges, so I
>>> think of it as routing but a rose by any other name...
>> Well, actually there''s a significant difference in how this
"task"
>> is accomplished.  The most apparent difference is in the impact on
>> performance (see below).
>>
>>>> What is the purpose for the additional ni_interfaces[] pointers
>>>> contained in the lnet_ni_t structure?
>>> socklnd can establish multiple connections via multiple NICs.
>>> ni_interfaces[] are their names.  I''ll be happier when
that''s done
>>> in LNET proper and ni_interfaces[] becomes ni_interface.
>> I see...So the direction that LNET is being taken is TCP/IP oriented
>> (i.e. the model you are using for the LNET layer is IP), correct?
> 
> W.r.t how routing is done; yes.  But not w.r.t reliability
> fragmentation etc.
Understood...Thanks.
> 
>> Was there any consideration made to implementing something similar
>> to InfiniBand''s method of routing between clusters?  When
routing
>> out of the local cluster, InfiniBand adds a Global Route Header.
>> This allows for routing between clusters in such a way as to not
>> impact performance when going between nodes local to each cluster.
> 
> No.  I don''t believe there is any measurable performance impact
from
> LNET''s use of a single global network address space.  The big
issue is
> taking an interrupt latency; this has to be avoided as much as
> possible since the hardware networks underlying LNET can typically
> move many kilobytes of data.  So although "penalising" the local
> network case by always using global addresses may add several bytes to
> the message header, the additional latency is in the noise.
Well, compared to previous releases, I wouldn''t expect there to be any 
noticeable impact on throughput for any interface type except the
Cray''s
XT3 system, which uses the Seastar interconnect.  For the XT3, this 
method of routing represents addition host processing for each message, 
not just additional bytes going across the wire.  Isn''t this correct?

If the peer structure had knowledge of whether the node was remote or 
local, the message could be sent to a "routing portal" with the 
additional information containing whatever routing info is necessary. 
Thus the "local node" case would remain unchanged.

For RDMA capable interconnects, performance could be enhanced by 
eliminating the processing associated with the LNET header for "local"
nodes.  The network layer could just RDMA the "payload" to the
correct,
predefined location/queue.  Of course something has to be done to 
support a datagram protocol going over TCP/IP, I just don''t like to see
other interconnects, which have reliable datagram/RDMA support, suffer.
> 
>>>> This would imply that routing/bridging between NIs is
>>>> accomplished by 1st receiving all of the data contained in the
>>>> packet/message (i.e. the entire payload as defined by the
>>>> lnet_hdr_t structure) before sending the data out via the next
NI
>>>> in the path to the destination NID.
>>> Yup
>>>
>>>> Basically LNET''s model is identical to how IP does
routing, with
>>>> the lnet_hdr_t being equivalent to the IP header. Is this
>>>> correct?
>>> Yup
>>>
>> Given this approach, the best one can do with regards to throughput
>> of a single "routed" stream is inversely proportional to the
number
>> of hops in the path between source and destination nodes.  That is,
>> as the number of hops increases, the available bandwidth decreases.
>>
>> This occurs because the Lustre protocol relies on a reliable,
>> datagram oriented protocol, which is represented by an LNET message
>> (this is true, correct?) and the time to get a message from source
>> to destination node using LNET''s routing scheme is:
>>
>>     time = time_to_send_message_to_router +
>>            time_for_router_to_process_message +
>>            time_to_send_message_to_destination;
>>
>> Where:
>>     time_to_send_message_to_router ~= #bytes / 1st_network_bandwidth
>>     time_for_router_to_process_message = constant
>>     time_to_send_message_to_destination ~= #bytes /
2nd_network_bandwidth
>>
>> So to route between 2 networks of equal bandwidth gives:
>>
>>     time = O( 2 * (#bytes / bandwidth) );
>>
>> Which indicates that throughput would be O( 1/2 * bandwidth).
>>
>> Have I made a mistake?
> 
> No, your analysis for a sequential stream of RPCs is quite correct.
> 
>> I haven''t seen any performance numbers showing how LNET
performs
>> when routing between networks, and I see that Scott Atchley has done
>> some analysis of Myrinet which shows single stream throughput at
>> roughly 1/2 of the available bandwidth (640 MBytes / 1,250 MBytes),
>> which shows the synchronous nature of the protocol, and I was hoping
>> he''d post results once he has things working using
LNET''s routing
>> mechanism.  My expectation being that single stream throughput would
>> be approximately 320 MBytes...So the results would either confirm my
>> analysis, or tell me that I have more to learn.
> 
> Indeed.  However lustre issues many concurrent bulk RPCs to "keep the
> pipe full".  This not only ensures that we max out single networks,
> but also pipelines dataflow through routers.
> 
This seems to imply that the onus for getting around limitations imposed 
by the transport/network protocol is on the folks coding/configuring 
Lustre.  In other words, rather than have Lustre do one send/receive of 
however many bytes of data and then let the network do the work 
optimally, performance will be dependent on how Lustre chops up the data 
into requests...So Lustre would have to be "network aware".

I guess it comes down to balancing costs/workloads...

-- 

Thank you,

Duane Cloud
Systems Programmer

Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)

cloud@ahpcrc.org, 612-337-3407 Desk

Eric Barton

2006-Jun-05 16:59 UTC

head link

[Lustre-devel] RE: Multiple NICs

Duane,
> >> Was there any consideration made to implementing something
> >> similar to InfiniBand''s method of routing between
clusters?  When
> >> routing out of the local cluster, InfiniBand adds a Global Route
> >> Header.  This allows for routing between clusters in such a way
> >> as to not impact performance when going between nodes local to
> >> each cluster.
> > 
> > No.  I don''t believe there is any measurable performance
impact
> > from LNET''s use of a single global network address space. 
The big
> > issue is taking an interrupt latency; this has to be avoided as
> > much as possible since the hardware networks underlying LNET can
> > typically move many kilobytes of data.  So although
"penalising"
> > the local network case by always using global addresses may add
> > several bytes to the message header, the additional latency is in
> > the noise.
> 
> Well, compared to previous releases, I wouldn''t expect there to be
> any noticeable impact on throughput for any interface type except
> the Cray''s XT3 system, which uses the Seastar interconnect.  For
the
> XT3, this method of routing represents addition host processing for
> each message, not just additional bytes going across the wire.
> Isn''t this correct?
? I don''t see what you''re driving at here w.r.t. Seastar. 
Unless it''s
the fact that we''re layering a portals-like protocol (LNET) over a
"native" implementation of portals?  True, we''re suffering a
latency
penalty for not using portals directly.  True, the prime motivation
was to achieve routing.  But the fact that LNET is _not_ portals
means we''re not bound by a standard API primarily targeted at parallel
as opposed to distributed programming.
> If the peer structure had knowledge of whether the node was remote
> or local, the message could be sent to a "routing portal" with
the
> additional information containing whatever routing info is
> necessary.  Thus the "local node" case would remain unchanged.
Yes, it could.
> For RDMA capable interconnects, performance could be enhanced by
> eliminating the processing associated with the LNET header for
> "local" nodes.  The network layer could just RDMA the
"payload" to
> the correct, predefined location/queue.  Of course something has to
> be done to support a datagram protocol going over TCP/IP, I just
> don''t like to see other interconnects, which have reliable
> datagram/RDMA support, suffer.
As I alluded to above, we lost a latency advantage by layering LNET
over portals.  But the service thread and backend F/S in the loop
still limits performance even with a perfect network, and mitigates the
loss in aggregate RPC service rates.  And we only suffer a marginal
bandwidth penalty, given LNET''s outrageous MTU.
> >> I haven''t seen any performance numbers showing how LNET
performs
> >> when routing between networks, and I see that Scott Atchley has
> >> done some analysis of Myrinet which shows single stream
> >> throughput at roughly 1/2 of the available bandwidth (640 MBytes
> >> / 1,250 MBytes), which shows the synchronous nature of the
> >> protocol, and I was hoping he''d post results once he has
things
> >> working using LNET''s routing mechanism.  My expectation
being
> >> that single stream throughput would be approximately 320
> >> MBytes...So the results would either confirm my analysis, or tell
> >> me that I have more to learn.
> > 
> > Indeed.  However lustre issues many concurrent bulk RPCs to "keep
> > the pipe full".  This not only ensures that we max out single
> > networks, but also pipelines dataflow through routers.
> > 
> 
> This seems to imply that the onus for getting around limitations
> imposed by the transport/network protocol is on the folks
> coding/configuring Lustre.  In other words, rather than have Lustre
> do one send/receive of however many bytes of data and then let the
> network do the work optimally, performance will be dependent on how
> Lustre chops up the data into requests...So Lustre would have to be
> "network aware".
Indeed.  

<tongue in cheek> 
Worse still, lustre has to be nice to disks.  They''re very particular
about liking only nice large contiguous I/Os coming at them in decent
numbers.  And there''s even a damn filesystem in the way between lustre
and those disks.
</tongue in cheek> 
> I guess it comes down to balancing costs/workloads...
Everywhere between the client and the disk needs to interact
efficiently.  IMHO the best you can do is come up with some simple
rules for "keeping the pipe full", and engineer the subsystems to
implement them so that they work everywhere.

-- 

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------

Lustre devel - May 2006 - Multiple NICs

[Lustre-devel] Multiple NICs

[Lustre-devel] Re: Multiple NICs

[Lustre-devel] Re: Multiple NICs

[Lustre-devel] RE: Multiple NICs

[Lustre-devel] Re: Multiple NICs

[Lustre-devel] RE: Multiple NICs

[Lustre-devel] RE: Multiple NICs

[Lustre-devel] Re: Multiple NICs

[Lustre-devel] RE: Multiple NICs