Duane, Scott,> Duane Cloud wrote: > > Is each interface identified by an "lnet_ni" structure, with its own > NID, driver, etc, similar to how interfaces are presented to IP? In > other words, each remote node would be identified by X NIDs, which > map to specific hardware interfaces, where X represents the number > of interfaces (e.g. X = 4, if 2 IP interfaces and 2 IB interfaces > are present). So which NIC is used would be identified by the > lnet_ni associated with that peer''s NID...And control of which NIC > to use would be via which NID is used when talking to that peer.Precisely, if by "interface" you mean "LNET interface". But the socklnd confuses the issue by allowing a single LNET interface (=NID) to span a number of hardware NICs. In this case, the NID is taken from the "first" NIC.> Or is each LND identified by one lnet_ni and each implements it''s > own decision logic to control multiple interfaces if/when present?This was the portals (== pre-LNET) model of lustre networking.> I see where there''s been some performance measurements using > "socknal load balancing" and was wondering if there''s some detailed > information available as to how it works.Only the socknal (now called the socklnd) was coded to load balance over multiple NICs; see below...> Any thoughts floating around about assigning one NID to each peer > and then having a "list of paths" that can be used to get to that > peer (e.g. peer NID001 can be reached using the SOCKLND with IP > address "a.b.c.d", or using the MXLND with address "whatever is > meaningful to it")?Yes, this was how portals was configured. Information was specified in ''lmc'' scripts and configured by ''lconf'' at runtime. But this method of configuring lustre was becoming unwieldy. O(#nodes) network configuration data was required to make these associations, and routed configurations (i.e. where there are different physical networks) were growing O(#nodes**2) rather than O(#nets**2). So we decided to simplify lustre configuration by retiring lmc/lconf and making network and filesystem configuration separate. This allowed a single site-wide network configuration, configured by module parameters, to provide connectivity to multiple lustre filesystems, configured using the lustre mount command. At this time we changed the "flat" portals NID into the LNET <address-within-network>@<network> NID to make the routing table on each node grows with the number of networks rather than the number of nodes. Furthermore, we decided that <address-within-network> should be determined by the underlying transport, so that once you have an LNET NID, you know right away how to get there without needing any additional configuration information. However we still had the "legacy" requirement that the socklnd support multiple NICs. We included that, but with the assumption that there is full connectivity between all NICs assigned to the same LNET network. For example... options lnet ip2nets="tcp(eth1,eth2) 192.168.1.*" ...says all nodes in my cluster have 2 NICs, their NIDs are <IP address of eth1>@tcp, and when they connect, they do so via both NICs. However this doesn''t handle the case where a server has multiple NICs but clients only have a single NIC e.g... options lnet ip2net="tcp(eth1) 192.168.[1-2].[100-254]; #clients\ tcp(eth1,eth2) 192.168.[1-2].[1-99]; #servers" ...which says that my servers have 2 NICs on the 192.168.1.0/24 and 192.168.2.0/24 subnets, but my clients only have one NIC on either one or the other subnet. In this configuration, ALL clients will connect to the servers'' eth1 IP addresses (i.e. the IP address that determines their NID), but create no further connections, since they are already connection on all their (count ''em... 1) NICs. In this case we could use IP routes on the subnet 2 clients to ensure they can ping the subnet 1 server addresses, or use 2 LNET networks to make the required connectivity explicit... options lnet ip2net="tcp0(eth1) 192.168.1.[100-254]; #subnet 1 clients\ tcp1(eth1) 192.168.2.[100-254]; #subnet 2 clients\ tcp0(eth1) 192.168.1.[1-99]; #servers\ tcp1(eth2) 192.168.2.[1-99]; #servers" ...so subnet 1 clients connect using the tcp0 server NIDs, and subnet 2 clients connect using the tcp1 server NIDs.> Scott Atchley wrote: > > Can other LNDs handle multiple NICs of the same type (e.g. IB or > Quadrics)? If so, how does LNET specify which NIC an instance uses?No, not in the way you think; i.e. the socklnd is the only one that you can configure with multiple NICs. But since the linux bonding driver appears to have made sufficient progress regarding load-balancing algorithms and has the upside of transparent fault-tolerance, we''d now rather recommend it. The Quadrics kernel API abstracts multiple NICs which allows the qswlnd to use them all. The RapidArray LND (ralnd) "knows" that it may have one or 2 NICs and uses either by XORing its own and the peer''s NID. All other LNDs only handle a single interface.>> I have two cases that I want to make sure that I handle correctly. >> Both involve a machine with two (or more) Myrinet NICs. The first >> case is that I want to sysadmin to be able to specify which NIC to >> use. The second case is to use both NICs (e.g. for routing). >> >> For the first case, I have a module parameter that allows the >> caller to set the board ID (0, 1, 2, etc.). That is simple >> enough. The sysadmin can then modify /etc/modprobe.conf (or >> /etc/modprobe.d/kmxlnd) to add "options kmxlnd board=1". >> >> For the second case, I can''t rely on the above since both instances >> will be passed the same board number. How do other LNDs handle this >> (or do they)?In retrospect, implementing link aggregation specifically in each LND seems like a cul-de-sac; especially w.r.t. code duplication. If LNET needs it because the underlying networks don''t support it, I''d rather it was moved into LNET proper and restricted to "multi-rail" configurations; i.e. where every node that connects on one rail connects on all others. For example... options lnet ip2nets="tcp0(eth1) 192.168.1.*; #rail 1\ tcp1(eth2) 192.168.2.*; #rail 2"\ bond="(tcp0,tcp1)" ...which describes a cluster where every node has 2 NICs. Of course, the devil is in the details :) -- Cheers, Eric
Eric Barton wrote:> Duane, Scott, > >> Duane Cloud wrote: >> >> Is each interface identified by an "lnet_ni" structure, with its own >> NID, driver, etc, similar to how interfaces are presented to IP? In >> other words, each remote node would be identified by X NIDs, which >> map to specific hardware interfaces, where X represents the number >> of interfaces (e.g. X = 4, if 2 IP interfaces and 2 IB interfaces >> are present). So which NIC is used would be identified by the >> lnet_ni associated with that peer''s NID...And control of which NIC >> to use would be via which NID is used when talking to that peer. > > Precisely, if by "interface" you mean "LNET interface". But the > socklnd confuses the issue by allowing a single LNET interface (=> NID) to span a number of hardware NICs. In this case, the NID is > taken from the "first" NIC. >Thanks for the information...I''ll have to take some time looking it over. I just wanted to clarify a few things. When I used "interface" I was referring to the actual physical hardware connection to the network, which is different than the logical representation implemented via the software. I meant to use "interface" in a manner which I assumed Scott was using the term NIC. And I quoted "lnet_ni" cause I wasn''t sure what term to use in order to describe the logical representation of what''s physically present in the system. ...I think I need to stop using "interface" by itself! -- Thank you, Duane Cloud Systems Programmer Network Computing Services, Inc. Army High Performance Computing Research Center (AHPCRC) cloud@ahpcrc.org, 612-337-3407 Desk
On May 31, 2006, at 12:06 PM, Eric Barton wrote:>> Scott Atchley wrote: >> >> Can other LNDs handle multiple NICs of the same type (e.g. IB or >> Quadrics)? If so, how does LNET specify which NIC an instance uses? > > No, not in the way you think; i.e. the socklnd is the only one that > you can configure with multiple NICs. But since the linux bonding > driver appears to have made sufficient progress regarding > load-balancing algorithms and has the upside of transparent > fault-tolerance, we''d now rather recommend it. > > The Quadrics kernel API abstracts multiple NICs which allows the > qswlnd to use them all. > > The RapidArray LND (ralnd) "knows" that it may have one or 2 NICs and > uses either by XORing its own and the peer''s NID. > > All other LNDs only handle a single interface.What I am interested in is the case of routing. For example, a site has 10G and older 2G Myrinet networks. They will need either (a) routers or (b) servers connected to both networks. MX allows me to use up to 4 NICs. Does LNET support this type of setup? If so, how would it invoke MXLND? I am assuming it would have two LNET nids. If so, I would then need to determine which NIC to use for each nid, no?>>> I have two cases that I want to make sure that I handle correctly. >>> Both involve a machine with two (or more) Myrinet NICs. The first >>> case is that I want to sysadmin to be able to specify which NIC to >>> use. The second case is to use both NICs (e.g. for routing). >>> >>> For the first case, I have a module parameter that allows the >>> caller to set the board ID (0, 1, 2, etc.). That is simple >>> enough. The sysadmin can then modify /etc/modprobe.conf (or >>> /etc/modprobe.d/kmxlnd) to add "options kmxlnd board=1". >>> >>> For the second case, I can''t rely on the above since both instances >>> will be passed the same board number. How do other LNDs handle this >>> (or do they)? > > In retrospect, implementing link aggregation specifically in each LND > seems like a cul-de-sac; especially w.r.t. code duplication. If LNET > needs it because the underlying networks don''t support it, I''d rather > it was moved into LNET proper and restricted to "multi-rail" > configurations; i.e. where every node that connects on one rail > connects on all others. For example... > > options lnet ip2nets="tcp0(eth1) 192.168.1.*; #rail 1\ > tcp1(eth2) 192.168.2.*; #rail 2"\ > bond="(tcp0,tcp1)" > > ...which describes a cluster where every node has 2 NICs. Of course, > the devil is in the details :)I''m not interested in link aggregation, but bridging separate networks. Thanks, Scott
> What I am interested in is the case of routing. For example, a site > has 10G and older 2G Myrinet networks. They will need either (a) > routers or (b) servers connected to both networks. MX allows me to > use up to 4 NICs. > > Does LNET support this type of setup? If so, how would it invoke > MXLND? I am assuming it would have two LNET nids. If so, I would then > need to determine which NIC to use for each nid, no?Yes. The network driver''s lnd_startup() API is called once for each network specified by the LNET ''networks'' module parameter (or each matching network specified by the ''ip2nets'' module parameter). So if you were to specify... options lnet networks="mx0(0),mx1(1)" ...then mxlnd_startup() will be called twice, once for each network. Since interfaces have been specified explicitly (i.e. "mx0(0)" as opposed to plain "mx"), ni->ni_interfaces[0] != NULL points to the string between the brackets for each network; i.e. "0" and "1" in the example above. You can choose your own interface specification syntax if you want, but a simple number seems unambiguous. BTW, the only LND that currently supports > 1 instance of itself is the socklnd, but that''s bound to change. Once your driver supports multiple instances of itself bound to different NICs like this, serving on both and/or routing between them will "just work". Cheers, Eric --------------------------------------------------- |Eric Barton Barton Software | |9 York Gardens Tel: +44 (117) 330 1575 | |Clifton Mobile: +44 (7909) 680 356 | |Bristol BS8 4LL Fax: call first | |United Kingdom E-Mail: eeb@bartonsoftware.com| ---------------------------------------------------
Hello Eric, Again, I appreciate the information. I think I''m getting a better handle on things. Also it looks like there are a few issues that seem to have been brought up by this one thread. For example, understanding what routing and/or bridging capabilities exist, how to do link aggregation, and the information concerning configuration complexity. I''d like to continue discussing each of these...But 1st some thoughts/questions about the following: Eric Barton wrote:>> What I am interested in is the case of routing. For example, a site >> has 10G and older 2G Myrinet networks. They will need either (a) >> routers or (b) servers connected to both networks. MX allows me to >> use up to 4 NICs. >> >> Does LNET support this type of setup? If so, how would it invoke >> MXLND? I am assuming it would have two LNET nids. If so, I would then >> need to determine which NIC to use for each nid, no? > > Yes. The network driver''s lnd_startup() API is called once for each network > specified by the LNET ''networks'' module parameter (or each matching network > specified by the ''ip2nets'' module parameter). So if you were to specify... > > options lnet networks="mx0(0),mx1(1)" > > ...then mxlnd_startup() will be called twice, once for each network. Since > interfaces have been specified explicitly (i.e. "mx0(0)" as opposed to plain > "mx"), ni->ni_interfaces[0] != NULL points to the string between the > brackets for each network; i.e. "0" and "1" in the example above. You can > choose your own interface specification syntax if you want, but a simple > number seems unambiguous. >So is it correct to say that the LNET layer supports routing between NIs (i.e. LNET''s network interfaces) independent of what LND is used to control the source NI and destination NI? If this is true, then I guess routing and/or bridging is synonymous with regards to LNET. I''m use to using the term "bridging" to identify connecting different networking hardware, while "routing" is just providing access between 2 points on a network (e.g. what IP does). Also, given your description, all an LND''s start up function would have to do is allocate and initialize some memory, to identify and control the specified hardware interface, and stash a pointer to it in the ni_data field in order to support multiple instances of itself, correct? Of course the LND has to be reentrant with respect to controlling its interface. What is the purpose for the additional ni_interfaces[] pointers contained in the lnet_ni_t structure? This would imply that routing/bridging between NIs is accomplished by 1st receiving all of the data contained in the packet/message (i.e. the entire payload as defined by the lnet_hdr_t structure) before sending the data out via the next NI in the path to the destination NID. Basically LNET''s model is identical to how IP does routing, with the lnet_hdr_t being equivalent to the IP header. Is this correct? I appreciate the help understanding how all of this works. Please let me know if my current understanding is on track. -- Thank you, Duane Cloud Systems Programmer Network Computing Services, Inc. Army High Performance Computing Research Center (AHPCRC) cloud@ahpcrc.org, 612-337-3407 Desk ----------------------------------------------------------------------- This message (including any attachments) may contain proprietary or privileged information, the use and disclosure of which is legally restricted. If you have received this message in error please notify the sender by reply message, do not otherwise distribute it, and delete this message, with all of its contents, from your files. -----------------------------------------------------------------------
Duane,> Hello Eric, > > Is there a reason for not including this e-mail on Lustre-devel?Nope; finger trouble I guess :)> Eric Barton wrote: > > > > >> I''m use to using the term "bridging" to identify connecting > >> different networking hardware, while "routing" is just providing > >> access between 2 points on a network (e.g. what IP does). > > > > Packets may pass through arbitrary many routers/bridges, so I > > think of it as routing but a rose by any other name... > > Well, actually there''s a significant difference in how this "task" > is accomplished. The most apparent difference is in the impact on > performance (see below). > > >> What is the purpose for the additional ni_interfaces[] pointers > >> contained in the lnet_ni_t structure? > > > > socklnd can establish multiple connections via multiple NICs. > > ni_interfaces[] are their names. I''ll be happier when that''s done > > in LNET proper and ni_interfaces[] becomes ni_interface. > > I see...So the direction that LNET is being taken is TCP/IP oriented > (i.e. the model you are using for the LNET layer is IP), correct?W.r.t how routing is done; yes. But not w.r.t reliability fragmentation etc.> Was there any consideration made to implementing something similar > to InfiniBand''s method of routing between clusters? When routing > out of the local cluster, InfiniBand adds a Global Route Header. > This allows for routing between clusters in such a way as to not > impact performance when going between nodes local to each cluster.No. I don''t believe there is any measurable performance impact from LNET''s use of a single global network address space. The big issue is taking an interrupt latency; this has to be avoided as much as possible since the hardware networks underlying LNET can typically move many kilobytes of data. So although "penalising" the local network case by always using global addresses may add several bytes to the message header, the additional latency is in the noise.> >> This would imply that routing/bridging between NIs is > >> accomplished by 1st receiving all of the data contained in the > >> packet/message (i.e. the entire payload as defined by the > >> lnet_hdr_t structure) before sending the data out via the next NI > >> in the path to the destination NID. > > > > Yup > > > >> Basically LNET''s model is identical to how IP does routing, with > >> the lnet_hdr_t being equivalent to the IP header. Is this > >> correct? > > > > Yup > > > > Given this approach, the best one can do with regards to throughput > of a single "routed" stream is inversely proportional to the number > of hops in the path between source and destination nodes. That is, > as the number of hops increases, the available bandwidth decreases. > > This occurs because the Lustre protocol relies on a reliable, > datagram oriented protocol, which is represented by an LNET message > (this is true, correct?) and the time to get a message from source > to destination node using LNET''s routing scheme is: > > time = time_to_send_message_to_router + > time_for_router_to_process_message + > time_to_send_message_to_destination; > > Where: > time_to_send_message_to_router ~= #bytes / 1st_network_bandwidth > time_for_router_to_process_message = constant > time_to_send_message_to_destination ~= #bytes / 2nd_network_bandwidth > > So to route between 2 networks of equal bandwidth gives: > > time = O( 2 * (#bytes / bandwidth) ); > > Which indicates that throughput would be O( 1/2 * bandwidth). > > Have I made a mistake?No, your analysis for a sequential stream of RPCs is quite correct.> I haven''t seen any performance numbers showing how LNET performs > when routing between networks, and I see that Scott Atchley has done > some analysis of Myrinet which shows single stream throughput at > roughly 1/2 of the available bandwidth (640 MBytes / 1,250 MBytes), > which shows the synchronous nature of the protocol, and I was hoping > he''d post results once he has things working using LNET''s routing > mechanism. My expectation being that single stream throughput would > be approximately 320 MBytes...So the results would either confirm my > analysis, or tell me that I have more to learn.Indeed. However lustre issues many concurrent bulk RPCs to "keep the pipe full". This not only ensures that we max out single networks, but also pipelines dataflow through routers. -- Cheers, Eric --------------------------------------------------- |Eric Barton Barton Software | |9 York Gardens Tel: +44 (117) 330 1575 | |Clifton Mobile: +44 (7909) 680 356 | |Bristol BS8 4LL Fax: call first | |United Kingdom E-Mail: eeb@bartonsoftware.com| ---------------------------------------------------
>> I haven''t seen any performance numbers showing how LNET performs >> when routing between networks, and I see that Scott Atchley has done >> some analysis of Myrinet which shows single stream throughput at >> roughly 1/2 of the available bandwidth (640 MBytes / 1,250 MBytes), >> which shows the synchronous nature of the protocol, and I was hoping >> he''d post results once he has things working using LNET''s routing >> mechanism. My expectation being that single stream throughput would >> be approximately 320 MBytes...So the results would either confirm my >> analysis, or tell me that I have more to learn. > > Indeed. However lustre issues many concurrent bulk RPCs to "keep the > pipe full". This not only ensures that we max out single networks, > but also pipelines dataflow through routers. > > -- > > Cheers, > EricI plan to test routing soon. I will post numbers as soon as I do. I expect that we will hit some limits in the host trying to handle two 10G NICs simultaneously (either CPU or memory read/write performance). To second Eric''s statement, concurrent requests do help pipeline both bulk IO (read and write) as well as metadata requests. Most LNDs currently allow 8 concurrent requests per pair of hosts and 128 or 256 total requests outstanding. Scott -- Scott Atchley Myricom Inc. http://www.myri.com
Eric Barton wrote:> Duane, > >> Hello Eric, >> >> Is there a reason for not including this e-mail on Lustre-devel? > > Nope; finger trouble I guess :)Thanks...I''m just wondering about whether/when certain discussions should be "taken off-line". I''m still getting my feet wet.> >> Eric Barton wrote: >> >>>> I''m use to using the term "bridging" to identify connecting >>>> different networking hardware, while "routing" is just providing >>>> access between 2 points on a network (e.g. what IP does). >>> Packets may pass through arbitrary many routers/bridges, so I >>> think of it as routing but a rose by any other name... >> Well, actually there''s a significant difference in how this "task" >> is accomplished. The most apparent difference is in the impact on >> performance (see below). >> >>>> What is the purpose for the additional ni_interfaces[] pointers >>>> contained in the lnet_ni_t structure? >>> socklnd can establish multiple connections via multiple NICs. >>> ni_interfaces[] are their names. I''ll be happier when that''s done >>> in LNET proper and ni_interfaces[] becomes ni_interface. >> I see...So the direction that LNET is being taken is TCP/IP oriented >> (i.e. the model you are using for the LNET layer is IP), correct? > > W.r.t how routing is done; yes. But not w.r.t reliability > fragmentation etc.Understood...Thanks.> >> Was there any consideration made to implementing something similar >> to InfiniBand''s method of routing between clusters? When routing >> out of the local cluster, InfiniBand adds a Global Route Header. >> This allows for routing between clusters in such a way as to not >> impact performance when going between nodes local to each cluster. > > No. I don''t believe there is any measurable performance impact from > LNET''s use of a single global network address space. The big issue is > taking an interrupt latency; this has to be avoided as much as > possible since the hardware networks underlying LNET can typically > move many kilobytes of data. So although "penalising" the local > network case by always using global addresses may add several bytes to > the message header, the additional latency is in the noise.Well, compared to previous releases, I wouldn''t expect there to be any noticeable impact on throughput for any interface type except the Cray''s XT3 system, which uses the Seastar interconnect. For the XT3, this method of routing represents addition host processing for each message, not just additional bytes going across the wire. Isn''t this correct? If the peer structure had knowledge of whether the node was remote or local, the message could be sent to a "routing portal" with the additional information containing whatever routing info is necessary. Thus the "local node" case would remain unchanged. For RDMA capable interconnects, performance could be enhanced by eliminating the processing associated with the LNET header for "local" nodes. The network layer could just RDMA the "payload" to the correct, predefined location/queue. Of course something has to be done to support a datagram protocol going over TCP/IP, I just don''t like to see other interconnects, which have reliable datagram/RDMA support, suffer.> >>>> This would imply that routing/bridging between NIs is >>>> accomplished by 1st receiving all of the data contained in the >>>> packet/message (i.e. the entire payload as defined by the >>>> lnet_hdr_t structure) before sending the data out via the next NI >>>> in the path to the destination NID. >>> Yup >>> >>>> Basically LNET''s model is identical to how IP does routing, with >>>> the lnet_hdr_t being equivalent to the IP header. Is this >>>> correct? >>> Yup >>> >> Given this approach, the best one can do with regards to throughput >> of a single "routed" stream is inversely proportional to the number >> of hops in the path between source and destination nodes. That is, >> as the number of hops increases, the available bandwidth decreases. >> >> This occurs because the Lustre protocol relies on a reliable, >> datagram oriented protocol, which is represented by an LNET message >> (this is true, correct?) and the time to get a message from source >> to destination node using LNET''s routing scheme is: >> >> time = time_to_send_message_to_router + >> time_for_router_to_process_message + >> time_to_send_message_to_destination; >> >> Where: >> time_to_send_message_to_router ~= #bytes / 1st_network_bandwidth >> time_for_router_to_process_message = constant >> time_to_send_message_to_destination ~= #bytes / 2nd_network_bandwidth >> >> So to route between 2 networks of equal bandwidth gives: >> >> time = O( 2 * (#bytes / bandwidth) ); >> >> Which indicates that throughput would be O( 1/2 * bandwidth). >> >> Have I made a mistake? > > No, your analysis for a sequential stream of RPCs is quite correct. > >> I haven''t seen any performance numbers showing how LNET performs >> when routing between networks, and I see that Scott Atchley has done >> some analysis of Myrinet which shows single stream throughput at >> roughly 1/2 of the available bandwidth (640 MBytes / 1,250 MBytes), >> which shows the synchronous nature of the protocol, and I was hoping >> he''d post results once he has things working using LNET''s routing >> mechanism. My expectation being that single stream throughput would >> be approximately 320 MBytes...So the results would either confirm my >> analysis, or tell me that I have more to learn. > > Indeed. However lustre issues many concurrent bulk RPCs to "keep the > pipe full". This not only ensures that we max out single networks, > but also pipelines dataflow through routers. >This seems to imply that the onus for getting around limitations imposed by the transport/network protocol is on the folks coding/configuring Lustre. In other words, rather than have Lustre do one send/receive of however many bytes of data and then let the network do the work optimally, performance will be dependent on how Lustre chops up the data into requests...So Lustre would have to be "network aware". I guess it comes down to balancing costs/workloads... -- Thank you, Duane Cloud Systems Programmer Network Computing Services, Inc. Army High Performance Computing Research Center (AHPCRC) cloud@ahpcrc.org, 612-337-3407 Desk
Duane,> >> Was there any consideration made to implementing something > >> similar to InfiniBand''s method of routing between clusters? When > >> routing out of the local cluster, InfiniBand adds a Global Route > >> Header. This allows for routing between clusters in such a way > >> as to not impact performance when going between nodes local to > >> each cluster. > > > > No. I don''t believe there is any measurable performance impact > > from LNET''s use of a single global network address space. The big > > issue is taking an interrupt latency; this has to be avoided as > > much as possible since the hardware networks underlying LNET can > > typically move many kilobytes of data. So although "penalising" > > the local network case by always using global addresses may add > > several bytes to the message header, the additional latency is in > > the noise. > > Well, compared to previous releases, I wouldn''t expect there to be > any noticeable impact on throughput for any interface type except > the Cray''s XT3 system, which uses the Seastar interconnect. For the > XT3, this method of routing represents addition host processing for > each message, not just additional bytes going across the wire. > Isn''t this correct?? I don''t see what you''re driving at here w.r.t. Seastar. Unless it''s the fact that we''re layering a portals-like protocol (LNET) over a "native" implementation of portals? True, we''re suffering a latency penalty for not using portals directly. True, the prime motivation was to achieve routing. But the fact that LNET is _not_ portals means we''re not bound by a standard API primarily targeted at parallel as opposed to distributed programming.> If the peer structure had knowledge of whether the node was remote > or local, the message could be sent to a "routing portal" with the > additional information containing whatever routing info is > necessary. Thus the "local node" case would remain unchanged.Yes, it could.> For RDMA capable interconnects, performance could be enhanced by > eliminating the processing associated with the LNET header for > "local" nodes. The network layer could just RDMA the "payload" to > the correct, predefined location/queue. Of course something has to > be done to support a datagram protocol going over TCP/IP, I just > don''t like to see other interconnects, which have reliable > datagram/RDMA support, suffer.As I alluded to above, we lost a latency advantage by layering LNET over portals. But the service thread and backend F/S in the loop still limits performance even with a perfect network, and mitigates the loss in aggregate RPC service rates. And we only suffer a marginal bandwidth penalty, given LNET''s outrageous MTU.> >> I haven''t seen any performance numbers showing how LNET performs > >> when routing between networks, and I see that Scott Atchley has > >> done some analysis of Myrinet which shows single stream > >> throughput at roughly 1/2 of the available bandwidth (640 MBytes > >> / 1,250 MBytes), which shows the synchronous nature of the > >> protocol, and I was hoping he''d post results once he has things > >> working using LNET''s routing mechanism. My expectation being > >> that single stream throughput would be approximately 320 > >> MBytes...So the results would either confirm my analysis, or tell > >> me that I have more to learn. > > > > Indeed. However lustre issues many concurrent bulk RPCs to "keep > > the pipe full". This not only ensures that we max out single > > networks, but also pipelines dataflow through routers. > > > > This seems to imply that the onus for getting around limitations > imposed by the transport/network protocol is on the folks > coding/configuring Lustre. In other words, rather than have Lustre > do one send/receive of however many bytes of data and then let the > network do the work optimally, performance will be dependent on how > Lustre chops up the data into requests...So Lustre would have to be > "network aware".Indeed. <tongue in cheek> Worse still, lustre has to be nice to disks. They''re very particular about liking only nice large contiguous I/Os coming at them in decent numbers. And there''s even a damn filesystem in the way between lustre and those disks. </tongue in cheek>> I guess it comes down to balancing costs/workloads...Everywhere between the client and the disk needs to interact efficiently. IMHO the best you can do is come up with some simple rules for "keeping the pipe full", and engineer the subsystems to implement them so that they work everywhere. -- Cheers, Eric --------------------------------------------------- |Eric Barton Barton Software | |9 York Gardens Tel: +44 (117) 330 1575 | |Clifton Mobile: +44 (7909) 680 356 | |Bristol BS8 4LL Fax: call first | |United Kingdom E-Mail: eeb@bartonsoftware.com| ---------------------------------------------------