Hi there, I have two Lustre filesystems using Infiniband as interconnect. Each filesystem has its own Infiniband network (physically separated from the other one), its own OSSs and its own storage devices (OSTs). From the Lustre clients of the first filesystem I need to access the second filesystem. For that purpose I would like to setup a Lustre router farm between the two Infiniband networks. The routers would have several IB interfaces, some of them connected to the IB network of the first filesystem, the others connected to the IB network of the second filesystem. I am wondering how many Lustre routers I will have to put. I mean, if I have an available bandwith of 100 on each side of a router, what will be the max reachable bandwith from clients on one side of the router to servers on the other side of the router? Is it 50? 80? 99? Is the routing process CPU or memory hungry? Thanks in advance. Sebastien.
On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote:> I mean, if I > have an available bandwith of 100 on each side of a router, what will be > the max reachable bandwith from clients on one side of the router to > servers on the other side of the router? Is it 50? 80? 99? Is the > routing process CPU or memory hungry?While I can''t answer these things specifically another important consideration is the bus architecture involved. How many I/B cards can you put on a bus before you saturate the bus? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080409/9dfb726c/attachment-0002.bin
Let''s consider that the internal bus of the machine is bigger enough so that it will not be saturated. In that case, what will be the limiting factor? memory? CPU? I know that it depends on how many I/B cards are plugged in the machine, but generally speaking, is the routing activity CPU or memory hungry? By the way, are there people on that list that have feedback about Lustre routers sizing? For instance, I know that Lustre routers have been set up at the LLNL. What is the throughput obtained via the routers, compared to the raw bandwidth of the interconnect? Thanks, Sebastien. Brian J. Murrell a ?crit :> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote: >> I mean, if I >> have an available bandwith of 100 on each side of a router, what will be >> the max reachable bandwith from clients on one side of the router to >> servers on the other side of the router? Is it 50? 80? 99? Is the >> routing process CPU or memory hungry? > > While I can''t answer these things specifically another important > consideration is the bus architecture involved. How many I/B cards can > you put on a bus before you saturate the bus? > > b. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Sebastien, ORNL is assembling a fairly large configuration that will rely heavily on routers. We have a small configuration running similar to yours (IB to IB). Being a Cray shop, we also have IB to Portals running. We are trying to reach 1.25 GB/s per router. We haven''t achieved that yet. The best so far is about 1.1 GB/s sustained. We do see burst up to 1.25-1.3, but not sustained. This was measured on an IB to portals setup with the XT having PCI-e riser (not GA yet). I don''t have good numbers on the IB-to-IB routers. If I get some, I will post to the list. Depending on the scale of the cluster, memory can be a driver. You will need to allocate enough buffers on the routers to sustain traffic flow. The main driver for memory consumption is the number of 1M buffers that the routers will run with. For IB we have not seen a huge CPU load, but I don''t have good numbers to back that up. So CPU may still be a factor. We hope to repeat the above test but on a Dual core box (versus the single core used thus far). I will update the list as things evolve. --Shane -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of S?bastien Buisson Sent: Thursday, April 10, 2008 4:07 AM To: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre routers capabilities Let''s consider that the internal bus of the machine is bigger enough so that it will not be saturated. In that case, what will be the limiting factor? memory? CPU? I know that it depends on how many I/B cards are plugged in the machine, but generally speaking, is the routing activity CPU or memory hungry? By the way, are there people on that list that have feedback about Lustre routers sizing? For instance, I know that Lustre routers have been set up at the LLNL. What is the throughput obtained via the routers, compared to the raw bandwidth of the interconnect? Thanks, Sebastien. Brian J. Murrell a ?crit :> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote: >> I mean, if I >> have an available bandwith of 100 on each side of a router, what will be >> the max reachable bandwith from clients on one side of the router to >> servers on the other side of the router? Is it 50? 80? 99? Is the >> routing process CPU or memory hungry? > > While I can''t answer these things specifically another important > consideration is the bus architecture involved. How many I/B cards can > you put on a bus before you saturate the bus? > > b. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss_______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Sebastien, For the most part we try to match the bandwidth of the disks, to the network, to the number of routers needed. I will be at the Lustre User Group meeting in Sonoma, CA at the end of this month giving a talk about Lustre at LLNL, including our network design, and router usage, but here is a quick description. We have a large federated ethernet core. We then have edge switches for each of our clusters that have links up to the core, and back down to the routers or tcp-only clients. In a typical situation, if we think one file system can achieve 20 GB/s based on disk bandwidth, we try to make sure that the filesystem cluster has 20 GB/s network bandwith (1GigE, 10GigE, etc), and that the routers for the compute cluster total up to 20 GB/s as well. So we may have a server cluster with servers having dual GigE links, and routers with 10 GigE links, and we just try to match them up so the numbers are even. Typically, the routers in a cluster are the same node type as the compute cluster, just populated with additional network hardware. In the future, we will likely be building a router cluster that will bridge our existing federated ethernet core to a large Infinband network, but that is at least one year away. Most of our routers are rather simple, the have one high speed interconnect HCA (Quadrics, Mellanox IB), and one network card ( dual GigE, or single 10 GigE). I don''t think we''ve hit any bus bandwidth limitation, and I haven''t seen any of them really pressed for CPU or Memory. We do make sure to turn of irq_affinity when we have a single network interface (the 10 GigE routers), and we''ve had to tune the buffers and credits on the routers to get better throughput. We have noticed a problem with serialization of checksum processing on a single core (bz #14690). The beauty of routers though, is that if you find that they are all running at capacity, you can always add a couple more, and move the bottleneck to the network or disks. We find we are mostly slowed down by the disks. -Marc ---- D. Marc Stearman LC Lustre Administration Lead marc at llnl.gov 925.423.9670 Pager: 1.888.203.0641 On Apr 10, 2008, at 1:06 AM, S?bastien Buisson wrote:> Let''s consider that the internal bus of the machine is bigger > enough so > that it will not be saturated. In that case, what will be the limiting > factor? memory? CPU? > I know that it depends on how many I/B cards are plugged in the > machine, > but generally speaking, is the routing activity CPU or memory hungry? > > By the way, are there people on that list that have feedback about > Lustre routers sizing? For instance, I know that Lustre routers have > been set up at the LLNL. What is the throughput obtained via the > routers, compared to the raw bandwidth of the interconnect? > > Thanks, > Sebastien. > > > Brian J. Murrell a ?crit : >> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote: >>> I mean, if I >>> have an available bandwith of 100 on each side of a router, what >>> will be >>> the max reachable bandwith from clients on one side of the router to >>> servers on the other side of the router? Is it 50? 80? 99? Is the >>> routing process CPU or memory hungry? >> >> While I can''t answer these things specifically another important >> consideration is the bus architecture involved. How many I/B >> cards can >> you put on a bus before you saturate the bus? >> >> b. >> >> >> >> --------------------------------------------------------------------- >> --- >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Thu, Apr 10, 2008 at 10:06:40AM +0200, S?bastien Buisson wrote:> Let''s consider that the internal bus of the machine is bigger enough so > that it will not be saturated. In that case, what will be the limiting > factor? memory? CPU? > I know that it depends on how many I/B cards are plugged in the machine, > but generally speaking, is the routing activity CPU or memory hungry? >LNet routing is always memory hungry. Be sure to use 64-bit routers, because router buffers can only come from low system memory (i.e. ZONE_DMA and ZONE_NORMAL) which is usually 896M on 32-bit systems. Router CPU usage depends on network types. When routing between IB networks, router''s CPU usage shall be minimal. But when a router is homed in a TCP network, it can be CPU hungry since host CPU must be involved in copying data received from the TCP network. Isaac
Hello, And thanks for your feedback. I am very interested in the figures you got. I understand that you reached 1.1 GB/s sustained on an IB to Portals router. But on that router, what is the raw IB bandwidth available, and the raw Portals bandwidth available? In your specific case, what do you think is the bottleneck on the Lustre router? Sebastien. Canon, Richard Shane a ?crit :> Sebastien, > > ORNL is assembling a fairly large configuration that will rely heavily on routers. We have a small configuration running similar to yours (IB to IB). Being a Cray shop, we also have IB to Portals running. > > We are trying to reach 1.25 GB/s per router. We haven''t achieved that yet. The best so far is about 1.1 GB/s sustained. We do see burst up to 1.25-1.3, but not sustained. This was measured on an IB to portals setup with the XT having PCI-e riser (not GA yet). > > I don''t have good numbers on the IB-to-IB routers. If I get some, I will post to the list. > > Depending on the scale of the cluster, memory can be a driver. You will need to allocate enough buffers on the routers to sustain traffic flow. The main driver for memory consumption is the number of 1M buffers that the routers will run with. > > For IB we have not seen a huge CPU load, but I don''t have good numbers to back that up. So CPU may still be a factor. We hope to repeat the above test but on a Dual core box (versus the single core used thus far). > > I will update the list as things evolve. > > --Shane > > > -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of S?bastien Buisson > Sent: Thursday, April 10, 2008 4:07 AM > To: lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] Lustre routers capabilities > > Let''s consider that the internal bus of the machine is bigger enough so > that it will not be saturated. In that case, what will be the limiting > factor? memory? CPU? > I know that it depends on how many I/B cards are plugged in the machine, > but generally speaking, is the routing activity CPU or memory hungry? > > By the way, are there people on that list that have feedback about > Lustre routers sizing? For instance, I know that Lustre routers have > been set up at the LLNL. What is the throughput obtained via the > routers, compared to the raw bandwidth of the interconnect? > > Thanks, > Sebastien. > > > Brian J. Murrell a ?crit : >> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote: >>> I mean, if I >>> have an available bandwith of 100 on each side of a router, what will be >>> the max reachable bandwith from clients on one side of the router to >>> servers on the other side of the router? Is it 50? 80? 99? Is the >>> routing process CPU or memory hungry? >> While I can''t answer these things specifically another important >> consideration is the bus architecture involved. How many I/B cards can >> you put on a bus before you saturate the bus? >> >> b. >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Hello Marc, Thank you for this feedback. This is a very exhaustive description of how you set up routers at the LLNL. Just one question however: according to you, a simple way to increase routing bandwidth is to add more Lustre routers, so that they are not the bottleneck in the cluster. But at the LLNL, how do you deal with Lustre routing configuration when you add new routers? I mean, how is the network load balanced between all routers? Is it done in a dynamic way that supports adding or removal of routers? Sebastien. D. Marc Stearman a ?crit :> Sebastien, > > For the most part we try to match the bandwidth of the disks, to the > network, to the number of routers needed. I will be at the Lustre > User Group meeting in Sonoma, CA at the end of this month giving a > talk about Lustre at LLNL, including our network design, and router > usage, but here is a quick description. > > We have a large federated ethernet core. We then have edge switches > for each of our clusters that have links up to the core, and back > down to the routers or tcp-only clients. In a typical situation, if > we think one file system can achieve 20 GB/s based on disk bandwidth, > we try to make sure that the filesystem cluster has 20 GB/s network > bandwith (1GigE, 10GigE, etc), and that the routers for the compute > cluster total up to 20 GB/s as well. So we may have a server cluster > with servers having dual GigE links, and routers with 10 GigE links, > and we just try to match them up so the numbers are even. Typically, > the routers in a cluster are the same node type as the compute > cluster, just populated with additional network hardware. > > In the future, we will likely be building a router cluster that will > bridge our existing federated ethernet core to a large Infinband > network, but that is at least one year away. > > Most of our routers are rather simple, the have one high speed > interconnect HCA (Quadrics, Mellanox IB), and one network card ( dual > GigE, or single 10 GigE). I don''t think we''ve hit any bus bandwidth > limitation, and I haven''t seen any of them really pressed for CPU or > Memory. We do make sure to turn of irq_affinity when we have a > single network interface (the 10 GigE routers), and we''ve had to tune > the buffers and credits on the routers to get better throughput. We > have noticed a problem with serialization of checksum processing on a > single core (bz #14690). > > The beauty of routers though, is that if you find that they are all > running at capacity, you can always add a couple more, and move the > bottleneck to the network or disks. We find we are mostly slowed > down by the disks. > > -Marc > > ---- > D. Marc Stearman > LC Lustre Administration Lead > marc at llnl.gov > 925.423.9670 > Pager: 1.888.203.0641 > > > > On Apr 10, 2008, at 1:06 AM, S?bastien Buisson wrote: >> Let''s consider that the internal bus of the machine is bigger >> enough so >> that it will not be saturated. In that case, what will be the limiting >> factor? memory? CPU? >> I know that it depends on how many I/B cards are plugged in the >> machine, >> but generally speaking, is the routing activity CPU or memory hungry? >> >> By the way, are there people on that list that have feedback about >> Lustre routers sizing? For instance, I know that Lustre routers have >> been set up at the LLNL. What is the throughput obtained via the >> routers, compared to the raw bandwidth of the interconnect? >> >> Thanks, >> Sebastien. >> >> >> Brian J. Murrell a ?crit : >>> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote: >>>> I mean, if I >>>> have an available bandwith of 100 on each side of a router, what >>>> will be >>>> the max reachable bandwith from clients on one side of the router to >>>> servers on the other side of the router? Is it 50? 80? 99? Is the >>>> routing process CPU or memory hungry? >>> While I can''t answer these things specifically another important >>> consideration is the bus architecture involved. How many I/B >>> cards can >>> you put on a bus before you saturate the bus? >>> >>> b. >>> >>> >>> >>> --------------------------------------------------------------------- >>> --- >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
The routing configuration is in /etc/modprobe.conf, and routes can also be dynamically added with lctl add_route. All of our lustre servers are tcp servers, and we have client clusters that are tcp only, and we also have IB and Elan clusters. Let''s say you want to add a router to one of the IB clusters: We''ll assume that there is either a free port on the IB fabric, or we change a compute node into a router node by adding some 10GigE hardware. The IB cluster is o2ib0 The Lustre cluster is using tcp0 The IB cluster routers have connections on o2ib0 and tcp0. Assuming you have an existing setup in place using either ip2nets or networks in your modprobe.conf, and that you have existing routes listed in the modprobe.conf, adding a router should be simple. On the client side, add the routes in the modprobe.conf, and on the lustre servers add the routes to the modprobe.conf. On the new router, make sure it has the same modprobe.conf as the existing routers. This will ensure that the configuration works after a reboot. Since these are production clusters, we don''t want to reboot any of them, so we need to add the routes dynamically. Lets say that the new router has IP address 172.16.1.100 on tcp0 and 192.168.120.100 on o2ib0, you would need to run the following commands: On lustre servers: lctl --net o2ib0 add_route 172.16.1.100 at tcp0 On IB clients: lctl --net tcp0 add_route 192.168.120.100 at o2ib0 The clients/servers will add the routes, and they will be down until you start LNET on the new router: service lnet start At LLNL, we have one large tcp0 that all the server clusters belong to, and LNET is smart enough to use all the routers equally, so once we add a new router, it just becomes part of the router pool for that cluster, thereby increasing the bandwidth of that cluster. In reality, we rarely add new routers. We typcially spec out what we call scalable units, so when we add onto a compute cluster, we add a known chunk of compute servers, with a known number of routers. For example, if a scalable unit is 144 compute nodes, and 4 IB/10 GigE routers, then we may buy 4 scalable units, ending up with 576 compute nodes, and 16 lustre routers. Hope that helps answer your question. -Marc ---- D. Marc Stearman LC Lustre Administration Lead marc at llnl.gov 925.423.9670 Pager: 1.888.203.0641 On Apr 10, 2008, at 9:20 AM, S?bastien Buisson wrote:> Hello Marc, > > Thank you for this feedback. This is a very exhaustive description of > how you set up routers at the LLNL. > Just one question however: according to you, a simple way to increase > routing bandwidth is to add more Lustre routers, so that they are not > the bottleneck in the cluster. But at the LLNL, how do you deal with > Lustre routing configuration when you add new routers? I mean, how is > the network load balanced between all routers? Is it done in a dynamic > way that supports adding or removal of routers? > > Sebastien. > > > D. Marc Stearman a ?crit : >> Sebastien, >> >> For the most part we try to match the bandwidth of the disks, to the >> network, to the number of routers needed. I will be at the Lustre >> User Group meeting in Sonoma, CA at the end of this month giving a >> talk about Lustre at LLNL, including our network design, and router >> usage, but here is a quick description. >> >> We have a large federated ethernet core. We then have edge switches >> for each of our clusters that have links up to the core, and back >> down to the routers or tcp-only clients. In a typical situation, if >> we think one file system can achieve 20 GB/s based on disk bandwidth, >> we try to make sure that the filesystem cluster has 20 GB/s network >> bandwith (1GigE, 10GigE, etc), and that the routers for the compute >> cluster total up to 20 GB/s as well. So we may have a server cluster >> with servers having dual GigE links, and routers with 10 GigE links, >> and we just try to match them up so the numbers are even. Typically, >> the routers in a cluster are the same node type as the compute >> cluster, just populated with additional network hardware. >> >> In the future, we will likely be building a router cluster that will >> bridge our existing federated ethernet core to a large Infinband >> network, but that is at least one year away. >> >> Most of our routers are rather simple, the have one high speed >> interconnect HCA (Quadrics, Mellanox IB), and one network card ( dual >> GigE, or single 10 GigE). I don''t think we''ve hit any bus bandwidth >> limitation, and I haven''t seen any of them really pressed for CPU or >> Memory. We do make sure to turn of irq_affinity when we have a >> single network interface (the 10 GigE routers), and we''ve had to tune >> the buffers and credits on the routers to get better throughput. We >> have noticed a problem with serialization of checksum processing on a >> single core (bz #14690). >> >> The beauty of routers though, is that if you find that they are all >> running at capacity, you can always add a couple more, and move the >> bottleneck to the network or disks. We find we are mostly slowed >> down by the disks. >> >> -Marc >> >> ---- >> D. Marc Stearman >> LC Lustre Administration Lead >> marc at llnl.gov >> 925.423.9670 >> Pager: 1.888.203.0641 >> >> >> >> On Apr 10, 2008, at 1:06 AM, S?bastien Buisson wrote: >>> Let''s consider that the internal bus of the machine is bigger >>> enough so >>> that it will not be saturated. In that case, what will be the >>> limiting >>> factor? memory? CPU? >>> I know that it depends on how many I/B cards are plugged in the >>> machine, >>> but generally speaking, is the routing activity CPU or memory >>> hungry? >>> >>> By the way, are there people on that list that have feedback about >>> Lustre routers sizing? For instance, I know that Lustre routers have >>> been set up at the LLNL. What is the throughput obtained via the >>> routers, compared to the raw bandwidth of the interconnect? >>> >>> Thanks, >>> Sebastien. >>> >>> >>> Brian J. Murrell a ?crit : >>>> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote: >>>>> I mean, if I >>>>> have an available bandwith of 100 on each side of a router, what >>>>> will be >>>>> the max reachable bandwith from clients on one side of the >>>>> router to >>>>> servers on the other side of the router? Is it 50? 80? 99? Is the >>>>> routing process CPU or memory hungry? >>>> While I can''t answer these things specifically another important >>>> consideration is the bus architecture involved. How many I/B >>>> cards can >>>> you put on a bus before you saturate the bus? >>>> >>>> b. >>>> >>>> >>>> >>>> ------------------------------------------------------------------- >>>> -- >>>> --- >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Yes thank you Mark, that was really the piece of information I was looking for: among all available routes, LNET is smart enough to use them equally. So I can define all the routes on all my clients and all my servers, LNET will take care to use all routers equally. Sebastien. D. Marc Stearman a ?crit :> The routing configuration is in /etc/modprobe.conf, and routes can > also be dynamically added with lctl add_route. All of our lustre > servers are tcp servers, and we have client clusters that are tcp > only, and we also have IB and Elan clusters. Let''s say you want to > add a router to one of the IB clusters: > > We''ll assume that there is either a free port on the IB fabric, or we > change a compute node into a router node by adding some 10GigE hardware. > > The IB cluster is o2ib0 > The Lustre cluster is using tcp0 > The IB cluster routers have connections on o2ib0 and tcp0. > > Assuming you have an existing setup in place using either ip2nets or > networks in your modprobe.conf, and that you have existing routes > listed in the modprobe.conf, adding a router should be simple. On > the client side, add the routes in the modprobe.conf, and on the > lustre servers add the routes to the modprobe.conf. On the new > router, make sure it has the same modprobe.conf as the existing > routers. This will ensure that the configuration works after a > reboot. Since these are production clusters, we don''t want to reboot > any of them, so we need to add the routes dynamically. Lets say that > the new router has IP address 172.16.1.100 on tcp0 and > 192.168.120.100 on o2ib0, you would need to run the following commands: > > On lustre servers: > lctl --net o2ib0 add_route 172.16.1.100 at tcp0 > > On IB clients: > lctl --net tcp0 add_route 192.168.120.100 at o2ib0 > > The clients/servers will add the routes, and they will be down until > you start LNET on the new router: > service lnet start > > At LLNL, we have one large tcp0 that all the server clusters belong > to, and LNET is smart enough to use all the routers equally, so once > we add a new router, it just becomes part of the router pool for that > cluster, thereby increasing the bandwidth of that cluster. > > In reality, we rarely add new routers. We typcially spec out what we > call scalable units, so when we add onto a compute cluster, we add a > known chunk of compute servers, with a known number of routers. For > example, if a scalable unit is 144 compute nodes, and 4 IB/10 GigE > routers, then we may buy 4 scalable units, ending up with 576 compute > nodes, and 16 lustre routers. > > Hope that helps answer your question. > > -Marc > > ---- > D. Marc Stearman > LC Lustre Administration Lead > marc at llnl.gov > 925.423.9670 > Pager: 1.888.203.0641 > > > > On Apr 10, 2008, at 9:20 AM, S?bastien Buisson wrote: >> Hello Marc, >> >> Thank you for this feedback. This is a very exhaustive description of >> how you set up routers at the LLNL. >> Just one question however: according to you, a simple way to increase >> routing bandwidth is to add more Lustre routers, so that they are not >> the bottleneck in the cluster. But at the LLNL, how do you deal with >> Lustre routing configuration when you add new routers? I mean, how is >> the network load balanced between all routers? Is it done in a dynamic >> way that supports adding or removal of routers? >> >> Sebastien. >> >> >> D. Marc Stearman a ?crit : >>> Sebastien, >>> >>> For the most part we try to match the bandwidth of the disks, to the >>> network, to the number of routers needed. I will be at the Lustre >>> User Group meeting in Sonoma, CA at the end of this month giving a >>> talk about Lustre at LLNL, including our network design, and router >>> usage, but here is a quick description. >>> >>> We have a large federated ethernet core. We then have edge switches >>> for each of our clusters that have links up to the core, and back >>> down to the routers or tcp-only clients. In a typical situation, if >>> we think one file system can achieve 20 GB/s based on disk bandwidth, >>> we try to make sure that the filesystem cluster has 20 GB/s network >>> bandwith (1GigE, 10GigE, etc), and that the routers for the compute >>> cluster total up to 20 GB/s as well. So we may have a server cluster >>> with servers having dual GigE links, and routers with 10 GigE links, >>> and we just try to match them up so the numbers are even. Typically, >>> the routers in a cluster are the same node type as the compute >>> cluster, just populated with additional network hardware. >>> >>> In the future, we will likely be building a router cluster that will >>> bridge our existing federated ethernet core to a large Infinband >>> network, but that is at least one year away. >>> >>> Most of our routers are rather simple, the have one high speed >>> interconnect HCA (Quadrics, Mellanox IB), and one network card ( dual >>> GigE, or single 10 GigE). I don''t think we''ve hit any bus bandwidth >>> limitation, and I haven''t seen any of them really pressed for CPU or >>> Memory. We do make sure to turn of irq_affinity when we have a >>> single network interface (the 10 GigE routers), and we''ve had to tune >>> the buffers and credits on the routers to get better throughput. We >>> have noticed a problem with serialization of checksum processing on a >>> single core (bz #14690). >>> >>> The beauty of routers though, is that if you find that they are all >>> running at capacity, you can always add a couple more, and move the >>> bottleneck to the network or disks. We find we are mostly slowed >>> down by the disks. >>> >>> -Marc >>> >>> ---- >>> D. Marc Stearman >>> LC Lustre Administration Lead >>> marc at llnl.gov >>> 925.423.9670 >>> Pager: 1.888.203.0641 >>> >>> >>> >>> On Apr 10, 2008, at 1:06 AM, S?bastien Buisson wrote: >>>> Let''s consider that the internal bus of the machine is bigger >>>> enough so >>>> that it will not be saturated. In that case, what will be the >>>> limiting >>>> factor? memory? CPU? >>>> I know that it depends on how many I/B cards are plugged in the >>>> machine, >>>> but generally speaking, is the routing activity CPU or memory >>>> hungry? >>>> >>>> By the way, are there people on that list that have feedback about >>>> Lustre routers sizing? For instance, I know that Lustre routers have >>>> been set up at the LLNL. What is the throughput obtained via the >>>> routers, compared to the raw bandwidth of the interconnect? >>>> >>>> Thanks, >>>> Sebastien. >>>> >>>> >>>> Brian J. Murrell a ?crit : >>>>> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote: >>>>>> I mean, if I >>>>>> have an available bandwith of 100 on each side of a router, what >>>>>> will be >>>>>> the max reachable bandwith from clients on one side of the >>>>>> router to >>>>>> servers on the other side of the router? Is it 50? 80? 99? Is the >>>>>> routing process CPU or memory hungry? >>>>> While I can''t answer these things specifically another important >>>>> consideration is the bus architecture involved. How many I/B >>>>> cards can >>>>> you put on a bus before you saturate the bus? >>>>> >>>>> b. >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------- >>>>> -- >>>>> --- >>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >