thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre routers capabilities [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Sébastien Buisson

2008-Apr-09 17:07 UTC

[Lustre-discuss] Lustre routers capabilities

Hi there,

I have two Lustre filesystems using Infiniband as interconnect. Each 
filesystem has its own Infiniband network (physically separated from the 
other one), its own OSSs and its own storage devices (OSTs). From the 
Lustre clients of the first filesystem I need to access the second 
filesystem.
For that purpose I would like to setup a Lustre router farm between the 
two Infiniband networks. The routers would have several IB interfaces, 
some of them connected to the IB network of the first filesystem, the 
others connected to the IB network of the second filesystem.
I am wondering how many Lustre routers I will have to put. I mean, if I 
have an available bandwith of 100 on each side of a router, what will be 
the max reachable bandwith from clients on one side of the router to 
servers on the other side of the router? Is it 50? 80? 99? Is the 
routing process CPU or memory hungry?

Thanks in advance.
Sebastien.

Brian J. Murrell

2008-Apr-09 17:29 UTC

head link

[Lustre-discuss] Lustre routers capabilities

On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson
wrote:> I mean, if I 
> have an available bandwith of 100 on each side of a router, what will be 
> the max reachable bandwith from clients on one side of the router to 
> servers on the other side of the router? Is it 50? 80? 99? Is the 
> routing process CPU or memory hungry?
While I can''t answer these things specifically another important
consideration is the bus architecture involved.  How many I/B cards can
you put on a bus before you saturate the bus?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080409/9dfb726c/attachment-0002.bin

Sébastien Buisson

2008-Apr-10 08:06 UTC

head link

[Lustre-discuss] Lustre routers capabilities

Let''s consider that the internal bus of the machine is bigger enough so
that it will not be saturated. In that case, what will be the limiting 
factor? memory? CPU?
I know that it depends on how many I/B cards are plugged in the machine, 
but generally speaking, is the routing activity CPU or memory hungry?

By the way, are there people on that list that have feedback about 
Lustre routers sizing? For instance, I know that Lustre routers have 
been set up at the LLNL. What is the throughput obtained via the 
routers, compared to the raw bandwidth of the interconnect?

Thanks,
Sebastien.

Brian J. Murrell a ?crit :> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote:
>> I mean, if I 
>> have an available bandwith of 100 on each side of a router, what will
be
>> the max reachable bandwith from clients on one side of the router to 
>> servers on the other side of the router? Is it 50? 80? 99? Is the 
>> routing process CPU or memory hungry?
> 
> While I can''t answer these things specifically another important
> consideration is the bus architecture involved.  How many I/B cards can
> you put on a bus before you saturate the bus?
> 
> b.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Canon, Richard Shane

2008-Apr-10 12:31 UTC

head link

[Lustre-discuss] Lustre routers capabilities

Sebastien,

ORNL is assembling a fairly large configuration that will rely heavily on
routers.  We have a small configuration running similar to yours (IB to IB). 
Being a Cray shop, we also have IB to Portals running.

We are trying to reach 1.25 GB/s per router.  We haven''t achieved that
yet.  The best so far is about 1.1 GB/s sustained.  We do see burst up to
1.25-1.3, but not sustained.  This was measured on an IB to portals setup with
the XT having PCI-e riser (not GA yet).

I don''t have good numbers on the IB-to-IB routers.  If I get some, I
will post to the list.

Depending on the scale of the cluster, memory can be a driver.  You will need to
allocate enough buffers on the routers to sustain traffic flow.  The main driver
for memory consumption is the number of 1M buffers that the routers will run
with.

For IB we have not seen a huge CPU load, but I don''t have good numbers
to back that up.  So CPU may still be a factor.  We hope to repeat the above
test but on a Dual core box (versus the single core used thus far).

I will update the list as things evolve.

--Shane

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of S?bastien Buisson
Sent: Thursday, April 10, 2008 4:07 AM
To: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Lustre routers capabilities

Let''s consider that the internal bus of the machine is bigger enough so
that it will not be saturated. In that case, what will be the limiting 
factor? memory? CPU?
I know that it depends on how many I/B cards are plugged in the machine, 
but generally speaking, is the routing activity CPU or memory hungry?

By the way, are there people on that list that have feedback about 
Lustre routers sizing? For instance, I know that Lustre routers have 
been set up at the LLNL. What is the throughput obtained via the 
routers, compared to the raw bandwidth of the interconnect?

Thanks,
Sebastien.

Brian J. Murrell a ?crit :> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote:
>> I mean, if I 
>> have an available bandwith of 100 on each side of a router, what will
be
>> the max reachable bandwith from clients on one side of the router to 
>> servers on the other side of the router? Is it 50? 80? 99? Is the 
>> routing process CPU or memory hungry?
> 
> While I can''t answer these things specifically another important
> consideration is the bus architecture involved.  How many I/B cards can
> you put on a bus before you saturate the bus?
> 
> b.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

D. Marc Stearman

2008-Apr-10 15:09 UTC

head link

[Lustre-discuss] Lustre routers capabilities

Sebastien,

For the most part we try to match the bandwidth of the disks, to the  
network, to the number of routers needed.  I will be at the Lustre  
User Group meeting in Sonoma, CA at the end of this  month giving a  
talk about Lustre at LLNL, including our network design, and router  
usage, but here is a quick description.

We have a large federated ethernet core.  We then have edge switches  
for each of our clusters that have links up to the core, and back  
down to the routers or tcp-only clients.  In a typical situation, if  
we think one file system can achieve 20 GB/s based on disk bandwidth,  
we try to make sure that the filesystem cluster has 20 GB/s network  
bandwith (1GigE, 10GigE, etc), and that the routers for the compute  
cluster total up to 20 GB/s as well.  So we may have a server cluster  
with servers having dual GigE links, and routers with 10 GigE links,  
and we just try to match them up so the numbers are even.  Typically,  
the routers in a cluster are the same node type as the compute  
cluster, just populated with additional network hardware.

In the future, we will likely be building a router cluster that will  
bridge our existing federated ethernet core to a large Infinband  
network, but that is at least one year away.

Most of our routers are rather simple, the have one high speed  
interconnect HCA (Quadrics, Mellanox IB), and one network card ( dual  
GigE, or single 10 GigE).  I don''t think we''ve hit any bus
bandwidth
limitation, and I haven''t seen any of them really pressed for CPU or  
Memory.  We do make sure to turn of irq_affinity when we have a  
single network interface (the 10 GigE routers), and we''ve had to tune  
the buffers and credits on the routers to get better throughput.  We  
have noticed a problem with serialization of checksum processing on a  
single core (bz #14690).

The beauty of routers though, is that if you find that they are all  
running at capacity, you can always add a couple more, and move the  
bottleneck to the network or disks.  We find we are mostly slowed  
down by the disks.

-Marc

----
D. Marc Stearman
LC Lustre Administration Lead
marc at llnl.gov
925.423.9670
Pager: 1.888.203.0641

On Apr 10, 2008, at 1:06 AM, S?bastien Buisson wrote:> Let''s consider that the internal bus of the machine is bigger  
> enough so
> that it will not be saturated. In that case, what will be the limiting
> factor? memory? CPU?
> I know that it depends on how many I/B cards are plugged in the  
> machine,
> but generally speaking, is the routing activity CPU or memory hungry?
>
> By the way, are there people on that list that have feedback about
> Lustre routers sizing? For instance, I know that Lustre routers have
> been set up at the LLNL. What is the throughput obtained via the
> routers, compared to the raw bandwidth of the interconnect?
>
> Thanks,
> Sebastien.
>
>
> Brian J. Murrell a ?crit :
>> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote:
>>> I mean, if I
>>> have an available bandwith of 100 on each side of a router, what  
>>> will be
>>> the max reachable bandwith from clients on one side of the router
to
>>> servers on the other side of the router? Is it 50? 80? 99? Is the
>>> routing process CPU or memory hungry?
>>
>> While I can''t answer these things specifically another
important
>> consideration is the bus architecture involved.  How many I/B  
>> cards can
>> you put on a bus before you saturate the bus?
>>
>> b.
>>
>>
>>
>> --------------------------------------------------------------------- 
>> ---
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Isaac Huang

2008-Apr-10 15:29 UTC

head link

[Lustre-discuss] Lustre routers capabilities

On Thu, Apr 10, 2008 at 10:06:40AM +0200, S?bastien Buisson
wrote:> Let''s consider that the internal bus of the machine is bigger
enough so
> that it will not be saturated. In that case, what will be the limiting 
> factor? memory? CPU?
> I know that it depends on how many I/B cards are plugged in the machine, 
> but generally speaking, is the routing activity CPU or memory hungry?
> 
LNet routing is always memory hungry. Be sure to use 64-bit routers,
because router buffers can only come from low system memory (i.e.
ZONE_DMA and ZONE_NORMAL) which is usually 896M on 32-bit systems.

Router CPU usage depends on network types. When routing between IB
networks, router''s CPU usage shall be minimal. But when a router is 
homed in a TCP network, it can be CPU hungry since host CPU must be
involved in copying data received from the TCP network.

Isaac

Sébastien Buisson

2008-Apr-10 15:59 UTC

head link

[Lustre-discuss] Lustre routers capabilities

Hello,

And thanks for your feedback.
I am very interested in the figures you got. I understand that you 
reached 1.1 GB/s sustained on an IB to Portals router. But on that 
router, what is the raw IB bandwidth available, and the raw Portals 
bandwidth available? In your specific case, what do you think is the 
bottleneck on the Lustre router?

Sebastien.


Canon, Richard Shane a ?crit :> Sebastien,
> 
> ORNL is assembling a fairly large configuration that will rely heavily on
routers.  We have a small configuration running similar to yours (IB to IB). 
Being a Cray shop, we also have IB to Portals running.
> 
> We are trying to reach 1.25 GB/s per router.  We haven''t achieved
that yet.  The best so far is about 1.1 GB/s sustained.  We do see burst up to
1.25-1.3, but not sustained.  This was measured on an IB to portals setup with
the XT having PCI-e riser (not GA yet).
> 
> I don''t have good numbers on the IB-to-IB routers.  If I get some,
I will post to the list.
> 
> Depending on the scale of the cluster, memory can be a driver.  You will
need to allocate enough buffers on the routers to sustain traffic flow.  The
main driver for memory consumption is the number of 1M buffers that the routers
will run with.
> 
> For IB we have not seen a huge CPU load, but I don''t have good
numbers to back that up.  So CPU may still be a factor.  We hope to repeat the
above test but on a Dual core box (versus the single core used thus far).
> 
> I will update the list as things evolve.
> 
> --Shane
> 
> 
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of S?bastien
Buisson
> Sent: Thursday, April 10, 2008 4:07 AM
> To: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] Lustre routers capabilities
> 
> Let''s consider that the internal bus of the machine is bigger
enough so
> that it will not be saturated. In that case, what will be the limiting 
> factor? memory? CPU?
> I know that it depends on how many I/B cards are plugged in the machine, 
> but generally speaking, is the routing activity CPU or memory hungry?
> 
> By the way, are there people on that list that have feedback about 
> Lustre routers sizing? For instance, I know that Lustre routers have 
> been set up at the LLNL. What is the throughput obtained via the 
> routers, compared to the raw bandwidth of the interconnect?
> 
> Thanks,
> Sebastien.
> 
> 
> Brian J. Murrell a ?crit :
>> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote:
>>> I mean, if I 
>>> have an available bandwith of 100 on each side of a router, what
will be
>>> the max reachable bandwith from clients on one side of the router
to
>>> servers on the other side of the router? Is it 50? 80? 99? Is the 
>>> routing process CPU or memory hungry?
>> While I can''t answer these things specifically another
important
>> consideration is the bus architecture involved.  How many I/B cards can
>> you put on a bus before you saturate the bus?
>>
>> b.
>>
>>
>>
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>

Sébastien Buisson

2008-Apr-10 16:20 UTC

head link

[Lustre-discuss] Lustre routers capabilities

Hello Marc,

Thank you for this feedback. This is a very exhaustive description of 
how you set up routers at the LLNL.
Just one question however: according to you, a simple way to increase 
routing bandwidth is to add more Lustre routers, so that they are not 
the bottleneck in the cluster. But at the LLNL, how do you deal with 
Lustre routing configuration when you add new routers? I mean, how is 
the network load balanced between all routers? Is it done in a dynamic 
way that supports adding or removal of routers?

Sebastien.


D. Marc Stearman a ?crit :> Sebastien,
> 
> For the most part we try to match the bandwidth of the disks, to the  
> network, to the number of routers needed.  I will be at the Lustre  
> User Group meeting in Sonoma, CA at the end of this  month giving a  
> talk about Lustre at LLNL, including our network design, and router  
> usage, but here is a quick description.
> 
> We have a large federated ethernet core.  We then have edge switches  
> for each of our clusters that have links up to the core, and back  
> down to the routers or tcp-only clients.  In a typical situation, if  
> we think one file system can achieve 20 GB/s based on disk bandwidth,  
> we try to make sure that the filesystem cluster has 20 GB/s network  
> bandwith (1GigE, 10GigE, etc), and that the routers for the compute  
> cluster total up to 20 GB/s as well.  So we may have a server cluster  
> with servers having dual GigE links, and routers with 10 GigE links,  
> and we just try to match them up so the numbers are even.  Typically,  
> the routers in a cluster are the same node type as the compute  
> cluster, just populated with additional network hardware.
> 
> In the future, we will likely be building a router cluster that will  
> bridge our existing federated ethernet core to a large Infinband  
> network, but that is at least one year away.
> 
> Most of our routers are rather simple, the have one high speed  
> interconnect HCA (Quadrics, Mellanox IB), and one network card ( dual  
> GigE, or single 10 GigE).  I don''t think we''ve hit any
bus bandwidth
> limitation, and I haven''t seen any of them really pressed for CPU
or
> Memory.  We do make sure to turn of irq_affinity when we have a  
> single network interface (the 10 GigE routers), and we''ve had to
tune
> the buffers and credits on the routers to get better throughput.  We  
> have noticed a problem with serialization of checksum processing on a  
> single core (bz #14690).
> 
> The beauty of routers though, is that if you find that they are all  
> running at capacity, you can always add a couple more, and move the  
> bottleneck to the network or disks.  We find we are mostly slowed  
> down by the disks.
> 
> -Marc
> 
> ----
> D. Marc Stearman
> LC Lustre Administration Lead
> marc at llnl.gov
> 925.423.9670
> Pager: 1.888.203.0641
> 
> 
> 
> On Apr 10, 2008, at 1:06 AM, S?bastien Buisson wrote:
>> Let''s consider that the internal bus of the machine is bigger
>> enough so
>> that it will not be saturated. In that case, what will be the limiting
>> factor? memory? CPU?
>> I know that it depends on how many I/B cards are plugged in the  
>> machine,
>> but generally speaking, is the routing activity CPU or memory hungry?
>>
>> By the way, are there people on that list that have feedback about
>> Lustre routers sizing? For instance, I know that Lustre routers have
>> been set up at the LLNL. What is the throughput obtained via the
>> routers, compared to the raw bandwidth of the interconnect?
>>
>> Thanks,
>> Sebastien.
>>
>>
>> Brian J. Murrell a ?crit :
>>> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote:
>>>> I mean, if I
>>>> have an available bandwith of 100 on each side of a router,
what
>>>> will be
>>>> the max reachable bandwith from clients on one side of the
router to
>>>> servers on the other side of the router? Is it 50? 80? 99? Is
the
>>>> routing process CPU or memory hungry?
>>> While I can''t answer these things specifically another
important
>>> consideration is the bus architecture involved.  How many I/B  
>>> cards can
>>> you put on a bus before you saturate the bus?
>>>
>>> b.
>>>
>>>
>>>
>>>
---------------------------------------------------------------------
>>> ---
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>

D. Marc Stearman

2008-Apr-10 16:57 UTC

head link

[Lustre-discuss] Lustre routers capabilities

The routing configuration is in /etc/modprobe.conf, and routes can  
also be dynamically added with lctl add_route.  All of our lustre  
servers are tcp servers, and we have client clusters that are tcp  
only, and we also have IB and Elan clusters.  Let''s say you want to  
add a router to one of the IB clusters:

We''ll assume that there is either a free port on the IB fabric, or we  
change a compute node into a router node by adding some 10GigE hardware.

The IB cluster is o2ib0
The Lustre cluster is using tcp0
The IB cluster routers have connections on o2ib0 and tcp0.

Assuming you have an existing setup in place using either ip2nets or  
networks in your modprobe.conf, and that you have existing routes  
listed in the modprobe.conf, adding a router should be simple.  On  
the client side, add the routes in the modprobe.conf, and on the  
lustre servers add the routes to the modprobe.conf.  On the new  
router, make sure it has the same modprobe.conf as the existing  
routers.  This will ensure that the configuration works after a  
reboot.  Since these are production clusters, we don''t want to reboot  
any of them, so we need to add the routes dynamically.  Lets say that  
the new router has IP address 172.16.1.100 on tcp0 and  
192.168.120.100 on o2ib0, you would need to run the following commands:

On lustre servers:
lctl --net o2ib0 add_route 172.16.1.100 at tcp0

On IB clients:
lctl --net tcp0 add_route 192.168.120.100 at o2ib0

The clients/servers will add the routes, and they will be down until  
you start LNET on the new router:
service lnet start

At LLNL, we have one large tcp0 that all the server clusters belong  
to, and LNET is smart enough to use all the routers equally, so once  
we add a new router, it just becomes part of the router pool for that  
cluster, thereby increasing the bandwidth of that cluster.

In reality, we rarely add new routers.  We typcially spec out what we  
call scalable units, so when we add onto a compute cluster, we add a  
known chunk of compute servers, with a known number of routers.  For  
example, if a scalable unit is 144 compute nodes, and 4 IB/10 GigE  
routers, then we may buy 4 scalable units, ending up with 576 compute  
nodes, and 16 lustre routers.

Hope that helps answer your question.

-Marc

----
D. Marc Stearman
LC Lustre Administration Lead
marc at llnl.gov
925.423.9670
Pager: 1.888.203.0641



On Apr 10, 2008, at 9:20 AM, S?bastien Buisson wrote:> Hello Marc,
>
> Thank you for this feedback. This is a very exhaustive description of
> how you set up routers at the LLNL.
> Just one question however: according to you, a simple way to increase
> routing bandwidth is to add more Lustre routers, so that they are not
> the bottleneck in the cluster. But at the LLNL, how do you deal with
> Lustre routing configuration when you add new routers? I mean, how is
> the network load balanced between all routers? Is it done in a dynamic
> way that supports adding or removal of routers?
>
> Sebastien.
>
>
> D. Marc Stearman a ?crit :
>> Sebastien,
>>
>> For the most part we try to match the bandwidth of the disks, to the
>> network, to the number of routers needed.  I will be at the Lustre
>> User Group meeting in Sonoma, CA at the end of this  month giving a
>> talk about Lustre at LLNL, including our network design, and router
>> usage, but here is a quick description.
>>
>> We have a large federated ethernet core.  We then have edge switches
>> for each of our clusters that have links up to the core, and back
>> down to the routers or tcp-only clients.  In a typical situation, if
>> we think one file system can achieve 20 GB/s based on disk bandwidth,
>> we try to make sure that the filesystem cluster has 20 GB/s network
>> bandwith (1GigE, 10GigE, etc), and that the routers for the compute
>> cluster total up to 20 GB/s as well.  So we may have a server cluster
>> with servers having dual GigE links, and routers with 10 GigE links,
>> and we just try to match them up so the numbers are even.  Typically,
>> the routers in a cluster are the same node type as the compute
>> cluster, just populated with additional network hardware.
>>
>> In the future, we will likely be building a router cluster that will
>> bridge our existing federated ethernet core to a large Infinband
>> network, but that is at least one year away.
>>
>> Most of our routers are rather simple, the have one high speed
>> interconnect HCA (Quadrics, Mellanox IB), and one network card ( dual
>> GigE, or single 10 GigE).  I don''t think we''ve hit
any bus bandwidth
>> limitation, and I haven''t seen any of them really pressed for
CPU or
>> Memory.  We do make sure to turn of irq_affinity when we have a
>> single network interface (the 10 GigE routers), and we''ve had
to tune
>> the buffers and credits on the routers to get better throughput.  We
>> have noticed a problem with serialization of checksum processing on a
>> single core (bz #14690).
>>
>> The beauty of routers though, is that if you find that they are all
>> running at capacity, you can always add a couple more, and move the
>> bottleneck to the network or disks.  We find we are mostly slowed
>> down by the disks.
>>
>> -Marc
>>
>> ----
>> D. Marc Stearman
>> LC Lustre Administration Lead
>> marc at llnl.gov
>> 925.423.9670
>> Pager: 1.888.203.0641
>>
>>
>>
>> On Apr 10, 2008, at 1:06 AM, S?bastien Buisson wrote:
>>> Let''s consider that the internal bus of the machine is
bigger
>>> enough so
>>> that it will not be saturated. In that case, what will be the  
>>> limiting
>>> factor? memory? CPU?
>>> I know that it depends on how many I/B cards are plugged in the
>>> machine,
>>> but generally speaking, is the routing activity CPU or memory  
>>> hungry?
>>>
>>> By the way, are there people on that list that have feedback about
>>> Lustre routers sizing? For instance, I know that Lustre routers
have
>>> been set up at the LLNL. What is the throughput obtained via the
>>> routers, compared to the raw bandwidth of the interconnect?
>>>
>>> Thanks,
>>> Sebastien.
>>>
>>>
>>> Brian J. Murrell a ?crit :
>>>> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote:
>>>>> I mean, if I
>>>>> have an available bandwith of 100 on each side of a router,
what
>>>>> will be
>>>>> the max reachable bandwith from clients on one side of the
>>>>> router to
>>>>> servers on the other side of the router? Is it 50? 80? 99?
Is the
>>>>> routing process CPU or memory hungry?
>>>> While I can''t answer these things specifically another
important
>>>> consideration is the bus architecture involved.  How many I/B
>>>> cards can
>>>> you put on a bus before you saturate the bus?
>>>>
>>>> b.
>>>>
>>>>
>>>>
>>>>
-------------------------------------------------------------------
>>>> --
>>>> ---
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Sébastien Buisson

2008-Apr-11 12:14 UTC

head link

[Lustre-discuss] Lustre routers capabilities

Yes thank you Mark, that was really the piece of information I was 
looking for: among all available routes, LNET is smart enough to use 
them equally. So I can define all the routes on all my clients and all 
my servers, LNET will take care to use all routers equally.

Sebastien.


D. Marc Stearman a ?crit :> The routing configuration is in /etc/modprobe.conf, and routes can  
> also be dynamically added with lctl add_route.  All of our lustre  
> servers are tcp servers, and we have client clusters that are tcp  
> only, and we also have IB and Elan clusters.  Let''s say you want
to
> add a router to one of the IB clusters:
> 
> We''ll assume that there is either a free port on the IB fabric, or
we
> change a compute node into a router node by adding some 10GigE hardware.
> 
> The IB cluster is o2ib0
> The Lustre cluster is using tcp0
> The IB cluster routers have connections on o2ib0 and tcp0.
> 
> Assuming you have an existing setup in place using either ip2nets or  
> networks in your modprobe.conf, and that you have existing routes  
> listed in the modprobe.conf, adding a router should be simple.  On  
> the client side, add the routes in the modprobe.conf, and on the  
> lustre servers add the routes to the modprobe.conf.  On the new  
> router, make sure it has the same modprobe.conf as the existing  
> routers.  This will ensure that the configuration works after a  
> reboot.  Since these are production clusters, we don''t want to
reboot
> any of them, so we need to add the routes dynamically.  Lets say that  
> the new router has IP address 172.16.1.100 on tcp0 and  
> 192.168.120.100 on o2ib0, you would need to run the following commands:
> 
> On lustre servers:
> lctl --net o2ib0 add_route 172.16.1.100 at tcp0
> 
> On IB clients:
> lctl --net tcp0 add_route 192.168.120.100 at o2ib0
> 
> The clients/servers will add the routes, and they will be down until  
> you start LNET on the new router:
> service lnet start
> 
> At LLNL, we have one large tcp0 that all the server clusters belong  
> to, and LNET is smart enough to use all the routers equally, so once  
> we add a new router, it just becomes part of the router pool for that  
> cluster, thereby increasing the bandwidth of that cluster.
> 
> In reality, we rarely add new routers.  We typcially spec out what we  
> call scalable units, so when we add onto a compute cluster, we add a  
> known chunk of compute servers, with a known number of routers.  For  
> example, if a scalable unit is 144 compute nodes, and 4 IB/10 GigE  
> routers, then we may buy 4 scalable units, ending up with 576 compute  
> nodes, and 16 lustre routers.
> 
> Hope that helps answer your question.
> 
> -Marc
> 
> ----
> D. Marc Stearman
> LC Lustre Administration Lead
> marc at llnl.gov
> 925.423.9670
> Pager: 1.888.203.0641
> 
> 
> 
> On Apr 10, 2008, at 9:20 AM, S?bastien Buisson wrote:
>> Hello Marc,
>>
>> Thank you for this feedback. This is a very exhaustive description of
>> how you set up routers at the LLNL.
>> Just one question however: according to you, a simple way to increase
>> routing bandwidth is to add more Lustre routers, so that they are not
>> the bottleneck in the cluster. But at the LLNL, how do you deal with
>> Lustre routing configuration when you add new routers? I mean, how is
>> the network load balanced between all routers? Is it done in a dynamic
>> way that supports adding or removal of routers?
>>
>> Sebastien.
>>
>>
>> D. Marc Stearman a ?crit :
>>> Sebastien,
>>>
>>> For the most part we try to match the bandwidth of the disks, to
the
>>> network, to the number of routers needed.  I will be at the Lustre
>>> User Group meeting in Sonoma, CA at the end of this  month giving a
>>> talk about Lustre at LLNL, including our network design, and router
>>> usage, but here is a quick description.
>>>
>>> We have a large federated ethernet core.  We then have edge
switches
>>> for each of our clusters that have links up to the core, and back
>>> down to the routers or tcp-only clients.  In a typical situation,
if
>>> we think one file system can achieve 20 GB/s based on disk
bandwidth,
>>> we try to make sure that the filesystem cluster has 20 GB/s network
>>> bandwith (1GigE, 10GigE, etc), and that the routers for the compute
>>> cluster total up to 20 GB/s as well.  So we may have a server
cluster
>>> with servers having dual GigE links, and routers with 10 GigE
links,
>>> and we just try to match them up so the numbers are even. 
Typically,
>>> the routers in a cluster are the same node type as the compute
>>> cluster, just populated with additional network hardware.
>>>
>>> In the future, we will likely be building a router cluster that
will
>>> bridge our existing federated ethernet core to a large Infinband
>>> network, but that is at least one year away.
>>>
>>> Most of our routers are rather simple, the have one high speed
>>> interconnect HCA (Quadrics, Mellanox IB), and one network card (
dual
>>> GigE, or single 10 GigE).  I don''t think we''ve
hit any bus bandwidth
>>> limitation, and I haven''t seen any of them really pressed
for CPU or
>>> Memory.  We do make sure to turn of irq_affinity when we have a
>>> single network interface (the 10 GigE routers), and we''ve
had to tune
>>> the buffers and credits on the routers to get better throughput. 
We
>>> have noticed a problem with serialization of checksum processing on
a
>>> single core (bz #14690).
>>>
>>> The beauty of routers though, is that if you find that they are all
>>> running at capacity, you can always add a couple more, and move the
>>> bottleneck to the network or disks.  We find we are mostly slowed
>>> down by the disks.
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> LC Lustre Administration Lead
>>> marc at llnl.gov
>>> 925.423.9670
>>> Pager: 1.888.203.0641
>>>
>>>
>>>
>>> On Apr 10, 2008, at 1:06 AM, S?bastien Buisson wrote:
>>>> Let''s consider that the internal bus of the machine is
bigger
>>>> enough so
>>>> that it will not be saturated. In that case, what will be the  
>>>> limiting
>>>> factor? memory? CPU?
>>>> I know that it depends on how many I/B cards are plugged in the
>>>> machine,
>>>> but generally speaking, is the routing activity CPU or memory  
>>>> hungry?
>>>>
>>>> By the way, are there people on that list that have feedback
about
>>>> Lustre routers sizing? For instance, I know that Lustre routers
have
>>>> been set up at the LLNL. What is the throughput obtained via
the
>>>> routers, compared to the raw bandwidth of the interconnect?
>>>>
>>>> Thanks,
>>>> Sebastien.
>>>>
>>>>
>>>> Brian J. Murrell a ?crit :
>>>>> On Wed, 2008-04-09 at 19:07 +0200, S?bastien Buisson wrote:
>>>>>> I mean, if I
>>>>>> have an available bandwith of 100 on each side of a
router, what
>>>>>> will be
>>>>>> the max reachable bandwith from clients on one side of
the
>>>>>> router to
>>>>>> servers on the other side of the router? Is it 50? 80?
99? Is the
>>>>>> routing process CPU or memory hungry?
>>>>> While I can''t answer these things specifically
another important
>>>>> consideration is the bus architecture involved.  How many
I/B
>>>>> cards can
>>>>> you put on a bus before you saturate the bus?
>>>>>
>>>>> b.
>>>>>
>>>>>
>>>>>
>>>>>
-------------------------------------------------------------------
>>>>> --
>>>>> ---
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>

Lustre discuss - Apr 2008 - Lustre routers capabilities

[Lustre-discuss] Lustre routers capabilities

[Lustre-discuss] Lustre routers capabilities

[Lustre-discuss] Lustre routers capabilities

[Lustre-discuss] Lustre routers capabilities

[Lustre-discuss] Lustre routers capabilities

[Lustre-discuss] Lustre routers capabilities

[Lustre-discuss] Lustre routers capabilities

[Lustre-discuss] Lustre routers capabilities

[Lustre-discuss] Lustre routers capabilities

[Lustre-discuss] Lustre routers capabilities