thr3ads.net - Lustre discuss - [Lustre-discuss] 1GB throughput limit on OST (1.8.5)? [Jan 2011]

If this information is useful, please help other people find it:
Share via:

David Merhar

2011-Jan-27 13:17 UTC

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

Our OSS''s with 2x1GB NICs (bonded) appear limited to 1GB worth of  
write throughput each.

Our setup:
2 OSS serving 1 OST each
Lustre 1.8.5
RHEL 5.4
New Dell M610''s blade servers with plenty of CPU and RAM
All SAN fibre connections are at least 4GB

Some notes:
- A direct write (dd) from a single OSS to the OST gets 4GB, the OSS''s
fibre wire speed.
- A single client will get 2GB of lustre write speed, the client''s  
ethernet wire speed.
- We''ve tried bond mode 6 and 0 on all systems.  With mode 6 we will  
see both NICs on both OSSs receiving data.
- We''ve tried multiple OSTs per OSS.

But 2 clients writing a file will get 2GB of total bandwidth to the  
filesystems.  We have been unable to isolate any particular resource  
bottleneck.  None of the systems (MDS, OSS, or client) seem to be  
working very hard.

The 1GB per OSS threshold is so consistent, that it almost appears by  
design - and hopefully we''re missing something obvious.

Any advice?

Thanks.

djm

Balagopal Pillai

2011-Jan-27 13:48 UTC

head link

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

I guess you have two gigabit nics bonded in mode 6 and not two 1GB nics? 
(B-Bytes, b-bits) The max aggregate throughput could be about 200MBps 
out of the 2 bonded nics. I think the mode 0 bonding works only with 
cisco etherchannel or something similar on the switch side. Same with 
the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max 
throughout. Maybe you could also see the max read and write capabilities 
of the raid controller other than just the network. When testing with 
dd, some of the data remains as dirty data till its flushed into the 
disk. I think the default background ratio is 10% for rhel5 which would 
be sizable if your oss have lots of ram. There is chance of lockup of 
the oss once it hits the dirty_ratio limit,which is 40% by default. So a 
bit more aggressive flush to disk by lowering the background_ratio and a 
bit more headroom before it hits the dirty_ratio is generally desirable 
if your raid controller could keep up with it. So with your current 
setup, i guess you could get a max of 400MBps out of both OSS''s if they
both have two 1Gb nics in them. Maybe if you have one of the switches 
from Dell that has 4 10Gb ports in them (their powerconnect 6248), 10Gb 
nics for your OSS''s might be a cheaper way to increase the aggregate 
performance. I think over 1GBps from a client is possible in cases where 
you use infiniband and rdma to deliver data.

David Merhar wrote:> Our OSS''s with 2x1GB NICs (bonded) appear limited to 1GB worth of
> write throughput each.
> 
> Our setup:
> 2 OSS serving 1 OST each
> Lustre 1.8.5
> RHEL 5.4
> New Dell M610''s blade servers with plenty of CPU and RAM
> All SAN fibre connections are at least 4GB
> 
> Some notes:
> - A direct write (dd) from a single OSS to the OST gets 4GB, the
OSS''s
> fibre wire speed.
> - A single client will get 2GB of lustre write speed, the client''s
> ethernet wire speed.
> - We''ve tried bond mode 6 and 0 on all systems.  With mode 6 we
will
> see both NICs on both OSSs receiving data.
> - We''ve tried multiple OSTs per OSS.
> 
> But 2 clients writing a file will get 2GB of total bandwidth to the  
> filesystems.  We have been unable to isolate any particular resource  
> bottleneck.  None of the systems (MDS, OSS, or client) seem to be  
> working very hard.
> 
> The 1GB per OSS threshold is so consistent, that it almost appears by  
> design - and hopefully we''re missing something obvious.
> 
> Any advice?
> 
> Thanks.
> 
> djm
> 
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

David Merhar

2011-Jan-27 14:12 UTC

head link

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

Sorry - little b all the way around.

We''re limited to 1Gb per OST.

djm



On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:
> I guess you have two gigabit nics bonded in mode 6 and not two 1GB  
> nics?
> (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps
> out of the 2 bonded nics. I think the mode 0 bonding works only with
> cisco etherchannel or something similar on the switch side. Same with
> the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
> throughout. Maybe you could also see the max read and write  
> capabilities
> of the raid controller other than just the network. When testing with
> dd, some of the data remains as dirty data till its flushed into the
> disk. I think the default background ratio is 10% for rhel5 which  
> would
> be sizable if your oss have lots of ram. There is chance of lockup of
> the oss once it hits the dirty_ratio limit,which is 40% by default.  
> So a
> bit more aggressive flush to disk by lowering the background_ratio  
> and a
> bit more headroom before it hits the dirty_ratio is generally  
> desirable
> if your raid controller could keep up with it. So with your current
> setup, i guess you could get a max of 400MBps out of both OSS''s if
> they
> both have two 1Gb nics in them. Maybe if you have one of the switches
> from Dell that has 4 10Gb ports in them (their powerconnect 6248),  
> 10Gb
> nics for your OSS''s might be a cheaper way to increase the
aggregate
> performance. I think over 1GBps from a client is possible in cases  
> where
> you use infiniband and rdma to deliver data.
>
>
> David Merhar wrote:
>> Our OSS''s with 2x1GB NICs (bonded) appear limited to 1GB worth
of
>> write throughput each.
>>
>> Our setup:
>> 2 OSS serving 1 OST each
>> Lustre 1.8.5
>> RHEL 5.4
>> New Dell M610''s blade servers with plenty of CPU and RAM
>> All SAN fibre connections are at least 4GB
>>
>> Some notes:
>> - A direct write (dd) from a single OSS to the OST gets 4GB, the  
>> OSS''s
>> fibre wire speed.
>> - A single client will get 2GB of lustre write speed, the
client''s
>> ethernet wire speed.
>> - We''ve tried bond mode 6 and 0 on all systems.  With mode 6
we will
>> see both NICs on both OSSs receiving data.
>> - We''ve tried multiple OSTs per OSS.
>>
>> But 2 clients writing a file will get 2GB of total bandwidth to the
>> filesystems.  We have been unable to isolate any particular resource
>> bottleneck.  None of the systems (MDS, OSS, or client) seem to be
>> working very hard.
>>
>> The 1GB per OSS threshold is so consistent, that it almost appears by
>> design - and hopefully we''re missing something obvious.
>>
>> Any advice?
>>
>> Thanks.
>>
>> djm
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Kevin Van Maren

2011-Jan-27 17:16 UTC

head link

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

Normally if you are having a problem with write BW, you need to futz 
with the switch.  If you were having
problems with read BW, you need to futz with the server''s config (xmit 
hash policy is the usual culprit).

Are you testing multiple clients to the same server?

Are you using mode 6 because you don''t have bonding support in your 
switch?  I normally use 802.3ad mode,
assuming your switch supports link aggregation.


I was bonding 2x1Gb links for Lustre back in 2004.  That was before 
BOND_XMIT_POLICY_LAYER34
was in the kernel, so I had to hack the bond xmit hash (with multiple 
NICs standard, layer2 hashing does not
produce a uniform distribution, and can''t work if going through a
router).

Any one connection (socket or node/node connection) will use only one 
gigabit link.  While it is possible
to use two links using round-robin, that normally only helps for client 
reads (server can''t choose which link to
receive data, the switch picks that), and has the serious downside of 
out-of-order packets on the TCP stream.

[If you want clients to have better client bandwidth for a single file, 
change your default stripe count to 2, so it
will hit two different servers.]

Kevin


David Merhar wrote:> Sorry - little b all the way around.
>
> We''re limited to 1Gb per OST.
>
> djm
>
>
>
> On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:
>
>   
>> I guess you have two gigabit nics bonded in mode 6 and not two 1GB  
>> nics?
>> (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps
>> out of the 2 bonded nics. I think the mode 0 bonding works only with
>> cisco etherchannel or something similar on the switch side. Same with
>> the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
>> throughout. Maybe you could also see the max read and write  
>> capabilities
>> of the raid controller other than just the network. When testing with
>> dd, some of the data remains as dirty data till its flushed into the
>> disk. I think the default background ratio is 10% for rhel5 which  
>> would
>> be sizable if your oss have lots of ram. There is chance of lockup of
>> the oss once it hits the dirty_ratio limit,which is 40% by default.  
>> So a
>> bit more aggressive flush to disk by lowering the background_ratio  
>> and a
>> bit more headroom before it hits the dirty_ratio is generally  
>> desirable
>> if your raid controller could keep up with it. So with your current
>> setup, i guess you could get a max of 400MBps out of both
OSS''s if
>> they
>> both have two 1Gb nics in them. Maybe if you have one of the switches
>> from Dell that has 4 10Gb ports in them (their powerconnect 6248),  
>> 10Gb
>> nics for your OSS''s might be a cheaper way to increase the
aggregate
>> performance. I think over 1GBps from a client is possible in cases  
>> where
>> you use infiniband and rdma to deliver data.
>>
>>
>> David Merhar wrote:
>>     
>>> Our OSS''s with 2x1GB NICs (bonded) appear limited to 1GB
worth of
>>> write throughput each.
>>>
>>> Our setup:
>>> 2 OSS serving 1 OST each
>>> Lustre 1.8.5
>>> RHEL 5.4
>>> New Dell M610''s blade servers with plenty of CPU and RAM
>>> All SAN fibre connections are at least 4GB
>>>
>>> Some notes:
>>> - A direct write (dd) from a single OSS to the OST gets 4GB, the  
>>> OSS''s
>>> fibre wire speed.
>>> - A single client will get 2GB of lustre write speed, the
client''s
>>> ethernet wire speed.
>>> - We''ve tried bond mode 6 and 0 on all systems.  With mode
6 we will
>>> see both NICs on both OSSs receiving data.
>>> - We''ve tried multiple OSTs per OSS.
>>>
>>> But 2 clients writing a file will get 2GB of total bandwidth to the
>>> filesystems.  We have been unable to isolate any particular
resource
>>> bottleneck.  None of the systems (MDS, OSS, or client) seem to be
>>> working very hard.
>>>
>>> The 1GB per OSS threshold is so consistent, that it almost appears
by
>>> design - and hopefully we''re missing something obvious.
>>>
>>> Any advice?
>>>
>>> Thanks.
>>>
>>> djm
>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>       
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>     
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

David Merhar

2011-Jan-27 18:09 UTC

head link

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

Appreciate the input.

We''ve been using mode 6 as I expect it provides the fewest  
configuration pratfalls.  IF the single stream becomes our bottleneck  
we''ll mess with aggregation.

What I can''t find is the bottleneck in our current setup.  With 4  
servers - 2 clients, two OSSs - I''d expect 4Gb of aggregate throughput
where each client has a single connection to each OST.   Instead we''re
limited to 2GB, where each OSS appears limited to 1Gb of I/O.   The  
strangeness is that iptraf on the OSSs shows traffic through the  
expected connections (2 X 2) but at only 35% - 65% of bandwidth.

And a third client writing to the filesystem will briefly increase  
aggregate throughput, but it quickly settles back to ~2Gb.

djm




On Jan 27, 2011, at 11:16 AM, Kevin Van Maren wrote:
> Normally if you are having a problem with write BW, you need to futz  
> with the switch.  If you were having
> problems with read BW, you need to futz with the server''s config  
> (xmit hash policy is the usual culprit).
>
> Are you testing multiple clients to the same server?
>
> Are you using mode 6 because you don''t have bonding support in
your
> switch?  I normally use 802.3ad mode,
> assuming your switch supports link aggregation.
>
>
> I was bonding 2x1Gb links for Lustre back in 2004.  That was before  
> BOND_XMIT_POLICY_LAYER34
> was in the kernel, so I had to hack the bond xmit hash (with  
> multiple NICs standard, layer2 hashing does not
> produce a uniform distribution, and can''t work if going through a
> router).
>
> Any one connection (socket or node/node connection) will use only  
> one gigabit link.  While it is possible
> to use two links using round-robin, that normally only helps for  
> client reads (server can''t choose which link to
> receive data, the switch picks that), and has the serious downside  
> of out-of-order packets on the TCP stream.
>
> [If you want clients to have better client bandwidth for a single  
> file, change your default stripe count to 2, so it
> will hit two different servers.]
>
> Kevin
>
>
> David Merhar wrote:
>> Sorry - little b all the way around.
>>
>> We''re limited to 1Gb per OST.
>>
>> djm
>>
>>
>>
>> On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:
>>
>>
>>> I guess you have two gigabit nics bonded in mode 6 and not two  
>>> 1GB  nics?
>>> (B-Bytes, b-bits) The max aggregate throughput could be about  
>>> 200MBps
>>> out of the 2 bonded nics. I think the mode 0 bonding works only
with
>>> cisco etherchannel or something similar on the switch side. Same  
>>> with
>>> the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
>>> throughout. Maybe you could also see the max read and write   
>>> capabilities
>>> of the raid controller other than just the network. When testing  
>>> with
>>> dd, some of the data remains as dirty data till its flushed into
the
>>> disk. I think the default background ratio is 10% for rhel5 which
>>> would
>>> be sizable if your oss have lots of ram. There is chance of lockup
>>> of
>>> the oss once it hits the dirty_ratio limit,which is 40% by  
>>> default.  So a
>>> bit more aggressive flush to disk by lowering the  
>>> background_ratio  and a
>>> bit more headroom before it hits the dirty_ratio is generally   
>>> desirable
>>> if your raid controller could keep up with it. So with your current
>>> setup, i guess you could get a max of 400MBps out of both
OSS''s
>>> if  they
>>> both have two 1Gb nics in them. Maybe if you have one of the  
>>> switches
>>> from Dell that has 4 10Gb ports in them (their powerconnect  
>>> 6248),  10Gb
>>> nics for your OSS''s might be a cheaper way to increase the
aggregate
>>> performance. I think over 1GBps from a client is possible in  
>>> cases  where
>>> you use infiniband and rdma to deliver data.
>>>
>>>
>>> David Merhar wrote:
>>>
>>>> Our OSS''s with 2x1GB NICs (bonded) appear limited to
1GB worth of
>>>> write throughput each.
>>>>
>>>> Our setup:
>>>> 2 OSS serving 1 OST each
>>>> Lustre 1.8.5
>>>> RHEL 5.4
>>>> New Dell M610''s blade servers with plenty of CPU and
RAM
>>>> All SAN fibre connections are at least 4GB
>>>>
>>>> Some notes:
>>>> - A direct write (dd) from a single OSS to the OST gets 4GB,
the
>>>> OSS''s
>>>> fibre wire speed.
>>>> - A single client will get 2GB of lustre write speed, the
client''s
>>>> ethernet wire speed.
>>>> - We''ve tried bond mode 6 and 0 on all systems.  With
mode 6 we
>>>> will
>>>> see both NICs on both OSSs receiving data.
>>>> - We''ve tried multiple OSTs per OSS.
>>>>
>>>> But 2 clients writing a file will get 2GB of total bandwidth to
the
>>>> filesystems.  We have been unable to isolate any particular  
>>>> resource
>>>> bottleneck.  None of the systems (MDS, OSS, or client) seem to
be
>>>> working very hard.
>>>>
>>>> The 1GB per OSS threshold is so consistent, that it almost  
>>>> appears by
>>>> design - and hopefully we''re missing something
obvious.
>>>>
>>>> Any advice?
>>>>
>>>> Thanks.
>>>>
>>>> djm
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110127/2821abe7/attachment-0001.html

Lustre discuss - Jan 2011 - 1GB throughput limit on OST (1.8.5)?

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?