thr3ads.net - Lustre discuss - [Lustre-discuss] Max bandwidth through a single 4xQDR IB link? [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Ashley Pittman

2010-Jun-29 10:13 UTC

[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?

Hi,

Could anyone confirm to me the maximum achievable bandwidth over a single 4xQDR
IB link into a OSS node.  I have many clients doing a write test over IB and
want to know the maximum bandwidth we can expect to see for each OSS node.  For
MPI over these links we see between 3 and 3.5BG/s but I suspect Lustre is
capable of more than this because it''s not using DALP, is this correct?

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

Kevin Van Maren

2010-Jun-29 13:57 UTC

head link

[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?

DAPL is a high-performance interface that uses a small shim to provide a 
common DMA API on top of (in this case) the IB verbs layer.  In general, 
there is a very small performance impact to be able to use the common 
API, so you will not get more large-message bandwidth using native IB verbs.

I''ve never had enough disk bandwidth behind a node to saturate a QDR IB
link, so I''m not sure how high LNET will go.  If you have an IB test 
cluster, you should be able to measure the upper limits by creating an 
OST on a loopback device on tmpfs, although you have to ensure the 
client-side cache is not skewing your results (hint: boot client with 
something like "mem=1g" to limit the ram they can use for the cache).

While the QDR IB link bandwidth is 4GB/s (or around 3.9GB/s with 2KB 
packets), the maximum HCA bandwidth is normally around 3.2GB/s 
(unidirectional), due to the PCIe overhead of breaking the transaction 
into (relatively) small packets and managing the packet flow 
control/credits.  This is independent of the protocol, and limited by 
the PCIe Gen2 x8 PCIe interface.  You will see somewhat higher bandwidth 
if your system supports and uses a 256 byte MaxPayload, rather than 128 
bytes.  Use lspci to see what your system is using, as in: "lspci -vv -d 
15b3: | grep MaxPayload"

Kevin

Ashley Pittman wrote:> Hi,
>
> Could anyone confirm to me the maximum achievable bandwidth over a single
4xQDR IB link into a OSS node.  I have many clients doing a write test over IB
and want to know the maximum bandwidth we can expect to see for each OSS node. 
For MPI over these links we see between 3 and 3.5BG/s but I suspect Lustre is
capable of more than this because it''s not using DALP, is this correct?
>
> Ashley.
>
>

Bernd Schubert

2010-Jun-29 14:15 UTC

head link

[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?

Hello Ashley, hello Kevin,

I really see no point to use disks to benchmark performance, when 
lnet_selftest exists. Benchmark order should be:

- test how much the disks can provide
- test network with lnet_selftest

=> make sure lustre performance is not much below the 
   min(disks, lnet_selftest)


Cheers,
Bernd



On Tuesday, June 29, 2010, Kevin Van Maren wrote:> DAPL is a high-performance interface that uses a small shim to provide a
> common DMA API on top of (in this case) the IB verbs layer.  In general,
> there is a very small performance impact to be able to use the common
> API, so you will not get more large-message bandwidth using native IB
> verbs.
> 
> I''ve never had enough disk bandwidth behind a node to saturate a
QDR IB
> link, so I''m not sure how high LNET will go.  If you have an IB
test
> cluster, you should be able to measure the upper limits by creating an
> OST on a loopback device on tmpfs, although you have to ensure the
> client-side cache is not skewing your results (hint: boot client with
> something like "mem=1g" to limit the ram they can use for the
cache).
> 
> While the QDR IB link bandwidth is 4GB/s (or around 3.9GB/s with 2KB
> packets), the maximum HCA bandwidth is normally around 3.2GB/s
> (unidirectional), due to the PCIe overhead of breaking the transaction
> into (relatively) small packets and managing the packet flow
> control/credits.  This is independent of the protocol, and limited by
> the PCIe Gen2 x8 PCIe interface.  You will see somewhat higher bandwidth
> if your system supports and uses a 256 byte MaxPayload, rather than 128
> bytes.  Use lspci to see what your system is using, as in: "lspci -vv
-d
> 15b3: | grep MaxPayload"
> 
> Kevin
> 
> Ashley Pittman wrote:
> > Hi,
> > 
> > Could anyone confirm to me the maximum achievable bandwidth over a
single
> > 4xQDR IB link into a OSS node.  I have many clients doing a write test
> > over IB and want to know the maximum bandwidth we can expect to see
for
> > each OSS node.  For MPI over these links we see between 3 and 3.5BG/s
> > but I suspect Lustre is capable of more than this because
it''s not using
> > DALP, is this correct?
> > 
> > Ashley.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
Bernd Schubert
DataDirect Networks

Atul Vidwansa

2010-Jul-01 04:20 UTC

head link

[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?

I would do following tests to see real life performance with QDR IB:

1. See what badwidth has been negotiated by IB HCA and your system using 
ibv_devinfo
2. Use ib_rdma_bw and ib_send_bw between a pair of Lustre client and 
server to see how much raw bandwidth you are getting.
3. Use lnet_selftest unidirectional (read OR write) and bidirectional 
(read AND write) tests to see how much lnet can give you. See lustre 
manual on using lnet_selftest
4. Benchmark your storage using sgpdd_survey or XDD
5. Run IOR or IOzone from "multiple" clients to see what throughput
you
are getting. If you are interested in single client results, you can run 
muti-threaded "dd" command from client on a Lustre filesystem.

Cheers,
_Atul

On 06/29/2010 07:45 PM, Bernd Schubert wrote:> Hello Ashley, hello Kevin,
>
> I really see no point to use disks to benchmark performance, when
> lnet_selftest exists. Benchmark order should be:
>
> - test how much the disks can provide
> - test network with lnet_selftest
>
> =>  make sure lustre performance is not much below the
>     min(disks, lnet_selftest)
>
>
> Cheers,
> Bernd
>
>
>
> On Tuesday, June 29, 2010, Kevin Van Maren wrote:
>    
>> DAPL is a high-performance interface that uses a small shim to provide
a
>> common DMA API on top of (in this case) the IB verbs layer.  In
general,
>> there is a very small performance impact to be able to use the common
>> API, so you will not get more large-message bandwidth using native IB
>> verbs.
>>
>> I''ve never had enough disk bandwidth behind a node to saturate
a QDR IB
>> link, so I''m not sure how high LNET will go.  If you have an
IB test
>> cluster, you should be able to measure the upper limits by creating an
>> OST on a loopback device on tmpfs, although you have to ensure the
>> client-side cache is not skewing your results (hint: boot client with
>> something like "mem=1g" to limit the ram they can use for the
cache).
>>
>> While the QDR IB link bandwidth is 4GB/s (or around 3.9GB/s with 2KB
>> packets), the maximum HCA bandwidth is normally around 3.2GB/s
>> (unidirectional), due to the PCIe overhead of breaking the transaction
>> into (relatively) small packets and managing the packet flow
>> control/credits.  This is independent of the protocol, and limited by
>> the PCIe Gen2 x8 PCIe interface.  You will see somewhat higher
bandwidth
>> if your system supports and uses a 256 byte MaxPayload, rather than 128
>> bytes.  Use lspci to see what your system is using, as in: "lspci
-vv -d
>> 15b3: | grep MaxPayload"
>>
>> Kevin
>>
>> Ashley Pittman wrote:
>>      
>>> Hi,
>>>
>>> Could anyone confirm to me the maximum achievable bandwidth over a
single
>>> 4xQDR IB link into a OSS node.  I have many clients doing a write
test
>>> over IB and want to know the maximum bandwidth we can expect to see
for
>>> each OSS node.  For MPI over these links we see between 3 and
3.5BG/s
>>> but I suspect Lustre is capable of more than this because
it''s not using
>>> DALP, is this correct?
>>>
>>> Ashley.
>>>        
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>      
>
>

Lustre discuss - Jun 2010 - Max bandwidth through a single 4xQDR IB link?

[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?

[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?

[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?

[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?