Ashley Pittman
2010-Jun-29 10:13 UTC
[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?
Hi, Could anyone confirm to me the maximum achievable bandwidth over a single 4xQDR IB link into a OSS node. I have many clients doing a write test over IB and want to know the maximum bandwidth we can expect to see for each OSS node. For MPI over these links we see between 3 and 3.5BG/s but I suspect Lustre is capable of more than this because it''s not using DALP, is this correct? Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Kevin Van Maren
2010-Jun-29 13:57 UTC
[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?
DAPL is a high-performance interface that uses a small shim to provide a common DMA API on top of (in this case) the IB verbs layer. In general, there is a very small performance impact to be able to use the common API, so you will not get more large-message bandwidth using native IB verbs. I''ve never had enough disk bandwidth behind a node to saturate a QDR IB link, so I''m not sure how high LNET will go. If you have an IB test cluster, you should be able to measure the upper limits by creating an OST on a loopback device on tmpfs, although you have to ensure the client-side cache is not skewing your results (hint: boot client with something like "mem=1g" to limit the ram they can use for the cache). While the QDR IB link bandwidth is 4GB/s (or around 3.9GB/s with 2KB packets), the maximum HCA bandwidth is normally around 3.2GB/s (unidirectional), due to the PCIe overhead of breaking the transaction into (relatively) small packets and managing the packet flow control/credits. This is independent of the protocol, and limited by the PCIe Gen2 x8 PCIe interface. You will see somewhat higher bandwidth if your system supports and uses a 256 byte MaxPayload, rather than 128 bytes. Use lspci to see what your system is using, as in: "lspci -vv -d 15b3: | grep MaxPayload" Kevin Ashley Pittman wrote:> Hi, > > Could anyone confirm to me the maximum achievable bandwidth over a single 4xQDR IB link into a OSS node. I have many clients doing a write test over IB and want to know the maximum bandwidth we can expect to see for each OSS node. For MPI over these links we see between 3 and 3.5BG/s but I suspect Lustre is capable of more than this because it''s not using DALP, is this correct? > > Ashley. > >
Bernd Schubert
2010-Jun-29 14:15 UTC
[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?
Hello Ashley, hello Kevin, I really see no point to use disks to benchmark performance, when lnet_selftest exists. Benchmark order should be: - test how much the disks can provide - test network with lnet_selftest => make sure lustre performance is not much below the min(disks, lnet_selftest) Cheers, Bernd On Tuesday, June 29, 2010, Kevin Van Maren wrote:> DAPL is a high-performance interface that uses a small shim to provide a > common DMA API on top of (in this case) the IB verbs layer. In general, > there is a very small performance impact to be able to use the common > API, so you will not get more large-message bandwidth using native IB > verbs. > > I''ve never had enough disk bandwidth behind a node to saturate a QDR IB > link, so I''m not sure how high LNET will go. If you have an IB test > cluster, you should be able to measure the upper limits by creating an > OST on a loopback device on tmpfs, although you have to ensure the > client-side cache is not skewing your results (hint: boot client with > something like "mem=1g" to limit the ram they can use for the cache). > > While the QDR IB link bandwidth is 4GB/s (or around 3.9GB/s with 2KB > packets), the maximum HCA bandwidth is normally around 3.2GB/s > (unidirectional), due to the PCIe overhead of breaking the transaction > into (relatively) small packets and managing the packet flow > control/credits. This is independent of the protocol, and limited by > the PCIe Gen2 x8 PCIe interface. You will see somewhat higher bandwidth > if your system supports and uses a 256 byte MaxPayload, rather than 128 > bytes. Use lspci to see what your system is using, as in: "lspci -vv -d > 15b3: | grep MaxPayload" > > Kevin > > Ashley Pittman wrote: > > Hi, > > > > Could anyone confirm to me the maximum achievable bandwidth over a single > > 4xQDR IB link into a OSS node. I have many clients doing a write test > > over IB and want to know the maximum bandwidth we can expect to see for > > each OSS node. For MPI over these links we see between 3 and 3.5BG/s > > but I suspect Lustre is capable of more than this because it''s not using > > DALP, is this correct? > > > > Ashley. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Bernd Schubert DataDirect Networks
Atul Vidwansa
2010-Jul-01 04:20 UTC
[Lustre-discuss] Max bandwidth through a single 4xQDR IB link?
I would do following tests to see real life performance with QDR IB: 1. See what badwidth has been negotiated by IB HCA and your system using ibv_devinfo 2. Use ib_rdma_bw and ib_send_bw between a pair of Lustre client and server to see how much raw bandwidth you are getting. 3. Use lnet_selftest unidirectional (read OR write) and bidirectional (read AND write) tests to see how much lnet can give you. See lustre manual on using lnet_selftest 4. Benchmark your storage using sgpdd_survey or XDD 5. Run IOR or IOzone from "multiple" clients to see what throughput you are getting. If you are interested in single client results, you can run muti-threaded "dd" command from client on a Lustre filesystem. Cheers, _Atul On 06/29/2010 07:45 PM, Bernd Schubert wrote:> Hello Ashley, hello Kevin, > > I really see no point to use disks to benchmark performance, when > lnet_selftest exists. Benchmark order should be: > > - test how much the disks can provide > - test network with lnet_selftest > > => make sure lustre performance is not much below the > min(disks, lnet_selftest) > > > Cheers, > Bernd > > > > On Tuesday, June 29, 2010, Kevin Van Maren wrote: > >> DAPL is a high-performance interface that uses a small shim to provide a >> common DMA API on top of (in this case) the IB verbs layer. In general, >> there is a very small performance impact to be able to use the common >> API, so you will not get more large-message bandwidth using native IB >> verbs. >> >> I''ve never had enough disk bandwidth behind a node to saturate a QDR IB >> link, so I''m not sure how high LNET will go. If you have an IB test >> cluster, you should be able to measure the upper limits by creating an >> OST on a loopback device on tmpfs, although you have to ensure the >> client-side cache is not skewing your results (hint: boot client with >> something like "mem=1g" to limit the ram they can use for the cache). >> >> While the QDR IB link bandwidth is 4GB/s (or around 3.9GB/s with 2KB >> packets), the maximum HCA bandwidth is normally around 3.2GB/s >> (unidirectional), due to the PCIe overhead of breaking the transaction >> into (relatively) small packets and managing the packet flow >> control/credits. This is independent of the protocol, and limited by >> the PCIe Gen2 x8 PCIe interface. You will see somewhat higher bandwidth >> if your system supports and uses a 256 byte MaxPayload, rather than 128 >> bytes. Use lspci to see what your system is using, as in: "lspci -vv -d >> 15b3: | grep MaxPayload" >> >> Kevin >> >> Ashley Pittman wrote: >> >>> Hi, >>> >>> Could anyone confirm to me the maximum achievable bandwidth over a single >>> 4xQDR IB link into a OSS node. I have many clients doing a write test >>> over IB and want to know the maximum bandwidth we can expect to see for >>> each OSS node. For MPI over these links we see between 3 and 3.5BG/s >>> but I suspect Lustre is capable of more than this because it''s not using >>> DALP, is this correct? >>> >>> Ashley. >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >